<a href="https://colab.research.google.com/github/farzadmohseni-ir/business-social-network-analysis/blob/main/Car_Sales_Network_Analysis_Gephi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## 🚗 Car Sales Network Analysis using Gephi

### 📌 Dataset Source

This project uses the dataset available on Kaggle:
🔗 [Car Sales Report – MissionJee on Kaggle](https://www.kaggle.com/datasets/missionjee/car-sales-report/data)



---

### ❓ Key Question

> **Q16: Which product features most influence car purchase decisions?**


---

### 📑 Features Description Table

| 🔢 Feature Name | 📝 Description                                             |
| --------------- | ---------------------------------------------------------- |
| `Car_id`        | Unique identifier for each car in the dataset.             |
| `Date`          | Date of the car sale transaction.                          |
| `Customer Name` | Name of the customer purchasing the car.                   |
| `Gender`        | Gender of the customer (e.g., Male, Female).               |
| `Annual Income` | Annual income of the customer.                             |
| `Dealer_Name`   | Name of the car dealer associated with the sale.           |
| `Company`       | Company or brand of the car.                               |
| `Model`         | Model name of the car.                                     |
| `Engine`        | Specifications of the car's engine.                        |
| `Transmission`  | Type of transmission in the car (e.g., Automatic, Manual). |
| `Color`         | Color of the car's exterior.                               |
| `Price ($)`     | Listed price of the car for sale.                          |
| `Dealer_No`     | Dealer identification number associated with the sale.     |
| `Body Style`    | Style or design of the car's body (e.g., Sedan, SUV).      |
| `Phone`         | Contact phone number associated with the car sale.         |
| `Dealer_Region` | Geographic region or location of the car dealer.           |

---

In [None]:
# Install gdown
!pip install -q gdown

# Import required libraries
import gdown
import pandas as pd

# Define file ID and desired output file name
file_id = '1d54Ulgj_Mmh7OdVspuN3NaSq-k79oFxF'
output = 'Car_Sales_Report.csv'

# Download CSV file from Google Drive
gdown.download(f'https://drive.google.com/uc?id={file_id}', output, quiet=False)

# Load the dataset using pandas
df = pd.read_csv(output)

# Display the first few rows
df.head()

Downloading...
From: https://drive.google.com/uc?id=1d54Ulgj_Mmh7OdVspuN3NaSq-k79oFxF
To: /content/Car_Sales_Report.csv
100%|██████████| 3.83M/3.83M [00:00<00:00, 78.8MB/s]


Unnamed: 0,Car_id,Date,Customer Name,Gender,Annual Income,Dealer_Name,Company,Model,Engine,Transmission,Color,Price ($),Dealer_No,Body Style,Phone,Dealer_Region
0,C_CND_000001,1/2/2022,Geraldine,Male,13500,Buddy Storbeck's Diesel Service Inc,Ford,Expedition,DoubleÂ Overhead Camshaft,Auto,Black,26000,06457-3834,SUV,8264678,Middletown
1,C_CND_000002,1/2/2022,Gia,Male,1480000,C & M Motors Inc,Dodge,Durango,DoubleÂ Overhead Camshaft,Auto,Black,19000,60504-7114,SUV,6848189,Aurora
2,C_CND_000003,1/2/2022,Gianna,Male,1035000,Capitol KIA,Cadillac,Eldorado,Overhead Camshaft,Manual,Red,31500,38701-8047,Passenger,7298798,Greenville
3,C_CND_000004,1/2/2022,Giselle,Male,13500,Chrysler of Tri-Cities,Toyota,Celica,Overhead Camshaft,Manual,Pale White,14000,99301-3882,SUV,6257557,Pasco
4,C_CND_000005,1/2/2022,Grace,Male,1465000,Chrysler Plymouth,Acura,TL,DoubleÂ Overhead Camshaft,Auto,Red,24500,53546-9427,Hatchback,7081483,Janesville


In [None]:
# Display the list of all column names
print("List of all column names in the dataset:\n")
print(df.columns.tolist())

List of all column names in the dataset:

['Car_id', 'Date', 'Customer Name', 'Gender', 'Annual Income', 'Dealer_Name', 'Company', 'Model', 'Engine', 'Transmission', 'Color', 'Price ($)', 'Dealer_No ', 'Body Style', 'Phone', 'Dealer_Region']


In [None]:
# Retrieve the number of rows (samples) and columns (features) in the dataset
rows, cols = df.shape

# Print dataset shape information in a structured format
print(f"Total number of samples (rows): {rows}\n")
print(f"Total number of features (columns): {cols}")

Total number of samples (rows): 23906

Total number of features (columns): 16


In [None]:
# Get a concise summary of the DataFrame

print("Summary of the dataset:\n")
df.info()

Summary of the dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23906 entries, 0 to 23905
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Car_id         23906 non-null  object
 1   Date           23906 non-null  object
 2   Customer Name  23905 non-null  object
 3   Gender         23906 non-null  object
 4   Annual Income  23906 non-null  int64 
 5   Dealer_Name    23906 non-null  object
 6   Company        23906 non-null  object
 7   Model          23906 non-null  object
 8   Engine         23906 non-null  object
 9   Transmission   23906 non-null  object
 10  Color          23906 non-null  object
 11  Price ($)      23906 non-null  int64 
 12  Dealer_No      23906 non-null  object
 13  Body Style     23906 non-null  object
 14  Phone          23906 non-null  int64 
 15  Dealer_Region  23906 non-null  object
dtypes: int64(3), object(13)
memory usage: 2.9+ MB


In [None]:
# Check for duplicate rows in the dataset
duplicate_rows = df.duplicated()

# Count how many duplicate rows exist
num_duplicates = duplicate_rows.sum()

# Print the result
print(f"Number of duplicate rows in the dataset: {num_duplicates}")

Number of duplicate rows in the dataset: 0


In [None]:
# Check for all types of missing values: NaN, empty string, and '?'

# Count standard missing values (NaN)
nan_count = df.isnull().sum()

# Count empty string values
empty_str_count = (df == '').sum()

# Count cells with question mark '?'
question_mark_count = (df == '?').sum()

# Combine all into a single DataFrame
missing_summary = pd.DataFrame({
    'NaN Count': nan_count,
    'Empty String Count': empty_str_count,
    "'?' Count": question_mark_count
})

# Total suspicious values per column
missing_summary['Total Suspect Values'] = missing_summary.sum(axis=1)

# Display the result
print(" Missing or suspicious values summary:\n")
display(missing_summary)

# Total in the entire dataset
print("\n Total suspicious cells in entire dataset:", missing_summary['Total Suspect Values'].sum())

 Missing or suspicious values summary:



Unnamed: 0,NaN Count,Empty String Count,'?' Count,Total Suspect Values
Car_id,0,0,0,0
Date,0,0,0,0
Customer Name,1,0,0,1
Gender,0,0,0,0
Annual Income,0,0,0,0
Dealer_Name,0,0,0,0
Company,0,0,0,0
Model,0,0,0,0
Engine,0,0,0,0
Transmission,0,0,0,0



 Total suspicious cells in entire dataset: 1


In [None]:
# Remove the row where 'Customer Name' is missing (NaN)
df = df.dropna(subset=['Customer Name'])

# Reset index after dropping
df.reset_index(drop=True, inplace=True)

print("Row with missing 'Customer Name' removed successfully.")
print("New shape of dataset:", df.shape)

Row with missing 'Customer Name' removed successfully.
New shape of dataset: (23905, 16)


In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('Car_Sales_Report.csv')

# Get max price and define 5K bins dynamically
max_price = df['Price ($)'].max()
bin_max = int(np.ceil(max_price / 5000.0) * 5000) + 5000  # One extra bin for ceiling safety

bins = list(range(0, bin_max + 1, 5000))
labels = [f'{i//1000}K–{(i+5000)//1000}K' for i in bins[:-1]]

# Create the 'Price Range' column
df['Price Range'] = pd.cut(df['Price ($)'], bins=bins, labels=labels, include_lowest=True)

# Save the updated DataFrame back to the same CSV file
df.to_csv('Car_Sales_Report.csv', index=False)

# Preview the output
print("'Price Range' column with clean 5K intervals added.")
df.head(10)

'Price Range' column with clean 5K intervals added.


Unnamed: 0,Car_id,Date,Customer Name,Gender,Annual Income,Dealer_Name,Company,Model,Engine,Transmission,Color,Price ($),Dealer_No,Body Style,Phone,Dealer_Region,Price Range
0,C_CND_000001,1/2/2022,Geraldine,Male,13500,Buddy Storbeck's Diesel Service Inc,Ford,Expedition,DoubleÂ Overhead Camshaft,Auto,Black,26000,06457-3834,SUV,8264678,Middletown,25K–30K
1,C_CND_000002,1/2/2022,Gia,Male,1480000,C & M Motors Inc,Dodge,Durango,DoubleÂ Overhead Camshaft,Auto,Black,19000,60504-7114,SUV,6848189,Aurora,15K–20K
2,C_CND_000003,1/2/2022,Gianna,Male,1035000,Capitol KIA,Cadillac,Eldorado,Overhead Camshaft,Manual,Red,31500,38701-8047,Passenger,7298798,Greenville,30K–35K
3,C_CND_000004,1/2/2022,Giselle,Male,13500,Chrysler of Tri-Cities,Toyota,Celica,Overhead Camshaft,Manual,Pale White,14000,99301-3882,SUV,6257557,Pasco,10K–15K
4,C_CND_000005,1/2/2022,Grace,Male,1465000,Chrysler Plymouth,Acura,TL,DoubleÂ Overhead Camshaft,Auto,Red,24500,53546-9427,Hatchback,7081483,Janesville,20K–25K
5,C_CND_000006,1/2/2022,Guadalupe,Male,850000,Classic Chevy,Mitsubishi,Diamante,Overhead Camshaft,Manual,Pale White,12000,85257-3102,Hatchback,7315216,Scottsdale,10K–15K
6,C_CND_000007,1/2/2022,Hailey,Male,1600000,Clay Johnson Auto Sales,Toyota,Corolla,Overhead Camshaft,Manual,Pale White,14000,78758-7841,Passenger,7727879,Austin,10K–15K
7,C_CND_000008,1/2/2022,Graham,Male,13500,U-Haul CO,Mitsubishi,Galant,DoubleÂ Overhead Camshaft,Auto,Pale White,42000,78758-7841,Passenger,6206512,Austin,40K–45K
8,C_CND_000009,1/2/2022,Naomi,Male,815000,Rabun Used Car Sales,Chevrolet,Malibu,Overhead Camshaft,Manual,Pale White,82000,85257-3102,Hardtop,7194857,Pasco,80K–85K
9,C_CND_000010,1/2/2022,Grayson,Female,13500,Rabun Used Car Sales,Ford,Escort,DoubleÂ Overhead Camshaft,Auto,Pale White,15000,85257-3102,Passenger,7836892,Scottsdale,10K–15K


In [None]:
import pandas as pd
from google.colab import files

# Dataset already has 'Price Range' column created before
df = pd.read_csv('Car_Sales_Report.csv')

# STEP 1: Define product feature columns (including existing 'Price Range')
feature_columns = ['Company', 'Model', 'Engine', 'Transmission', 'Color', 'Body Style', 'Price Range']

# STEP 2: Create Car nodes
car_nodes = pd.DataFrame({
    'Id': df['Car_id'],
    'Label': df['Car_id'],
    'Type': 'Car_id'
})

# STEP 3: Create feature nodes
feature_nodes = []
for col in feature_columns:
    unique_values = df[col].dropna().unique()
    temp = pd.DataFrame({
        'Id': unique_values,
        'Label': unique_values,
        'Type': col
    })
    feature_nodes.append(temp)

# STEP 4: Combine and deduplicate all nodes
all_nodes = pd.concat([car_nodes] + feature_nodes, ignore_index=True)
all_nodes.drop_duplicates(subset='Id', inplace=True)

# STEP 5: Save to CSV
all_nodes.to_csv('nodes.csv', index=False)

# STEP 6: Download file in Colab
files.download('nodes.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd
from google.colab import files

# Load the cleaned dataset (already contains 'Price Range' column)
df = pd.read_csv('Car_Sales_Report.csv')

# Define the product feature columns to be connected to each Car_id
feature_columns = ['Company', 'Model', 'Color', 'Transmission', 'Engine', 'Body Style', 'Price Range']

# STEP 1: Build the edge list
edges = []
for idx, row in df.iterrows():
    car_id = row['Car_id']
    for col in feature_columns:
        feature_value = row[col]
        if pd.notnull(feature_value):
            edges.append({
                'Source': car_id,
                'Target': feature_value,
                'Weight': 1
            })

# STEP 2: Convert to DataFrame
edges_df = pd.DataFrame(edges)

# STEP 3: Save to CSV
edges_df.to_csv('edges.csv', index=False)

# STEP 4: Download the file in Colab
files.download('edges.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import pandas as pd

# Load nodes and edges
nodes = pd.read_csv('nodes.csv')
edges = pd.read_csv('edges.csv')

# Total counts
num_nodes = nodes.shape[0]
num_edges = edges.shape[0]

# Count of nodes by type
node_type_counts = nodes['Type'].value_counts()

# Display results
print("Network Summary\n")

print(f"Total number of nodes: {num_nodes}")
print(f"Total number of edges: {num_edges}\n")

print("Node breakdown by type:")
for node_type, count in node_type_counts.items():
    print(f"  • {node_type}: {count}")

Network Summary

Total number of nodes: 24120
Total number of edges: 167342

Node breakdown by type:
  • Car_id: 23906
  • Model: 154
  • Company: 30
  • Price Range: 18
  • Body Style: 5
  • Color: 3
  • Engine: 2
  • Transmission: 2
