# Exploratory Data Analysis (EDA) for Customer Segmentation

#### Description: 
TechElectro Inc. is a prominent electronics retailer with a widespread customer base. They are keen on gaining deeper insights into their customers' preferences and behaviors to optimize their marketing strategies and enhance customer satisfaction. Glowingsoft Technologies has been selected to conduct an exploratory data analysis (EDA) project that will help TechElectro Inc. discover meaningful patterns and segment their customers based on their characteristics.

### Client Name: TechElectro Inc.
### Company Name: Glowingsoft Technologies
### <span style="color:blue;">Data Scientist: Tausif Ul Rahman</span>
### Contact Now: cwtausif@gmail.com

 

### Steps:

Data Collection: Obtain the dataset from TechElectro Inc., containing customer information, purchase history, demographics, and preferences.

Data Cleaning: Clean the dataset by handling missing values, duplicates, and any inconsistencies.

Data Preprocessing: Perform feature scaling, normalization, and encode categorical variables if needed.

Exploratory Data Analysis: Utilize Python libraries (e.g., Pandas, Matplotlib, Seaborn) to visualize and explore the data, uncovering patterns and insights about customer behavior.

Customer Segmentation: Apply clustering algorithms (e.g., K-means) to segment customers based on their buying patterns, demographics, and preferences.

### Tools: Python, Jupyter Notebook, Pandas, Matplotlib, Seaborn, Scikit-learn

### Deployment:
After completing the EDA and customer segmentation, DataGenius Analytics will create an interactive dashboard using Dash or Streamlit. This dashboard will allow TechElectro Inc. to explore the customer segments and access visualizations that reveal customer preferences, helping them optimize marketing campaigns and tailor their offerings to specific customer groups.

### Project Outcome: 
The final deliverables will include a Jupyter Notebook containing the EDA process, a report detailing the insights and identified customer segments, and an interactive dashboard deployed for TechElectro Inc. to use in their business decisions. The project's aim is to provide valuable information that will enable TechElectro Inc. to improve customer satisfaction and increase sales through targeted marketing strategies. DataGenius Analytics will also provide recommendations for further analysis and potential areas of improvement.

Dataset Sample: TechElectro_Customer_Data.csv

| CustomerID | Age | Gender | MaritalStatus | AnnualIncome (USD) | TotalPurchases | PreferredCategory |
|------------|-----|--------|---------------|-------------------|----------------|-------------------|
| 1001       | 33  | Male   | Married       | 65000             | 18             | Electronics       |
| 1002       | 28  | Female | Single        | 45000             | 15             | Appliances        |
| 1003       | 42  | Male   | Single        | 55000             | 20             | Electronics       |
| 1004       | 51  | Female | Married       | 80000             | 12             | Electronics       |
| 1005       | 37  | Male   | Divorced      | 58000             | 10             | Appliances

### Description of the columns:

CustomerID: Unique identifier for each customer.

Age: Age of the customer.

Gender: Gender of the customer (Male/Female).

MaritalStatus: Marital status of the customer (Married/Single/Divorced).

AnnualIncome: Annual income of the customer in USD.

TotalPurchases: Total number of purchases made by the customer.

PreferredCategory: The category of products the customer prefers (e.g., Electronics, Appliances).

(Note: The dataset should contains a total of 500 customers.)


### Step: Import Libraries

In [1]:
import pandas as pd
import random
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

### Step: Create CSV File

In [2]:
# create an empty Data Frame
df = pd.DataFrame()

# File path for csv
file_path = "TechElectro_Customer_Data.csv"

# Create CSV
df.to_csv(file_path,index=False)

### Step:  Read CSV File

In [3]:
try:
    df = pd.read_csv(file_path)
    # Display the DataFrame
    print(df)
except pd.errors.EmptyDataError:
    print("Data set is empty or have no columns")
except FileNotFoundError:
    print(f"File '{file_path}' not found")
except Exception as e:
    print(f"An unexpected error occurs '{e}'")

Data set is empty or have no columns


### Step: Create Columns

In [4]:
# Define column names list for the csv
columns = ["CustomerID","Age","Gender","Marital Status","AnnualIncome","TotalPurchases"]

# Crete an empty dataframe with specified columns
df = pd.DataFrame(columns=columns)
df.to_csv(file_path,index=False)
print(f"Empty csv with columns\n\n'{','.join(columns)}'\n\nhas been created successfully")
print(df)

Empty csv with columns

'CustomerID,Age,Gender,Marital Status,AnnualIncome,TotalPurchases'

has been created successfully
Empty DataFrame
Columns: [CustomerID, Age, Gender, Marital Status, AnnualIncome, TotalPurchases]
Index: []


### Step: Ad Another Column

In [5]:
# Ad another column PreferredCategory
df = pd.read_csv(file_path)

df["PreferredCategory"] = None
df.to_csv(file_path,index=False)

print(f"A new column PreferredCategory added to csv successfully")
df

A new column PreferredCategory added to csv successfully


Unnamed: 0,CustomerID,Age,Gender,Marital Status,AnnualIncome,TotalPurchases,PreferredCategory


### Step: Remove space from column names

In [6]:
# Read CSV File
df = pd.read_csv(file_path)

# Remove Space
df.columns = df.columns.str.replace(' ','')
df.to_csv(file_path,index=False)

# Verify
df = pd.read_csv(file_path)
df.columns

Index(['CustomerID', 'Age', 'Gender', 'MaritalStatus', 'AnnualIncome',
       'TotalPurchases', 'PreferredCategory'],
      dtype='object')

### Step: Data Creation

#### 1. CustomerID

In [7]:
# Create 500 customer ids starting from 1000
customerIds = []
for i in range (1001,1501):
    customerIds.append(i)
    
print(customerIds[495:]) # Last Few
len(customerIds)

[1496, 1497, 1498, 1499, 1500]


500

#### 2. Age

In [8]:
# Create 500 ages of the people ranging from 20-75 using list comprehension

ages = [random.randint(20,75) for _ in range(500)]

## Print last few
print(ages[495:])
## Total Verify
len(ages)

[50, 52, 30, 52, 73]


500

#### 3. Gender

In [9]:
# Create 500 Gender data
genders_choice = ["Male","Female"]
genders = [random.choice(genders_choice) for _ in range(500)]

# Print Few
genders[:4]
# verify
len(genders)

500

#### 4. MaritalStatus

In [10]:
#MaritalStatus
marital_status = ["Married","Single","Divorced"]
maritalStatuses = [random.choice(marital_status) for _ in range(500)]

# Print few
print(maritalStatuses[:10])
len(maritalStatuses)

['Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Single', 'Divorced', 'Single', 'Single']


500

#### 6. AnnualIncome

In [11]:
#AnnualIncome
annualIncome = [random.randint(100000,1000000) for _ in range(500)]

#print few
print(annualIncome[:9])
#verify length
len(annualIncome)

[907420, 821270, 265107, 156757, 526468, 938413, 251092, 198980, 311026]


500

#### 7. TotalPurchases

In [12]:
#Total Purchases
totalPurchases = [random.randint(1,50) for _ in range(500)]

#print few
print(totalPurchases[:5])
#verify length
len(totalPurchases)

[21, 24, 9, 49, 10]


500

#### 8. PreferredCategory

In [13]:
#PreferredCategory
preferredcategory_choices = ["Electronics","Appliances"]
preferredCategory = [random.choice(preferredcategory_choices) for _ in range(500)]

# Print few
print(preferredCategory[:15])
len(preferredCategory)

['Appliances', 'Appliances', 'Electronics', 'Electronics', 'Appliances', 'Electronics', 'Appliances', 'Appliances', 'Electronics', 'Appliances', 'Appliances', 'Electronics', 'Electronics', 'Appliances', 'Electronics']


500

#### Step: Save Data to csv

In [14]:
# Read CSV
df = pd.read_csv(file_path)
df.head

data = {
    'CustomerID':customerIds,
    'Age':ages,
    'Gender':genders,
    'MaritalStatus':maritalStatuses,
    'AnnualIncome':annualIncome,
    'TotalPurchases':totalPurchases,
    'PreferredCategory':preferredCategory
}

df = pd.DataFrame(data)
df.to_csv(file_path,index=False)
df.head(500)

Unnamed: 0,CustomerID,Age,Gender,MaritalStatus,AnnualIncome,TotalPurchases,PreferredCategory
0,1001,32,Male,Divorced,907420,21,Appliances
1,1002,50,Male,Divorced,821270,24,Appliances
2,1003,59,Female,Divorced,265107,9,Electronics
3,1004,68,Female,Married,156757,49,Electronics
4,1005,36,Female,Married,526468,10,Appliances
...,...,...,...,...,...,...,...
495,1496,50,Male,Single,363556,37,Appliances
496,1497,52,Female,Divorced,604517,23,Appliances
497,1498,30,Female,Divorced,210840,7,Appliances
498,1499,52,Male,Single,404554,2,Appliances


## Handle missing values using Pandas, remove duplicates, and resolve inconsistencies.

#### Step: Missing Values

In [15]:
df.isnull()

Unnamed: 0,CustomerID,Age,Gender,MaritalStatus,AnnualIncome,TotalPurchases,PreferredCategory
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
495,False,False,False,False,False,False,False
496,False,False,False,False,False,False,False
497,False,False,False,False,False,False,False
498,False,False,False,False,False,False,False


In [16]:
# Check total null values
df.isnull().sum()

CustomerID           0
Age                  0
Gender               0
MaritalStatus        0
AnnualIncome         0
TotalPurchases       0
PreferredCategory    0
dtype: int64

#### Step: Find duplicates

In [17]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
495    False
496    False
497    False
498    False
499    False
Length: 500, dtype: bool

In [18]:
# Check Total Duplicates
df.duplicated().sum()

0

In [19]:
df.columns

Index(['CustomerID', 'Age', 'Gender', 'MaritalStatus', 'AnnualIncome',
       'TotalPurchases', 'PreferredCategory'],
      dtype='object')

In [20]:
df.dtypes

CustomerID            int64
Age                   int64
Gender               object
MaritalStatus        object
AnnualIncome          int64
TotalPurchases        int64
PreferredCategory    object
dtype: object

### Data Preprocessing: Encode categorical variables with Scikit-learn, perform feature scaling using MinMaxScaler.

#### Step:  Encode categorical variables

In [21]:
df.head(10)

Unnamed: 0,CustomerID,Age,Gender,MaritalStatus,AnnualIncome,TotalPurchases,PreferredCategory
0,1001,32,Male,Divorced,907420,21,Appliances
1,1002,50,Male,Divorced,821270,24,Appliances
2,1003,59,Female,Divorced,265107,9,Electronics
3,1004,68,Female,Married,156757,49,Electronics
4,1005,36,Female,Married,526468,10,Appliances
5,1006,50,Male,Divorced,938413,26,Electronics
6,1007,68,Male,Single,251092,34,Appliances
7,1008,20,Female,Divorced,198980,17,Appliances
8,1009,40,Female,Single,311026,18,Electronics
9,1010,75,Male,Single,630257,23,Appliances


#### Step: Gender Encoding

In [22]:
label_encoder = LabelEncoder()
label_encoder.fit_transform(df['Gender'])

array([1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,

In [23]:
# Initilize Lavel Encoder
df["Gender_Encoded"] = label_encoder.fit_transform(df['Gender'])
df.head()

Unnamed: 0,CustomerID,Age,Gender,MaritalStatus,AnnualIncome,TotalPurchases,PreferredCategory,Gender_Encoded
0,1001,32,Male,Divorced,907420,21,Appliances,1
1,1002,50,Male,Divorced,821270,24,Appliances,1
2,1003,59,Female,Divorced,265107,9,Electronics,0
3,1004,68,Female,Married,156757,49,Electronics,0
4,1005,36,Female,Married,526468,10,Appliances,0


In [24]:
#### MaritalStatus Encoded
df["MaritalStatus_Encoded"] = label_encoder.fit_transform(df['MaritalStatus'])
df.head()

Unnamed: 0,CustomerID,Age,Gender,MaritalStatus,AnnualIncome,TotalPurchases,PreferredCategory,Gender_Encoded,MaritalStatus_Encoded
0,1001,32,Male,Divorced,907420,21,Appliances,1,0
1,1002,50,Male,Divorced,821270,24,Appliances,1,0
2,1003,59,Female,Divorced,265107,9,Electronics,0,0
3,1004,68,Female,Married,156757,49,Electronics,0,1
4,1005,36,Female,Married,526468,10,Appliances,0,1


In [25]:
#### MaritalStatus Encoded
df["PreferredCategory_Encoded"] = label_encoder.fit_transform(df['PreferredCategory'])
df.head()

Unnamed: 0,CustomerID,Age,Gender,MaritalStatus,AnnualIncome,TotalPurchases,PreferredCategory,Gender_Encoded,MaritalStatus_Encoded,PreferredCategory_Encoded
0,1001,32,Male,Divorced,907420,21,Appliances,1,0,0
1,1002,50,Male,Divorced,821270,24,Appliances,1,0,0
2,1003,59,Female,Divorced,265107,9,Electronics,0,0,1
3,1004,68,Female,Married,156757,49,Electronics,0,1,1
4,1005,36,Female,Married,526468,10,Appliances,0,1,0


#### Perform feature scaling using MinMaxScaler

In [26]:
df.head()

Unnamed: 0,CustomerID,Age,Gender,MaritalStatus,AnnualIncome,TotalPurchases,PreferredCategory,Gender_Encoded,MaritalStatus_Encoded,PreferredCategory_Encoded
0,1001,32,Male,Divorced,907420,21,Appliances,1,0,0
1,1002,50,Male,Divorced,821270,24,Appliances,1,0,0
2,1003,59,Female,Divorced,265107,9,Electronics,0,0,1
3,1004,68,Female,Married,156757,49,Electronics,0,1,1
4,1005,36,Female,Married,526468,10,Appliances,0,1,0


In [27]:
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Separate the numerical and categorical columns
numerical_columns = ['Gender_Encoded', 'MaritalStatus_Encoded', 'PreferredCategory_Encoded']
categorical_columns = ['CustomerID','Age','AnnualIncome','TotalPurchases']

# Perform feature scaling only on the numerical columns
scaled_features = scaler.fit_transform(df[numerical_columns])

# Create a new DataFrame with the scaled numerical features
scaled_df = pd.DataFrame(scaled_features, columns=numerical_columns)

# Concatenate the scaled numerical features with the categorical columns
scaled_df = pd.concat([scaled_df, df[categorical_columns]], axis=1)

print("Original DataFrame:")
print(df)
print("\nScaled DataFrame:")
print(scaled_df)

Original DataFrame:
     CustomerID  Age  Gender MaritalStatus  AnnualIncome  TotalPurchases  \
0          1001   32    Male      Divorced        907420              21   
1          1002   50    Male      Divorced        821270              24   
2          1003   59  Female      Divorced        265107               9   
3          1004   68  Female       Married        156757              49   
4          1005   36  Female       Married        526468              10   
..          ...  ...     ...           ...           ...             ...   
495        1496   50    Male        Single        363556              37   
496        1497   52  Female      Divorced        604517              23   
497        1498   30  Female      Divorced        210840               7   
498        1499   52    Male        Single        404554               2   
499        1500   73  Female       Married        586821               1   

    PreferredCategory  Gender_Encoded  MaritalStatus_Encoded  \
0  