### Project 2: Customer Segmentation Data Prep
#### Objective: Prepare a customer dataset for clustering by handling missing values and performing one-hot encoding on categorical data.

##### Step 1: Import pandas library

In [1]:
import pandas as pd

##### Step 2: Create a sample customer dataset

In [2]:
data = {
    'Customer_ID': [1, 2, 3, 4, 5],
    'Customer_Type': ['New', 'Regular', None, 'Regular', 'New'],  # Categorical with a missing value
    'Age': [25, None, 40, 35, None]  # Numerical with some missing values
}

In [3]:
# Create DataFrame
df = pd.DataFrame(data)

print("ðŸ§¾ Original Customer Data:")
print(df)

ðŸ§¾ Original Customer Data:
   Customer_ID Customer_Type   Age
0            1           New  25.0
1            2       Regular   NaN
2            3          None  40.0
3            4       Regular  35.0
4            5           New   NaN


##### Step 3: Impute Missing Numerical Data

In [4]:
# Fill missing 'Age' values with the median age
median_age = df['Age'].median()
df.fillna({'Age': median_age}, inplace=True)

print("\nAfter Filling Missing 'Age' Values with Median:")
print(df)


After Filling Missing 'Age' Values with Median:
   Customer_ID Customer_Type   Age
0            1           New  25.0
1            2       Regular  35.0
2            3          None  40.0
3            4       Regular  35.0
4            5           New  35.0


##### Step 4: Analyzing and filling Missing Categorical Data

In [5]:
# Fill missing 'Customer_Type' with a new category 'Unknown'
df.fillna({'Customer_Type': 'Unknown'}, inplace=True)

print("\nAfter Filling Missing 'Customer_Type' with 'Unknown':")
print(df)


After Filling Missing 'Customer_Type' with 'Unknown':
   Customer_ID Customer_Type   Age
0            1           New  25.0
1            2       Regular  35.0
2            3       Unknown  40.0
3            4       Regular  35.0
4            5           New  35.0


##### Step 5: One-Hot Encode the 'Customer_Type' Column

In [6]:
# Convert categorical column into multiple binary columns
df_encoded = pd.get_dummies(df, columns=['Customer_Type'], prefix='Type')

print("\nFinal Cleaned and Encoded DataFrame (Ready for Machine Learning):")
print(df_encoded)

# categorical text data in the column converts 'Customer_Type' into numeric binary (0/1) columns â€” a process called one-hot encoding.
# Machine learning models cannot understand words like "New" or "Regular",so we transform them into numbers
# If the customerâ€™s type matches the column name â†’ it puts 1 Otherwise â†’ it puts 0



Final Cleaned and Encoded DataFrame (Ready for Machine Learning):
   Customer_ID   Age  Type_New  Type_Regular  Type_Unknown
0            1  25.0      True         False         False
1            2  35.0     False          True         False
2            3  40.0     False         False          True
3            4  35.0     False          True         False
4            5  35.0      True         False         False
