# 1. Data Loading and Initial Exploration

In this step, we begin by loading the marketing campaign dataset into a pandas DataFrame. The dataset is stored in a tab-separated (\t) file format, so we use the appropriate separator when reading the data. After successfully loading the dataset, we proceed to inspect the first few rows to get a general sense of the data's structure and content.

We then explore the dataset's metadata by checking the data types of each column using the info() function. This helps us understand the nature of the features, whether they are numerical or categorical, and provides insight into any potential data cleaning steps that may be required.

Lastly, we check for missing values using isnull().sum(). Identifying missing data early on allows us to decide how to handle incomplete entries, ensuring the integrity of the dataset for further analysis. Based on the results, we can determine whether imputing or removing missing data will be necessary in the next stages.

### Column Descriptions

- **`ID`**: Unique identifier for each customer in the dataset.
- **`Year_Birth`**: The year the customer was born. This can be used to calculate the age of the customer, which is important for demographic analysis.
- **`Education`**: The highest level of education the customer has achieved (e.g., Graduation, PhD). This is a categorical variable that can affect consumer behavior.
- **`Marital_Status`**: The marital status of the customer (e.g., Single, Married, Together). This helps understand household composition and spending habits.
- **`Income`**: The annual income of the customer. Higher income levels may correlate with higher spending on luxury goods.
- **`Kidhome`**: The number of small children (under 12) living in the customer's home. This might influence certain types of purchases, such as toys or food.
- **`Teenhome`**: The number of teenagers (aged 12 to 18) living in the customer's home. Teenagers might impact the type and frequency of purchases.
- **`Dt_Customer`**: The date the customer was enrolled in the company. This can help measure customer loyalty.
- **`Recency`**: The number of days since the customer last made a purchase. This is important for understanding customer engagement.
- **`MntWines`**: The amount of money the customer spent on wine in the last two years.
- **`MntFruits`**: The amount of money the customer spent on fruits in the last two years.
- **`MntMeatProducts`**: The amount of money the customer spent on meat products in the last two years.
- **`MntFishProducts`**: The amount of money the customer spent on fish products in the last two years.
- **`MntSweetProducts`**: The amount of money the customer spent on sweets in the last two years.
- **`MntGoldProds`**: The amount of money the customer spent on gold products in the last two years.
- **`NumDealsPurchases`**: The number of purchases made using a discount or deal.
- **`NumWebPurchases`**: The number of purchases made through the company's website.
- **`NumCatalogPurchases`**: The number of purchases made using a catalog.
- **`NumStorePurchases`**: The number of purchases made in-store.
- **`NumWebVisitsMonth`**: The number of times the customer visited the company's website in the last month.
- **`AcceptedCmp1`, `AcceptedCmp2`, `AcceptedCmp3`, `AcceptedCmp4`, `AcceptedCmp5`**: Binary columns indicating whether the customer accepted marketing campaigns 1 through 5.
- **`Complain`**: Whether the customer has ever filed a complaint with the company (binary: 1 for yes, 0 for no).
- **`Z_CostContact`**: A feature with constant values, possibly related to customer contact cost.
- **`Z_Revenue`**: Another feature with constant values, likely related to revenue estimation.
- **`Response`**: Whether the customer responded positively to the last marketing campaign (binary: 1 for yes, 0 for no).


In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [2]:
# Load the dataset
file_path = 'data/marketing_campaign.csv'
data = pd.read_csv(file_path, sep='\t')

In [3]:
# Preview the first few rows of the dataset
data.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


In [4]:
# Check the structure of the dataset (columns, data types, etc.)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [5]:
# Check for missing values in the dataset
data.isnull().sum()

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

# 2. Data Preprocessing

In this step, we preprocess the dataset to ensure it is clean and ready for analysis. The preprocessing involves several tasks such as handling missing values, encoding categorical variables, standardizing numerical data, and ensuring consistency in data types.

1. **Handling Missing Values**:
   - First, we remove rows with missing values in the `Income` column, as this feature is critical for analysis. Using the `dropna()` function, we remove any rows that contain a null value in the `Income` column.
   
2. **One-Hot Encoding of Categorical Variables**:
   - We apply one-hot encoding to the `Education` and `Marital_Status` columns to convert categorical data into numerical format. This results in binary columns for each category, and to avoid multicollinearity, we drop the first category using the `drop_first=True` argument.
   
3. **Handling Date Columns**:
   - The `Dt_Customer` column is converted into a `datetime` format using the `pd.to_datetime()` function with `dayfirst=True`. Then, a new feature `Customer_Since` is created to calculate the number of years each customer has been with the company (based on the current year, 2024). After extracting this useful information, the `Dt_Customer` column is dropped.
   
4. **Standardizing Numerical Data**:
   - Numerical columns such as `Income`, `Kidhome`, `Teenhome`, `Recency`, and product spending columns are standardized using the `StandardScaler`. This ensures that all numerical features have a mean of 0 and a standard deviation of 1, making the data more consistent for model training.
   
5. **Imputation of Missing Values**:
   - For any remaining missing values in the numerical columns, we apply mean imputation. This fills in missing values with the column’s mean, ensuring the dataset remains complete and ready for further analysis.

6. **Converting Integer Columns to Floats**:
   - To maintain consistency across all numerical data, we convert all `int32` and `int64` columns to `float64`. This ensures that the dataset has consistent data types across numerical features, which may improve performance and avoid potential issues with certain algorithms.

By the end of this step, the dataset is fully prepared for further analysis or model training. It has been cleaned, standardized, encoded, and missing values have been appropriately handled, ensuring consistency in the data types as well.


In [6]:
# Remove rows with missing values in the 'Income' column
data = data.dropna(subset=['Income'])
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2216 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2216 non-null   int64  
 1   Year_Birth           2216 non-null   int64  
 2   Education            2216 non-null   object 
 3   Marital_Status       2216 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2216 non-null   int64  
 6   Teenhome             2216 non-null   int64  
 7   Dt_Customer          2216 non-null   object 
 8   Recency              2216 non-null   int64  
 9   MntWines             2216 non-null   int64  
 10  MntFruits            2216 non-null   int64  
 11  MntMeatProducts      2216 non-null   int64  
 12  MntFishProducts      2216 non-null   int64  
 13  MntSweetProducts     2216 non-null   int64  
 14  MntGoldProds         2216 non-null   int64  
 15  NumDealsPurchases    2216 non-null   int64 

In [7]:
#One-Hot Encoding for categorical variables
data = pd.get_dummies(data, columns=['Education', 'Marital_Status'], drop_first=True)

In [8]:
# Convert 'Dt_Customer' to datetime and extract useful information
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], dayfirst=True)
data['Customer_Since'] = 2024 - data['Dt_Customer'].dt.year  # Calculate how long they have been a customer

# Optional: Drop the original date column if it's no longer needed
data.drop('Dt_Customer', axis=1, inplace=True)

In [9]:
# Standardize numerical columns (e.g., Income, spending data)
numerical_columns = ['Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 
                     'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                     'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 
                     'NumStorePurchases', 'NumWebVisitsMonth']

scaler = StandardScaler()
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

In [10]:
# Fill missing values in the dataset with the mean value of each column
data.fillna(data.mean(), inplace=True)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2216 entries, 0 to 2239
Data columns (total 38 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       2216 non-null   int64  
 1   Year_Birth               2216 non-null   int64  
 2   Income                   2216 non-null   float64
 3   Kidhome                  2216 non-null   float64
 4   Teenhome                 2216 non-null   float64
 5   Recency                  2216 non-null   float64
 6   MntWines                 2216 non-null   float64
 7   MntFruits                2216 non-null   float64
 8   MntMeatProducts          2216 non-null   float64
 9   MntFishProducts          2216 non-null   float64
 10  MntSweetProducts         2216 non-null   float64
 11  MntGoldProds             2216 non-null   float64
 12  NumDealsPurchases        2216 non-null   float64
 13  NumWebPurchases          2216 non-null   float64
 14  NumCatalogPurchases      2216

In [12]:
# Convert int32 and int64 columns to float64 for consistency
data = data.astype({col: 'float64' for col in data.select_dtypes(include=['int32', 'int64']).columns})

# Check data types again
data.dtypes

ID                         float64
Year_Birth                 float64
Income                     float64
Kidhome                    float64
Teenhome                   float64
Recency                    float64
MntWines                   float64
MntFruits                  float64
MntMeatProducts            float64
MntFishProducts            float64
MntSweetProducts           float64
MntGoldProds               float64
NumDealsPurchases          float64
NumWebPurchases            float64
NumCatalogPurchases        float64
NumStorePurchases          float64
NumWebVisitsMonth          float64
AcceptedCmp3               float64
AcceptedCmp4               float64
AcceptedCmp5               float64
AcceptedCmp1               float64
AcceptedCmp2               float64
Complain                   float64
Z_CostContact              float64
Z_Revenue                  float64
Response                   float64
Education_Basic               bool
Education_Graduation          bool
Education_Master    

# 3. Feature Selection and Correlation Analysis