# **Task 2**

---

## **Buidling a Machine Learning Model**

This task involves choosing an appropriate algorithm, training the model, and evaluating its performance. The goal is to create a model that can be easily understood and acted on by business stakeholders. 

The challenge lies in selecting the right machine learning algorithm and fine-tuning it to accurately predict which customers are at risk of leaving. This model will provide actionable insights, enabling the team to develop targeted interventions to retain valuable customers. Specifically, it is important to select the most appropriate machine learning algorithm, which balances predictive accuracy with interpretability

### **Exploratory data analysis**

First, we must load all sheets in order to better understand what we have and the statistical properties of the dataset.

In [1]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import data preprocessing libraries
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Ignore Warning
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load Excel file
file_path = "Data/Customer_Churn_Data_Large.xlsx"
xlsx = pd.ExcelFile(file_path)

In [3]:
# Load sheets into separate DataFrames
demographics = xlsx.parse("Customer_Demographics")
transactions = xlsx.parse("Transaction_History")
service = xlsx.parse("Customer_Service")
online = xlsx.parse("Online_Activity")
churn = xlsx.parse("Churn_Status")

Now we can start merging all the sheets by `CustomerID` column. To tackle the problem of **transactional format** of `Transaction_History` and `Customer_Service` sheets, I will aggregate the transactional data first using `.groupby('CustomerID')` and applied functions like _sum_, _count_, and _mean_.

This turned the transactional sheets into **customer-level summaries** (i.e., each row = one customer), making it possible to:

- Merge them safely with `Customer_Demographics`, `Online_Activity`, and `Churn_Status`.

- Do EDA with one observation per customer.

- Fit a proper model. 

Let's check if missing values still exist in the dataset.

### **Data Preprocessing** 

Before building any predictive model, it’s critical to ensure that the dataset is clean, balanced, and properly preprocessed. 

**1. Handle Missing Values**

- `ServiceInteractions` and `ResolutionRate` have 332 missing values — these likely came from customers who had no customer service records.

    → Treat missing values as **meaningful absence**:

In [13]:
# Fill missing service-related fields with 0 (no interaction)
base_df['ServiceInteractions'] = base_df['ServiceInteractions'].fillna(0)
base_df['ResolutionRate'] = base_df['ResolutionRate'].fillna(0)

**2. Feature Engineering**

- Convert `LastLoginDate` to days since last login event and drop the column.

- Create new feature `AgeGroup` based on `Age` for better customer segmentation purpose.

In [14]:
# Convert LastLoginDate to days since last login
base_df['LastLoginDate'] = pd.to_datetime(base_df['LastLoginDate'])
ref_date = base_df['LastLoginDate'].max()
base_df['DaysSinceLastLogin'] = (ref_date - base_df['LastLoginDate']).dt.days
base_df.drop(columns=['LastLoginDate'], inplace=True)

In [15]:
# Create age bins
base_df['AgeGroup'] = pd.cut(base_df['Age'], bins=[0, 30, 50, 100], labels=['<30', '30-50', '>50'])

**3. Outliers Detection and Treatment**

- Use the `clip` method to limit the values in numeric features if there exist outliers. Given the interval, values outside the interval are clipped to the interval edges.

In [16]:
# Cap outliers at 1st and 99th percentile for numeric features
for col in numeric_cols:
    lower = base_df[col].quantile(0.01)
    upper = base_df[col].quantile(0.99)
    base_df[col] = base_df[col].clip(lower, upper)

**4. Encoding and Scaling**

- Use the **One-Hot Encoding** method to encode all the categorical features as it turns each category into a binary feature (dummy variable), avoiding any unintended ordinal interpretation.

    - This method is ideal when categories don’t have a natural order like `Gender`, `MaritalStatus`, `AgeGroup` or `IncomeLevel`.

- Scale numerical data to ensure all features contribute equally to the predictive model.

In [17]:
# Separate features and target
data_for_export = base_df.drop(columns=["CustomerID"])
categorical = data_for_export.select_dtypes(include='object').columns.tolist()
numerical = data_for_export.select_dtypes(include='number').drop(columns=['ChurnStatus']).columns.tolist()

In [18]:
# One-hot encode categorical variables
encoded_df = pd.get_dummies(data_for_export, columns=categorical, drop_first=True)

In [19]:
# Scale numerical features
scaler = StandardScaler()
encoded_df[numerical] = scaler.fit_transform(encoded_df[numerical])

### **Export Processed Data**

In [20]:
# Export processed data for modeling task
encoded_df.to_csv("Data/processed_churn_csv", index=False)