# Phase 3 – Data Exploration & Feature Engineering
## Objectives
Analyze the collected data to understand its structure, uncover meaningful insights, and engineer useful features for modeling.

## Tasks

### 1. Exploratory Data Analysis (EDA)
- Compute descriptive statistics (mean, median, standard deviation, etc.).
- Visualize data distribution and relationships using charts (histograms, scatter plots, correlation heatmaps, box plots, etc.).
- Identify trends, outliers, anomalies, and potential issues.
### 2. Insight Extraction
- Highlight important patterns or relationships relevant to your research problem.
- Describe how these findings guide your next steps (modeling or deeper analysis).
### 3. Feature Engineering
- Create new attributes from existing raw features (e.g., ratios, aggregated variables, domain-based transformations).
- Encode categorical variables if needed.
- Scale/normalize features where appropriate.
- Justify why each engineered feature might improve performance.
### Deliverables
A concise EDA and Feature Engineering Report including:
- Key statistics and summary tables
- Visualizations with clear explanations
- A list of extracted insights
- A table of selected features with description and justification

### 1. Exploratory Data Analysis (EDA)
- Compute descriptive statistics (mean, median, standard deviation, etc.).
- Visualize data distribution and relationships using charts (histograms, scatter plots, correlation heatmaps, box plots, etc.).
- Identify trends, outliers, anomalies, and potential issues.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [None]:
# Load the dataset

data_file = "datasets/telco_churn_clean_stage_2.csv"

df = pd.read_csv(data_file)

print("The shape of the dataset is: ", df.shape, "\n")
df.head()

### Step 2 – Basic data understanding & cleaning

In [None]:
df.info()

## Covert datatype to the proper type
AS total charges shows as an Object instead a numeric value, lets Convert TotalCharges to numeric & handle missing values

In [None]:
# Convert TotalCharges to numeric
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Check how many became NaN
df["TotalCharges"].isna().sum(), df.loc[df["TotalCharges"].isna(), ["tenure", "TotalCharges"]].head()


In [None]:
# Impute TotalCharges = 0 for tenure = 0 & NaN TotalCharges
mask_new = (df["tenure"] == 0) & (df["TotalCharges"].isna())
df.loc[mask_new, "TotalCharges"] = 0

# Confirm no more NaNs in TotalCharges
df["TotalCharges"].isna().sum()

### Step 3 – Descriptive statistics (EDA Task 1: statistics)

In [None]:
num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]

desc_stats = df[num_cols].describe().T.round(2)
desc_stats


### Step 4 – Target variable: Churn distribution

In [None]:
churn_counts = df["Churn"].value_counts()
churn_props = df["Churn"].value_counts(normalize=True).round(3)

print("Counts:\n", churn_counts)
print("\nProportions:\n", churn_props)


### Plot churn distribution

In [None]:
plt.figure(figsize=(5, 4))
churn_counts.plot(kind="bar")
plt.title("Customer Churn Distribution")
plt.xlabel("Churn")
plt.ylabel("Number of Customers")
plt.tight_layout()
plt.show()


### Step 5 – Numeric relationships & correlation (EDA Task 1: relationships)
5.1 Create numeric target ChurnFlag

In [None]:
df["ChurnFlag"] = df["Churn"].map({"Yes": 1, "No": 0})
df["ChurnFlag"].value_counts()


## 5.2 Correlation matrix

In [None]:
corr_cols = ["tenure", "MonthlyCharges", "TotalCharges", "ChurnFlag"]
corr_matrix = df[corr_cols].corr()
corr_matrix


### 5.3 Simple correlation heatmap

In [None]:
plt.figure(figsize=(12, 8))
im = plt.imshow(corr_matrix, cmap="coolwarm", interpolation="nearest")

# Add colorbar
plt.colorbar(im)

# Tick labels
plt.xticks(range(len(corr_cols)), corr_cols, rotation=45, ha="right")
plt.yticks(range(len(corr_cols)), corr_cols)

# Annotate values inside the heatmap
for i in range(len(corr_cols)):
    for j in range(len(corr_cols)):
        value = corr_matrix.iloc[i, j]
        plt.text(j, i, f"{value:.2f}", ha="center", va="center", color="black")

plt.title("Correlation Heatmap (with values)")
plt.tight_layout()
plt.show()

### Step 6 – Distribution plots & outliers (EDA Task 1: distributions)
6.1 Histograms for numeric features

In [None]:
plt.figure(figsize=(12, 4))

for i, col in enumerate(num_cols, 1):
    plt.subplot(1, 3, i)
    df[col].hist(bins=30)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Frequency")

plt.tight_layout()
plt.show()


### 6.2 Boxplots for outlier inspection

In [None]:
plt.figure(figsize=(12, 4))

for i, col in enumerate(num_cols, 1):
    plt.subplot(1, 3, i)
    df.boxplot(column=col)
    plt.title(f"Boxplot of {col}")

plt.tight_layout()
plt.show()


### Step 7 – Categorical features vs Churn (EDA Task 1 & 2)

We’ll define a helper function to compute churn rate by category.

In [None]:
def churn_rate_by(df, col):
    tmp = (
        df.groupby(col)["ChurnFlag"]
          .agg(["count", "mean"])
          .rename(columns={"mean": "ChurnRate"})
          .reset_index()
    )
    return tmp

cat_cols_to_check = ["Contract", "InternetService", "PaymentMethod", "SeniorCitizen"]

for col in cat_cols_to_check:
    print(f"\n=== {col} ===")
    display(churn_rate_by(df, col).round(3))


### Example bar chart: churn rate by Contract

In [None]:
contract_stats = churn_rate_by(df, "Contract")

plt.figure(figsize=(6, 4))
plt.bar(contract_stats["Contract"], contract_stats["ChurnRate"])
plt.title("Churn Rate by Contract Type")
plt.xlabel("Contract Type")
plt.ylabel("Churn Rate")
plt.tight_layout()
plt.show()


### Step 8 – Feature Engineering (Task 3)

Now we create new features, encode categoricals, and prepare a modeling-ready dataset.

8.1 Tenure bands

In [None]:
def tenure_band(t):
    if t <= 6:
        return "0-6"
    elif t <= 12:
        return "7-12"
    elif t <= 24:
        return "13-24"
    elif t <= 48:
        return "25-48"
    else:
        return "49+"

df["TenureBand"] = df["tenure"].apply(tenure_band)
df["TenureBand"].value_counts()


### 8.2 Service count (how many services a customer has)

In [None]:
service_cols = [
    "PhoneService", "MultipleLines", "OnlineSecurity", "OnlineBackup",
    "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"
]

df["ServicesCount"] = (df[service_cols] == "Yes").sum(axis=1)
df[["ServicesCount"]].describe()


### 8.3 Contract & payment related binary flags

You’ll describe why each of these may help model performance in your report (e.g., month-to-month contracts are high churn risk).

In [None]:
df["IsNewCustomer"] = (df["tenure"] <= 6).astype(int)

df["HasFiber"] = (df["InternetService"] == "Fiber optic").astype(int)

df["IsMonthToMonth"] = (df["Contract"] == "Month-to-month").astype(int)

df["IsLongTermContract"] = df["Contract"].isin(["One year", "Two year"]).astype(int)

df["IsElectronicCheck"] = (df["PaymentMethod"] == "Electronic check").astype(int)

df["AutoPayment"] = df["PaymentMethod"].isin(
    ["Bank transfer (automatic)", "Credit card (automatic)"]
).astype(int)

df["SeniorWithNoDependents"] = (
    (df["SeniorCitizen"] == 1) & (df["Dependents"] == "No")
).astype(int)

df["TechSupportOrSecurity"] = (
    (df["TechSupport"] == "Yes") | (df["OnlineSecurity"] == "Yes")
).astype(int)

df["StreamingBundle"] = (
    (df["StreamingTV"] == "Yes") | (df["StreamingMovies"] == "Yes")
).astype(int)


### 8.4 One-hot encoding for remaining categorical variables

Select categorical columns:

In [None]:
cat_cols = [
    "gender", "SeniorCitizen", "Partner", "Dependents",
    "PhoneService", "MultipleLines", "InternetService",
    "OnlineSecurity", "OnlineBackup", "DeviceProtection",
    "TechSupport", "StreamingTV", "StreamingMovies",
    "Contract", "PaperlessBilling", "PaymentMethod",
    "TenureBand"
]

# Some are numeric but categorical (SeniorCitizen), we can leave it or one-hot; here we one-hot:
df[cat_cols].head()


In [None]:
# Dataset after one-hot encoding
df_model = df.copy()

# One-hot encode categorical columns
df_model = pd.get_dummies(df_model, columns=cat_cols, drop_first=True)

df_model.head()
df_model.shape


### 8.5 (Optional) Scale numeric features

If you want to scale numeric features for certain models:

In [None]:
from sklearn.preprocessing import StandardScaler

numeric_for_scaling = ["tenure", "MonthlyCharges", "TotalCharges", "ServicesCount"]

scaler = StandardScaler()
df_model[numeric_for_scaling] = scaler.fit_transform(df_model[numeric_for_scaling])

df_model[numeric_for_scaling].head()


### Step 9 – Save the engineered dataset

Finally, save a clean, feature-engineered CSV for Phase 4 modeling.

In [None]:
output_path = "datasets/outputs/telco_churn_phase3_features.csv"
df_model.to_csv(output_path, index=False)
print("Saved:", output_path)
