<a href="https://colab.research.google.com/github/catebarry/xai-assignments/blob/main/assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIPI 590 - XAI | Assignment 1
### Interpretable ML
### Catie Barry


[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catebarry/xai-assignments/blob/main/assignment-01.ipynb)

👉 Make sure to delete the helper markdown below in your own notebook!

## DO:
* Use markdown and comments effectively
* Pull out classes and functions into scripts
* Ensure cells are executed in order and avoid skipping cells to maintain reproducibility
* Choose the appropriate runtime (i.e. GPU) if needed
* If you are using a dataset that is too large to put in your GitHub repository, you must either pull it in via Hugging Face Datasets or put it in an S3 bucket and use boto3 to pull from there.
* Use versioning on all installs (ie pandas==1.3.0) to ensure consistency across versions
* Implement error handling where appropriate

## DON'T:
* Absolutely NO sending us Google Drive links or zip files with data (see above).
* Load packages throughout the notebook. Please load all packages in the first code cell in your notebook.
* Add API keys or tokens directly to your notebook!!!! EVER!!!
* Include cells that you used for testing or debugging. Delete these before submission
* Have errors rendered in your notebook. Fix errors prior to submission.

In [2]:
# installations

!pip install pandas==1.3.0

Collecting pandas==1.3.0
  Downloading pandas-1.3.0.tar.gz (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m2[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Installing build dependencies ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
[31m│[0m exit code: [1;36m2[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.


In [3]:
# import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Please use this to connect your GitHub repository to your Google Colab notebook
# Connects to any needed files from GitHub and Google Drive
import os

# Remove Colab default sample_data
!rm -r ./sample_data

# Clone GitHub files to colab workspace
repo_name = "xai-assignments" # Change to your repo name
git_path = 'https://github.com/catebarry/xai-assignments.git' #Change to your path
!git clone "{git_path}"

# Install dependencies from requirements.txt file
#!pip install -r "{os.path.join(repo_name,'requirements.txt')}" #Add if using requirements.txt

# Change working directory to location of notebook
notebook_dir = 'assignments'
path_to_notebook = os.path.join(repo_name,notebook_dir)
%cd "{path_to_notebook}"
%ls

Cloning into 'xai-assignments'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 30 (delta 6), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (30/30), 7.49 KiB | 1.87 MiB/s, done.
Resolving deltas: 100% (6/6), done.
/content/xai-assignments/assignments
assignment-1.ipynb


In this assignment, you will work with a dataset from a telecommunications company (https://www.kaggle.com/datasets/blastchar/telco-customer-churn/code). The company is interested in understanding the factors that contribute to customer churn (customers leaving the company for a competitor) and developing interpretable models to predict which customers are at risk of churning.

# Exploratory Data Analysis to Check Assumptions
Exploratory analysis of dataset to understand relationships between different features and the target variable (churn), using appropriate visualizations and statistical methods to determine whether assumptions about linear, logistic, and GAM models are met.


In this section, we explore the Telco Customer Churn dataset to understand feature distributions, relationships, and potential issues before modeling.
We will:
- Inspect the dataset and clean missing/invalid values
- Explore class balance of the target variable `Churn`
- Visualize numeric and categorical features
- Identify potential violations of linear, logistic, and GAM model assumptions:
  - **Linearity** (needed for linear/logistic regression)
  - **Independence of observations**
  - **Homoscedasticity** (equal variance of errors)
  - **Normality of residuals**
  - **No multicollinearity** (features not too correlated)
  - **No influential outliers**

### Load and Inspect the Dataset

We will use the [Telco Customer Churn dataset](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).  
This dataset contains information about customers of a telecommunications company and whether they churned (left the service).

The dataset includes:
- **Customer demographics** (e.g., gender, senior citizen, partner, dependents)
- **Account information** (e.g., tenure, contract type, payment method)
- **Service usage** (e.g., internet service, phone service, streaming options)
- **Charges** (monthly charges and total charges)
- **Churn** (Yes/No) — the target variable

In [6]:
# load and inspect dataset
url = "https://raw.githubusercontent.com/blastchar/telco-customer-churn/master/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(url)

print("Shape:", df.shape)
print("\nData types and non-null counts:")
print(df.info())
print("\nFirst 5 rows:")
display(df.head())

HTTPError: HTTP Error 404: Not Found

### Target Variable and Data Cleaning

The target variable `Churn` is categorical with values **Yes** or **No**.  
For modeling, we will convert it into a binary variable:  
- `Yes` → 1  
- `No` → 0  

Additionally, the column `TotalCharges` should be numeric but sometimes contains blank values.  
We will coerce it to numeric and replace missing values with the median.

In [None]:
# clean target variable and features
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# convert TotalCharges (some values are blank)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

### Churn Class Balance

Class balance is critical for classification problems. If the dataset is highly imbalanced (e.g., very few churn cases compared to non-churn), many models may perform poorly without rebalancing techniques.

We visualize the distribution of churn vs. non-churn customers.

In [None]:
# class balance
plt.figure(figsize=(5,4))
sns.countplot(x='Churn', data=df)
plt.title("Churn Distribution")
plt.show()

### Distribution of Numeric Features

We examine the distributions of the key numeric variables:
- **tenure** (number of months with the company)
- **MonthlyCharges** (monthly bill amount)
- **TotalCharges** (total billed amount)

This helps us understand skewness, ranges, and potential outliers in the data.

In [None]:
# univariate analysis
num_features = ['tenure', 'MonthlyCharges', 'TotalCharges']

df[num_features].hist(bins=30, figsize=(12,6))
plt.suptitle("Distribution of Numeric Features")
plt.show()

### Numeric Features by Churn Status

We compare the distributions of numeric variables between churned and non-churned customers using boxplots.  
This allows us to see whether the central tendency and spread differ between the two groups, which can suggest predictive value.

⚠️ Example interpretation — check your plots:  
- Customers with lower **tenure** appear more likely to churn.  
- Customers with higher **MonthlyCharges** are somewhat more likely to churn.  
- **TotalCharges** is lower for churned customers, likely because they left earlier.  

In [None]:
# churn vs numeric features
for col in num_features:
    plt.figure(figsize=(6,4))
    sns.boxplot(x='Churn', y=col, data=df)
    plt.title(f"{col} vs Churn")
    plt.show()

### Categorical Features by Churn Status

We now explore the relationship between categorical features and churn.  
This includes demographics (e.g., gender, partner, dependents) and account information (e.g., contract type, payment method, internet service).

⚠️ Example interpretation — check your plots:  
- **Contract type** shows a strong effect: customers with month-to-month contracts churn more often.  
- **Payment method** also matters: electronic check users churn more frequently.  
- **Gender** does not show a major difference.  

In [None]:
# churn vs categorical features
cat_features = ['gender','Partner','Dependents','PhoneService','MultipleLines',
                'InternetService','Contract','PaymentMethod']

for col in cat_features:
    plt.figure(figsize=(7,4))
    sns.countplot(x=col, hue='Churn', data=df)
    plt.title(f"{col} vs Churn")
    plt.xticks(rotation=45)
    plt.show()

### Correlation Between Numerical Features

Correlation analysis helps us check for **multicollinearity**, which can negatively affect linear regression models.  
Highly correlated predictors may provide redundant information.  

⚠️ Example interpretation — check your heatmap:  
- **MonthlyCharges** and **TotalCharges** are moderately correlated.  
- **Tenure** and **TotalCharges** are also correlated (longer tenure → higher total charges).  
- Correlation with `Churn` is relatively weak overall, suggesting multiple features contribute in combination.

In [None]:
# correlation matrix
plt.figure(figsize=(8,6))
sns.heatmap(df[num_features + ['Churn']].corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap (Numerical Features)")
plt.show()

## Assumption Checks Summary

Based on the exploratory analysis:

- **Linearity**: The relationship between `tenure` and churn looks non-linear (sharp drop in churn risk after the first few months).  
- **Independence**: Observations are customers, not time-series data, so independence is reasonable.  
- **Homoscedasticity**: To be tested later with residuals after model fitting.  
- **Normality**: Numeric features like `MonthlyCharges` and `TotalCharges` are skewed; residuals will need to be checked later.  
- **Multicollinearity**: Some correlation between `tenure`, `MonthlyCharges`, and `TotalCharges`, but not extreme.  
- **Outliers**: Some customers have unusually high charges; worth monitoring.  

These findings will guide which models are appropriate (e.g., logistic regression and GAM may handle non-linearities better than linear regression).

# Linear Regression
 Treat the churn variable as a continuous variable (e.g., 0 for staying, 1 for churning) and build a linear regression model to predict churn. Interpret the coefficients and assess the model's performance.

#Logistic Regression
Treat churn as a binary variable and build a logistic regression model to predict the probability of churn. Interpret the coefficients.

# Generalized Additive Model (GAM)
Build a GAM to model the non-linear relationships between customer features and churn. Interpret the GAM model.

# Model Comparison
Compare the performance and interpretability of the different models you built. Discuss the strengths and weaknesses of each approach and provide recommendations for which model(s) the telecommunications company should use to address their customer churn problem.