## Final Project Submission

Please fill out:
* Student name: Cindy Minyade
* Student pace: Part-time 
* Instructor name: Brian Chacha

## SYRIATEL CUSTOMER CHURN PREDICTION

### 1. Business Understanding

SyriaTel is experiencing a loss of customers, which is referred to as 'churn'. This hurts the business as it takes more revenue to generate new customers, leading to loss of business revenue.
 
 ### 2. Business Goal

 The goal of this project is to develop a model that helps SyriaTel to predict which cutomers are liekly to churn using the historical data. [SyriaTel Customer Churn](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset)

By indentifying the potential churners early, SyriaTel is able to take action needed to ensure that they retain their clients. 

We will tackle the following in this project;
- Build multiple predictive models
- Evaluate and compare their preformance
- Explain model insights that will benefit the business



### 3. Data Understanding

We will begin by loading and exploring the data.

In [47]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [48]:
#load the dataset
df = pd.read_csv("data/bigml_59c28831336c6604c800002a.csv")

In [49]:
#basic overview of the dataset
df.head()


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


Now to farmiliarize with the data:  
- Understand the dimentions  
- Type of data it contains  
- If there's missing values  

In [50]:
#check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

### Interpretation of the data information
- Has 3,333 rows and 21 columns ; rows are the customers and columns are the customer features.
- The churn colum is the one of interest to this project in bool ie. True/False
- Data has no missing values
- `account length`: Duration of the customer relationship
- `international plan`, `voice mail plan`: Categorical service flags
- `total day minutes`, `total intl calls`, `customer service calls`: Customer Usage behavior
- `state`, `area code`, and `phone number`: Likely not useful for prediction


### 4. Data Preparation

In this section we will prep the data for machine learning by:

- Dropping the irrelevant columns (`state`, `phone number`)
- Converting the categorical variables to numeric
- Separating the features and target variable
- Split data into training and testing sets

In [51]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#Drop the irrelevant columns
df = df.drop(columns=['state', 'phone number'])

#convert 'churn' from boolean to numeric [0, 1]
df['churn'] = df['churn']. astype(int)

# Encode categorical variables; international plan, voice mail plan
categorical_cols = ['international plan', 'voice mail plan']
df[categorical_cols] = df[categorical_cols].apply(lambda col: LabelEncoder().fit_transform(col))

#split features and target
X = df.drop('churn', axis=1) # input features used to predict churn
y = df['churn']    #target varible(churn) we are trying to predict

#split into training and testing sets ie 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Scaling
Scaling ensures that the feature values need to be standardized through use if the standardscaler. 
In our SyriaTel churn model, some features have much larger ranges than others as follows:

- **total day minutes**: ~0–350  
- **total eve minutes**: ~0–350  
- **customer service calls**: 0–9  
- **international plan**, **voice mail plan**: 0 or 1 

Therefore we will need to standardize these values so that:

- **Mean = 0**  
- **Standard deviation = 1**  

To ensure that all features are in some sort of uniformity when the model is being trained.


In [52]:

from sklearn.preprocessing import StandardScaler


# Instantiate a scaler object
scaler = StandardScaler()

# Fit the scaler on X_train and transform X_train
X_train_scaled = scaler.fit_transform(X_train)

# Transform X_test
X_test_scaled = scaler.transform(X_test)