## Preprocessing and Modeling - For the model to understand data
1. Data must be all numeric for training
2. Neutral and algorithm ready
3. Must encode strings before training

 This often has shorter steps than EDA since you're mostly encoding the non numeric data, training different models, measuring and comparing their performances and saving the ones you want. 

 You can then load the models somewhere else and use them.

### Preprocessing

In [5]:
import pandas as pd

data = pd.read_csv("Telco-Customer-Churn.csv")
data.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


#### Overview of cell below:
1. Spliting the data into training and testing parts.
2. Encoding the categorical columns

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

clean_df = data.copy()

for col in clean_df.select_dtypes(include='object').columns:
    clean_df[col] = LabelEncoder().fit_transform(clean_df[col])

X = clean_df.drop('Churn', axis=1)
y = clean_df['Churn']

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=42)

### Train a Model 

> I thought a logistic regression and a random forest classifier model would be good enough for this job so i trained both and compared their performances

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

LR_model = LogisticRegression(max_iter=2000, class_weight='balanced')
LR_model.fit(X_train_scaled, y_train)

y_pred = LR_model.predict(X_test_scaled)
y_pred_proba = LR_model.predict_proba(X_test_scaled)[:,1]

print(classification_report(y_test, y_pred))
print("ROC AUC: ", roc_auc_score(y_test, y_pred_proba))
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy : {accuracy:.2%}")


              precision    recall  f1-score   support

           0       0.92      0.72      0.81      1036
           1       0.52      0.83      0.64       373

    accuracy                           0.75      1409
   macro avg       0.72      0.78      0.72      1409
weighted avg       0.81      0.75      0.77      1409

ROC AUC:  0.860827890318507
Accuracy : 75.16%


In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score

# Simple Random Forest model
RF_model = RandomForestClassifier()
RF_model.fit(X_train, y_train)

# Predictions
y_pred = RF_model.predict(X_test)
y_pred_proba = RF_model.predict_proba(X_test)[:, 1]

# Evaluation
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_pred_proba))
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy : {accuracy:.2%}")


              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1036
           1       0.64      0.49      0.55       373

    accuracy                           0.79      1409
   macro avg       0.74      0.69      0.71      1409
weighted avg       0.78      0.79      0.78      1409

ROC AUC: 0.839809744635482
Accuracy : 79.21%


### save the model

In [9]:
import joblib

joblib.dump(RF_model, "RF_Churn_model.pkl")
joblib.dump(LR_model, "LR_Churn_model.pkl")

#To Load it later use: model = joblib.load("Churn_model.pkl")

['LR_Churn_model.pkl']