**Importing the libraries**


In [None]:
import pandas as pd
import numpy as np

**Importing DataSet**

In [None]:
dataset = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

**Customer Id table dropped, total charges is in string so convert it into numeric, errors = 'coerce' Tells pandas:
If it can’t convert a value (for example, if it's an empty string, space, or invalid text), instead of throwing an error, it should replace that value with NaN (missing value)**

In [None]:
dataset = dataset.drop('customerID', axis=1)
dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'], errors='coerce')

**This cell counts the number of values, important for model evualation**

In [106]:
print(dataset['Churn'].value_counts())


Churn
0    5174
1    1869
Name: count, dtype: int64


**Here in this cell we are replacing missing values with the mean**

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
dataset['TotalCharges'] = imputer.fit_transform(dataset[['TotalCharges']])

**In this cell, applying encoding to the values which are binary applying Label Encoding to them**

In [None]:
from sklearn.preprocessing import LabelEncoder
target_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']
le = LabelEncoder()
for col in target_cols:
    dataset[col] = le.fit_transform(dataset[col])  

**In this cell, applying OneHotEncoding to the columns not binary, but instead of applying it to the dataset we are applying to X because it keeps: original dataframe save, preferred and preserves pandas structure**

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

onehot_cols = ['InternetService', 'Contract', 'PaymentMethod', 
               'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 
               'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

ct = ColumnTransformer(transformers=[('encode', OneHotEncoder(), onehot_cols)], remainder='passthrough')



In [None]:
X = ct.fit_transform(dataset.drop('Churn', axis=1))
y = dataset["Churn"].values

In [None]:
print(X)
print(y)

**Just train test split, but without random_state we will get random splits everytime and obviously not realiable for consistent results and also 42 has nothing to do with it**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Applying feature sclaling so that the features cannot overcome others with low values and applied only on numeric columns If data is fully numeric we can scale everything without specifying columns. If  data is mixed (text + numbers) we have to select numeric columns first. normally after encoding no need to mention the columns**

In [None]:
from sklearn.preprocessing import StandardScaler
sc =  StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)  