# Experiments with Telco Churn data

This notebook demonstrates how to build a ML model for telco churn data. We are going to build a basic model using scikit learn. The purpose is to experiment with feature engineering and different models. Once we are happy with the result we are going to package our code and execute this on AI platform training service.

Before you execute the following, please replace the PROJECT variable with your project id. The project id can be found on your GCP console.

In [None]:
PROJECT=!gcloud config get-value project # returns default project id 

In [None]:
PROJECT=PROJECT[0]

## Loading data from BigQuery to Pandas Dataframe

In the following cell we are pulling data from BQ and loading them to a dataframe. Keep in mind that data might not fit your instance memory and therefore we might need to only bring a sample of the data. That is not a big problem as we are only experimenting. When we will be running our training job on AI Platform training we need to pick the right instance with enough memory.

Additionally our telco dataset fits the memory so we will go ahead and load everything.

In [None]:
import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage

# Make clients.
bqclient = bigquery.Client(project=PROJECT)
bqstorageclient = bigquery_storage.BigQueryReadClient()
query_string = """
SELECT * from telco.churn
"""

df = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

Let's have a look how the data loaded in the dataframe look like

In [None]:
df.head()

## Data Cleaning 

It seems that there are some invalid values in TotalCharges column where the TotalCharges is missing. Look at the first record below, data are order based on TotalCharges.

Why is that?

In [None]:
df.sort_values("TotalCharges", ascending=True)

hm... I suspect that the reason is that new customers do not have TotalCharges as this is their first month...

In [None]:
df.loc[df['tenure']==0, ['tenure', 'MonthlyCharges', 'TotalCharges']]

Okey lets fix this by assigning the values of this column to the same as MonthlyCharges. For new customers at the end of the first month the total charges should be the same as that month

In [None]:
df.loc[df.tenure == 0, 'TotalCharges'] = df.loc[df.tenure == 0, 'MonthlyCharges']
df.loc[df.tenure==0, ['tenure', 'MonthlyCharges', 'TotalCharges']]

## Feature Engineering
We have columns in multible formats. Some are numerical, some are categorical(0 or 1) and some are categorical with multiple options.
There is also customerID that we do not really need. It is uniqu to the customer and it should not be part of the prediction equation.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

BINARY_FEATURES = ['gender',
            'SeniorCitizen',
            'Partner',
            'Dependents',
            'PhoneService',
            'MultipleLines',
            'PaperlessBilling']

NUMERIC_FEATURES = [
            'tenure',
            'MonthlyCharges',
            'TotalCharges']

CATEGORICAL_FEATURES = [
            'InternetService',
            'OnlineSecurity',
            'DeviceProtection',
            'TechSupport',
            'StreamingTV',
            'StreamingMovies',
            'Contract',
            'PaymentMethod']

# all rows but only selected features/columns
X = df.loc[:, BINARY_FEATURES+NUMERIC_FEATURES+CATEGORICAL_FEATURES]

# We create a series with the prediciton label
y = df.Churn


Now we are going to perform opperations to our features, to OneHotEncode and to scaling to unit variance

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Definining a preprocessing step for our pipeline. 
# it specifies how the features are going to be transformed
preprocessor = ColumnTransformer(
    transformers=[
        ('bin', OneHotEncoder(sparse=False), BINARY_FEATURES),
        ('num', StandardScaler(), NUMERIC_FEATURES),
        ('cat', OneHotEncoder(handle_unknown='ignore'), CATEGORICAL_FEATURES)])


# We now create a full pipeline, for preprocessing and training.
# for training we selected a linear SVM classifier
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', SVC(kernel='linear'))])

We are going to split our data to  80% training and 20% test sets, and we will traing our model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## Training ML model
In the next step we are going to train our model and predict on the test data. We will then use the predictions to evaluate our model performance.

In [None]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

## Evaluating model
What do you think of this model? Is it accurate enough? Shall we move this into production?

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt  
print(classification_report(y_test,y_pred))

print("\n Confusion Matrix")
plot_confusion_matrix(clf, X_test,y_test)
plt.show()