# Install Comet

In [None]:
!pip install comet_ml --quiet

# Initialize Comet
Set your API Key to enable logging to Comet from this notebook

In [None]:
import comet_ml

comet_ml.init(project_name="comet-churn-prediction")

COMET INFO: Comet API key is valid


# Download The Data using Artifacts

[Comet Artifacts](https://www.comet.ml/site/artifacts/) help you conveniently track your datasets throughout the experimentation process. Here we're going to fetch the dataset Churn prediction using just two lines of code!

You can take a look at the dataset [here](https://www.comet.ml/team-comet-ml/artifacts/telco-churn-dataset/1.0.0)

In [None]:
experiment = comet_ml.Experiment()
artifact = experiment.get_artifact("team-comet-ml/telco-churn-dataset:latest")

COMET INFO: Couldn't find a Git repository in '/content' and lookings in parents. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY`
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/team-comet-ml/comet-churn-prediction/23731d13e7ee4a858490ad124a38ef82



In [None]:
artifact.download("./")

COMET INFO: Artifact 'team-comet-ml/telco-churn-dataset:1.0.0' download has been started asynchronously
COMET INFO: Still downloading 1 file(s), remaining 1.69 MB/1.69 MB
COMET INFO: Artifact 'team-comet-ml/telco-churn-dataset:1.0.0' has been successfully downloaded


Artifact(name='telco-churn-dataset', artifact_type='dataset', version=None, aliases=set(), version_tags=set())

# Basic EDA and Dataset Profiling with Sweetviz

In [None]:
!pip install sweetviz --quiet

In [None]:
import pandas as pd
import sweetviz

df = pd.read_csv("./telco-churn-dataset.csv", index_col=0)

In [None]:
report = sweetviz.analyze(df, target_feat="Churn Label")
report.log_comet(experiment)

                                             |          | [  0%]   00:00 -> (? left)

# Training a Model

We're going to build a very basic baseline model for this problem. Let's start by taking a look at our data. We'll start with just the columns

In [None]:
df.columns

Index(['CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code',
       'Lat Long', 'Latitude', 'Longitude', 'Gender', 'Senior Citizen',
       'Partner', 'Dependents', 'Tenure Months', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Online Security',
       'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
       'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method',
       'Monthly Charges', 'Total Charges', 'Churn Label', 'Churn Value',
       'Churn Score', 'CLTV', 'Churn Reason'],
      dtype='object')

Next, lets look at the data types of each column to confirm that they make sense.

In [None]:
df.dtypes

CustomerID            object
Count                  int64
Country               object
State                 object
City                  object
Zip Code               int64
Lat Long              object
Latitude             float64
Longitude            float64
Gender                object
Senior Citizen        object
Partner               object
Dependents            object
Tenure Months          int64
Phone Service         object
Multiple Lines        object
Internet Service      object
Online Security       object
Online Backup         object
Device Protection     object
Tech Support          object
Streaming TV          object
Streaming Movies      object
Contract              object
Paperless Billing     object
Payment Method        object
Monthly Charges      float64
Total Charges         object
Churn Label           object
Churn Value            int64
Churn Score            int64
CLTV                   int64
Churn Reason          object
dtype: object

From this quick exploration step, we can determine two preprocessing steps that we're going to have to execute to make this data usable to the model.

First, we're going to convert the "Total Charges" column to a `float` data type. This column has been labelled as an `object` type, which implies the values have been recorded as strings. There are some entries in this column that are empty string, we're going to drop those rows once we've converted the column to a numeric type.   

Next, we're going to drop the columns that might leak information about the target to our model (these are all the columns that start with "Churn"). Next, we'll drop high cardinality features such as the CustomerID and Zip Code, features that add no additional information such as the State and Country column, and redundant columns like "Lat Long".

  

In [None]:
def convert_to_float(x):
    try:
        return float(x)

    except Exception:
        return None


df["Total Charges"] = df["Total Charges"].apply(lambda x: convert_to_float(x))
df.dropna(subset=["Total Charges"], inplace=True)

In [None]:
y = df.pop("Churn Value")
X = df.drop(
    [
        "CustomerID",
        "Churn Label",
        "Churn Reason",
        "Churn Score",
        "Lat Long",
        "State",
        "Country",
        "Zip Code",
    ],
    axis=1,
)

## Feature Engineering

Our dataset contains a bunch of categorical variables, so let's one-hot encode them.

In [None]:
X_features = pd.get_dummies(X)

In [None]:
X_features.head()

Unnamed: 0,Count,Latitude,Longitude,Tenure Months,Monthly Charges,Total Charges,CLTV,City_Acampo,City_Acton,City_Adelanto,City_Adin,City_Agoura Hills,City_Aguanga,City_Ahwahnee,City_Alameda,City_Alamo,City_Albany,City_Albion,City_Alderpoint,City_Alhambra,City_Aliso Viejo,City_Alleghany,City_Alpaugh,City_Alpine,City_Alta,City_Altadena,City_Alturas,City_Alviso,City_Amador City,City_Amboy,City_Anaheim,City_Anderson,City_Angels Camp,City_Angelus Oaks,City_Angwin,City_Annapolis,City_Antelope,City_Antioch,City_Anza,City_Apple Valley,...,Senior Citizen_Yes,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,Phone Service_No,Phone Service_Yes,Multiple Lines_No,Multiple Lines_No phone service,Multiple Lines_Yes,Internet Service_DSL,Internet Service_Fiber optic,Internet Service_No,Online Security_No,Online Security_No internet service,Online Security_Yes,Online Backup_No,Online Backup_No internet service,Online Backup_Yes,Device Protection_No,Device Protection_No internet service,Device Protection_Yes,Tech Support_No,Tech Support_No internet service,Tech Support_Yes,Streaming TV_No,Streaming TV_No internet service,Streaming TV_Yes,Streaming Movies_No,Streaming Movies_No internet service,Streaming Movies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,Paperless Billing_No,Paperless Billing_Yes,Payment Method_Bank transfer (automatic),Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
0,1,33.964131,-118.272783,2,53.85,108.15,3239,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1
1,1,34.059281,-118.30742,2,70.7,151.65,2701,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,1,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0
2,1,34.048013,-118.293953,8,99.65,820.5,5372,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0
3,1,34.062125,-118.315709,28,104.8,3046.05,5003,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0
4,1,34.039224,-118.266293,49,103.7,5036.3,5340,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0,0


## Splitting the Data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_features, y, test_size=0.2, random_state=42
)

## Fit a Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

# Evaluating the Model

## Logging The Classification Report

In this block, we're going to make use of Experiment Context to automatically add the apppropriate prefixes to our metrics.

In [None]:
from sklearn.metrics import classification_report


def log_classification_report(y_true, y_pred):
    report = classification_report(y_true, y_pred, output_dict=True)
    for key, value in report.items():
        if key == "accuracy":
            experiment.log_metric(key, value)
        else:
            experiment.log_metrics(value, prefix=f"{key}")


with experiment.train():
    log_classification_report(y_train, clf.predict(X_train))

with experiment.test():
    log_classification_report(y_test, clf.predict(X_test))

{'0': {'precision': 0.9997591522157996, 'recall': 1.0, 'f1-score': 0.9998795616042394, 'support': 4151}, '1': {'precision': 1.0, 'recall': 0.9993215739484396, 'f1-score': 0.999660671869698, 'support': 1474}, 'accuracy': 0.9998222222222222, 'macro avg': {'precision': 0.9998795761078998, 'recall': 0.9996607869742198, 'f1-score': 0.9997701167369687, 'support': 5625}, 'weighted avg': {'precision': 0.999822265039606, 'recall': 0.9998222222222222, 'f1-score': 0.9998222027653569, 'support': 5625}}
{'0': {'precision': 0.8246869409660107, 'recall': 0.9110671936758893, 'f1-score': 0.8657276995305164, 'support': 1012}, '1': {'precision': 0.6885813148788927, 'recall': 0.5037974683544304, 'f1-score': 0.5818713450292397, 'support': 395}, 'accuracy': 0.7967306325515281, 'macro avg': {'precision': 0.7566341279224518, 'recall': 0.7074323310151598, 'f1-score': 0.7237995222798781, 'support': 1407}, 'weighted avg': {'precision': 0.786476761645178, 'recall': 0.7967306325515281, 'f1-score': 0.78603810462788

## Logging Precision-Recall Curves

In [None]:
import numpy as np
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(
    y_test, clf.predict_proba(X_test)[:, 1]
)
pr_df = pd.DataFrame([precision, recall, thresholds]).T
pr_df.columns = ["precision", "recall", "thresholds"]

In [None]:
experiment.log_table(
    filename="precision-recall-data.csv", tabular_data=pr_df, headers=True
)

{'api': 'https://www.comet.ml/api/rest/v2/experiment/asset/get-asset?assetId=55b49bb4a07a447499f8e73a0f9b4f68&experimentKey=f598691808cd450788e66c8e19d353a7',
 'assetId': '55b49bb4a07a447499f8e73a0f9b4f68',
 'web': 'https://www.comet.ml/api/asset/download?assetId=55b49bb4a07a447499f8e73a0f9b4f68&experimentKey=f598691808cd450788e66c8e19d353a7'}

## Logging Confusion Matrix

In [None]:
def index_to_example(index):
    return X_test.iloc[index, :][["CLTV", "Monthly Charges", "Total Charges"]].to_json()


experiment.log_confusion_matrix(
    y_test.tolist(),
    clf.predict(X_test).tolist(),
    index_to_example_function=index_to_example,
)

{'api': 'https://www.comet.ml/api/rest/v2/experiment/asset/get-asset?assetId=1aa491a9f9414927915e5f47147fadb2&experimentKey=f598691808cd450788e66c8e19d353a7',
 'assetId': '1aa491a9f9414927915e5f47147fadb2',
 'web': 'https://www.comet.ml/api/asset/download?assetId=1aa491a9f9414927915e5f47147fadb2&experimentKey=f598691808cd450788e66c8e19d353a7'}

# Ending an Experiment

In [None]:
experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/team-comet-ml/comet-churn-prediction/f598691808cd450788e66c8e19d353a7
COMET INFO:   Downloads:
COMET INFO:     artifact assets : 1 (1.69 MB)
COMET INFO:     artifacts       : 1
COMET INFO:   Metrics:
COMET INFO:     test_0_f1-score              : 0.8686679174484053
COMET INFO:     test_0_precision             : 0.8267857142857142
COMET INFO:     test_0_recall                : 0.9150197628458498
COMET INFO:     test_0_support               : 1012
COMET INFO:     test_1_f1-score              : 0.5894428152492669
COMET INFO:     test_1_precision             : 0.7003484320557491
COMET INFO:     test_1_recall                : 0.5088607594936709
COMET INFO:     test_1_support               : 395
COMET INFO:     test_accuracy                : 0.800995024