In [1]:
#%autosave 0 #if I uncomment it, it means autosave disavbled. Otherwise Jupyternotebook saves every 2 mins or so.

start of video: [ML Zoomcamp 4.1 - Evaluation Metrics: Session Overview](https://www.youtube.com/watch?v=gmg5jw1bM8A&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=40)

# 4. Evaluation Metrics for Classification

In the previous session we trained a model for predicting churn. We build a model with logistic regression for scoring existing customers and assigned them a probability of this customer leaving a company. We trained a model and our accuracy was 80%. In this module we'll try to find out, what does it actually mean. Is it a good score or not. Or, are there other ways of evaluating binary classification models. We'll continue to same dataset as previous module. How do we know if it's good?

## 4.1 Evaluation metrics: session overview

* Dataset: [https://www.kaggle.com/blastchar/telco-customer-churn](https://www.kaggle.com/blastchar/telco-customer-churn)

* [https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv)

*Metric* - function that compares the predictions with the actual values and outputs a single number that tells how good the predictions are

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [11]:
df = pd.read_csv('data-week-3.csv')

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

df.churn = (df.churn == 'yes').astype(int)

In [13]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

del df_train['churn']
del df_val['churn']
del df_test['churn']

In [15]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]

In [16]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression()
model.fit(X_train, y_train)

In [17]:
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
(y_val == churn_decision).mean()

0.8034066713981547

This week we'll look at this number (accuracy) and compare it with a baseline model and see how good this number is, 80% is good or not. Then we'll look at different types of errors e.g. we think that a customer is churning but they are not or other ways, different types of correct decisions, Then we'll see how to put these numbers in a table (called as confusion table). Then we'll look at Precision and Recall, which are good evaluation metrics for binaray classification problem. Then we'll talk about ROC (Receiver Opearing Characteristics) curves. This gives us a way to evaluate a model. With ROC curves we'll be able to evaluate the quality of soft predictions. Then we'll look at the are under the ROC curve, which is the most important metric for binary classification problems. Then we'll finish this week by talking about cross-validation, which is a way of validating your model. It is a more involved process. It has both advantages and disadvanteges. We'll look at it.

end of video: [ML Zoomcamp 4.1 - Evaluation Metrics: Session Overview](https://www.youtube.com/watch?v=gmg5jw1bM8A&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=40)

## 4.2 Accuracy and dummy model

Here we'll look at accuracy and discuss if 80% accuracy is good enough or not.

* Evaluate the model on different thresholds
* Check the accuracy of dummy baselines