# Churn
**Goal**: Our goal is to develop a model to predict the class of an output variable, *y*, based on 20 input features (*x1, …, x20*).  

**Data**: The dataset contains labeled samples: 8000 for training and 2000 for testing. Each sample contains 20 input feature. Its label, *y*, can take one of 3 possible values (0, 1, 2). In addition, there are 30 unlabeled samples. The training, test, and unlabeled samples are in CSV files "*churn.train.csv*", "*churn.test.csv*", and "*churn.new.csv*", respectively.    

**Approach**: We shall choose a good model for this binary classification task based on results from *k*-fold cross-validation on the 8000 training samples. The selected model will be trained on all 8000 training samples and evaluated on the 2000 test samples. We shall then use the trained model to classify unlabeled samples.

## Import modules

In [1]:
import pandas as pd # for data handling
from sklearn.model_selection import cross_val_score # for cross-validation
from sklearn.metrics import accuracy_score, classification_report # evaluation metrics
import matplotlib.pyplot as plt # for plotting

# scikit-learn classifiers evaluated (change as desired)
from sklearn.naive_bayes import GaussianNB 
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

## Get data
We shall extract the following 3 *CSV* files with data from the *zip* file "*churn.data.zip*":
1. *churn.train.csv*  with 8000 rows and 21 columns.   
2. *churn.test.csv*   with  2000 rows and 21 columns.   
3. *churn.new.csv*  with 30 rows and 21 columns.

The first column, **y**, in the files "*churn.train.csv*" and "*churn.test.csv*"  contains the binary output labels. The file "*churn.new.csv*" contains unlabeled sampled; the first column **ID** is an identifier. The next 20 columns (*x1*, ..., *x20*) contain input features. 

We shall read data from these *CSV* files into *pandas* dataframes.

### Extract *CSV* files from *zip* file
We shall use unzip to extract CSV files from zip file with data.
- “**!**” allows us to run command line commands from a code cell in a notebook.


In [2]:
! unzip '/content/drive/MyDrive/Colab Notebooks/courses/sklearn_classifiers/data/churn.data.zip'

Archive:  /content/drive/MyDrive/Colab Notebooks/courses/sklearn_classifiers/data/churn.data.zip
  inflating: churn.new.csv           
  inflating: churn.test.csv          
  inflating: churn.train.csv         


### Read data into *pandas* dataframes
We shall use *pandas* **read_csv** function to read data from the CSV files "*churn.train.csv*", "*churn.test.csv*", and "*churn.new.csv*" to *pandas* dataframes **train**, **test**, and **new**, respectively.

In [3]:
# Read data from CSV files into pandas dataframes
train = pd.read_csv('churn.train.csv') # training data
test = pd.read_csv('churn.test.csv') # test data
new = pd.read_csv('churn.new.csv') # unlabeled data
# Show number of rows and columns in each dataframe
print('train contains %d rows and %d columns' %train.shape)
print('test contains %d rows and %d columns' %test.shape)
print('new contains %d rows and %d columns' %new.shape)
print('First 3 rows in train:') 
train.head(3) # display first 3 training samples 

train contains 8000 rows and 21 columns
test contains 2000 rows and 21 columns
new contains 30 rows and 21 columns
First 3 rows in train:


Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20
0,0.0,2.444,-2.329,1.478,-0.728,-1.175,-4.542,-0.581,1.265,0.702,1.354,-1.273,3.469,1.151,-2.121,-1.229,4.518,1.318,-0.253,2.79,3.519
1,1.0,-0.584,1.923,4.892,-0.281,-2.749,0.955,-3.017,-1.15,3.13,1.612,-4.394,-2.225,0.792,-1.083,-2.672,0.759,1.118,-0.069,6.413,-1.608
2,2.0,-2.587,1.851,-0.028,-4.569,-2.763,-0.18,1.691,-3.493,-1.696,1.676,-4.444,2.828,0.271,0.108,0.154,0.671,-0.467,-1.908,3.769,-2.186


In [4]:
print('Last 2 rows in new:') 
new.tail(2) # display last 2 unlabeled samples

Last 2 rows in new:


Unnamed: 0,ID,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20
28,ID_029,-2.839,-2.248,2.086,-5.58,-2.757,0.161,2.322,-1.6,1.542,-4.021,2.491,0.411,-2.039,-3.103,-0.302,-0.018,-3.063,-2.129,-2.152,-5.939
29,ID_030,-0.803,2.87,-0.363,-3.411,-1.058,-0.139,-1.646,-1.414,1.927,-1.077,-1.43,3.216,-2.766,-0.612,-4.65,-3.347,-3.29,-1.257,-3.598,-0.109


### Specify inputs and outputs
- **features**: List of the 16 input feature names
- **X_train**: $8000 \times 20$ array containing input values for training samples.
- **y_train**: Array containing labels for the 8000 training samples.
- **X_test**: $2000 \times 20$ array containing input values for test samples.
- **y_test**: Array containing labels for the 2000 training samples.
- **X_new**: $30 \times 20$ array containing input values for unlabeled samples.






In [5]:
features = list(train)[1:] # all but the first column header are feature names
print("features:", features)
X_train, X_test, X_new = train[features], test[features], new[features]
y_train, y_test = train.y, test.y
print('Shapes:')
print(f'X_train: {X_train.shape}, X_test: {X_test.shape}, X_new: {X_new.shape}')
print(f'y_train: {y_train.shape}, y_test: {y_test.shape}')

features: ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20']
Shapes:
X_train: (8000, 20), X_test: (2000, 20), X_new: (30, 20)
y_train: (8000,), y_test: (2000,)


## Evaluate models using *k*-fold cross-validation
We shall use **4**-fold cross-validation so that 6000 of the 8000 training samples are used for training and the remaining 2000 samples are used for validation in each fold. The mean cross-validation accuracy for each model with chosen hyper-parameters on the 4 runs will be computed using the command:
- **score = cross_val_score(model, X_train, y_train, cv=4).mean()**
> - model: classifier object with specified hyperparameters
> - X_train, y_train: Inputs and output labels for training
> - cv: number of folds in cross-validation
> - mean(): computes mean accuracy from the *cv* runs 

You can look up the documentation for each classifier, change hyper-parameter values, and observe the results. We shall also observe the time it takes to train and evaluate each model 4 times in this process.


### GaussianNB

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [6]:
%%time
model = GaussianNB() # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.8587
CPU times: user 35.2 ms, sys: 2.86 ms, total: 38 ms
Wall time: 43.1 ms


### DecisionTreeClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [7]:
%%time
model = DecisionTreeClassifier(max_leaf_nodes=10) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.6831
CPU times: user 322 ms, sys: 2.65 ms, total: 325 ms
Wall time: 334 ms


### RandomForestClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [8]:
%%time
model = RandomForestClassifier(n_estimators=100) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9433
CPU times: user 8.89 s, sys: 11.6 ms, total: 8.9 s
Wall time: 8.91 s


### ExtraTreesClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [9]:
%%time
model = ExtraTreesClassifier(n_estimators=100) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9585
CPU times: user 3.18 s, sys: 46 ms, total: 3.22 s
Wall time: 3.23 s


### KNeighborsClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [13]:
%%time
model = KNeighborsClassifier(n_neighbors=9, algorithm='brute') # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9759
CPU times: user 1.49 s, sys: 393 ms, total: 1.89 s
Wall time: 1.35 s


### LogisticRegression

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [14]:
%%time
model = LogisticRegression(max_iter=10000) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.8681
CPU times: user 705 ms, sys: 551 ms, total: 1.26 s
Wall time: 693 ms


### SVC

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [16]:
%%time
model = SVC(C=1.0) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9794
CPU times: user 2.97 s, sys: 6.78 ms, total: 2.98 s
Wall time: 2.99 s


### MLPClassifier
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [17]:
%%time
model = MLPClassifier(hidden_layer_sizes=100, max_iter=1000) # change hyperparameters as desired
score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
print(f'Mean cross-validation accuracy = {score:0.4f}')

Mean cross-validation accuracy = 0.9649
CPU times: user 54.1 s, sys: 41.6 s, total: 1min 35s
Wall time: 49.1 s


## Select a good model
Since a Support Vector Classifier has high cross-validation accuracy, we shall search for good hyper-parameter values for a SVC model using cross-validation. In this example I shall vary the regularization parameter C.

In [18]:
for c in [0.01, 0.1]: # number of rules
    model = SVC(C=c)
    score = cross_val_score(model, X_train, y_train, cv=4).mean() # mean cross-validation accuracy
    print(f'Mean cross-validation accuracy with C = {c:0.1f} = {score:0.4f}')

Mean cross-validation accuracy with C = 0.0 = 0.9215
Mean cross-validation accuracy with C = 0.1 = 0.9635


For this classification problem, we shall choose:
- SVC(C=1.0) 

In [19]:
chosen_model = SVC(C=1.0) # chosen model
print(chosen_model) # display model parameters

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


## Train and test selected model

In [20]:
%%time
chosen_model.fit(X_train, y_train) # train selected model on ALL training examples
predicted = chosen_model.predict(X_test) # predicted churn for test examples
acc = accuracy_score(y_test, predicted) # accuracy on test samples
print(f'Accuracy on test samples = {acc:0.4f}') # show test accuracy
print("Classification report on test samples:") # for precision, recall, F1-score
print(classification_report(y_test, predicted, digits=4)) # rounded to 4 decimal places

Accuracy on test samples = 0.9770
Classification report on test samples:
              precision    recall  f1-score   support

         0.0     0.9819    0.9716    0.9767       670
         1.0     0.9810    0.9810    0.9810       633
         2.0     0.9688    0.9785    0.9736       697

    accuracy                         0.9770      2000
   macro avg     0.9772    0.9771    0.9771      2000
weighted avg     0.9770    0.9770    0.9770      2000

CPU times: user 1.13 s, sys: 7.21 ms, total: 1.14 s
Wall time: 1.14 s


## Predict class for unlabeled samples
We shall use our trained model to predict the output class for the unlabeled samples.

In [21]:
predicted_new = chosen_model.predict(X_new) # predicted classes for unlabeled samples
churn_prediction = pd.DataFrame() # dataframe with predicted classes
churn_prediction['ID'] = new.ID # identifiers for unlabeled samples
churn_prediction['y'] = predicted_new # # predicted classes for unlabeled samples
churn_prediction.to_csv('churn.prediction.csv', index=False) # save as CSV file
churn_prediction # display results

Unnamed: 0,ID,y
0,ID_001,0.0
1,ID_002,1.0
2,ID_003,0.0
3,ID_004,0.0
4,ID_005,0.0
5,ID_006,0.0
6,ID_007,0.0
7,ID_008,0.0
8,ID_009,0.0
9,ID_010,0.0
