<p align="center">
<img src="https://github.com/datacamp/python-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
</p>
<br><br>

## **Machine Learning with scikit learn**

Welcome to this hands-on training where you will immerse yourself in machine learning with Python. Using both `pandas` and `scikit learn`, we'll learn how to process data for machine learning and create predictions on a churn case study. In this session you will learn:

- The different types of machine learning and when to use them.
- How to apply data preprocessing for machine learning including feature engineering. 
- How to apply supervised machine learning models to generate predictions!

## **The Dataset**

The dataset to be used in this webinar is a CSV file named `telco.csv`, which contains data on telecom customers churning and some of their key behaviors. It contains the following columns:

**Features**:

- `customerID`: Unique identifier of a customer.
- `gender`: Gender of customer.
- `SeniorCitizen`: Binary variable indicating if customer is senior citizen.
- `Partner`: Binary variable if customer has a partner.
- `Dependents`: Binary variable if customer has dependent.
- `tenure`: Number of weeks as a customer.
- `PhoneService`: Whether customer has phone service.
- `MultipleLines`: Whether customer has multiple lines.
- `InternetService`: What type of internet service customer has (`"DSL"`, `"Fiber optic"`, `"No"`).
- `OnlineSecurity`: Whether customer has online security service.
- `OnlineBackup`: Whether customer has online backup service.
- `DeviceProtection`: Whether customer has device protection service.
- `TechSupport`: Whether customer has tech support service.
- `StreamingTV`: Whether customer has TV streaming service.
- `StreamingMovies`: Whether customer has movies streaming service.
- `Contract`: Customer Contract Type (`'Month-to-month'`, `'One year'`, `'Two year'`).
- `PaperlessBilling`: Whether paperless billing is enabled.
- `PaymentMethod`: Payment method.
- `MonthlyCharges`: Amount of monthly charges in $.
- `TotalCharges`: Amount of total charges so far.

**Target Variable**:

- `Churn`: Whether customer `'Stayed'` or `'Churned'`.


In [None]:
# Import pandas
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn

  import pandas.util.testing as tm


## **Data Exploration**

In [None]:
# Read in dataset
telco = pd.read_csv('https://raw.githubusercontent.com/datacamp/machine-learning-with-scikit-learn-live-training/master/data/telco_churn.csv', index_col = "Unnamed: 0")
pd.set_option('display.max_columns', None)

In [None]:
# Print header


In [None]:
# Print info


In [None]:
# Take a look at unique values in telco


In [None]:
# Unique values of internet service

The **null model** is a model of reference to use for classification accuracy - where the  **null accuracy** is the accuracy of the model if we always choose the most frequent class *(or outcome)*. Accuracy is determined here by the following:

<br>


$$\large{accuracy = \frac{\# \space times \space model \space is \space right}{total \space number \space of \space predictions}}$$



In [None]:
# Find the null model


$$\large{null \space accuracy = \frac{\# \space times \space model \space predicted \space "Stayed"}{total \space number \space of \space predictions}}$$

In [None]:
# Find the null model


In this particular instance, the null model (always predicting `"Stayed"`) is 73.4% - and any meaningful model that improves performance will have to break that accuracy score. 

---
<center><h1> Q&A 1</h1> </center>

---

## **Data Cleaning**

**Task 1: Dropping** `customerID` **column**

To drop a column from a DataFrame - we can use the `.drop()` method alongside the following arguments:

- Name of `column` dropped - in this example `'customerID'`
- `axis`: Whether to drop row (`0`), or column (`1`).
- `inplace`: Boolean whether to drop in place and overwrite change in DataFrame.

In [None]:
# Drop customer ID column


**Task 2: Converting** `TotalCharges` **column**

To convert a column from string to numeric - we can use the `pd.to_numeric()` function - which takes the following arguments:

- Name of `column` to convert - in this example `'TotalCharges'`
- `errors`: Whether to `'raise'` an error if cannot convert or to `'coerce'` it to `NaN`.

In [None]:
# Convert TotalCharges to numeric


In [None]:
# Print info


In [None]:
# Print # of missing values


In [None]:
# Visualize distribution of TotalCharges


In [None]:
# Get distribution of TotalCharges


As a reminder, the `.loc[]` method lets us slice a DataFrame by a group of rows or columns by labels or boolean arrays - meaning we can subset a DataFrame `df` as such:

```
df.loc[row condition, column label]
```

In [None]:
# Replace NA of TotalCharges with median


In [None]:
# Make sure no more 


**Task 3: Collapse** values of `InternetService` **column**

To replace the values in of a column `col_A` in a DataFrame `df` - we can use the `.replace()` method which takes in a dictionary mapping the `old_value` to the `new_value` as such:

```
df['col_A'] = df['col_A'].replace({old_value : new_value})
```


In [None]:
# Collapse 'dsl' into 'DSL'



---
<center><h1> Q&A 2</h1> </center>

---

## **Exploratory Analysis for Machine Learning**

In order to understand which models have predictive power, which variables to use in feature engineering, and to build a common sense understanding of what is driving churn to understand results, it essential to explore the data and observe how **features** interact with the **target** variable. 

There are broadly 3 types of data: 

- Continous _(e.g. age)_ data. 
- Categorical data _(e.g. marriage status)_. 
- Other *(e.g. image, tweets, etc...)*

Let's visualize how continous and categorical data in `telco` behave with `Churn`.

In [None]:
# Grab a look at the header


> *A note on list comprehensions:*
>
> List comprehensions provide an elegant way to iteratively produce lists without using a traditional `for` loop. For example, here's how we can create a list of numbers from 0 to 9:
>
> ```
> my_list = [i for i in range(0,10)]
>
> print(my_list)
>
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
```
>
> It is also possible to add an `if` statement to create a condition while creating the list - here is an example where we create a list of numbers bigger than 3 from values ranging 0 to 9:
>
>```
># Create a list of the doubles of values from 0 to 9
>my_list = [i for i in range(0,10) if i > 3]
>
>print(my_list)
>
>[4, 5, 6, 7, 8, 9]
```

In [None]:
# Get dtypes of column


In [None]:
# Get all features


# Get all categorical features


# Get all numeric columns


In [None]:
# Print them out and make sure


> #### **Data Visualization Refresher** 
> 
> A `matplotlib` visualization is made of 3 components:
> - A **figure** which houses in one or many subplots (or axes).
> - The **axes** objects ~ the subplots within the figure.
> - The plot inside each subplot or axes.
>
> We can generate a figure with subplots using the following function:
>
> `fig, axes = plt.subplots(nrow, ncol)`
> 
> <p align="left">
<img src="https://github.com/adelnehme/intro-to-data-visualization-Python-live-training/blob/master/images/subplots.gif?raw=true" width="55%">
</p>


#### **Visualizing target variable relationship with categorical features**

To visualize the count of different categorical values by `Churn`, we can use the `sns.countplot(x, hue, data, ax)` function which takes in:
- `x`: The column name being counted.
- `hue`: The column name used for grouping the data.
- `data`: The DataFrame being visualized.
- `ax`: Which axes in the figure to assign the plot.

In [None]:
# Setting aesthetics for better viewing
plt.rcParams["axes.labelsize"] = 5
sns.set(font_scale=5) 

# Create figure and axes
fig, axes = plt.subplots(5, 3, figsize = (100, 100))

# Iterate over each axes, and plot a countplot with categorical columns
for ax, column in zip(axes.flatten(), categorical):
    
    # Create countplot
    
    
    # Set the title of each subplott
    

    # Improve legends
    handles, labels = ax.get_legend_handles_labels()
    fig.legend(handles, labels, loc='right', fontsize = 48)
    ax.get_legend().remove()

**Observation 6:** Gender seems to have a 50-50 split and values and does not affect Churn.

**Observation 7:** `Fiber optic` Internet Service seems to be a driver of Churn.

**Observation 8:** `OnlineBackup`, `DeviceProtection`, `TechSupport` and `OnlineSecurity` users tend to churn less.

#### **Visualizing target variable relationship with continuous features**

A great way to observe the differences between two groups (or categories) of data according to a numeric value is a boxplot, which visualizes the following:

<p align="left">
<img src="https://github.com/adelnehme/intro-to-data-visualization-Python-live-training/blob/master/images/boxplot.png?raw=true" alt = "DataCamp icon" width="80%">
</p>

It can be visusalized as such:

- `sns.boxplot(x=, y=, data=)`
  - `x`: Categorical variable we want to group our data by.
  - `y`: Numeric variable being observed by group.
  - `data`: The DataFrame being used.
  

In [None]:
# Setting aesthetics for better viewing
plt.rcParams["axes.labelsize"] = 1
sns.set(font_scale=1) 
 
# Create figure and axes
fig, axes = plt.subplots(1, 3, figsize = (20, 8))

# Iterate over each axes, and plot a boxplot with numeric columns
for ax, column in zip(axes.flatten(), numeric):
    
    # Create a boxplot
    
    
    # Set title
    

**Observation 8:** Higher monthly charges tend to be related to Churn.

**Observation 9:** Tenure may seem predictive, but it could very well much be that churners have low tenure by nature because they are churning.

---
<center><h1> Q&A 3</h1> </center>

---

## **Data pre-processing for machine learning**

Many machine learning algorithms require data to be processed before being passed into an algorithm first - dependent on whether data is numeric or categorical, the processing strategy is different.

**Continuous or numeric data**

Many machine learning models make assumptions about the distribution of numeric features when modeling (most commonly data is assumed to be normally distributed). Also, many numeric columns have different scales _(e.g. Age vs Salary)_. 

A common way to process numeric columns is through **Standardization** - where we substract their mean and divide by their standard deviation so that their mean becomes centered around 0 and have a standard deviation of 1 :

$$\large{x_{scaled} = \frac{x - mean}{std}}$$

<br>

<p align="left">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/standard_scaler.gif?raw=true?resize" width="45%">
</p>


<br>

We can do this easily in `sklearn` by using the `StandardScaler()` function. Many operations in `sklearn` fit the following `.fit()` $\rightarrow$ `.transform()` paradigm and `StandardScaler()` is no different:

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on data
scaler.fit(df[my_column])

# Transformed
column_scaled = scaler.transform(df[my_column])

# Replace column
df[my_column] = column_scaled
```

However, it is very important to **first split** your data before scaling your features since we do not want to scale our data according to the distribution of both the training data and test data. Failing to do so results in **data leakage** and could lead to "too good to be true" results on testing data with relatively weaker results on unseen data. 

<font color=00AAFF>Ideally, scalers should be fit on **training data only** - and be used to transform both training and testing data.</font>

In [None]:
# Split data between X and label
X = telco[features]
y = telco['Churn'].replace({'Stayed': 0, 'Churned':1})

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split data into train test splits


In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Intialize a scaler


# Fit on training data


# Transform training and test data


In [None]:
# Replace columns in training and testing data accordingly



**Categorical data**

While categorical variables like country, marriage status and more are easily interpretable by humans - they need to be properly encoded to be understood by machine learning algorithms. We will be using dummy encoding *(highly similar to one-hot encoding)* where categorical variables are converted to binary (`1`,`0`) columns to indicate whether they have a certain value or not. Note that, dummy encoding generates `n-1` categories. Using a country example - `0` on all columns encodes it as France.

<br>


<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/onehot_dummy.gif?raw=true" width="80%">
</p>

Using dummy encoding in `pandas` is actually very easy - we can use the `pd.get_dummies()` function which takes:

- The DataFrame being converted.
- `columns`: The name of the categorical columns to be converted.
- `drop_first`: Boolean to indicate onehot encoding (`False`) or dummy encoding (`True`).




In [None]:
# One hot encode cat variables


**Feature Engineering**

Generating new predictive features from existing features is an important aspect of machine learning. New features could be engineered using:

- Binning numeric values _(e.g. `age_category` column from `age` column)._
- Interaction of 2 columns _(e.g. `total_salary`/`tenure`)._
- Features from domain knowledge.

We learned while visualizing categorical columns that being subscribed to `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, and `TechSupport` tend to drive less churn. Let's visualize this further with a new feature called `in_ecosystem` which counts the number of services a given customer is subscribed to.


In [None]:
# Re-add Churned to add to train and test



In [None]:
# Check out header again


In [None]:
# Service columns
service_columns = ['OnlineSecurity_Yes', 'OnlineBackup_Yes', 'DeviceProtection_Yes', 'TechSupport_Yes']

# Create in_ecosystem column


# Visualize churn by number of services subscribed


In [None]:
# Create feature that is 1 if 2 or more services subscribed, 0 otherwise


# Apply the same on test_X



In [None]:
# Drop target variable from training and test data again 



---
<center><h1> Q&A 4</h1> </center>

---

## **Modeling**

Most machine learning models for classification aim at creating a decision boundary between data points to generate predictions. For example, here is a decision line where the target variable is whether tumor is benign or cancerous based on tumor height and width:

<br>

<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/decision_boundary.gif?raw=true?" width="50%">
</p>



#### **Using K-Nearest Neighbors to Generate Predictions**

The K-Nearest Neighbor tries to find the label of unseen data by choosing the label of the `K` closest points to it. Using our cancerous/benign tumour example, K-Nearest Neighbor would behave like this:


<br>

<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/knn.gif?raw=true?" width="50%">
</p>

Just like almost all algorithms on `sklearn` - the `KNeighborsClassifier()` needs to be instantiated and follows the `.fit()` $\rightarrow$ `.predict()` paradigm as such:

```
# Import algorithm
from sklearn.neighbors import KNeighborsClassifier

# Instantiate it
knn = KNeighborsClassifier(n_neighbors = k)

# Fit on training data
knn.fit(train_X, train_Y)

# Create predictions
predictions = knn.predict(test_X)
```

In [None]:
# Import K-Nearest Neighbor Classifier and accuracy_score
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score

# Instantiate K Nearest Neighbors with 6 neighbors


# Fit on training data


# Create Predictions



# Calculate accuracy score on testing data



# Print test accuracy score rounded to 4 decimals
print('Test accuracy:', round(test_accuracy, 4))
print('\nTrain accuracy:', round(train_accuracy, 4))

#### **Using Decision Trees and Random Forests to Generate Predictions**

A **decision tree** is a recursive algorithm that sequentially asks if-else questions about the data using a set of cutoff points designed to maximize the purity (homogeneity) of the resulting data points. 

Taking the tumour example, this would mean asking a series of questions about tumour height and width to determine whether a tumour is cancerous or not. Splits are made so that the resulting data points are as homogeneous as possible to predict the class on unseen data.


<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/decision_tree.png?raw=true" width="60%">
</p>

<br>

Just like `KNeighborsClassifier()` - the `DecisionTreeClassifier()` also uses the `.fit()` $\rightarrow$ `.predict()` paradigm.

A **Random Forest** pools the predictions of many decision trees each fit on a random number of features and samples from training data and returns the most common class for each sample of test data.


<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/forest.gif?raw=true" width="70%">
</p>

It can be used using the `RandomForestClassifier()` object - and also fits the `.fit()` $\rightarrow$ `.predict()` paradigm.

In [None]:
# Import relevant packages
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Instantiate decision tree and random forest classifiers



# Fit decision tree and random forest on data



# Create Predictions on test and train data using decision tree



# Create Predictions on test and train data using random forest



# Calculate test and train accuracy score on decision tree



# Calculate test and train accuracy score on random forest



# Print test accuracy score rounded to 4 decimals
print('Tree test accuracy:', round(test_accuracy_tree, 4))
print('Tree train accuracy:', round(train_accuracy_tree, 4))

# Print test accuracy score rounded to 4 decimals
print('\nForest test accuracy:', round(test_accuracy_forest, 4))
print('Forest train accuracy:', round(train_accuracy_forest, 4))

#### **Overfitting, the bias-variance tradeoff and cross validation**

Checking out the results of the decision tree and random forest classifiers, the training accuracy far exceeds the testing accuracy score, suggesting that the model is fitting really well (a bit too well) on the training data and does not generalize to unseen data. 

This is called overfitting and can be illustrated by highly complex decision boundary while fitting the model on the training data.

<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/overfittinig_new.gif?raw=true" width="50%">
</p>



**Model Variance**

A model is said to have high variance if it creates an elaborate decision boundary around data points for different sets of training data. 

<ins> It can be diagnosed if **training accuracy** >>> **test accuracy**. </ins>


**Model Bias**

A model underfits the data, or is said to have high bias if the decision boundary does not fit the data - and generates non-accurate predictions on both training and testing data.

<ins> It can be diagnosed if both **training accuracy** and **test accuracy** are low. </ins>

<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/high_bias.png?raw=true" width="60%">
</p>


**Cross Validation**

Cross validation is considered best practice for assessing a model's performance. It essentially divides the training data `n` times into a training sets and a hold out set - iteratively fitting the model on the training set and validating on the hold out set storing each validation result separately. Finally, the `n` results are pooled to get a mean validation score. 

<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/cross_validation.png?raw=true" width="60%">
</p>


Cross-validation can be done by using the `cross_val_score()` in `sklearn` - it takes in as arguments the following:

- The instantiated model in question.
- The training data and label.
- `cv`: The number of cross validation folds.


In [None]:
# Import relevant modules 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Instantiate decision tree


# Get cross validation scores


# Fit on training data and get predictions



# Fit on data
print(cv_scores)
print("\nMean cross-val score:", round(np.mean(cv_scores), 4))
print("\nTest score:", round(accuracy_score(y_pred, test_Y), 4))

---
<center><h1> Q&A 5</h1> </center>

---

#### **Hyperparameter Tuning and grid-search**

Almost all algorithms have hyperparameters that can be tuned to fine-tune their performance, reduce over-fitting and better capture the patterns in the dataset. Having a good understanding and intuition of how algorithms work is essential to fully utilize hyperparameter tuning for the purposes of improving model performance and testing different modeling strategies. Here we will tune the `max_depth` and `max_features` hyperparameters of the decision tree classifier to improve its performance.

In [None]:
# Get all parameters of a decision tree



**Tuning maximum depth**

From the `sklearn` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html):

> The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than 2 samples.


<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/max_depth.png?raw=true" width="50%">
</p>


The higher this number is, the more likely the model is to overfit. Let's try a `max_depth` of 4.

In [None]:
# Import relevant modules
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score

# Instantiate a decision tree with max_depth = 4


# Get cross validation scores


# Fit on training data and get predictions



# Print accuracy scores
print(cv_scores)
print("\nMean cross-val score:", round(np.mean(cv_scores), 4))
print("\nTest score:", round(accuracy_score(y_pred, test_Y), 4))

**Tuning maximum features**

From the `sklearn` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html):

> The number of features to consider when looking for the best split.

It could take many different values from:
- `"sqrt"` so that `max_num_features = sqrt(num_features)`.
- A float between 0 and 1 so that it is the percentage of features considered.
- Or `int` - considering the exact number of features. 

In [None]:
# Import relevant modules
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import cross_val_score

# Instantiate a decision tree with max_depth = 4 and max_features = 25


# Get cross validation scores


# Fit on training data and get predictions


# Print accuracy scores
print(cv_scores)
print("\nMean cross-val score:", round(np.mean(cv_scores), 4))
print("\nTest score:", round(accuracy_score(y_pred, test_Y), 4))

**Using grid-search**

Grid-search is a hyperparameter tuning algorithm that sequentially goes through every possible combination of hyperparameter combination it is fed in space. For example, for hyperparameters `parameter 1` and `parameter 2` - it would mean testing out all possible combinations of their values:


<p align="center">
<img src="https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/assets/grid-search.gif?raw=true" width="50%">
</p>


Grid-search can be done using the `GridSearchCV()` function - it takes in as arguments:

- The model being used.
- The possible parameters to test - inputted as a dictionary. 
- `cv`: The number of cross-validation folds.
- `verbose`: More detailed output if `2`.

**Note**: Grid-search can be very time-consuming if you are testing many different combinations using a complex learning model. [`RandomizedSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) could be a better alternative.




In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define parameter grid



# Instantiate a decision tree classifier 


# Instantiate a GridSearchCV classifier with 10 fold cross-validation


# Fit clf on training data


In [None]:
# Generate predictions and calculate accuracy error


# Get best parameters and accuracy score



---
<center><h1> Q&A 6</h1> </center>

---


<center><h1>Homework</h1> </center>

Try to break the **80%** accuracy threshold on the test data.

*Tips:* <br>

- Use different models (Random Forest, logistic regression, SVM and more)
- Try hyperparameter-tuning these models - make sure you read the sklearn - - documentation for each model.
- Investigate engineering new features for your model.

*Submission details:*<br>

Share with us a code snippet with your output on LinkedIn, Twitter <br>
Tag us `@DataCamp` with the hashtag `#datacamplive`<br>
Or reach out on [Linkedin](https://www.linkedin.com/in/adelnehme/) or [Twitter](https://twitter.com/Adel_Nehme)

