<p align="center">
<img src="https://github.com/adelnehme/data-upskilling-learning-club-IV/blob/master/assets/dc_amazon_logo.png?raw=True" alt = "DataCamp icon" width="50%">
</p>


## **Data UpSkilling Learning Club: Introduction to Machine Learning with Python**

<br>

#### **Key Session Takeaways**

- Understand the different types of machine learning and when to use them
- Know when machine learning is applicable and when it’s not
- A discussion on the different components of the machine learning workflow
- Apply a simple supervised learning workflow to classify customer churn

<br>

#### **The Dataset**

The dataset to be used in this session is a CSV file named `telco.csv`, which contains data on telecom customers churning and some of their key behaviors. It contains the following columns:

**Features**:

- `customerID`: Unique identifier of a customer.
- `gender`: Gender of customer.
- `SeniorCitizen`: Binary variable indicating if customer is senior citizen.
- `Partner`: Binary variable if customer has a partner.
- `Dependents`: Binary variable if customer has dependent.
- `tenure`: Number of weeks as a customer.
- `PhoneService`: Whether customer has phone service.
- `MultipleLines`: Whether customer has multiple lines.
- `InternetService`: What type of internet service customer has (`"DSL"`, `"Fiber optic"`, `"No"`).
- `OnlineSecurity`: Whether customer has online security service.
- `OnlineBackup`: Whether customer has online backup service.
- `DeviceProtection`: Whether customer has device protection service.
- `TechSupport`: Whether customer has tech support service.
- `StreamingTV`: Whether customer has TV streaming service.
- `StreamingMovies`: Whether customer has movies streaming service.
- `Contract`: Customer Contract Type (`'Month-to-month'`, `'One year'`, `'Two year'`).
- `PaperlessBilling`: Whether paperless billing is enabled.
- `PaymentMethod`: Payment method.
- `MonthlyCharges`: Amount of monthly charges in $.
- `TotalCharges`: Amount of total charges so far.

**Target Variable**:

- `Churn`: Whether customer `'Stayed'` or `'Churned'`.


## **Data Import & Cleaning**

In [None]:
# Import pandas
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sklearn

In [None]:
# Read in dataset
telco = pd.read_csv('https://raw.githubusercontent.com/adelnehme/machine-learning-with-scikit-learn-live-training/master/data/telco_churn.csv')
pd.set_option('display.max_columns', None)

In [None]:
# Print header


In [None]:
# Print info


The **null model** is a model of reference to use for classification accuracy - where the  **null accuracy** is the accuracy of the model if we always choose the most frequent class *(or outcome)*. Accuracy is determined here by the following:

<br>


$$\large{accuracy = \frac{\# \space times \space model \space is \space right}{total \space number \space of \space predictions}}$$



In [None]:
# Find the null model


$$\large{null \space accuracy = \frac{\# \space times \space model \space predicted \space "Stayed"}{total \space number \space of \space predictions}}$$

In [None]:
# Find the null model


In this particular instance, the null model (always predicting `"Stayed"`) is 73.4% - and any meaningful model that improves performance will have to break that accuracy score. 

In [None]:
# Drop customer ID column


---
<center><h1> Q&A 1</h1> </center>

---

## **Exploratory Analysis for Machine Learning**

In order to understand which features have predictive power, which variables to use in feature engineering, and to build a common sense understanding of what is driving churn to understand results, it essential to explore the data and observe how **features** interact with the **target** variable. To make this analysis convenient, we will isolate features between:

- Numeric columns _(e.g., Age)_
- Categorical columns _(e.g., Marriage Status)_

In [None]:
# Grab a look at the header


#### **Isolating categorical and numeric column names**

We can get all the data types of a DataFrame using the `.dtypes` attribute.

In [None]:
# Get dtypes of column


We can extract columns names from a DataFrame using the `.columns` attribute, and extract columns with specific data types using the `.select_dtypes()` method.

In [None]:
# Get all feature names


In [None]:
# Get categorical and numeric column names



> #### **Data Visualization Refresher** 
> 
> A `matplotlib` visualization is made of 3 components:
> - A **figure** which houses in one or many subplots (or axes).
> - The **axes** objects ~ the subplots within the figure.
> - The plot inside each subplot or axes.
>
> We can generate a figure with subplots using the following function:
>
> `fig, axes = plt.subplots(nrow, ncol)`
> 
> <p align="left">
<img src="https://github.com/adelnehme/intro-to-data-visualization-Python-live-training/blob/master/images/subplots.gif?raw=true" width="55%">
</p>


#### **Visualizing target variable (Churn) relationship with categorical features (columns)**

To visualize the count of different categorical values by `Churn`, we can use the `sns.countplot(x, hue, data, ax)` function which takes in:
- `x`: The column name being counted.
- `hue`: The column name used for grouping the data.
- `data`: The DataFrame being visualized.
- `ax`: Which axes in the figure to assign the plot.

In [None]:
# Setting aesthetics for better viewing
plt.rcParams["axes.labelsize"] = 5
sns.set(font_scale=5) 

# Create figure and axes
fig, axes = plt.subplots(5, 3, figsize = (100, 100))

# Iterate over each axes, and plot a countplot with categorical columns
for ax, column in zip(axes.flatten(), categorical_names):
    
    # Create countplot
    
    

#### **Visualizing target variable relationship with continuous features**

A great way to observe the differences between two groups (or categories) of data according to a numeric value is a boxplot, which visualizes the following:

<p align="left">
<img src="https://github.com/adelnehme/intro-to-data-visualization-Python-live-training/blob/master/images/boxplot.png?raw=true" alt = "DataCamp icon" width="80%">
</p>

It can be visusalized as such:

- `sns.boxplot(x=, y=, data=)`
  - `x`: Categorical variable we want to group our data by.
  - `y`: Numeric variable being observed by group.
  - `data`: The DataFrame being used.
  

In [None]:
# Setting aesthetics for better viewing
plt.rcParams["axes.labelsize"] = 1
sns.set(font_scale=1) 
 
# Create figure and axes
fig, axes = plt.subplots(1, 3, figsize = (20, 8))

# Iterate over each axes, and plot a boxplot with numeric columns
for ax, column in zip(axes.flatten(), numeric_names):
    
    # Create a boxplot
    

---
<center><h1> Q&A 2</h1> </center>

---

## **Data pre-processing for machine learning**

Many machine learning algorithms require data to be processed before being passed into an algorithm first - dependent on whether data is numeric or categorical, the processing strategy is different.


**Train-test split**

Since we have one DataFrame that is fully labelled, we want to evaluate accuracy on unseen data. To do so, we can split our data into Training data, i.e. the data that is trained on a Machine Learning algorithm with its labels, and test data, the data treated as "unseen data" used to evaluate the accuracy of our model.

<br>

<p align="center">
<img src="https://github.com/adelnehme/data-upskilling-learning-club-IV/blob/master/assets/telco.png?raw=true" width="68%">
</p>


<br>






In [None]:
# Split data between X (features) and y (target)


We can split features (`X`) and target (`y`) using the `train_test_split()` function from `sklearn.model_selection` — the arguments it takes are:

- `X`: The features 
- `y`: The target variable
- `test_size`: Size of the test set here `0.25`
- `random_state`: Takes in any number and allows to reproduce same split

In [None]:
# Import train_test_split


# Split data into train test splits




**Continuous or numeric data**

Many machine learning models make assumptions about the distribution of numeric features when modeling (most commonly data is assumed to be normally distributed). Also, many numeric columns have different scales _(e.g. Age vs Salary)_. 

A common way to process numeric columns is through **Standardization** - where we substract their mean and divide by their standard deviation so that their mean becomes centered around 0 and have a standard deviation of 1 :

<br>

$$\large{x_{scaled} = \frac{x - mean}{std}}$$

<br>

<p align="center">
<img src="https://github.com/adelnehme/data-upskilling-learning-club-IV/blob/master/assets/standard_scaler.gif?raw=true?resize" width="45%">
</p>


<br>

We can do this easily in `sklearn` by using the `StandardScaler()` function. Many operations in `sklearn` fit the following `.fit()` $\rightarrow$ `.transform()` paradigm and `StandardScaler()` is no different:

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on data
scaler.fit(df[my_column])

# Transformed
column_scaled = scaler.transform(df[my_column])

# Replace column
df[my_column] = column_scaled
```

However, it is very important to **first split** your data before scaling your features since we do not want to scale our data according to the distribution of both the training data and test data. Failing to do so results in **data leakage** and could lead to "too good to be true" results on testing data with relatively weaker results on unseen data. 

<font color=00AAFF>Ideally, scalers should be fit on **training data only** - and be used to transform both training and testing data.</font>

In [None]:
# Import StandardScaler


# Intialize a scaler


# Fit on training data


# Transform training and test data


In [None]:
# Replace columns in training and testing data accordingly


In [None]:
# See changes


**Categorical data**

While categorical variables like country, marriage status and more are easily interpretable by humans - they need to be properly encoded to be understood by machine learning algorithms. We will be using dummy encoding *(highly similar to one-hot encoding)* where categorical variables are converted to binary (`1`,`0`) columns to indicate whether they have a certain value or not. Note that, dummy encoding generates `n-1` categories. Using a country example - `0` on all columns encodes it as France.

<br>


<p align="center">
<img src="https://github.com/adelnehme/data-upskilling-learning-club-IV/blob/master/assets/onehot_dummy.gif?raw=true" width="80%">
</p>

Using dummy encoding in `pandas` is actually very easy - we can use the `pd.get_dummies()` function which takes:

- The DataFrame being converted.
- `columns`: The name of the categorical columns to be converted.
- `drop_first`: Boolean to indicate onehot encoding (`False`) or dummy encoding (`True`).




In [None]:
# One hot encode cat variables


In [None]:
# See changes


**Feature Engineering**

Generating new predictive features from existing features is an important aspect of machine learning. New features could be engineered using:

- Binning numeric values _(e.g. `age_category` column from `age` column)._
- Interaction of 2 columns _(e.g. `total_salary`/`tenure`)._
- Features from domain knowledge.

We learned while visualizing categorical columns that being subscribed to `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, and `TechSupport` tend to drive less churn. Let's visualize this further with a new feature called `in_ecosystem` which counts the number of services a given customer is subscribed to.


In [None]:
# Service columns
service_columns = ['OnlineSecurity_Yes', 'OnlineBackup_Yes', 'DeviceProtection_Yes', 'TechSupport_Yes']

# Create in_ecosystem column


# Create feature that is 1 if 2 or more services subscribed, 0 otherwise

# Apply the same on test_X


In [None]:
# See changes


---
<center><h1> Q&A 3</h1> </center>

---

## **Modeling**

Most machine learning models for classification aim at creating a decision boundary between data points to generate predictions. For example, here is a decision line where the target variable is whether tumor is benign or cancerous based on tumor height and width:

<br>

<p align="center">
<img src="https://github.com/adelnehme/data-upskilling-learning-club-IV/blob/master/assets/decision_boundary.gif?raw=true?" width="50%">
</p>



#### **Using K-Nearest Neighbors to Generate Predictions**

The K-Nearest Neighbor tries to find the label of unseen data by choosing the label of the `K` closest points to it. Using our cancerous/benign tumour example, K-Nearest Neighbor would behave like this:


<br>

<p align="center">
<img src="https://github.com/adelnehme/data-upskilling-learning-club-IV/blob/master/assets/knn.gif?raw=true?" width="50%">
</p>

Just like almost all algorithms on `sklearn` - the `KNeighborsClassifier()` needs to be instantiated and follows the `.fit()` $\rightarrow$ `.predict()` paradigm as such:

```
# Import algorithm
from sklearn.neighbors import KNeighborsClassifier

# Instantiate it
knn = KNeighborsClassifier(n_neighbors = k)

# Fit on training data
knn.fit(train_X, train_Y)

# Create predictions
predictions = knn.predict(test_X)
```

In [None]:
# Import K-Nearest Neighbor Classifier and accuracy_score
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score

# Instantiate K Nearest Neighbors with 6 neighbors


# Fit on training data

# Create Predictions

# Calculate accuracy score on testing data

# Print test accuracy score rounded to 4 decimals


---
<center><h1> Q&A 4</h1> </center>

---

#### **Hyperparameter Tuning**

Almost all algorithms have hyperparameters that can be tuned to fine-tune their performance, and better capture the patterns in the dataset. Having a good understanding and intuition of how algorithms work is essential to fully utilize hyperparameter tuning for the purposes of improving model performance and testing different modeling strategies. 

**Tuning the number of neighbors**


In [None]:
# Instantiate K Nearest Neighbors with 8 neighbors


# Fit on training data

# Create Predictions

# Calculate accuracy score on testing data

# Print test accuracy score rounded to 4 decimals


---
<center><h1> Q&A 5</h1> </center>

---