## 1 ) EDA

In [None]:
telco.info()
telco['Churn'].value_counts()

2850 No, 483 Yes (Churners)

The .groupby() method is incredibly useful when you want to investigate specific columns of your dataset. Here, you're going to explore the 'Churn' column further to see if there are differences between churners and non-churners. A subset version of the telco DataFrame, consisting of the columns 'Churn', 'CustServ_Calls', and 'Vmail_Message' is available in your workspace.

In [None]:
print(telco.groupby(['Churn']).std())

CHURN        CustServ_Calls       Vmail_Message

Yes          1.16                 13.91
No           1.85                 11.96

In [None]:
print(telco.groupby(['Churn']).mean())

When dealing with customer data, geographic regions may play an important part in determining whether a customer will cancel their service or not. You may have noticed that there is a 'State' column in the dataset. In this exercise, you'll group 'State' and 'Churn' to count the number of churners and non-churners by state. 

For example, if you wanted to group by x and aggregate by y, you could use .groupby() as follows:

df.groupby('x')['y'].value_counts()

In [None]:
print(telco.groupby('State')['Churn'].value_counts())

In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of 'Eve_Mins'
sns.distplot(telco['Day_Mins'])

# Display the plot
plt.show()

All of these features ('Day_Mins','Eve_Mins','Night_Mins','Intl_Mins') appear to be well approximated by the normal distribution. If this were not the case, we would have to consider applying a feature transformation of some kind.

In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create the box plot, sym="" removes the outliers, hue="Vmail_Plan"  to visualize whether or not having a voice mail plan 
# affects the number of customer service calls or churn
sns.boxplot(x = 'Churn',
            y = 'CustServ_Calls',
            data = telco
            sym=""
            hue = "Vmail_Plan" )

# Display the plot
plt.show()

## 2) Data Preprocessing

It is preferable to have features like 'Churn' encoded as 0 and 1 instead of no and yes, so that you can then feed it into machine learning algorithms that only accept numeric values.

In [None]:
# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco['Vmail_Plan'] = telco['Vmail_Plan'].replace({'no': 0 , 'yes': 1})

# Replace 'no' with 0 and 'yes' with 1 in 'Churn'
telco['Churn'] = telco['Churn'].replace({'no': 0 , 'yes': 1})

# Print the results to verify
print(telco['Vmail_Plan'].head())
print(telco['Churn'].head())

#### One hot encoding

'State' has many different labels like KS,OH,NJ etc. It can be less effective and can give damage to our model when we label it like  1,2,3,4,5... that model can see it like increasing value.

One Hot Encoding is a matrix style like;

KS        OH      NJ     FR
0         1       0       0
0         0       0       1


Doing this manually would be quite tedious, especially when you have 50 states and over 3000 customers! Fortunately, pandas has a get_dummies() function which automatically applies one hot encoding over the selected feature.

In [None]:
import pandas as pd

# Perform one hot encoding on 'State'
telco_state = pd.get_dummies(telco['State'])

#### Feature scaling

In [None]:
telco['Intl_Calls'].describe
telco['Night_Mins'].describe

When we look these features,there is big difference between scales(for ex mean of Night_Mins is 200 and Intl_Calls's mean is 4)
So, we will scale these features

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Scale telco using StandardScaler
telco_scaled = StandardScaler().fit_transform(telco)

# Add column names back for readability
telco_scaled_df = pd.DataFrame(telco_scaled, columns=["Intl_Calls", "Night_Mins"])

# Print summary statistics
print(telco_scaled_df.describe())

#### Feature Selection and Engineering

Feature Selection is deciding which features will be used in model
Feature Engineering is creating new features that will help impove our model

In [None]:
# Drop the unnecessary features
telco = telco.drop(['Area_Code','Phone'],axis=1)

# Verify dropped features
print(telco.columns)

Here, axis=1 indicates that you want to drop 'Area_Code' and 'Phone' from the columns.

In [None]:
# create a new feature that contains information about the average length of night calls made by customers.
telco['Avg_Night_Calls'] = telco['Night_Mins']/telco['Night_Calls']

## 3) Churn Prediction by Supervised Learning - Sklearn

#### by SVM;

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(telco['data'], telco['target'])

!! Both data and target must be Pandas DataFrame or Numpy Array

#### by LinearRegression;

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate the classifier
clf = LogisticRegression()

# Fit the classifier
# The features are contained in the features variable, and the target variable of interest is 'Churn'.
clf.fit(telco[features], telco['Churn'])

# Predict the label of new_customer
print(clf.predict(new_customer))

#### by DecisionTreeClassifier

In [None]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Instantiate the classifier
clf = DecisionTreeClassifier()

# Fit the classifier
# The features are contained in the features variable, and the target variable of interest is 'Churn'.
clf.fit(telco[features], telco['Churn'])

# Predict the label of new_customer
print(clf.predict(new_customer))

## 4) Evaluating Model Performance, now by RandomForestClassifier

#### Creating training and test sets

* Before you create any model, it is important to split your dataset into two: a training set which will be used to build your churn model, and a test set which will be used to validate your model.

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Create feature variable by dropping the target variable 'Churn'
X = telco.drop('Churn', axis=1)

# Create target variable
y = telco['Churn']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Make a prediction
y_pred = clf.predict(X_test)

#### Controlling the datasets

In [None]:
len(X_train)
len(X_test)

X_train.shape
X_test.shape

#### Computing accuracy

* Having split our data into training and testing sets, we can now fit your model to the training data and then predict the labels of the test data.

* we've used Logistic Regression and Decision Trees. Here, we'll use a RandomForestClassifier, which we can think of as an ensemble of Decision Trees that generally outperforms a single Decision Tree.

In [None]:
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train,y_train)

# Compute accuracy
print(clf.score(X_test, y_test))

[0.93]

#### Model Metrics - CONFUSION MATRIX

In [None]:
TRUE POSITIVE = ACTUAL IS CHURN PREDICTED AS CHURN
TRUE NEGATIVE = ACTUAL IS NON-CHURN PREDICTED AS NON-CHURN
FALSE POSITIVE = ACTUAL IS NON-CHURN PREDICTED AS CHURN
FALSE NEGATIVE = ACTUAL IS CHURN PREDICTED AS NON-CHURN


PRECISION = TP / (TP+FP)

RECALL = TP / (TP+FN)

* Using scikit-learn's confusion_matrix() function, you can easily create your classifier's confusion matrix and gain a more nuanced understanding of its performance. It takes in two arguments: The actual labels of your test set - y_test - and your predicted labels.

The predicted labels of your Random Forest classifier from the previous exercise are stored in y_pred and were computed as follows:

In [None]:
y_pred = clf.predict(X_test)

In [None]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Print the confusion matrix
print(confusion_matrix(X_test,y_pred))

([842  13]
 [53  92])

842 = TN
92  = TP
53  = FN
13  = FP

#### Varying training set size to get rid of overfitting and underfitting

Models learn better when they have more training data. However, there's a risk that they overfit to the training data and don't generalize well to new dataIt is betWhen setting test_size=0.2 instead of 0.3, the new confusion matrix ;

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(confusion_matrix(y_test, y_pred))

In [None]:
([550  8]
 [38  71])

This classifier is higher precision than the before.

#### Computing precision and recall

In [None]:
# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Import precision_score
from sklearn.metrics import precision_score

# Print the precision
print(precision_score(y_test, y_pred))

# Import recall_score
from sklearn.metrics import recall_score

# Print the recall
print(recall_score(y_test, y_pred))

#### ROC Curve and computing AUC score

When we need to check or visualize the performance of the multi - class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification modelâ€™s performance. 

In [None]:
# Generate the probabilities
y_pred_prob = clf.predict_proba(X_test)[:, 1]

# Import roc_curve
from sklearn.metrics import roc_curve

# Calculate the roc metrics
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot the ROC curve, 
# fpr = false positive rate, tpr = true positive rate
plt.plot(fpr, tpr)

# Add labels and diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot([0, 1], [0, 1], "k--")
plt.show()

In [None]:
# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Print the AUC
print(roc_auc_score(y_test, y_pred_prob))

#### Precision-Recall Curve - F1 Score

Another way to evaluate model performance is using a precision-recall curve, which shows the tradeoff between precision and recall for different thresholds.

AUC is one metric you can use in quantify model performance, and another is the F1 score, which is calculated as below:

2 * (precision * recall) / (precision + recall)

! high F1 score is a sign of a well-performing model

In [None]:
# Import f1_score
from sklearn.metrics import f1_score

# Print the F1 score
print(f1_score(y_test, y_pred))

## 5) Tuning The Model

Each model has hyperparameters.The default hyperparameters used by your models are not optimized for your data. We can modify them.

### Tuning the number of features with GridSearchCV

The goal of grid search cross-validation is to identify those hyperparameters that lead to optimal model performance. 

The n_estimators hyperparameter controls the number of trees to use in the forest, while the max_features hyperparameter controls the number features the random forest should consider when looking for the best split at decision tree.

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
param_grid = {'max_features': ['auto', 'sqrt', 'log2']}

# Call GridSearchCV
grid_search = GridSearchCV(clf, param_grid)

# Fit the model
grid_search.fit(X, y)

# Print the optimal parameters
print(grid_search.best_params_)

{'max_features': 'log2'}

!!!! It looks like taking a log of the number of features leads to optimal model performance. By default, the model takes the square root of the number of features.

#### Tuning multiple hyperparameters

The power of GridSearchCV really comes into play when you're tuning multiple hyperparameters, as then the algorithm tries out all possible combinations of hyperparameters to identify the best combination. Here, you'll tune the following random forest hyperparameters:



Hyperparameter	    Purpose

criterion	   :    Quality of Split
max_features   :	Number of features for best split
max_depth	   :    Max depth of tree
bootstrap	   :    Whether Bootstrap samples are used

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Call GridSearchCV
grid_search = GridSearchCV(clf, param_grid)

# Fit the model
grid_search.fit(X,y)

# Print the optimal parameters
print(grid_search.best_params_)

### Tuning with Randomized Search

** As the hyperparameter grid gets larger, grid search becomes slower. In order to solve this problem, instead of trying out every single combination of values, we could randomly jump around the grid and try different combinations. 
In scikit-learn, you can do this using RandomizedSearchCV

!! look at the max_features below

In [None]:
# Import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Create the hyperparameter grid
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Call RandomizedSearchCV
random_search = RandomizedSearchCV(clf, param_dist)

# Fit the model
random_search.fit(X,y)

# Print best parameters
print(random_search.best_params_)

### Visualising Feature importances

In [None]:
# Calculate feature importances
importances = clf.feature_importances_

# Create plot
plt.barh(range(X.shape[1]), importances)
plt.show()

#### Improving the plot to understand better

In order to make the plot more readable, we need to do achieve two goals:

* Re-order the bars in ascending order.
* Add labels to the plot that correspond to the feature names.

The Numpy .argsort() method sorts an array and returns the indices.

In [None]:
# Sort importances
sorted_index = np.argsort(importances)

# Create labels
labels = X.columns[sorted_index]

# Clear current plot
plt.clf()

# Create plot
plt.barh(range(X.shape[1]), importances[sorted_index], tick_label=labels)
plt.show()

In [None]:
The plot tells us that CustServ_Calls, Day_Mins and Day_Charge are the most important drivers of churn. 