[Note: Just run the cells for the Data Preparation part and start answering the questions after that]

### Data Preparation
For this task, you will perform the following steps:
- Load all the necessary packages for this exercise
- Load the data
- Split the data into input features and the target variable
- Set cateogorical columns as "Categorical" in the input dataset
- Split the data into training and validation datasets
- Standardize numeric variables in the datasets

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, mean_squared_error

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing the dataset
telecom = pd.read_csv("telecom_churn_dataset.csv")

In [3]:
##Random Delete later
telecom.shape

(7032, 21)

In [4]:
non_categorical_columns = ['tenure','MonthlyCharges','TotalCharges']
for column in telecom.columns:
    if column not in non_categorical_columns:
        telecom[column] = pd.Categorical(telecom[column])

In [5]:
X = telecom.drop(['Churn','customerID'], axis=1)
y = telecom['Churn']

X = pd.get_dummies(X, drop_first=False) #for kNN and trees
X2 = pd.get_dummies(X, drop_first=True) #for logistic regression

X_train, X_val, X2_train, X2_val, y_train, y_val = train_test_split(X, X2, y, test_size=0.3, random_state = 1)

# Standardize our non-dummy variables
scaler = StandardScaler()
X_train[['tenure','MonthlyCharges','TotalCharges']]= scaler.fit_transform(X_train[['tenure','MonthlyCharges','TotalCharges']])
X_val[['tenure','MonthlyCharges','TotalCharges']]= scaler.transform(X_val[['tenure','MonthlyCharges','TotalCharges']])

X2_train[['tenure','MonthlyCharges','TotalCharges']]= scaler.fit_transform(X2_train[['tenure','MonthlyCharges','TotalCharges']])
X2_val[['tenure','MonthlyCharges','TotalCharges']]= scaler.transform(X2_val[['tenure','MonthlyCharges','TotalCharges']])

### Q1 - Value of k: Validation Set

First, we will build a k-NN model for this problem statement. What is the optimal k for fitting a k-NN model using the validation set? (Iterate the k value from 1 to 35)


For this task, you will perform the following steps:
- Find the optimal k value for which the kNN model gives the maximum validation set accuracy

In [8]:
# Define the parameter range of k from 1 to 35
knn_clf = KNeighborsClassifier()
param_grid = { 'n_neighbors': np.arange(1, 35) } # Parameter range.

val_acc = []

# Fit a kNN model for each k value, find the validation set accuracy and store them in a list
for k in param_grid['n_neighbors']:
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    y_pred = knn_clf.predict(X_val)
    val_acc.append(accuracy_score(y_val, y_pred))


# Find the k value which gives the maximum validation set accuracy in the list
print("The maximum accuracy is at k=" + str(np.argmax(val_acc)+1))
max_acc = np.argmax(val_acc)+1

The maximum accuracy is at k=26


### Q2 - Value of k: Cross Validation

What is the optimal k for fitting a k-NN model using cross validation? (Iterate the k value from 1 to 35 and use 5 folds of cross validation)

For this task, you will perform the following steps:
- Find the optimal k value for which the kNN model gives the maximum mean test accuracy using GridSearchCV

In [9]:
# Initialize the kNN classifier model
knn_clf = KNeighborsClassifier()

# defining Grid search cv with parameter range
grid = GridSearchCV(knn_clf, param_grid, cv=5, scoring='accuracy', return_train_score=True, verbose=1)
grid_search = grid.fit(X_train, y_train)
grid_search.best_params_


Fitting 5 folds for each of 34 candidates, totalling 170 fits


{'n_neighbors': 34}

### Q3. Accuracy

From the optimal k found using the validation set and using cross validation, which one gives the highest accuracy on the validation set?

For this task, you will perform the following steps:
- Find the approach whose optimal k gives the maximum validation set accuracy

In [29]:
# Fit a kNN model using the optimal k value obtained in Q1 and find the validation set accuracy
knn_clf_best_val_set = KNeighborsClassifier(n_neighbors=max_acc)
knn_clf_best_val_set.fit(X_train, y_train)
y_pred_val_set = knn_clf_best_val_set.predict(X_val)
y_pred_train_set = knn_clf_best_val_set.predict(X_train)

# Fit a kNN model using the optimal k value obtained in Q2 and find the validation set accuracy
knn_clf_best_cv_set = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn_clf_best_cv_set.fit(X_train, y_train)
y_pred_cv_set = knn_clf_best_cv_set.predict(X_val)

# Find which task's optimal k gives the highest validation set accuracy
print("Validation data Accuracy Score: %.3f" % accuracy_score(y_val, y_pred_val_set))
print("CV data Accuracy Score: %.3f" % accuracy_score(y_val, y_pred_cv_set))

Validation data Accuracy Score: 0.796
CV data Accuracy Score: 0.789


### Q4 - Model Performance

Explore the performance of a logistic regression model and a decision tree model for this dataset and select the correct statements from the options given below.

Note - 

    Use the optimal k obtained using the validation set for the kNN model
    Use the CCP alpha as 0.0048016 for the decision tree model
    Use no penalty, lbfgs solver, random state as 0 and maximum iteration as 200 for the logistic regression model

For this task, you will perform the following steps:
- Analyzing the training and validation accuracies obtained for the logistic regression model, decision tree model and k-NN model

#### Logistic Regression Model

In [16]:
# Fit a logistic regression model on the training dataset and find the accuracy for training and validation datasets
# Hint: You need to set the 'penalty' parameter to 'none' and 'solver' to 'lbfgs'
# Note: Use 'max_iter = 200' and 'random_state = 0' for the model
# Note: Use X2_train and X2_val for logistic regression
log_cls = LogisticRegression(random_state=0, max_iter=200, penalty=None, solver='lbfgs')
log_cls.fit(X2_train, y_train)
y_pred = log_cls.predict(X2_val)


In [18]:
#Compute the training accuracy
log_cls_train_acc = log_cls.score(X2_train, y_train)

In [19]:
#Compute the validation accuracy
log_cls_val_acc = log_cls.score(X2_val, y_val)

In [20]:
(log_cls_train_acc, log_cls_val_acc)

(0.8086143843965867, 0.7981042654028436)

#### Decision Tree Model

In [24]:
# Fit a decision tree model on the training dataset and find the accuracy for training and validation datasets
# Note: Use 'ccp_alpha = 0.0048016' and 'random_state = 0' for the model
tree_clf = DecisionTreeClassifier(random_state=0, ccp_alpha=0.0048016)
tree_clf.fit(X_train, y_train)
# Fit a decision tree model on the training dataset and find the accuracy for training and validation datasets
# Note: Use optimal k value obtained using the validation dataset in Task 2
tree_train_acc = tree_clf.score(X_train, y_train)
tree_val_acc = tree_clf.score(X_val, y_val)

# Tabulate your results and answer the question
(tree_train_acc, tree_val_acc)

(0.7878911011783828, 0.7976303317535545)

In [31]:
results = pd.DataFrame([[log_cls_val_acc, log_cls_train_acc],[tree_val_acc, tree_train_acc], 
              [accuracy_score(y_val, y_pred_val_set), accuracy_score(y_train, y_pred_train_set)]], columns=['Validation accuracy', 'Training accuracy'], 
              index = ['Linear', 'Tree', 'kNN'])

In [33]:
results.sort_values(by=["Validation accuracy", "Training accuracy"], ascending=False)

Unnamed: 0,Validation accuracy,Training accuracy
Linear,0.798104,0.808614
Tree,0.79763,0.787891
kNN,0.795735,0.809224


In [None]:
#Write your code to build the Tree Model Here


In [None]:
##Compute the training accuracy


In [None]:
##Compute the validation accuracy
