**2. Consider the churn-bigml-80.csv and churn-bigml-20.csv datafile for this question. The Orange Telecom’s churn dataset, which consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription, will be used to develop predictive models. Each row represents a customer; each column contains customer’s attributes. The goal is to build models that can help Orange Telecom to flag customers who likely to churn.**

**(a) (5 points) Load the data files to your S3 bucket. Using the pandas library, read the csv data file and create two data-frames called: telecom train (for churn-bigml-80.csv) and telecom test (for churn-bigml-20.csv).**

In [1]:
import boto3
import pandas as pd; pd.set_option('display.max_column', 100)
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, accuracy_score 
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,  GradientBoostingClassifier 
from sklearn.tree import DecisionTreeClassifier
from itertools import product

#defining the s3 bucket
s3 = boto3.resource('s3')
bucket_name = 'craig-shaffer-data-445-bucket'
bucket = s3.Bucket(bucket_name)

#defining the file to be read from s3 bucket
file_key1 = 'churn-bigml-80.csv'
file_key2 = 'churn-bigml-20.csv'

bucket_object1 = bucket.Object(file_key1)
file_object1 = bucket_object1.get()
file_content_stream1 = file_object1.get('Body')

bucket_object2 = bucket.Object(file_key2)
file_object2 = bucket_object2.get()
file_content_stream2 = file_object2.get('Body')

#reading the datafiles
telecom_train = pd.read_csv(file_content_stream1)
telecom_test = pd.read_csv(file_content_stream2)

In [2]:
telecom_train.head()

Unnamed: 0,State,Account_length,Area_code,International_plan,Voice_mail_plan,Number_vmail_messages,Total_day_minutes,Total_day_calls,Total_day_charge,Total_eve_minutes,Total_eve_calls,Total_eve_charge,Total_night_minutes,Total_night_calls,Total_night_charge,Total_intl_minutes,Total_intl_calls,Total_intl_charge,Customer_service_calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
telecom_test.head()

Unnamed: 0,State,Account_length,Area_code,International_plan,Voice_mail_plan,Number_vmail_messages,Total_day_minutes,Total_day_calls,Total_day_charge,Total_eve_minutes,Total_eve_calls,Total_eve_charge,Total_night_minutes,Total_night_calls,Total_night_charge,Total_intl_minutes,Total_intl_calls,Total_intl_charge,Customer_service_calls,Churn
0,LA,117,408,No,No,0,184.5,97,31.37,351.6,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False
1,IN,65,415,No,No,0,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,True
2,NY,161,415,No,No,0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,True
3,SC,111,415,No,No,0,110.4,103,18.77,137.3,102,11.67,189.6,105,8.53,7.7,6,2.08,2,False
4,HI,49,510,No,No,0,119.3,117,20.28,215.1,109,18.28,178.7,90,8.04,11.1,1,3.0,1,False


**(b) (12 points) Conduct the following feature engineering:**
- **Using the numpy library, create a variable in telecom_train called Churn_numb that takes the value of 1 when Churn = True and 0 when Churn = False.**
- **Change the International_plan variable from a categorical variable to a numerical variable. That is, change Yes to 1 and No to 0 in both data-frames: telecom_train and telecom_test.**
- **Change the Voice_mail_plan variable from a categorical variable to a numerical variable. That is, change Yes to 1 and No to 0 in both data-frames: telecom_train and telecom_test.**
- **Create a new variable called: total_charge as the sum of Total_day_charge, Total_eve_charge, Total_night_charge, and Total_intl_charge in both data-frames: telecom_train and telecom_test.**

In [4]:
#add Churn_numb to 1/0 where churn is True/False
telecom_train['Churn_numb'] = np.where(telecom_train['Churn']==True,1,0)
telecom_test['Churn_numb'] = np.where(telecom_test['Churn']==True,1,0)

#change International_plan yes/no to 1/0
telecom_train['International_plan'].replace(['Yes', 'No'], [1,0], inplace = True)
telecom_test['International_plan'].replace(['Yes', 'No'], [1,0], inplace = True)

#change Voice_mail_plan yes/no to 1/0
telecom_train['Voice_mail_plan'].replace(['Yes', 'No'], [1,0], inplace= True)
telecom_test['Voice_mail_plan'].replace(['Yes', 'No'], [1,0], inplace= True)

#create total_charge
telecom_train = telecom_train.assign(total_charge = telecom_train['Total_day_charge'] + telecom_train['Total_eve_charge'] + telecom_train['Total_night_charge']+ telecom_train['Total_intl_charge'])
telecom_test = telecom_test.assign(total_charge = telecom_test['Total_day_charge'] + telecom_test['Total_eve_charge'] + telecom_test['Total_night_charge']+ telecom_test['Total_intl_charge'])

In [5]:
telecom_train.head()

Unnamed: 0,State,Account_length,Area_code,International_plan,Voice_mail_plan,Number_vmail_messages,Total_day_minutes,Total_day_calls,Total_day_charge,Total_eve_minutes,Total_eve_calls,Total_eve_charge,Total_night_minutes,Total_night_calls,Total_night_charge,Total_intl_minutes,Total_intl_calls,Total_intl_charge,Customer_service_calls,Churn,Churn_numb,total_charge
0,KS,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False,0,75.56
1,OH,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False,0,59.24
2,NJ,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False,0,62.29
3,OH,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False,0,66.8
4,OK,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False,0,52.09


**(c) (5 points) In both data-frames telecom train and telecom test, only keep the following variables: Account_length, International_plan, Voice_mail_plan, total_charge, Customer_service_calls, and Churn_numb.**

In [6]:
telecom_train = telecom_train[['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls','Churn_numb']]
telecom_test = telecom_test[['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls','Churn_numb']]

In [7]:
telecom_train.head()

Unnamed: 0,Account_length,International_plan,Voice_mail_plan,total_charge,Customer_service_calls,Churn_numb
0,128,0,1,75.56,1,0
1,107,0,1,59.24,1,0
2,137,0,0,62.29,0,0
3,84,1,0,66.8,2,0
4,75,1,0,52.09,3,0


**(d) (20 points) Consider the telecom train dataset. Using Account length, International plan, Voice mail plan, total charge, and Customer service calls as the input variables, and Churn_numb is the target variable. Do the following:**
- **(1) Split the data into train (80%) and test (20%) taking into account the proportion of 0s and 1s in the data. That is, if Y is the target variable, in train test split function, you need to add the extra argument stratify = Y.**
- **(2) Using the train dataset:**
    - **(i) Fit a random forest model with 500 trees and depth equal to 3 to the train dataset. Extract the importance of variables.**
    - **(ii) Fit an AdaBoost model with 500 trees, depth equal to 3, and learning rate equal to 0.01 to the train dataset. Extract the importance of variables.**
    - **(iii) Fit a gradient boosting model with 500 trees, depth equal to 3, and learning rate equal to 0.01 to the train dataset. Extract the importance of variables.**
    
**Repeat steps (1)-(2) 1000 times. Compute the average importance of each of the variables across the 1000 splits and the three models. After that, select the top 4 variables (the ones with top 4 average importance) as the predictor variables.**

In [20]:
rf_importance = list()
ada_importance = list()
gb_importance = list()
importance = list()

#defining input and target variables
x = telecom_train[['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls']]
y = telecom_train['Churn_numb']

for i in range(0,1000):
    #Splitting the Data
    x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, stratify= y)
    
    #random forest
    rf_md = RandomForestClassifier(n_estimators = 500, max_depth = 3).fit(x_train, y_train)
    #extract the feature importances
    rf_importance.append(rf_md.feature_importances_)
    importance.append(rf_md.feature_importances_)
    
    #adaboost
    ada_md = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3), n_estimators = 500, learning_rate = 0.01).fit(x_train, y_train)
    #extract the feature importances
    ada_importance.append(ada_md.feature_importances_)
    importance.append(ada_md.feature_importances_)
    
    #gradient boost
    gb_md = GradientBoostingClassifier(max_depth = 3, n_estimators = 500, learning_rate = 0.01).fit(x_train, y_train)
    #extract the feature importances
    gb_importance.append(gb_md.feature_importances_)
    importance.append(gb_md.feature_importances_)

KeyboardInterrupt: 

In [None]:
rf_importance = pd.DataFrame(rf_importance)
rf_importance.columns= ['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls']
rf_importance.apply(np.mean, axis = 0)

In [None]:
ada_importance = pd.DataFrame(ada_importance)
ada_importance.columns= ['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls']
ada_importance.apply(np.mean, axis = 0)

In [None]:
gb_importance = pd.DataFrame(gb_importance)
gb_importance.columns= ['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls']
gb_importance.apply(np.mean, axis = 0)

In [None]:
importance = pd.DataFrame(importance)
importance.columns= ['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls']
importance.apply(np.mean, axis = 0)

In [None]:
#removing the least important variable from both data sets
telecom_train = telecom_train.drop(['Account_length'], axis = 1)
telecom_test = telecom_test.drop(['Account_length'], axis = 1)

**(e) (45 points) Consider the telecom train dataset. Using Churn_numb as the target variable, and the remaining variables as the input variables. Do the following:**
 - **(i) Split the data into train (80%) and test (20%) taking into account the proportion of 0s and 1s in the data. That is, if Y is the target variable, in train test split function, you need to add the extra argument stratify = Y.**
 - **(ii)**
   - **Using the train dataset, build random forest models with the following setting: n_tree = [100, 500, 1000, 1500, 2000] and depth = [3, 5, 7]. In order to create a data-frame that contains all the combinations of trees and depths, you can use the following code: (refer to pdf) For each random forest model that is built, use it to predict the likelihood of churn on the test dataset. Using 10% as the cut-off value, compute the accuracy and recall of each of the models.**
   - **Using the train dataset, build AdaBoost models with the following setting: n_tree = [100, 500, 1000, 1500, 2000], depth = [3, 5, 7], and learning rate = [0.1, 0.01, 0.001]. In order to create a data-frame that contains all the combinations of trees, depths, and learning rates, you can use the following code: (refer to pdf) For each AdaBoost model that is built, use it to predict the likelihood of churn on the test dataset. Using 10% as the cut-off value, compute the accuracy and recall of each of the models.**
   - **Using the train dataset, build gradient boosting models with the following setting: n_tree = [100, 500, 1000, 1500, 2000], depth = [3, 5, 7], and learning rate = [0.1, 0.01, 0.001]. In order to create a data-frame that contains all the combinations of trees, depths, and learning rates, you can use the following code: (refer to the pdf) For each gradient boosting model that is built, use it to predict the likelihood of churn on the test dataset. Using 10% as the cut-off value, compute the accuracy and recall of each of the models.**

**(f) (30 points) Repeat part (e) 100 times. Identify the best model of each of the frameworks (based on the average accuracy and recall); that is, identify the best random forest model, the best AdaBoost model, and the best gradient boosting model.**

**(g) (35 points) Using the telecom train build three models: the best random forest model from part (f), the best AdaBoost model form part (f), and the best gradient boosting model form part (f). Using these to three models, predict the likelihood of Churn on the telecom test data-frame. After that, aggregate those likelihoods using the weighted average formula (use average recall of the models as weights). Using 10% as cutoff value, report the accuracy and recall of the aggregated predictions.**