**2. Consider the churn-bigml-80.csv and churn-bigml-20.csv datafile for this question. The Orange Telecom’s churn dataset, which consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription, will be used to develop predictive models. Each row represents a customer; each column contains customer’s attributes. The goal is to build models that can help Orange Telecom to flag customers who likely to churn.**

**(a) (5 points) Load the data files to your S3 bucket. Using the pandas library, read the csv data file and create two data-frames called: telecom train (for churn-bigml-80.csv) and telecom test (for churn-bigml-20.csv).**

In [1]:
import boto3
import pandas as pd; pd.set_option('display.max_column', 100)
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, accuracy_score 
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,  GradientBoostingClassifier 
from sklearn.tree import DecisionTreeClassifier
from itertools import product
from tqdm import tqdm #adds progress bar!

#defining the s3 bucket
s3 = boto3.resource('s3')
bucket_name = 'craig-shaffer-data-445-bucket'
bucket = s3.Bucket(bucket_name)

#defining the file to be read from s3 bucket
file_key1 = 'churn-bigml-80.csv'
file_key2 = 'churn-bigml-20.csv'

bucket_object1 = bucket.Object(file_key1)
file_object1 = bucket_object1.get()
file_content_stream1 = file_object1.get('Body')

bucket_object2 = bucket.Object(file_key2)
file_object2 = bucket_object2.get()
file_content_stream2 = file_object2.get('Body')

#reading the datafiles
telecom_train = pd.read_csv(file_content_stream1)
telecom_test = pd.read_csv(file_content_stream2)

In [2]:
telecom_train.head()

Unnamed: 0,State,Account_length,Area_code,International_plan,Voice_mail_plan,Number_vmail_messages,Total_day_minutes,Total_day_calls,Total_day_charge,Total_eve_minutes,Total_eve_calls,Total_eve_charge,Total_night_minutes,Total_night_calls,Total_night_charge,Total_intl_minutes,Total_intl_calls,Total_intl_charge,Customer_service_calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
telecom_test.head()

Unnamed: 0,State,Account_length,Area_code,International_plan,Voice_mail_plan,Number_vmail_messages,Total_day_minutes,Total_day_calls,Total_day_charge,Total_eve_minutes,Total_eve_calls,Total_eve_charge,Total_night_minutes,Total_night_calls,Total_night_charge,Total_intl_minutes,Total_intl_calls,Total_intl_charge,Customer_service_calls,Churn
0,LA,117,408,No,No,0,184.5,97,31.37,351.6,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False
1,IN,65,415,No,No,0,129.1,137,21.95,228.5,83,19.42,208.8,111,9.4,12.7,6,3.43,4,True
2,NY,161,415,No,No,0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4,True
3,SC,111,415,No,No,0,110.4,103,18.77,137.3,102,11.67,189.6,105,8.53,7.7,6,2.08,2,False
4,HI,49,510,No,No,0,119.3,117,20.28,215.1,109,18.28,178.7,90,8.04,11.1,1,3.0,1,False


**(b) (12 points) Conduct the following feature engineering:**
- **Using the numpy library, create a variable in telecom_train called Churn_numb that takes the value of 1 when Churn = True and 0 when Churn = False.**
- **Change the International_plan variable from a categorical variable to a numerical variable. That is, change Yes to 1 and No to 0 in both data-frames: telecom_train and telecom_test.**
- **Change the Voice_mail_plan variable from a categorical variable to a numerical variable. That is, change Yes to 1 and No to 0 in both data-frames: telecom_train and telecom_test.**
- **Create a new variable called: total_charge as the sum of Total_day_charge, Total_eve_charge, Total_night_charge, and Total_intl_charge in both data-frames: telecom_train and telecom_test.**

In [2]:
#add Churn_numb to 1/0 where churn is True/False
telecom_train['Churn_numb'] = np.where(telecom_train['Churn']==True,1,0)
telecom_test['Churn_numb'] = np.where(telecom_test['Churn']==True,1,0)

#change International_plan yes/no to 1/0
telecom_train['International_plan'].replace(['Yes', 'No'], [1,0], inplace = True)
telecom_test['International_plan'].replace(['Yes', 'No'], [1,0], inplace = True)

#change Voice_mail_plan yes/no to 1/0
telecom_train['Voice_mail_plan'].replace(['Yes', 'No'], [1,0], inplace= True)
telecom_test['Voice_mail_plan'].replace(['Yes', 'No'], [1,0], inplace= True)

#create total_charge
telecom_train = telecom_train.assign(total_charge = telecom_train['Total_day_charge'] + telecom_train['Total_eve_charge'] + telecom_train['Total_night_charge']+ telecom_train['Total_intl_charge'])
telecom_test = telecom_test.assign(total_charge = telecom_test['Total_day_charge'] + telecom_test['Total_eve_charge'] + telecom_test['Total_night_charge']+ telecom_test['Total_intl_charge'])

In [27]:
telecom_train.head()

Unnamed: 0,State,Account_length,Area_code,International_plan,Voice_mail_plan,Number_vmail_messages,Total_day_minutes,Total_day_calls,Total_day_charge,Total_eve_minutes,Total_eve_calls,Total_eve_charge,Total_night_minutes,Total_night_calls,Total_night_charge,Total_intl_minutes,Total_intl_calls,Total_intl_charge,Customer_service_calls,Churn,Churn_numb,total_charge
0,KS,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False,0,75.56
1,OH,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False,0,59.24
2,NJ,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False,0,62.29
3,OH,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False,0,66.8
4,OK,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False,0,52.09


**(c) (5 points) In both data-frames telecom train and telecom test, only keep the following variables: Account_length, International_plan, Voice_mail_plan, total_charge, Customer_service_calls, and Churn_numb.**

In [3]:
telecom_train = telecom_train[['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls','Churn_numb']]
telecom_test = telecom_test[['Account_length', 'International_plan', 'Voice_mail_plan', 'total_charge', 'Customer_service_calls','Churn_numb']]

In [29]:
telecom_train.head()

Unnamed: 0,Account_length,International_plan,Voice_mail_plan,total_charge,Customer_service_calls,Churn_numb
0,128,0,1,75.56,1,0
1,107,0,1,59.24,1,0
2,137,0,0,62.29,0,0
3,84,1,0,66.8,2,0
4,75,1,0,52.09,3,0


**(d) (20 points) Consider the telecom train dataset. Using Account length, International plan, Voice mail plan, total charge, and Customer service calls as the input variables, and Churn_numb is the target variable. Do the following:**
- **(1) Split the data into train (80%) and test (20%) taking into account the proportion of 0s and 1s in the data. That is, if Y is the target variable, in train test split function, you need to add the extra argument stratify = Y.**
- **(2) Using the train dataset:**
    - **(i) Fit a random forest model with 500 trees and depth equal to 3 to the train dataset. Extract the importance of variables.**
    - **(ii) Fit an AdaBoost model with 500 trees, depth equal to 3, and learning rate equal to 0.01 to the train dataset. Extract the importance of variables.**
    - **(iii) Fit a gradient boosting model with 500 trees, depth equal to 3, and learning rate equal to 0.01 to the train dataset. Extract the importance of variables.**
    
**Repeat steps (1)-(2) 1000 times. Compute the average importance of each of the variables across the 1000 splits and the three models. After that, select the top 4 variables (the ones with top 4 average importance) as the predictor variables.**

In [35]:
#empty list to store variable importances
importance = []

#tdqm() added around range() to show progress for the 1000 iterations
for i in tqdm(range(0, 1000)):
    
    #defining input and target variables
    x = telecom_train[['Account_length', 'International_plan', 'Voice_mail_plan','total_charge', 'Customer_service_calls']]
    y = telecom_train['Churn_numb']

    #split into train and test
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,stratify = y)

    #random forest model
    rf_md = RandomForestClassifier(n_estimators = 500, max_depth = 3).fit(x_train, y_train)
    #extract variable importance
    importance.append(rf_md.feature_importances_)

    #adaboost model
    ada_md = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3), n_estimators = 500, learning_rate = 0.01).fit(x_train, y_train)
    #extract variable importance
    importance.append(ada_md.feature_importances_)

    #gradient boosting model
    gb_md = GradientBoostingClassifier(max_depth = 3, n_estimators = 500, learning_rate = 0.01).fit(x_train, y_train)
    #extract variable importance
    importance.append(gb_md.feature_importances_)
    
#average importance scores 
importance = pd.DataFrame(importance, columns = x.columns)
importance = pd.DataFrame(importance.mean()).T
importance

100%|██████████| 1000/1000 [50:07<00:00,  3.01s/it]


Unnamed: 0,Account_length,International_plan,Voice_mail_plan,total_charge,Customer_service_calls
0,0.111084,0.16957,0.08051,0.490098,0.148739


remove Voice_mail_plan

**(e) (45 points) Consider the telecom train dataset. Using Churn_numb as the target variable, and the remaining variables as the input variables. Do the following:**
 - **(i) Split the data into train (80%) and test (20%) taking into account the proportion of 0s and 1s in the data. That is, if Y is the target variable, in train test split function, you need to add the extra argument stratify = Y.**
 - **(ii)**
   - **Using the train dataset, build random forest models with the following setting: n_tree = [100, 500, 1000, 1500, 2000] and depth = [3, 5, 7]. In order to create a data-frame that contains all the combinations of trees and depths, you can use the following code: (refer to pdf) For each random forest model that is built, use it to predict the likelihood of churn on the test dataset. Using 10% as the cut-off value, compute the accuracy and recall of each of the models.**
   - **Using the train dataset, build AdaBoost models with the following setting: n_tree = [100, 500, 1000, 1500, 2000], depth = [3, 5, 7], and learning rate = [0.1, 0.01, 0.001]. In order to create a data-frame that contains all the combinations of trees, depths, and learning rates, you can use the following code: (refer to pdf) For each AdaBoost model that is built, use it to predict the likelihood of churn on the test dataset. Using 10% as the cut-off value, compute the accuracy and recall of each of the models.**
   - **Using the train dataset, build gradient boosting models with the following setting: n_tree = [100, 500, 1000, 1500, 2000], depth = [3, 5, 7], and learning rate = [0.1, 0.01, 0.001]. In order to create a data-frame that contains all the combinations of trees, depths, and learning rates, you can use the following code: (refer to the pdf) For each gradient boosting model that is built, use it to predict the likelihood of churn on the test dataset. Using 10% as the cut-off value, compute the accuracy and recall of each of the models.**

In [4]:
def expand_grid(dictionary):
    return pd.DataFrame([row for row in product(*dictionary.values())],columns = dictionary.keys())

#random forest model parameters
dictionary_rf = {'n_tree': [100, 500, 1000, 1500, 2000], 'depth': [3, 5, 7]}
rf_parameters = expand_grid(dictionary_rf)
rf_parameters['accuracy'] = np.nan
rf_parameters['recall'] = np.nan

#adaboost model parameters
dictionary_ada = {'n_tree': [100, 500, 1000, 1500, 2000], 'depth': [3, 5, 7], 'learning_rate': [0.1, 0.01, 0.001]}
ada_parameters = expand_grid(dictionary_ada)
ada_parameters['accuracy'] = np.nan
ada_parameters['recall'] = np.nan

#gradient boosting model parameters
dictionary_gb = {'n_tree': [100, 500, 1000, 1500, 2000], 'depth': [3, 5, 7], 'learning_rate': [0.1, 0.01, 0.001]}
gb_parameters = expand_grid(dictionary_gb)
gb_parameters['accuracy'] = np.nan
gb_parameters['recall'] = np.nan

In [43]:
#defining input and target variables
x = telecom_train[['Account_length', 'International_plan', 'total_charge', 'Customer_service_calls']]
y = telecom_train['Churn_numb']

#empty lists to store results
rf_results = []
ada_results = []
gb_results = []

#split into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y)

#random forest model
#number of models
n = rf_parameters.shape[0]
for i in range(0, n):

    #building random forest model
    rf_md = RandomForestClassifier(n_estimators = rf_parameters.loc[i, 'n_tree'], max_depth = rf_parameters.loc[i, 'depth']).fit(x_train, y_train)
    #predicting on test set
    rf_preds = rf_md.predict_proba(x_test)[:, 1]
    #changing likelihoods to label w/ 10% cutoff
    rf_preds = np.where(rf_preds < 0.1, 0, 1)
    #compute recall & accuracy and store them
    rf_results.append([rf_parameters.loc[i, 'n_tree'], rf_parameters.loc[i, 'depth'], accuracy_score(y_test, rf_preds), recall_score(y_test, rf_preds)])

#adaboost model
#number of models
n = ada_parameters.shape[0]
for i in range(0, n):

    #building adaboost model
    ada_md = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = ada_parameters.loc[i, 'depth']), n_estimators = ada_parameters.loc[i, 'n_tree'], learning_rate = ada_parameters.loc[i, 'learning_rate']).fit(x_train, y_train)
    #predicting on test set
    ada_preds = ada_md.predict_proba(x_test)[:, 1]
    #changing likelihoods to label w/ 10% cutoff
    ada_preds = np.where(ada_preds < 0.1, 0, 1)
    #compute recall & accuracy and store them
    ada_results.append([ada_parameters.loc[i, 'n_tree'], ada_parameters.loc[i, 'depth'], ada_parameters.loc[i, 'learning_rate'], accuracy_score(y_test, ada_preds), recall_score(y_test, ada_preds)])

#gradient boosting model
#number of models
n = gb_parameters.shape[0]
for i in range(0, n):

    #building gradient boosting model
    gb_md = GradientBoostingClassifier(max_depth = gb_parameters.loc[i, 'depth'], n_estimators = gb_parameters.loc[i, 'n_tree'], learning_rate = gb_parameters.loc[i, 'learning_rate']).fit(x_train, y_train)
    #predicting on test set
    gb_preds = gb_md.predict_proba(x_test)[:, 1]
    #changing likelihoods to label w/ 10% cutoff
    gb_preds = np.where(gb_preds < 0.1, 0, 1)
    #compute recall & accuracy and store them
    gb_results.append([gb_parameters.loc[i, 'n_tree'], gb_parameters.loc[i, 'depth'], gb_parameters.loc[i, 'learning_rate'], accuracy_score(y_test, gb_preds), recall_score(y_test, gb_preds)])



In [44]:
#random forest
rf_results_df = pd.DataFrame(columns = ['trees', 'depth', 'accuracy', 'recall'], data = rf_results)
rf_results_df

Unnamed: 0,trees,depth,accuracy,recall
0,100,3,0.859551,0.846154
1,100,5,0.874532,0.846154
2,100,7,0.874532,0.846154
3,500,3,0.859551,0.846154
4,500,5,0.874532,0.846154
5,500,7,0.870787,0.846154
6,1000,3,0.859551,0.846154
7,1000,5,0.874532,0.846154
8,1000,7,0.868914,0.846154
9,1500,3,0.859551,0.846154


In [45]:
#adaboost
ada_results_df = pd.DataFrame(columns = ['trees', 'depth', 'learning rate', 'accuracy', 'recall'], data = ada_results)
ada_results_df

Unnamed: 0,trees,depth,learning rate,accuracy,recall
0,100,3,0.1,0.146067,1.0
1,100,3,0.01,0.857678,0.833333
2,100,3,0.001,0.857678,0.820513
3,100,5,0.1,0.638577,0.871795
4,100,5,0.01,0.801498,0.833333
5,100,5,0.001,0.882022,0.833333
6,100,7,0.1,0.913858,0.74359
7,100,7,0.01,0.913858,0.74359
8,100,7,0.001,0.900749,0.74359
9,500,3,0.1,0.146067,1.0


In [46]:
#gradient boost
gb_results_df = pd.DataFrame(columns = ['trees', 'depth', 'learning rate', 'accuracy', 'recall'], data = gb_results)
gb_results_df

Unnamed: 0,trees,depth,learning rate,accuracy,recall
0,100,3,0.1,0.874532,0.833333
1,100,3,0.01,0.891386,0.846154
2,100,3,0.001,0.146067,1.0
3,100,5,0.1,0.868914,0.833333
4,100,5,0.01,0.88764,0.846154
5,100,5,0.001,0.146067,1.0
6,100,7,0.1,0.889513,0.807692
7,100,7,0.01,0.882022,0.846154
8,100,7,0.001,0.146067,1.0
9,500,3,0.1,0.863296,0.833333


**(f) (30 points) Repeat part (e) 100 times. Identify the best model of each of the frameworks (based on the average accuracy and recall); that is, identify the best random forest model, the best AdaBoost model, and the best gradient boosting model.**

In [5]:
#defining input and target variables
x = telecom_train[['Account_length', 'International_plan', 'total_charge', 'Customer_service_calls']]
y = telecom_train['Churn_numb']

#empty lists to store results
rf_results = []
ada_results = []
gb_results = []

#repeating (e) 100 times
for j in range(0, 100):
    
    #split into train and test
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y)

    #random forest model
    #number of models
    n = rf_parameters.shape[0]
    for i in range(0, n):
        
        #building random forest model
        rf_md = RandomForestClassifier(n_estimators = rf_parameters.loc[i, 'n_tree'], max_depth = rf_parameters.loc[i, 'depth']).fit(x_train, y_train)
        #predicting on test set
        rf_preds = rf_md.predict_proba(x_test)[:, 1]
        #changing likelihoods to label w/ 10% cutoff
        rf_preds = np.where(rf_preds < 0.1, 0, 1)
        #compute recall & accuracy and store them
        rf_results.append([rf_parameters.loc[i, 'n_tree'], rf_parameters.loc[i, 'depth'], accuracy_score(y_test, rf_preds), recall_score(y_test, rf_preds)])

    #adaboost model
    #number of models
    n = ada_parameters.shape[0]
    for i in range(0, n):

        #building adaboost model
        ada_md = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = ada_parameters.loc[i, 'depth']), n_estimators = ada_parameters.loc[i, 'n_tree'], learning_rate = ada_parameters.loc[i, 'learning_rate']).fit(x_train, y_train)
        #predicting on test set
        ada_preds = ada_md.predict_proba(x_test)[:, 1]
        #changing likelihoods to label w/ 10% cutoff
        ada_preds = np.where(ada_preds < 0.1, 0, 1)
        #compute recall & accuracy and store them
        ada_results.append([ada_parameters.loc[i, 'n_tree'], ada_parameters.loc[i, 'depth'], ada_parameters.loc[i, 'learning_rate'], accuracy_score(y_test, ada_preds), recall_score(y_test, ada_preds)])
        
    #gradient boosting model
    #number of models
    n = gb_parameters.shape[0]
    for i in range(0, n):

        #building gradient boosting model
        gb_md = GradientBoostingClassifier(max_depth = gb_parameters.loc[i, 'depth'], n_estimators = gb_parameters.loc[i, 'n_tree'], learning_rate = gb_parameters.loc[i, 'learning_rate']).fit(x_train, y_train)
        #predicting on test set
        gb_preds = gb_md.predict_proba(x_test)[:, 1]
        #changing likelihoods to label w/ 10% cutoff
        gb_preds = np.where(gb_preds < 0.1, 0, 1)
        #compute recall & accuracy and store them
        gb_results.append([gb_parameters.loc[i, 'n_tree'], gb_parameters.loc[i, 'depth'], gb_parameters.loc[i, 'learning_rate'], accuracy_score(y_test, gb_preds), recall_score(y_test, gb_preds)])
        


In [8]:
#random forest best models
rf_results = pd.DataFrame(columns = ['trees', 'depth', 'accuracy', 'recall'], data = rf_results)
rf_results['accuracy + recall'] = (rf_results['accuracy'] + rf_results['recall'])
rf_results = rf_results.groupby(['trees', 'depth']).mean().reset_index(drop = False)
rf_results.sort_values('accuracy + recall', ascending = False).reset_index(drop = True).head()

Unnamed: 0,trees,depth,accuracy,recall,accuracy + recall
0,2000,7,0.899126,0.870085,1.769212
1,1500,7,0.899001,0.870085,1.769087
2,100,7,0.898127,0.87094,1.769068
3,1000,7,0.898752,0.870085,1.768837
4,500,7,0.898752,0.868376,1.767128


In [9]:
#adaboost best models
ada_results = pd.DataFrame(columns = ['trees', 'depth', 'learning rate', 'accuracy', 'recall'], data = ada_results)
ada_results['accuracy + recall'] = (ada_results['accuracy'] + ada_results['recall'])
ada_results = ada_results.groupby(['trees', 'depth', 'learning rate']).mean().reset_index(drop = False)
ada_results.sort_values('accuracy + recall', ascending = False).reset_index(drop = True).head()

Unnamed: 0,trees,depth,learning rate,accuracy,recall,accuracy + recall
0,100,3,0.001,0.874282,0.865812,1.740094
1,100,3,0.01,0.873908,0.864957,1.738865
2,1000,3,0.001,0.873908,0.864957,1.738865
3,500,3,0.001,0.873783,0.863248,1.737031
4,100,5,0.001,0.90412,0.806838,1.710957


In [10]:
#gradient boosting best models
gb_results = pd.DataFrame(columns = ['trees', 'depth', 'learning rate', 'accuracy', 'recall'], data = gb_results)
gb_results['accuracy + recall'] = (gb_results['accuracy'] + gb_results['recall']) 
gb_results = gb_results.groupby(['trees', 'depth', 'learning rate']).mean().reset_index(drop = False)
gb_results.sort_values('accuracy + recall', ascending = False).reset_index(drop = True).head()

Unnamed: 0,trees,depth,learning rate,accuracy,recall,accuracy + recall
0,1000,5,0.001,0.905493,0.864957,1.77045
1,100,5,0.01,0.905493,0.864957,1.77045
2,1500,5,0.001,0.905493,0.864957,1.77045
3,2000,5,0.001,0.904869,0.864957,1.769826
4,1000,3,0.001,0.909863,0.857265,1.767128


best models (based on both recall and accuracy):
 - rf: 2000 tree, depth 7
 - ada: 100 tree, depth 3, learning rate .001
 - gb: 100 tree, depth 5, learning rate .010 (had same accuracy+recall as 1000 tree model, but we prefer the simpler model)

**(g) (35 points) Using the telecom train build three models: the best random forest model from part (f), the best AdaBoost model form part (f), and the best gradient boosting model form part (f). Using these to three models, predict the likelihood of Churn on the telecom test data-frame. After that, aggregate those likelihoods using the weighted average formula (use average recall of the models as weights). Using 10% as cutoff value, report the accuracy and recall of the aggregated predictions.**

In [11]:
#split into train(telecom_train) and test(telecom_test)
x_train = telecom_train[['Account_length', 'International_plan', 'total_charge', 'Customer_service_calls']]
x_test = telecom_test[['Account_length', 'International_plan', 'total_charge', 'Customer_service_calls']]
y_train = telecom_train['Churn_numb']
y_test = telecom_test['Churn_numb']

#building random forest model
rf_md = RandomForestClassifier(n_estimators = 2000, max_depth = 7).fit(x_train, y_train)
#predicting on test set
rf_preds = rf_md.predict_proba(x_test)[:, 1]

#building adaboost model
ada_md = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3), n_estimators = 100, learning_rate = 0.001).fit(x_train, y_train)
#predicting on test set
ada_preds = ada_md.predict_proba(x_test)[:, 1]

#building gradient boosting model
gb_md = GradientBoostingClassifier(max_depth = 5, n_estimators = 100, learning_rate = 0.01).fit(x_train, y_train)
#predicting on test set
gb_preds = gb_md.predict_proba(x_test)[:, 1]

#recall of the best models from part F to use as weights
rf_recall = 0.870085
ada_recall = 0.865812
gb_recall = 0.864957

#sum of recalls to use for weighted average
total_recall = rf_recall + ada_recall + gb_recall

#aggregate the likelihoods using the weighted average formula
overall_preds = (rf_recall/total_recall)*(rf_preds) + (ada_recall/total_recall)*(ada_preds) + (gb_recall/total_recall)*(gb_preds)

#changing likelihoods to label w/ 10% cutoff
overall_labels = np.where(overall_preds < 0.1, 0, 1)

#compute and print accuracy and recall
print('the accuracy score is', accuracy_score(y_test, overall_labels))
print('the recall score is', recall_score(y_test, overall_labels))

the accuracy score is 0.8665667166416792
the recall score is 0.8842105263157894
