# DS Automation Assignment

Using our prepared churn data from week 2:
- use TPOT to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
    - REMEMBER: TPOT only finds the optimized processing pipeline and model. It doesn't create the model. 
        - You can use `tpot.export('my_model_name.py')` (assuming you called your TPOT object tpot) and it will save a Python template with an example of the optimized pipeline. 
        - Use the template code saved from the `export()` function in your program.
- create a Python script/file/module using code from the exported template above that
    - create a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

<span style="color:purple">Import necessary library</span>.

In [1]:
import pandas as pd
from pandas import read_csv 
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

<span style="color:purple">Import sklearn</span>.

In [2]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDRegressor
from sklearn.svm import LinearSVR
from sklearn.pipeline import make_pipeline, make_union

<span style="color:purple">Import TPOTClassifier</span>.

In [3]:
from tpot import TPOTClassifier
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
# Any results you write to the current directory are saved as output.
import timeit 

<span style="color:purple">Import other necessary library</span>.

In [4]:
from pyexpat import features
from random import random
from tkinter.tix import COLUMN
from xml.dom.xmlbuilder import DOMEntityResolver

<span style="color:purple">Loading data file</span>.

In [5]:
df = pd.read_csv('prepped_churn_data.csv',index_col='customerID')
df #df could be dangerous if file too big
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   tenure                     7032 non-null   int64  
 1   PhoneService               7032 non-null   int64  
 2   Contract                   7032 non-null   int64  
 3   PaymentMethod              7032 non-null   int64  
 4   MonthlyCharges             7032 non-null   float64
 5   TotalCharges               7032 non-null   float64
 6   Churn                      7032 non-null   int64  
 7   tenure_TotalCharges_ratio  7032 non-null   float64
dtypes: float64(3), int64(5)
memory usage: 494.4+ KB


In [6]:
#might not need this -- but just in case
df['tenure'] = df['tenure'].apply(lambda x: float())
df['PhoneService'] = df['PhoneService'].apply(lambda x: float())
df['Contract'] = df['Contract'].apply(lambda x: float())
df['PaymentMethod'] = df['PaymentMethod'].apply(lambda x: float())
df['MonthlyCharges'] = df['MonthlyCharges'].apply(lambda x: float())

In [7]:
print(df.dtypes)

tenure                       float64
PhoneService                 float64
Contract                     float64
PaymentMethod                float64
MonthlyCharges               float64
TotalCharges                 float64
Churn                          int64
tenure_TotalCharges_ratio    float64
dtype: object


<span style="color:purple">TPOT Loading data file & printing contents & diving data into train and test</span>.

In [8]:
tpot_data =  pd.read_csv('prepped_churn_data.csv')
print(tpot_data)
features = tpot_data.drop(['customerID','Churn'], axis=1)
target = tpot_data['Churn']
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, target, random_state=1)

      customerID  tenure  PhoneService  Contract  PaymentMethod  \
0     7590-VHVEG       1             0         0              0   
1     5575-GNVDE      34             1         1              1   
2     3668-QPYBK       2             1         0              1   
3     7795-CFOCW      45             0         1              2   
4     9237-HQITU       2             1         0              0   
...          ...     ...           ...       ...            ...   
7027  6840-RESVB      24             1         1              1   
7028  2234-XADUH      72             1         1              3   
7029  4801-JZAZL      11             0         0              0   
7030  8361-LTMKD       4             1         0              1   
7031  3186-AJIEK      66             1         2              2   

      MonthlyCharges  TotalCharges  Churn  tenure_TotalCharges_ratio  
0              29.85         29.85      0                   0.033501  
1              56.95       1889.50      0            

<span style="color:purple">Initiating TPOT Classification and exporting parameters</span>. 

In [9]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, random_state=42)
tpot.fit(training_features, training_target)
print(tpot.score(testing_features, testing_target))
tpot.export('churn_data_best_class_model.py')

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.796171119724453

Generation 2 - Current best internal CV score: 0.7967398401035999

Generation 3 - Current best internal CV score: 0.7969299531462181

Generation 4 - Current best internal CV score: 0.7969299531462181

Generation 5 - Current best internal CV score: 0.7980668543216094

Best pipeline: RandomForestClassifier(FastICA(input_matrix, tol=0.2), bootstrap=True, criterion=gini, max_features=0.2, min_samples_leaf=8, min_samples_split=4, n_estimators=100)
0.7918088737201365
CPU times: total: 1min 13s
Wall time: 7min 49s


<span style="color:purple">Using parameter from Classification py file</span>. 

In [10]:
import numpy as np
import pandas as pd
from sklearn.decomposition import FastICA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from tpot.export_utils import set_param_recursive

In [11]:
# Average CV score on the training set was: 0.7980668543216094
exported_pipeline = make_pipeline(
    FastICA(tol=0.2),
    RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.2, min_samples_leaf=8, min_samples_split=4, n_estimators=100)
)

In [12]:
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

In [13]:
exported_pipeline.fit(training_features, training_target)

Pipeline(steps=[('fastica', FastICA(random_state=42, tol=0.2)),
                ('randomforestclassifier',
                 RandomForestClassifier(max_features=0.2, min_samples_leaf=8,
                                        min_samples_split=4,
                                        random_state=42))])

In [14]:
results = exported_pipeline.predict(testing_features)

<span style="color:purple">Import TPOT Regressor</span>. 

In [15]:
# import the usual packages
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
import timeit 

<span style="color:purple">Load data and divide the data into train and test</span>. 

In [16]:
tpot_data =  pd.read_csv('prepped_churn_data.csv')
X_train, X_test, y_train, y_test = train_test_split(training_features, training_target, train_size=0.8, test_size=0.2, random_state=42)

<span style="color:purple">Initializing TPOT Regression and exporting parameter</span>. 

In [17]:
%%time
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, n_jobs=-1, scoring='r2', random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('churn_data_best_reg_model.py')

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.2860959679646081

Generation 2 - Current best internal CV score: 0.2879225707996487

Generation 3 - Current best internal CV score: 0.28924936746591035

Generation 4 - Current best internal CV score: 0.2896377923505735

Generation 5 - Current best internal CV score: 0.2896377923505735

Best pipeline: ExtraTreesRegressor(input_matrix, bootstrap=True, max_features=0.7500000000000001, min_samples_leaf=7, min_samples_split=19, n_estimators=100)
0.2703730539144147
CPU times: total: 48.3 s
Wall time: 5min 23s


<span style="color:purple">Using parameter from Regression py file</span>. 

In [55]:
import imp
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union #not in file 
from tpot.builtins import stacking_estimator #not in file 
from tpot.export_utils import set_param_recursive #not in file 
from xgboost import XGBRegressor #not in file 

<span style="color:purple">Loading data</span>. 

In [56]:
tpot_data =  pd.read_csv('new_churn_data.csv')

In [65]:
#might not need this -- but just in case
tpot_data['customerID'] = tpot_data['customerID'].apply(lambda x: float())
tpot_data['tenure'] = tpot_data['tenure'].astype(float)
tpot_data['PhoneService'] = tpot_data['PhoneService'].astype(float)
tpot_data['Contract'] = tpot_data['Contract'].astype(float)
tpot_data['PaymentMethod'] = tpot_data['PaymentMethod'].astype(float)
tpot_data['MonthlyCharges'] = tpot_data['MonthlyCharges'].astype(float)

<span style="color:purple">Training & Testing </span>. 

In [66]:
features = tpot_data.drop('customerID', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['customerID'], random_state=42)

<span style="color:purple">Using exported pipeline automation parameters</span>. 

In [67]:
tpot_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customerID         5 non-null      float64
 1   tenure             5 non-null      float64
 2   PhoneService       5 non-null      float64
 3   Contract           5 non-null      float64
 4   PaymentMethod      5 non-null      float64
 5   MonthlyCharges     5 non-null      float64
 6   TotalCharges       5 non-null      float64
 7   charge_per_tenure  5 non-null      float64
dtypes: float64(8)
memory usage: 448.0 bytes


In [59]:
#might not need this -- but just in case
tpot_data['customerID'] = tpot_data['customerID'].apply(lambda x: float())
tpot_data['tenure'] = tpot_data['tenure'].astype(float)
tpot_data['PhoneService'] = tpot_data['PhoneService'].astype(float)
tpot_data['Contract'] = tpot_data['Contract'].astype(float)
tpot_data['PaymentMethod'] = tpot_data['PaymentMethod'].astype(float)
tpot_data['MonthlyCharges'] = tpot_data['MonthlyCharges'].astype(float)

In [68]:
# Average CV score on the training set was: 0.2896377923505735
exported_pipeline = ExtraTreesRegressor(bootstrap=True, max_features=0.7500000000000001, min_samples_leaf=7, min_samples_split=19, n_estimators=100)

<span style="color:purple">Just making sure of data integrity</span>. 

In [61]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   tenure             5 non-null      int64  
 1   PhoneService       5 non-null      int64  
 2   Contract           5 non-null      int64  
 3   PaymentMethod      5 non-null      int64  
 4   MonthlyCharges     5 non-null      float64
 5   TotalCharges       5 non-null      float64
 6   charge_per_tenure  5 non-null      float64
dtypes: float64(3), int64(4)
memory usage: 408.0 bytes


In [49]:
tpot_data['customerID'] = tpot_data['customerID'].astype(float)

In [63]:
#might not need this -- but just in case
#df['customerID'] = df['customerID'].apply(lambda x: float())
tpot_data['tenure'] = tpot_data['tenure'].astype(float)
tpot_data['PhoneService'] = tpot_data['PhoneService'].astype(float)
tpot_data['Contract'] = tpot_data['Contract'].astype(float)
tpot_data['PaymentMethod'] = tpot_data['PaymentMethod'].astype(float)
tpot_data['MonthlyCharges'] = tpot_data['MonthlyCharges'].astype(float)

In [50]:
tpot_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customerID         5 non-null      float64
 1   tenure             5 non-null      float64
 2   PhoneService       5 non-null      float64
 3   Contract           5 non-null      float64
 4   PaymentMethod      5 non-null      float64
 5   MonthlyCharges     5 non-null      float64
 6   TotalCharges       5 non-null      float64
 7   charge_per_tenure  5 non-null      float64
dtypes: float64(8)
memory usage: 448.0 bytes


In [54]:
tpot_data = tpot_data.astype({"customerID": float})
print(tpot_data.dtypes)

customerID           float64
tenure               float64
PhoneService         float64
Contract             float64
PaymentMethod        float64
MonthlyCharges       float64
TotalCharges         float64
charge_per_tenure    float64
dtype: object


In [69]:
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

In [70]:
#spend 14+ hours on this...still thinking is wrong -- HELP!!!
row = np.reshape(training_features, (3,7))
results = exported_pipeline.predict(row)
results
print(row)
print(f'Row 0 Predicted: {results[0]}, target: {training_target}')
print(f'Row 0 Predicted: {results[0]}, target: {testing_features}')

   tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
2    28.0           1.0       0.0            0.0           28.25   
0    22.0           1.0       0.0            2.0           97.40   
3    62.0           1.0       0.0            2.0          101.70   

   TotalCharges  charge_per_tenure  
2        250.90           8.960714  
0        811.70          36.895455  
3       3106.56          50.105806  
Row 0 Predicted: 0.0, target: 2    0.0
0    0.0
3    0.0
Name: customerID, dtype: float64
Row 0 Predicted: 0.0, target:    tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
1     8.0           0.0       1.0            1.0           77.30   
4    10.0           0.0       0.0            1.0           51.15   

   TotalCharges  charge_per_tenure  
1       1701.95          212.74375  
4       3440.97          344.09700  


# Summary

The process for TPOT Classification & TPOT Regression were clear.       
Train & Test data were accurate.            
I really like the ability to export a template file for automation.             
The only challenge was to get the data type correct.                    
Even after change data type from string to float, the pipeline still recognizing a cell of an dropped column as a string. 