# DS Automation Assignment

Using our prepared churn data from week 2:
- use TPOT to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
    - REMEMBER: TPOT only finds the optimized processing pipeline and model. It doesn't create the model. 
        - You can use `tpot.export('my_model_name.py')` (assuming you called your TPOT object tpot) and it will save a Python template with an example of the optimized pipeline. 
        - Use the template code saved from the `export()` function in your program.
- create a Python script/file/module using code from the exported template above that
    - create a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [55]:
import pandas as pd
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")
# import the usual packages
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from tpot import TPOTRegressor
# Any results you write to the current directory are saved as output.
import timeit 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import Normalizer
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier



# Loading Data

In [68]:
df = pd.read_csv('../week2/prepped_churned_data.csv', index_col='customerID')

#making adjustments to synch with course-provided test data later on
df.rename(columns = {'TotalCharges_by_tenure':'charge_per_tenure'}, inplace = True)
df.drop('TotalCharges_by_tenure_log', axis=1, inplace=True)


## Splitting data for training


In [69]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

## training data

In [70]:
X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.7, test_size=0.3, random_state=42)

## Running TPOT to get the best model parameters to use

In [71]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8017089678510999

Generation 2 - Current best internal CV score: 0.8017089678510999

Generation 3 - Current best internal CV score: 0.8017089678510999

Generation 4 - Current best internal CV score: 0.8017091741983411

Generation 5 - Current best internal CV score: 0.8017091741983411

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=entropy, max_features=0.2, min_samples_leaf=8, min_samples_split=4, n_estimators=100)
0.7853080568720379
CPU times: total: 30.7 s
Wall time: 4min 8s


In [73]:
tpot.export('tpotmodel.py')

## loading "newdf" which is the new churn data

In [131]:
newdf = pd.read_csv('new_churn_data.csv', index_col='customerID')
#need to rename Churn column  to target to work in function
df.rename(columns = {'Churn':'target'}, inplace = True)


## TPOT Pipeline is a function that I tested the .py development in, using the tpotmodel.py file generated above.  I use the training data from previous weeks to fit the "proba" of the five datapoints provided this week.

In [296]:
def TPOT_Pipeline(trainingdata,testdata):
    features = trainingdata.drop('Churn', axis=1)
    targets = trainingdata['Churn']
    
    tpot_data = trainingdata.copy()
    
    training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

    # Average CV score on the training set was: 0.8017091741983411
    exported_pipeline = RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=.5, min_samples_leaf=8, min_samples_split=4, n_estimators=100)
    # Fix random state in exported estimator
    if hasattr(exported_pipeline, 'random_state'):
        setattr(exported_pipeline, 'random_state', 3)

    #This fits our training data
    exported_pipeline.fit(training_features, training_target)
    
    #This fits our test data 
    testing_features = testdata
    for row in range(len(testing_features)):
        results = exported_pipeline.predict_proba(testing_features)
        prob = (results[row][0]*100).round(decimals = 2)
        print(f'There is a {prob} probability for customer {testing_features.index[row]} to Churn' )

    


### here it is:

In [297]:
TPOT_Pipeline(df, newdf)

There is a 69.01 probability for customer 9305-CKSKC to Churn
There is a 68.91 probability for customer 1452-KNGVK to Churn
There is a 83.93 probability for customer 6723-OKKJM to Churn
There is a 84.54 probability for customer 7832-POPKP to Churn
There is a 59.38 probability for customer 6348-TACGU to Churn


### following the assignment, I now attempted to make a .py file

In [298]:
import predict_Churn

In [307]:
run predict_Churn

predictions:
There is a 69.01 probability for customer 9305-CKSKC to Churn
There is a 68.91 probability for customer 1452-KNGVK to Churn
There is a 83.93 probability for customer 6723-OKKJM to Churn
There is a 84.54 probability for customer 7832-POPKP to Churn
There is a 59.38 probability for customer 6348-TACGU to Churn


# It worked!

# Summary

In this lesson we studied TPOTClassifier to identify the best model to use against our dataset.  When we applied our TPOT solution onto test data we did not have an easy time of identifying the true positive cases.  Customer 1 and 4 of the 5 sample customers churned.  Customer 1, 930-CKSKC was only 69% probably to Churn, yet they churned anyway.  Customer 4, 7832-POPKP scored the highest probability to churn, and they did in fact churn.  

40% of the customers in the sample churned, this aligns with 2 of the top 3 of our highest probability-to-churn customers churning.  For the variance in the data, I think this is a pretty good model and can be very useful in determining the top quartile or so of customers to solicit for better deals to stay with our service.
