# Logistic Regression Consulting Project

## Binary Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager. Luckily they have some historical data, can you help them out? Create a classification algorithm that will help classify whether or not a customer churned. Then the company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

The data is saved as customer_churn.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
Once you've created the model and evaluated it, test out the model on some new data (you can think of this almost like a hold-out set) that your client has provided, saved under new_customers.csv. The client wants to know which customers are most likely to churn given this data (they don't have the label yet).

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('logregdoc').getOrCreate()

The above are generally constant during the debugging process.   Below, the list of needed packages changes, hence it is kept isolated.

In [2]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator


This standardizes the data setup.  Used for both the training and evaluation set.

In [3]:
def inhale_data(csv):
    df = spark.read.csv(csv,inferSchema=True,header=True)
    fields = ["Age","Total_Purchase","Years","Num_Sites"]
    assembler = VectorAssembler(inputCols=fields, outputCol="features")
    df = assembler.transform(df)
    return df


Here, several models are tried at different training/test set ratios.  The function returns an array where each element is a tupple of form (train_fraction, AUC_unvalidated, AUC_crossvalidated)

NB: 

All values of AUC_unvalidated should be the same where train_fraction=1.0 for a logistic regression.

AUC_crossvalidated is undefined at train_fraction=1.0 (manually set to -0.001 to avoid nan errors).

One cannot have a model where train_fraction is 0.0.



In [4]:
def progressive_challenges(df,train_fractions):
    logregmod = LogisticRegression(featuresCol='features',labelCol='Churn')
    AUC = []
    for train_fraction in train_fractions:
        test_fraction = 1.00 - train_fraction
        
        if train_fraction != 1.0:
            train_df,test_df = df.randomSplit([train_fraction,test_fraction])
        else:
            train_df = df
            
        fit_model = logregmod.fit(train_df)
        results = fit_model.transform(train_df)
        my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Churn')
        AUC_unvalidated = my_eval.evaluate(results)
        
        if train_fraction != 1.0:
            results = fit_model.transform(test_df)
            my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Churn')
            AUC_crossvalidated = my_eval.evaluate(results)
        else:
            AUC_crossvalidated = -0.001
            
        AUC.append((train_fraction,AUC_unvalidated,AUC_crossvalidated))
    return AUC

In [5]:
def train_predict(train_df, eval_df):
    logregmod = LogisticRegression(featuresCol='features',labelCol='Churn')
    fit_model = logregmod.fit(train_df)
    return fit_model.transform(eval_df)


In [6]:
df = inhale_data("customer_churn.csv")

Do a simple trial model based on the numerics only.  

Each two bit tupple is  (train_fraction, AUC_unvalidated, AUC_crossvalidated).

Those results below are consistent with the results in the example presented.  They are not great for overall performance.  On the plus side, the difference between training over 10% and predicting 90% is not too different from training on 90% and predicting 10%.  

If this was for real, we would break out the zip code from the address, pull the year from the date and try to make it better.  The example he gave does not do this.  

NB: 

All values of AUC_unvalidated should be the same where train_fraction=1.0 for a logistic regression.

AUC_crossvalidated is undefined at train_fraction=1.0 (manually set to -0.001 to avoid nan errors).

One cannot have a model where train_fraction is 0.0.



In [7]:
train_fractions = [0.1,0.1,0.3,0.3,0.5,0.5,0.7,.7,0.9,1.0,1.0]
progressive_challenges(df,train_fractions)

[(0.1, 0.8081791626095424, 0.7966744387289073),
 (0.1, 0.8453441295546559, 0.7824873237867052),
 (0.3, 0.7785117056856188, 0.7604199372056514),
 (0.3, 0.7474628274722681, 0.7143958389807535),
 (0.5, 0.7841130604288499, 0.7354534986113934),
 (0.5, 0.721175992412952, 0.7342711145951281),
 (0.7, 0.7770079435127979, 0.772340425531915),
 (0.7, 0.7831683168316832, 0.7612244897959183),
 (0.9, 0.7625238469307598, 0.8279411764705883),
 (1.0, 0.7633333333333333, -0.001),
 (1.0, 0.7633333333333333, -0.001)]

Let's do an eval on the unknown new customers.

In [8]:
eval_df = inhale_data("new_customers.csv")
eval_df = train_predict(df, eval_df)

In [9]:
eval_df.select(['Company','prediction']).show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

