# Logistic Regression Consulting Project

## Binary Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager. Luckily they have some historical data, can you help them out? Create a classification algorithm that will help classify whether or not a customer churned. Then the company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

The data is saved as customer_churn.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
Once you've created the model and evaluated it, test out the model on some new data (you can think of this almost like a hold-out set) that your client has provided, saved under new_customers.csv. The client wants to know which customers are most likely to churn given this data (they don't have the label yet).

## Load Spark and Data

In [2]:
# start spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logreg_project').getOrCreate()

In [3]:
# read in the input csv file.
data = spark.read.csv('customer_churn.csv', inferSchema=True, header=True)

In [4]:
data.count()

900

In [5]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



## Clean Data

Just select the columns we are interested in and check how many nulls are there

In [7]:
# select only desired columns
mycols_data = data.select('Age','Total_Purchase','Years','Num_Sites','Company','Churn')

In [8]:
# list number of NANs or NULLs in each column
from pyspark.sql.functions import count, when, isnan, col
mycols_data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in mycols_data.columns]).show()

+---+--------------+-----+---------+-------+-----+
|Age|Total_Purchase|Years|Num_Sites|Company|Churn|
+---+--------------+-----+---------+-------+-----+
|  0|             0|    0|        0|      0|    0|
+---+--------------+-----+---------+-------+-----+



In [24]:
# nothing to drop
cleaned_data = mycols_data

In [79]:
# check stats of each column with a pretty printed describe()

from pyspark.sql.functions import format_number
cdd = cleaned_data.describe()
cdd.select([cdd['summary'],
            format_number(cdd['Age'].cast('double'), 2).alias('Age'),
            format_number(cdd['Total_Purchase'].cast('double'), 2).alias('Total_Purchase'),
            format_number(cdd['Years'].cast('double'), 2).alias('Years'),
            format_number(cdd['Num_Sites'].cast('double'), 2).alias('Num_Sites'),
            cdd['Company'],
            format_number(cdd['Churn'].cast('double'), 2).alias('Churn'),
           ]).show()

+-------+------+--------------+------+---------+--------------------+------+
|summary|   Age|Total_Purchase| Years|Num_Sites|             Company| Churn|
+-------+------+--------------+------+---------+--------------------+------+
|  count|900.00|        900.00|900.00|   900.00|                 900|900.00|
|   mean| 41.82|     10,062.82|  5.27|     8.59|                null|  0.17|
| stddev|  6.13|      2,408.64|  1.27|     1.76|                null|  0.37|
|    min| 22.00|        100.00|  1.00|     3.00|     Abbott-Thompson|  0.00|
|    max| 65.00|     18,026.01|  9.15|    14.00|Zuniga, Clark and...|  1.00|
+-------+------+--------------+------+---------+--------------------+------+



In [87]:
# how many churns?
cleaned_data.filter(cleaned_data['Churn'] == 1).count()

150

## Prepare Data

All columns are numerical except 'Company', to which we will apply a string indexation (one hop encoding nedded here? i'll skip it).

Then we will assemble the features vector.

All this configured in a pipeline

In [68]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,StringIndexer)
from pyspark.ml import Pipeline

In [69]:
# list all string columns where we will apply the transformations
string_columns = ['Company']

In [70]:
# add the transformations on categorical columns as stages in the Pipeline
stages = [] 
for col in string_columns:
  # Category Indexing with StringIndexer
  stringIndexer = StringIndexer(inputCol=col, outputCol=col+"Index")
    
  # Use OneHotEncoder to convert categorical variables into binary SparseVectors
  #encoder = OneHotEncoder(inputCol=categoricalCol+"Index", outputCol=categoricalCol+"Vec")
    
  # Add into stages
  stages += [stringIndexer]

In [71]:
# list all numerical features that need no transformation
numeric_colums = ['Age','Total_Purchase','Years','Num_Sites']

In [72]:
# What about normalization of numerical columns with a normalizer?

In [73]:
# Configure the vector assembler add it as pipeline stage 
assembler_inputs = list(map(lambda c: c + "Index", string_columns)) + numeric_colums
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")
stages += [assembler]

In [74]:
# the stages so far...
stages

[StringIndexer_40fdaf0947592adb3a13, VectorAssembler_4f96a395ebf9de348253]

In [75]:
# Create a Pipeline with all previous actions
features_pipeline = Pipeline(stages=stages)

In [76]:
# Run the feature transformations
final_data = features_pipeline.fit(cleaned_data).transform(cleaned_data)

# select only features and label columns
final_data = final_data.select('features', 'Churn')

In [77]:
final_data.head(5)

[Row(features=DenseVector([824.0, 42.0, 11066.8, 7.22, 8.0]), Churn=1),
 Row(features=DenseVector([1.0, 41.0, 11916.22, 6.5, 11.0]), Churn=1),
 Row(features=DenseVector([272.0, 38.0, 12884.75, 6.67, 12.0]), Churn=1),
 Row(features=DenseVector([21.0, 42.0, 8010.76, 6.71, 10.0]), Churn=1),
 Row(features=DenseVector([524.0, 37.0, 9191.58, 5.56, 9.0]), Churn=1)]

In [78]:
# Randomly split data into training and test sets. set seed for reproducibility
train_data, test_data = final_data.randomSplit([0.7, 0.3], seed = 100)
print(train_data.count())
print(test_data.count())

618
282


## Fit Logistic Regression model

Just fit the model with the train_data

In [121]:
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
log_reg = LogisticRegression(labelCol='Churn', featuresCol='features', maxIter=20)

# Train model with Training Data
log_reg_model = log_reg.fit(train_data)

In [122]:
# Observe history of objective (optimizer)
log_reg_model.summary.objectiveHistory

[0.45315607565403704,
 0.4502984696058951,
 0.4320218479136546,
 0.3712197048313319,
 0.3645315593854564,
 0.36315719238575317,
 0.36296060505013067,
 0.3626645236647848,
 0.36189973320469954,
 0.3600359063097216,
 0.3556504981285021,
 0.346548428413119,
 0.33091381368477735,
 0.32206015312551034,
 0.3053964968577326,
 0.2937744858197606,
 0.28883723135422795,
 0.28362863146181316,
 0.25963878894581965,
 0.2572683493594377,
 0.2570045126178582]

In [136]:
# stats on predictions vs labels (on train data)
log_reg_model.summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              Churn|         prediction|
+-------+-------------------+-------------------+
|  count|                618|                618|
|   mean|0.16828478964401294|0.13268608414239483|
| stddev| 0.3744220438216086|0.33950994595753775|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



## Evaluate on test data

Now evaluate the fitted model on the test data

In [123]:
test_results = log_reg_model.transform(test_data)

In [124]:
test_results.select('Churn','prediction','probability').show()

+-----+----------+--------------------+
|Churn|prediction|         probability|
+-----+----------+--------------------+
|    0|       0.0|[0.94539994899230...|
|    0|       0.0|[0.94025049244233...|
|    0|       0.0|[0.94922647221510...|
|    0|       0.0|[0.90177178049092...|
|    1|       1.0|[0.15486082733329...|
|    0|       0.0|[0.99789117379955...|
|    0|       0.0|[0.92654019294094...|
|    0|       0.0|[0.94066404686597...|
|    0|       0.0|[0.90740094916262...|
|    1|       1.0|[0.17999244510170...|
|    0|       0.0|[0.98552435893210...|
|    0|       0.0|[0.90126407756561...|
|    0|       0.0|[0.83898347775944...|
|    0|       0.0|[0.99577290879145...|
|    0|       0.0|[0.98274591902442...|
|    0|       0.0|[0.94615249194316...|
|    0|       0.0|[0.65823913194111...|
|    0|       0.0|[0.63800169380683...|
|    0|       0.0|[0.99275343419509...|
|    0|       0.0|[0.96465727920270...|
+-----+----------+--------------------+
only showing top 20 rows



In [125]:
# Evaluate with BinaryClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
test_evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction',labelCol='Churn')

In [126]:
# Get the area under ROC curve
test_evaluator.evaluate(test_results)

0.8866985998526159

## Evaluate on unlabeled data

Import the file of new customers and apply the features transformation pipeline to construct the features vector

In [141]:
# read in the input csv file.
unlabeled_data = spark.read.csv('new_customers.csv', inferSchema=True, header=True)

In [146]:
# Run the feature transformations
unlabeled_final_data = features_pipeline.fit(new_data).transform(new_data)

In [147]:
unlabeled_final_data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- CompanyIndex: double (nullable = true)
 |-- features: vector (nullable = true)



In [143]:
# apply the model and predict 'Churn'
churn_predictions = log_reg_model.transform(unlabeled_final_data)

In [145]:
churn_predictions.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

