# Logistic Regression

Scenario:

A marketing agency has many customers that use their service to produce ads for the client/customer website.
They have noticed that they have quite a bit of churn in clients.
They currently randomly assign account managers for each client but It doens't work so good.

Goal:

In this project I use the logistic regression algorithm in order to create a machine learning model that will help to predict wich customers will churn so that can correctly assign the customers most at risk to churn an account manager.

Dataset description:

Here we use two datasets the first one "customer_churn.csv" is that one where we build our model because It's already labelized. The second one is the dataset where actually use the model in order to get a prediction.

So let's start!

In [1]:
# Basic imports
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import *
import pyspark

In [2]:
# Creation of a spark session
spark = SparkSession.builder.appName('churn_analysis').getOrCreate()

In [3]:
# Open and reading the file
data = spark.read.csv('customer_churn.csv', inferSchema= True, header= True)

In [4]:
# Show up the data frame
data.show()

+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|              Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|   Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|
|      Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|
|        Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 06:20:07|1331 Keith Court ...|Miller, Johnson a...|    1|
|      Phillip White|42.0|       8010.76|              0| 6.71|     10.0|2014-04-22 12:43:12|13120 Daniel Moun...|           Smith Inc|    1|
|     

In [5]:
# Look up to the schema
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



# Data preparation

As a first operation I work on the "Onboard_date" column in order to create new features to use into my predictive model.

In [6]:
# Split the column 'Onboard_date' on the space and save the result into a variable
# It will be: split_col= ["2011-08-29","18:37:54"]
split_col = pyspark.sql.functions.split(data['Onboard_date'], ' ')

# From the "split_col" variable I take only the year and month  2011-08- and than i will convert into an integer
df = data.withColumn('Date', split_col.getItem(0)[0:8])

In [7]:
# Conversion
df= df.withColumn('DATA',unix_timestamp(col('Date'), format='yyyy-MM').alias('unix_timestamp'))
df.show()

+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------+----------+
|              Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|    Date|      DATA|
+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------+----------+
|   Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|2013-08-|1375308000|
|      Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|2013-08-|1375308000|
|        Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 06:20:07|1331 Keith Court ...|Miller, Johnson a...|    1|2016-06-|1464732000|
|      Phillip White|42.0|  

From here I start to shape a new data frame in order to create my model. The final df will have the "label" column and the "features" column with the vectorized features varibles.

In [8]:
# Imports for vectorizing the features
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [9]:
# Instancing of the VectorAssembler method. In input I pass all the varibles I want to use as features. 
assembler = VectorAssembler(inputCols=['Age','Total_Purchase','Account_Manager','Years','Num_Sites','DATA'],
                             outputCol= 'features')

In [10]:
# Here i transform (shape) the data frame 
output = assembler.transform(df)
output.show()

+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------+----------+--------------------+
|              Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|    Date|      DATA|            features|
+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------+----------+--------------------+
|   Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|2013-08-|1375308000|[42.0,11066.8,0.0...|
|      Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|2013-08-|1375308000|[41.0,11916.22,0....|
|        Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 0

In [11]:
# Selecting only the columns I need 
final_data = output.select('features', 'Churn')

In [12]:
# Rename the columns
final_data = final_data.selectExpr("churn as label", "features as features") 
final_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    1|[42.0,11066.8,0.0...|
|    1|[41.0,11916.22,0....|
|    1|[38.0,12884.75,0....|
|    1|[42.0,8010.76,0.0...|
|    1|[37.0,9191.58,0.0...|
|    1|[48.0,10356.02,0....|
|    1|[44.0,11331.58,1....|
|    1|[32.0,9885.12,1.0...|
|    1|[43.0,14062.6,1.0...|
|    1|[40.0,8066.94,1.0...|
|    1|[30.0,11575.37,1....|
|    1|[45.0,8771.02,1.0...|
|    1|[45.0,8988.67,1.0...|
|    1|[40.0,8283.32,1.0...|
|    1|[41.0,6569.87,1.0...|
|    1|[38.0,10494.82,1....|
|    1|[45.0,8213.41,1.0...|
|    1|[43.0,11226.88,0....|
|    1|[53.0,5515.09,0.0...|
|    1|[46.0,8046.4,1.0,...|
+-----+--------------------+
only showing top 20 rows



In [13]:
# Creation of the training set (70%) and test set (30%)
lgr_training, lgr_test = final_data.randomSplit([0.7, 0.3])

In [14]:
# Checking the split
lgr_training.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                624|
|   mean|0.16025641025641027|
| stddev| 0.3671379894939084|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



In [15]:
# Checking the split
lgr_test.describe().show()

+-------+-------------------+
|summary|              label|
+-------+-------------------+
|  count|                276|
|   mean|0.18115942028985507|
| stddev|0.38584984825945506|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



# Creation of the model

In [16]:
# Import of the logistic regression method
from pyspark.ml.classification import LogisticRegression

In [17]:
#Istancing of the Logistic Regression method
lg_regress = LogisticRegression()

# Here I create the model on the training data 
lg_model = lg_regress.fit(lgr_training)

In [18]:
# Here I use the summary function in order to observe the predictions on the training set
training_summary = lg_model.summary

# On the prediction I observe the mean and the standard deviation in order to have a first glance on the
# performance of the model
training_summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              label|         prediction|
+-------+-------------------+-------------------+
|  count|                624|                624|
|   mean|0.16025641025641027|              0.125|
| stddev| 0.3671379894939084|0.33098423194731325|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



Ok, now is time to use the model on the test set.

In [19]:
# Let's see how the model works on the test set
predictLabels = lg_model.evaluate(lgr_test)

# Here I can compare the prediction of the model with looking directly to the labelized records
predictLabels.predictions.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|[27.0,8628.8,1.0,...|[5.28380959390407...|[0.99495253850763...|       0.0|
|    0|[28.0,9090.43,1.0...|[1.42771330057017...|[0.80654477144594...|       0.0|
|    0|[29.0,5900.78,1.0...|[3.83962197482202...|[0.97895086477677...|       0.0|
|    0|[29.0,10203.18,1....|[3.62060384389362...|[0.97393126033138...|       0.0|
|    0|[29.0,11274.46,1....|[4.49181676622733...|[0.98892377970641...|       0.0|
|    0|[29.0,13255.05,1....|[4.16455892764550...|[0.98470112583151...|       0.0|
|    0|[30.0,6744.87,0.0...|[3.34076089897675...|[0.96580098328716...|       0.0|
|    0|[30.0,8677.28,1.0...|[3.75136540764072...|[0.97705326272971...|       0.0|
|    0|[31.0,5304.6,0.0,...|[3.17700285396611...|[0.95995962299042...|       0.0|
|    0|[31.0,100

Now we need to numerically evaluate the performance of our model.

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

# For more informations:
# https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator
# https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator

In [21]:
# Instancing the BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='label')

In [22]:
# The "evaluate" function give us the value of the AUC (Area Under the ROC Curve)
# A good value should be upper than 0.5
# A value = 0.5 means that our model predict as we will assign randomly a value to the prediction. The model 
# doens't work.

auc = evaluator.evaluate(predictLabels.predictions)

In [23]:
print ('AUC = ',auc)

AUC =  0.9163716814159271


# Prediction on new data

In [24]:
# Opening the new dataset (unclassified)
newData = spark.read.csv('new_customers.csv', header=True, inferSchema=True)

In [25]:
newData.show()

+--------------+----+--------------+---------------+-----+---------+-------------------+--------------------+----------------+
|         Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|         Company|
+--------------+----+--------------+---------------+-----+---------+-------------------+--------------------+----------------+
| Andrew Mccall|37.0|       9935.53|              1| 7.71|      8.0|2011-08-29 18:37:54|38612 Johnny Stra...|        King Ltd|
|Michele Wright|23.0|       7526.94|              1| 9.28|     15.0|2013-07-22 18:19:54|21083 Nicole Junc...|   Cannon-Benson|
|  Jeremy Chang|65.0|         100.0|              1|  1.0|     15.0|2006-12-11 07:48:13|085 Austin Views ...|Barron-Robertson|
|Megan Ferguson|32.0|        6487.5|              0|  9.4|     14.0|2016-10-28 05:32:13|922 Wright Branch...|   Sexton-Golden|
|  Taylor Young|32.0|      13147.71|              1| 10.0|      8.0|2012-03-20 00:36:46|Unit 0789 Box 073...|  

Since now I must pay attention because I need to shape a data frame with the same structure of the previous one. So, I should follow the the same steps...

In [26]:
# Split the column 'Onboard_date' on the space and save the result into a variable
# It will be: split_col= ["2011-08-29","18:37:54"]
split_col = pyspark.sql.functions.split(newData['Onboard_date'], ' ')

# Here I take only the month and year
df2 = newData.withColumn('Date', split_col.getItem(0)[0:8])

# Converting the date into integer
df2= df2.withColumn('DATA',unix_timestamp(col('Date'), format='yyyy-MM').alias('unix_timestamp'))

# Check the conversion
df2.show()

+--------------+----+--------------+---------------+-----+---------+-------------------+--------------------+----------------+--------+----------+
|         Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|         Company|    Date|      DATA|
+--------------+----+--------------+---------------+-----+---------+-------------------+--------------------+----------------+--------+----------+
| Andrew Mccall|37.0|       9935.53|              1| 7.71|      8.0|2011-08-29 18:37:54|38612 Johnny Stra...|        King Ltd|2011-08-|1312149600|
|Michele Wright|23.0|       7526.94|              1| 9.28|     15.0|2013-07-22 18:19:54|21083 Nicole Junc...|   Cannon-Benson|2013-07-|1372629600|
|  Jeremy Chang|65.0|         100.0|              1|  1.0|     15.0|2006-12-11 07:48:13|085 Austin Views ...|Barron-Robertson|2006-12-|1164927600|
|Megan Ferguson|32.0|        6487.5|              0|  9.4|     14.0|2016-10-28 05:32:13|922 Wright Branch...|   Sexton

At this step I do the fitting of the logistic regression model on the "final_data dataset",(look above). Pay attention! The final_data dataset is the dataset composed of two column ("label","features") created before the splitting in test set and training set. 
So, after the evaluation on the test set It's recomended to recreate the model on the totality of the data in order to emprove the performance of the model itself. 

In [27]:
# Fitting on the final_data
final_lgr_model = lg_regress.fit(final_data)

In [28]:
# Here I use the same assembler object in order to transform the new dataframe
test_new_costumers = assembler.transform(df2)

In [29]:
# Checking the creation of the features column
test_new_costumers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- DATA: long (nullable = true)
 |-- features: vector (nullable = true)



In [30]:
test_new_costumers.show()

+--------------+----+--------------+---------------+-----+---------+-------------------+--------------------+----------------+--------+----------+--------------------+
|         Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|         Company|    Date|      DATA|            features|
+--------------+----+--------------+---------------+-----+---------+-------------------+--------------------+----------------+--------+----------+--------------------+
| Andrew Mccall|37.0|       9935.53|              1| 7.71|      8.0|2011-08-29 18:37:54|38612 Johnny Stra...|        King Ltd|2011-08-|1312149600|[37.0,9935.53,1.0...|
|Michele Wright|23.0|       7526.94|              1| 9.28|     15.0|2013-07-22 18:19:54|21083 Nicole Junc...|   Cannon-Benson|2013-07-|1372629600|[23.0,7526.94,1.0...|
|  Jeremy Chang|65.0|         100.0|              1|  1.0|     15.0|2006-12-11 07:48:13|085 Austin Views ...|Barron-Robertson|2006-12-|1164927600|[65.0,100.0,1.

Now is the moment to apply the "final" logistic regression model on the shaped data frame 

In [31]:
# Applying the model
# The model takes in input the features column by default and gives in output the prediction
final_result = final_lgr_model.transform(test_new_costumers)

# Checking the schema
final_result.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- DATA: long (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



Ok, here we are at the end of the job. At this step our model should predict who are the company with a higher churn risk. So let's see the result!

In [32]:
final_result.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

