<a href="https://colab.research.google.com/github/cagrmo11/logreg-customer-churn/blob/master/LogisticRegression_CustomerChurn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Customer Churn with Logistic Regression in PySpark

Install pysprk in enviornment. Does not come OOTB with Colab


In [2]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/37/98/244399c0daa7894cdf387e7007d5e8b3710a79b67f3fd991c0b0b644822d/pyspark-2.4.3.tar.gz (215.6MB)
[K     |████████████████████████████████| 215.6MB 122kB/s 
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 40.2MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/8d/20/f0/b30e2024226dc112e256930dd2cd4f06d00ab053c86278dcf3
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.3


**Import pyspark libraries**

In [0]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

**Create Spark session and read in CSV data**

In [0]:
spark = SparkSession.builder.appName('logreg').getOrCreate()

In [0]:
data = spark.read.csv('customer_churn.csv',inferSchema=True,header=True)

**Explore Customer Churn data**

In [14]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [40]:
data.show()

+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|              Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|   Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|
|      Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|
|        Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 06:20:07|1331 Keith Court ...|Miller, Johnson a...|    1|
|      Phillip White|42.0|       8010.76|              0| 6.71|     10.0|2014-04-22 12:43:12|13120 Daniel Moun...|           Smith Inc|    1|
|     

In [19]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

**Use VectorAssembler to create vector of variables to be used to make the churn prediction. Transform the data to VectorAssembler format.**

In [0]:
assembler = VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')

In [0]:
output = assembler.transform(data)

**Save the transformed dataframe after selecting the new features column and the churn column that you want to predict**

In [0]:
final_data = output.select('features','churn')

**Split the data into test and training data sets**

In [0]:
train_churn,test_churn = final_data.randomSplit([0.7,0.3])

**Create a Logistic regression object using 'churn' as the column you want to predict**

In [0]:
lr_churn = LogisticRegression(labelCol='churn')

**Fit the logistic regression model on the training data**

In [0]:
fitted_churn_model = lr_churn.fit(train_churn)

**Check output of the fitted model and see the rawPrediction, probability, and prediction values**

In [0]:
training_sum = fitted_churn_model.summary

In [39]:
training_sum.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[25.0,9672.03,0.0...|  0.0|[4.53777904738992...|[0.98941607962991...|       0.0|
|[26.0,8939.61,0.0...|  0.0|[6.26678604710152...|[0.99810527689047...|       0.0|
|[27.0,8628.8,1.0,...|  0.0|[5.22570644675506...|[0.99465219244579...|       0.0|
|[28.0,8670.98,0.0...|  0.0|[7.68680041323758...|[0.99954136650464...|       0.0|
|[28.0,9090.43,1.0...|  0.0|[1.34824805349724...|[0.79384305879075...|       0.0|
|[28.0,11128.95,1....|  0.0|[3.97140564433774...|[0.98150171333656...|       0.0|
|[28.0,11204.23,0....|  0.0|[1.83611546300779...|[0.86248864202563...|       0.0|
|[28.0,11245.38,0....|  0.0|[3.58375708387202...|[0.97297923467449...|       0.0|
|[29.0,5900.78,1.0...|  0.0|[3.93243332794652...|[0.98078068912841...|       0.0|
|[29.0,8688.17,1

**Now use the fitted model on the test data and evalaute the predictions**

In [0]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

In [67]:
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[22.0,11254.38,1....|    0|[4.35687913401306...|[0.98734390047568...|       0.0|
|[26.0,8787.39,1.0...|    1|[0.46981080864297...|[0.61533897635686...|       0.0|
|[29.0,11274.46,1....|    0|[4.32041624637542...|[0.98688007222813...|       0.0|
|[29.0,12711.15,0....|    0|[5.21705004738516...|[0.99460594953352...|       0.0|
|[30.0,8677.28,1.0...|    0|[3.89121228685363...|[0.97998807950078...|       0.0|
|[30.0,10960.52,1....|    0|[2.20472541645675...|[0.90067305298035...|       0.0|
|[30.0,12788.37,0....|    0|[2.46012346675768...|[0.92129861555817...|       0.0|
|[31.0,9574.89,0.0...|    0|[3.15609104287833...|[0.95914805611776...|       0.0|
|[31.0,12264.68,1....|    0|[3.31660205003806...|[0.96499398789653...|       0.0|
|[32.0,7896.65,0

**Use the BinaryClssificationEvaluator to check how good your model worked at making correct predictions**

In [0]:
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='churn')

**Evaluation of AUC not working in Colab**


In [69]:
auc = churn_eval.evaluate(pred_and_labels.predictions)

IllegalArgumentException: ignored

**Fit the logistic regression model on at the entire data set**

In [0]:
final_lr_model = lr_churn.fit(final_data)

**Import and explore new unlabeled customer dataset - no churn column**

In [0]:
new_customers = spark.read.csv('new_customers.csv',inferSchema=True,header=True)

In [52]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



**Apply same VectorAssembler transformation to create features column**

In [0]:
test_new_customers = assembler.transform(new_customers)

In [55]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



**Apply model to newly transformed unlabeled customer data**

In [0]:
final_results = final_lr_model.transform(test_new_customers)

**See churn predictions for each company based on historical data.
0 = Will not churn
1= Will churn**


In [60]:
final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

