# Logistic Regression Consulting Project

## Binary Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients. the goal is to create a machine learning model that will help predict which customers will churn so that they can correctly assign the customers most at risk to churn an account manager.
Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317145 sha256=516cce52ae25376de89f1e4cac10fc8fa64c373b04f84bfe1883969b14145a15
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logistic_reg_project').getOrCreate()

In [3]:
#import data
data = spark.read.csv('customer_churn.csv',inferSchema=True,header=True)

In [4]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [5]:
data.show()

+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|              Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|   Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|
|      Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|
|        Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 06:20:07|1331 Keith Court ...|Miller, Johnson a...|    1|
|      Phillip White|42.0|       8010.76|              0| 6.71|     10.0|2014-04-22 12:43:12|13120 Daniel Moun...|           Smith Inc|    1|
|     

In [6]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [24]:
data.groupBy('Location').count().show()
data.groupBy('Company').count().show()

+--------------------+-----+
|            Location|count|
+--------------------+-----+
|062 Trevor Falls ...|    1|
|066 Jenkins Walks...|    1|
|45946 Day Springs...|    1|
|143 Andrea Flat L...|    1|
|Unit 2093 Box 153...|    1|
|399 Herbert Key P...|    1|
|104 Ruben Rapid A...|    1|
|930 Carrie Harbor...|    1|
|8202 Jade Unions ...|    1|
|USCGC Bailey FPO ...|    1|
|893 Carla Trace S...|    1|
|446 Rodney Ridge ...|    1|
|30668 Isabella Fr...|    1|
|911 Kent Point An...|    1|
|078 Nunez Haven S...|    1|
|PSC 5667, Box 831...|    1|
|4972 Michael Vill...|    1|
|567 Ian Loop Lamb...|    1|
|482 Wells Mountai...|    1|
|7259 Brown Street...|    1|
+--------------------+-----+
only showing top 20 rows

+--------------------+-----+
|             Company|count|
+--------------------+-----+
|Miller, Johnson a...|    1|
|Hunter, Reyes and...|    1|
|          Obrien PLC|    1|
|            Soto PLC|    2|
|            Todd LLC|    1|
|Smith, Marshall a...|    1|
|           Smith

 We don't need the name and onboard date. Moreover Company and Location are sting but they have too many category so we can exclude them as well.

In [25]:
my_cols = data.select(['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'])

In [26]:
#drop missing data
my_final_data = my_cols.na.drop()

Format for MLlib

In [27]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')
output = assembler.transform(data)
final_data = output.select('features','churn')
final_data.show()

+--------------------+-----+
|            features|churn|
+--------------------+-----+
|[42.0,11066.8,0.0...|    1|
|[41.0,11916.22,0....|    1|
|[38.0,12884.75,0....|    1|
|[42.0,8010.76,0.0...|    1|
|[37.0,9191.58,0.0...|    1|
|[48.0,10356.02,0....|    1|
|[44.0,11331.58,1....|    1|
|[32.0,9885.12,1.0...|    1|
|[43.0,14062.6,1.0...|    1|
|[40.0,8066.94,1.0...|    1|
|[30.0,11575.37,1....|    1|
|[45.0,8771.02,1.0...|    1|
|[45.0,8988.67,1.0...|    1|
|[40.0,8283.32,1.0...|    1|
|[41.0,6569.87,1.0...|    1|
|[38.0,10494.82,1....|    1|
|[45.0,8213.41,1.0...|    1|
|[43.0,11226.88,0....|    1|
|[53.0,5515.09,0.0...|    1|
|[46.0,8046.4,1.0,...|    1|
+--------------------+-----+
only showing top 20 rows



In [28]:
#train & test split
train_churn,test_churn = final_data.randomSplit([0.8,0.2])

## Fit the model

In [29]:
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression(labelCol='churn')
log_reg_model = log_reg.fit(train_churn)

In [30]:
training_sum = log_reg_model.summary
training_sum.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                727|                727|
|   mean|0.16781292984869325|0.12379642365887207|
| stddev| 0.3739573614828893| 0.3295759063688023|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



## Evaluation

In [31]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
pred_and_labels = log_reg_model.evaluate(test_churn)
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[25.0,9672.03,0.0...|    0|[4.31998558406955...|[0.98687449493140...|       0.0|
|[28.0,8670.98,0.0...|    0|[7.44701593839036...|[0.99941716063870...|       0.0|
|[28.0,11204.23,0....|    0|[1.49150669374831...|[0.81630431137394...|       0.0|
|[28.0,11245.38,0....|    0|[3.4753536229146,...|[0.96997831095851...|       0.0|
|[31.0,8829.83,1.0...|    0|[4.12150404563998...|[0.98403879207823...|       0.0|
|[31.0,10058.87,1....|    0|[4.19623517440571...|[0.98517106818392...|       0.0|
|[32.0,8575.71,0.0...|    0|[3.53402197561094...|[0.97164044947947...|       0.0|
|[32.0,10716.75,0....|    0|[4.18774116240494...|[0.98504646612671...|       0.0|
|[32.0,11540.86,0....|    0|[6.48546864251690...|[0.99847687772994...|       0.0|
|[32.0,12403.6,0

In [32]:
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                           labelCol='churn')
auc = churn_eval.evaluate(pred_and_labels.predictions)
auc

0.751231527093596

## Predict on brand new unlabeled data

In [37]:
final_log_reg = log_reg.fit(final_data)
new_customers = spark.read.csv('new_customers.csv',inferSchema=True,
                              header=True)
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [38]:
#add features column 
test_new_customers = assembler.transform(new_customers)
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [39]:
final_results = final_log_reg.transform(test_new_customers)
final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

