# Customer Churn Prediction


## Problem Statement

There is a marketing agency, that has number of customers using their service to produce ads for the client / customer websites. They have noticed quite a bit of churn in the numbers  of their clients.  Hence, there is need to create a machine learning model that will help to predict which customers are likely to get churned, who will stop buying their service in future, so that, Agency can take corrective measures to improve business efficiency.

### -> Training data is named as 'customer_churn.csv'. Below are mentioned fields and their definitions as per given dataset

---

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company 


### i) Installing Spark

In [None]:
!apt update > /dev/null
!apt install openjdk-8-jdk-headless -qq > /dev/null







In [None]:
# Get latest and correct version of Spark
# if the current version of Spark is not used, there may be errors
# check here for current versions http://apache.osuosl.org/spark
#!wget -q http://apache.osuosl.org/spark/spark-2.2.2/spark-2.2.2-bin-hadoop2.7.tgz
#!wget -q http://apache.osuosl.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
#!wget -q http://apache.osuosl.org/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
#!wget -q http://apache.osuosl.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
#!wget -q http://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
#!wget -q http://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
!wget -q http://apache.osuosl.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

#!tar xf spark-2.4.5-bin-hadoop2.7.tgz
#!tar xf spark-3.0.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.2-bin-hadoop3.2.tgz
#!pip install -q findspark
!pip install -q pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"
#os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
#os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"
#os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

[K     |████████████████████████████████| 281.4 MB 36 kB/s 
[K     |████████████████████████████████| 198 kB 29.6 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
from pyspark.sql import SparkSession
#spark = SparkSession.builder.master("local[*]").getOrCreate()
# note UI port switched from default 4040 to 4050 to avoid clash with ngrok
spark = SparkSession.builder.master("local[*]").config('spark.ui.port', '4050').getOrCreate()

#### ii) Loading training data

In [None]:
data= spark.read.csv('/content/customer_churn.csv',inferSchema=True,header=True)

#### iii) The inferred schema can be visualized as below:

In [None]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



#### iv) Computing summary statistics on the DataFrame:

In [None]:
data.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|       Onboard_date|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|               null|                null|                null|0.16666666666666666|
| stddev| 

####v) Getting all column names as a list:

In [None]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

#####vi) Importing Libraries for Vector Assembeler

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

#### vii) Setting Vector Assembeler:

In [None]:
assembler = VectorAssembler(
  inputCols=['Age',
             'Total_Purchase',
             'Account_Manager',
             'Years',
             'Num_Sites',
            ],
    outputCol="features")

#### viii) Utilizing Assembler to Transform the Data 

In [None]:
output = assembler.transform(data)

####ix) Data Outcome of Features Vector and Churn Data Binary Classifier:

In [None]:
final_data = output.select('features','churn')

####x) Show Data:

In [None]:
final_data.show()

+--------------------+-----+
|            features|churn|
+--------------------+-----+
|[42.0,11066.8,0.0...|    1|
|[41.0,11916.22,0....|    1|
|[38.0,12884.75,0....|    1|
|[42.0,8010.76,0.0...|    1|
|[37.0,9191.58,0.0...|    1|
|[48.0,10356.02,0....|    1|
|[44.0,11331.58,1....|    1|
|[32.0,9885.12,1.0...|    1|
|[43.0,14062.6,1.0...|    1|
|[40.0,8066.94,1.0...|    1|
|[30.0,11575.37,1....|    1|
|[45.0,8771.02,1.0...|    1|
|[45.0,8988.67,1.0...|    1|
|[40.0,8283.32,1.0...|    1|
|[41.0,6569.87,1.0...|    1|
|[38.0,10494.82,1....|    1|
|[45.0,8213.41,1.0...|    1|
|[43.0,11226.88,0....|    1|
|[53.0,5515.09,0.0...|    1|
|[46.0,8046.4,1.0,...|    1|
+--------------------+-----+
only showing top 20 rows



####xi) Creating Train, Test, Split (70:30):

In [None]:
train_data,test_data = final_data.randomSplit([0.7,0.3], seed=42)

#### xii) Importing Liberary for Logistic Regression:

In [None]:
from pyspark.ml.classification import LogisticRegression

#### xii) Instantiating the Model

In [None]:
lr = LogisticRegression(labelCol='churn')

####xiii) Fitting the model mentioning as 'lrModel':

In [None]:
lrModel = lr.fit(train_data)

####xiv) Assigning attribute of model

In [None]:
trainingSummary = lrModel.summary

In [None]:
#trainingSummary.featuresCol

'features'

####xv)  Data Description:

In [None]:
trainingSummary.predictions.describe().show()



+-------+------------------+-------------------+
|summary|             churn|         prediction|
+-------+------------------+-------------------+
|  count|               667|                667|
|   mean|0.1634182908545727|0.12293853073463268|
| stddev|0.3700243606477147|0.32861306618408714|
|    min|               0.0|                0.0|
|    max|               1.0|                1.0|
+-------+------------------+-------------------+



####xvi) Prediction Summary:

In [None]:
trainingSummary.predictions.show()



+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[22.0,11254.38,1....|  0.0|[4.55979933582600...|[0.98964420615385...|       0.0|
|[25.0,9672.03,0.0...|  0.0|[4.67536684163721...|[0.99076399423917...|       0.0|
|[26.0,8939.61,0.0...|  0.0|[6.28230375013810...|[0.99813439726403...|       0.0|
|[27.0,8628.8,1.0,...|  0.0|[5.32554193456119...|[0.99515784712679...|       0.0|
|[28.0,8670.98,0.0...|  0.0|[7.59026142971801...|[0.99949490632507...|       0.0|
|[28.0,11128.95,1....|  0.0|[4.09748998252342...|[0.98365719925299...|       0.0|
|[29.0,5900.78,1.0...|  0.0|[4.06733654772172...|[0.98316532508264...|       0.0|
|[29.0,8688.17,1.0...|  1.0|[2.71962043931940...|[0.93817452170582...|       0.0|
|[29.0,9378.24,0.0...|  0.0|[4.73007501034927...|[0.99125140444539...|       0.0|
|[29.0,12711.15,

#### xvii) Importing Library to Evaluate the Model:

In [None]:
from pyspark.mllib.evaluation import MulticlassMetrics

####xviii) Prediction with Labels:

In [None]:
predictionAndLabels = lrModel.evaluate(test_data)

####xix) Prediction Data:

In [None]:
predictionAndLabels.predictions.show()



+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[26.0,8787.39,1.0...|    1|[0.45556621649245...|[0.61196183386580...|       0.0|
|[28.0,8670.98,0.0...|    0|[7.42728848152859...|[0.99940555544108...|       0.0|
|[28.0,11204.23,0....|    0|[1.66600109078847...|[0.84104193421074...|       0.0|
|[29.0,8688.17,1.0...|    1|[2.48412942065942...|[0.92302171705931...|       0.0|
|[29.0,9378.24,0.0...|    0|[4.45490353219201...|[0.98851206543727...|       0.0|
|[29.0,9617.59,0.0...|    0|[4.12434942126402...|[0.98408342132510...|       0.0|
|[29.0,12711.15,0....|    0|[5.08391036969074...|[0.99384251516006...|       0.0|
|[29.0,13255.05,1....|    0|[4.02401999680907...|[0.98243317242450...|       0.0|
|[30.0,8403.78,1.0...|    0|[5.70126594447149...|[0.99666939765053...|       0.0|
|[30.0,12788.37,

#### -> There is a Difference in Actual Tested Data with Evaluated Data

In [None]:
predictionAndLabels = predictionAndLabels.predictions.select('churn','prediction')



In [None]:
predictionAndLabels.show()

+-----+----------+
|churn|prediction|
+-----+----------+
|    1|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
+-----+----------+
only showing top 20 rows



####xx) Importing Library

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

####xxi) Evaluator for Binary Classification:

In [None]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='churn')

####xxii) Calculating AUC (Area Under the Curve):

In [None]:
AUC = evaluator.evaluate(predictionAndLabels)

In [None]:
AUC

0.7696666666666667

The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes. The value is closer to 1, the better it is. In this case, value is decent i.e. 0.77

## Predicting on New Customer Dataset - Loading Data

##xxiii) Fitting the model:

In [None]:
final_lrModel = lr.fit(final_data)

In [None]:
new_customers = spark.read.csv('/content/new_customers.csv',inferSchema=True,header=True)

####xxix) The inferred schema can be visualized as below:

In [None]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



####xxv) Computing summary statistics on the DataFrame:

In [None]:
new_customers.describe().show()

+-------+-------------+------------------+-----------------+------------------+-----------------+------------------+----------------+--------------------+----------------+
|summary|        Names|               Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|    Onboard_date|            Location|         Company|
+-------+-------------+------------------+-----------------+------------------+-----------------+------------------+----------------+--------------------+----------------+
|  count|            6|                 6|                6|                 6|                6|                 6|               6|                   6|               6|
|   mean|         null|35.166666666666664|7607.156666666667|0.8333333333333334|6.808333333333334|12.333333333333334|            null|                null|            null|
| stddev|         null| 15.71517313511584|4346.008232825459| 0.408248290463863|3.708737880555414|3.3862466931200785|            null|       

####xxvi) Utilizing Assembler to Transform the Data :

In [None]:
test_new_customers = assembler.transform(new_customers)

####xxvii) The inferred schema can be visualized as below:

In [None]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



--> Here we do not have churn data as this is new customer dataset which is realtime data and does not contain churn data.

#### xxviii) Running Final Tests

In [None]:
test_new_customers.count()

6

In [None]:
final_test = test_new_customers.select('features','company')

In [None]:
results = final_lrModel.transform(final_test)

#### xxix) Prediction Outcome

In [None]:
results.show()

+--------------------+----------------+--------------------+--------------------+----------+
|            features|         company|       rawPrediction|         probability|prediction|
+--------------------+----------------+--------------------+--------------------+----------+
|[37.0,9935.53,1.0...|        King Ltd|[2.22168705251434...|[0.90218018099704...|       0.0|
|[23.0,7526.94,1.0...|   Cannon-Benson|[-6.2207530595013...|[0.00198380445829...|       1.0|
|[65.0,100.0,1.0,1...|Barron-Robertson|[-3.7691621189411...|[0.02255110110411...|       1.0|
|[32.0,6487.5,0.0,...|   Texton-Golden|[-5.0956222016513...|[0.00608622642085...|       1.0|
|[32.0,13147.71,1....|        Wood LLC|[1.10475867224171...|[0.75115067517478...|       0.0|
|[22.0,8445.26,1.0...|   Parks-Robbins|[-1.6896019277060...|[0.15582819767641...|       1.0|
+--------------------+----------------+--------------------+--------------------+----------+



##As per the results,  out of Six (6) customers, Four (4) customers i.e. Cannon-Benson, Barron-Robertson, Texton-Golden and Parks-Robbins are projected to get churned and Marketing Agency would require to implement corrective measures by assigning account managers for these churning customers, accordingly.