# Customer Churn Prediction - Logistics Regression
### (Using Logistics Regression to see if the model performance is better than Optimized Decision Tree (Abhi-Churn-DT-Optimized-1.0.pynb)

Retaining existing customers is an important aspect of running a successful business. The cost of acquiring a new customer is significantly high, and it is imperative for any business to retain its existing customers. Loyal customers are more profitable and they bring in steady streams of revenue

## Research Goal

If there is a way to identify the customers with higher probability of attrition, various strategies to retain them could be implemented. In this analysis, we will use machine learning algorithm (Logistics Regressiob) to build a model, which could predict customer churn.

In this study we use Orange Telecom’s dataset. It is a cleansed dataset, with customer activity information as features and a label specifying if the customer cancelled subscription (churn)

https://github.com/caroljmcdonald/mapr-sparkml-churn/tree/master/data

There are two data sets 

churn80 (training) 

churn20(test)

Reference – https://dzone.com/articles/churn-prediction-with-apache-spark-machine-learnin

### SCHEMA

State: string

Account length: integer

Area code: integer

International plan: string

Voice mail plan: string

Number vmail messages: integer

Total day minutes: double

Total day calls: integer

Total day charge: double

Total eve minutes: double

Total eve calls: integer

Total eve charge: double

Total night minutes: double

Total night calls: integer

Total night charge: double

Total international minutes: double

Total international calls: integer

Total international charge: double

Customer service calls: integer

Churn: String


In below step, the data from file location is loaded into a DataFrame.

In [1]:
from pyspark.sql.types import *

schema = StructType([
    StructField("state", StringType(), True),
    StructField("len", IntegerType(), True),
    StructField("acode", StringType(), True),
    StructField("intlplan", StringType(), True),
    StructField("vplan", StringType(), True),
    StructField("numvmail", DoubleType(), True),
    StructField("tdmins", DoubleType(), True),
    StructField("tdcalls", DoubleType(), True),
    StructField("tdcharge", DoubleType(), True),
    StructField("temins", DoubleType(), True),
    StructField("tecalls", DoubleType(), True),
    StructField("techarge", DoubleType(), True),
    StructField("tnmins", DoubleType(), True),
    StructField("tncalls", DoubleType(), True),
    StructField("tncharge", DoubleType(), True),
    StructField("timins", DoubleType(), True),
    StructField("ticalls", DoubleType(), True),
    StructField("ticharge", DoubleType(), True),
    StructField("numcs", DoubleType(), True),
    StructField("churn", StringType(), True)
                    ])

churn20_df = spark.read.csv("/home/training/bdpgstudent/data/churn/churn-bigml-20.csv", schema=schema, header = True)

churn80_df = spark.read.csv("/home/training/bdpgstudent/data/churn/churn-bigml-80.csv",  schema=schema, header = True)
churn80_df.count()

#churn80_df.show()

2666

In [2]:
churn20_df.count()


667

### Data Exploration
In below steps, we perform Data Exploration. Aggregate of call minutes and charges are computed and printed.

Also, to find the correlation between fields, a Correlation Matrix is printed in tabular form. The intention is to identify the fields with close correlation and remove them from the model.


In [3]:
#kdd10_df.show()


import pyspark.sql.functions as fn
test_df = churn80_df.agg( fn.sum('tdmins').alias('tdmintotal'), fn.sum('tdcharge').alias('tdchargetotal'),
                         fn.sum('temins').alias('temintotal'),fn.sum('techarge').alias('techargetotal')
                        )
test_df.show()

numerical = ['tdmins', 'tdcharge', 'temins', 'techarge', 'tnmins', 'tncharge', 'timins', 'ticharge']

n_numerical = len(numerical)
corr = []

for i in range(0, n_numerical):
    temp = [None] * i
    for j in range(i, n_numerical):
        temp.append(churn80_df.corr(numerical[i], numerical[j]))
    corr.append(temp)

# print the correlation matrix in a nicely formatted table
from tabulate import tabulate
print(tabulate(corr, headers=numerical, showindex=numerical, tablefmt="fancy_grid", numalign="center"))



+------------------+-----------------+-----------------+-----------------+
|        tdmintotal|    tdchargetotal|       temintotal|    techargetotal|
+------------------+-----------------+-----------------+-----------------+
|478498.00000000023|81346.07000000011|534229.5000000003|45410.17000000004|
+------------------+-----------------+-----------------+-----------------+

╒══════════╤══════════╤════════════╤════════════╤════════════╤════════════╤════════════╤═════════════╤═════════════╕
│          │  tdmins  │  tdcharge  │   temins   │  techarge  │   tnmins   │  tncharge  │   timins    │  ticharge   │
╞══════════╪══════════╪════════════╪════════════╪════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ tdmins   │    1     │     1      │ 0.00399863 │ 0.00399248 │  0.013491  │ 0.0134638  │  -0.011042  │  -0.010934  │
├──────────┼──────────┼────────────┼────────────┼────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ tdcharge │          │     1      │ 0.

From the above step, it is evident that the fields Call Minutes and Call Charges have direct correlation. So one of the field can be removed from the model. In below step, we remove the call charge features and also a few other features (state, area code...). 

In the non optimized Decision Tree model, we had removed customer servicing calls, number of voice mails and length of account. We will not remove customer servicing calls, number of voice mail and length of account to compare the prediction of Logistics Regression Model with Decision Tree Model prediction

In [4]:
#remove correlated columns and area code state code

churn80_df_d = churn80_df.drop("state").drop("acode").drop("vplan")\
                            .drop("tdcharge").drop("techarge")\
                            .drop("tncharge").drop("ticharge")
churn20_df_d = churn20_df.drop("state").drop("acode").drop("vplan")\
                        .drop("tdcharge").drop("techarge")\
                        .drop("tncharge").drop("ticharge")

churn80_df_d.printSchema()
churn80_df_d.groupBy("churn").count().show()

root
 |-- len: integer (nullable = true)
 |-- intlplan: string (nullable = true)
 |-- numvmail: double (nullable = true)
 |-- tdmins: double (nullable = true)
 |-- tdcalls: double (nullable = true)
 |-- temins: double (nullable = true)
 |-- tecalls: double (nullable = true)
 |-- tnmins: double (nullable = true)
 |-- tncalls: double (nullable = true)
 |-- timins: double (nullable = true)
 |-- ticalls: double (nullable = true)
 |-- numcs: double (nullable = true)
 |-- churn: string (nullable = true)

+-----+-----+
|churn|count|
+-----+-----+
|False| 2278|
| True|  388|
+-----+-----+



### Stratified Sampling
From the count above we see that there are roughly about six times False churn samples than True Churn samples. We have to make the the model sensitive to Churn = True, so that model predicts the customers likely to leave. We use stratified sampling as shown below to keep all the Churn=True samples and randomly select Churn=False samples.

In [5]:

churn80_df_d_s = churn80_df_d.sampleBy("churn", fractions={"False": 0.17, "True": 1}, seed=0)
churn80_df_d_s.groupBy("churn").count().orderBy("churn").show()

+-----+-----+
|churn|count|
+-----+-----+
|False|  377|
| True|  388|
+-----+-----+



### Creating Transformers
In below step, we create transformers to transform Data.

The model will only accept numeric values. So we use a StrinIndexer to convert some String fields to integer values. We are interested in Churn feature, as it is the field that indicates if the customer is likely to leave or not. We are also interested in internatonal plan field, which is a String field.

Then we create a features vector, with input columns required for the model computation, using VectorAssember.



In [6]:
import pyspark.ml.feature as ft

label_indexer = ft.StringIndexer(inputCol="churn", outputCol="label")
ip_indexer = ft.StringIndexer(inputCol="intlplan", outputCol="iplanIndex")


input_cols = ["len", "numvmail","numcs","iplanIndex", "tdmins",
     "tdcalls", "temins", "tecalls", "tnmins", "tncalls", "timins",
     "ticalls"]

featuresCreator = ft.VectorAssembler(inputCols=input_cols, outputCol="features")

### Create an Estimator

We create a classifier model with a label column. This label column will be used by the classifier to classify the dataset.
Here we are using a Decision Tree Classifer.



In [7]:

#Creating an estimator
import pyspark.ml.classification as cl
lr = cl.LogisticRegression(regParam=0.01, maxIter=50, elasticNetParam=0.01, labelCol="label")


### Creating a pipeline
Now, create a Pipeline to pull the different transformations together:

In [8]:
#Creating Pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline( stages = [ label_indexer, ip_indexer, featuresCreator, lr] )

### Fitting the Model
Now we run the pipeline to estimate the model


In [11]:
#train the model
model = pipeline.fit(churn80_df_d_s)

#test the model with 20% test data
predictions = model.transform(churn20_df_d)

#printing the indexed label value of Churn Feature
label_mapping = predictions.select("churn","label").distinct()
label_mapping.show()


+-----+-----+
|churn|label|
+-----+-----+
|False|  1.0|
| True|  0.0|
+-----+-----+



### Evaluating the Model
Now we use a MultiClassClassificationEvaluator to evaluate the accuracy of the model prediction

In [13]:
import pyspark.ml.evaluation as ev

evaluator = ev.MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", 
        metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(accuracy)



0.7466266866566716


The accuracy metric show 74.6% for Logistics Regression Model. For Decision Tree, we got about 86.05% Accuracy for the same dataset and same featuers (Abhi-Churn-DT-Optimized-1.0.pynb). So it is evident that Decision Tree model performs better for this use case, with the same data set and feature set.
