# Support Vector Machines: Churn Analysis

Let's look at a classification example in Spark MLLib.  We are going to look at some telecom data to see whether or not a customer "churned" or not.


In [1]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
spark

Initializing Spark...
Spark found in :  /home/ubuntu/spark
Spark config:
	 spark.app.name=TestApp
	spark.master=local[*]
	executor.memory=2g
	spark.sql.warehouse.dir=/tmp/tmp_mao9uz9
	some_property=some_value
Spark UI running on port 4041


## Step 1: Load the data

In [2]:
%%time
dataset = spark.read.csv("/data/churn/telco.csv.gz", header=True, inferSchema=True)

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.9 s


In [3]:
print("read {:,} records".format(dataset.count()))

dataset.printSchema()

read 7,043 records
root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: integer (nullable = true)
 |-- Dependents: integer (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: integer (nullable = true)
 |-- MultipleLines: integer (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: integer (nullable = true)
 |-- OnlineBackup: integer (nullable = true)
 |-- DeviceProtection: integer (nullable = true)
 |-- TechSupport: integer (nullable = true)
 |-- StreamingTV: integer (nullable = true)
 |-- StreamingMovies: integer (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: integer (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: double (nullable = true)
 |-- Churn: string (nullable = true)



In [4]:
## Dataframe show output is not easy to read
# dataset.show()

## pretty print with pandas
## horizontally
dataset.limit(5).toPandas()

## vertically
# dataset.limit(5).toPandas().T

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,1,0,1,0,0,DSL,0,...,0,0,0,0,Month-to-month,1,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,0,0,34,1,0,DSL,1,...,1,0,0,0,One year,0,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,0,0,2,1,0,DSL,1,...,0,0,0,0,Month-to-month,1,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,0,0,45,0,0,DSL,1,...,1,1,0,0,One year,0,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,0,0,2,1,0,Fiber optic,0,...,0,0,0,0,Month-to-month,1,Electronic check,70.7,151.65,Yes


## Step 2 : Basic Analytics of Data

In [5]:
## describe

## following output is hard to read
# dataset.describe().show() 

## use pandas for pretty print
## TODO : convert to pandas ('toPandas')
dataset.describe().toPandas().T

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
customerID,7043,,,0002-ORFBO,9995-HOTOH
gender,7043,,,Female,Male
SeniorCitizen,7043,0.1621468124378816,0.3686116056100135,0,1
Partner,7043,0.4830327985233565,0.49974751071998735,0,1
Dependents,7043,0.2995882436461735,0.4581101675100144,0,1
tenure,7043,32.37114865824223,24.559481023094442,0,72
PhoneService,7043,0.9031662643759761,0.29575223178363513,0,1
MultipleLines,7043,0.42183728524776376,0.49388786554556857,0,1
InternetService,7043,0.0,0.0,0,Fiber optic


In [6]:
## TODO : Distribution by 'Churn'
dataset.groupBy('Churn').count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|   No| 5174|
|  Yes| 1869|
+-----+-----+



In [8]:
## TODO : Distribution by 'ContraCT'
dataset.groupBy('Contract').count().show()

+--------------+-----+
|      Contract|count|
+--------------+-----+
|Month-to-month| 3875|
|      One year| 1473|
|      Two year| 1695|
+--------------+-----+



In [9]:
## TODO : Distribution by 'Gender'
dataset.groupBy('gender').count().show()

+------+-----+
|gender|count|
+------+-----+
|Female| 3488|
|  Male| 3555|
+------+-----+



## Step 3 : Categorical Data

In [10]:
## Define columns
prediction_column = ['Churn']
categorical_columns = ['gender',  'InternetService','Contract','PaymentMethod']
categorical_index = ['gender_index',  'InternetService_index','Contract_index','PaymentMethod_index']


columns = ['SeniorCitizen','PhoneService','Partner','Dependents','tenure','MultipleLines',
           'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport',
           'StreamingTV','StreamingMovies','PaperlessBilling',
           'MonthlyCharges','TotalCharges']

In [12]:
dataset.select(categorical_columns).show(5)
# dataset.select(categorical_index).show(5)
dataset.select(prediction_column).show(5)


+------+---------------+--------------+--------------------+
|gender|InternetService|      Contract|       PaymentMethod|
+------+---------------+--------------+--------------------+
|Female|            DSL|Month-to-month|    Electronic check|
|  Male|            DSL|      One year|        Mailed check|
|  Male|            DSL|Month-to-month|        Mailed check|
|  Male|            DSL|      One year|Bank transfer (au...|
|Female|    Fiber optic|Month-to-month|    Electronic check|
+------+---------------+--------------+--------------------+
only showing top 5 rows

+-----+
|Churn|
+-----+
|   No|
|   No|
|  Yes|
|   No|
|  Yes|
+-----+
only showing top 5 rows



## Step 4: Deal with Categorical Columns

Let's deal with the categorical columns, including the output

Workflow:
- **Feature Indexers** :  ( category columns --> '*_index' columns)
- **Label indexer** : 'Churn' --> 'indexedLabel'
- **Vector Assembler** : '*_index' columns --> 'features' 
- **Scaler** :  'features' --> 'scaledFeatures'

In [13]:
## handy function to pretty print indexers, scalers, assemblers

from pyspark.ml.feature import StringIndexer, StandardScaler, VectorAssembler, MinMaxScaler

def pretty_print_transformer(transformer):
    if (type(transformer) is StringIndexer) \
        or (type(transformer) is StandardScaler) \
        or (type(transformer) is MinMaxScaler) : \
        return (transformer.__class__.__name__ + " : " + transformer.getInputCol() + ' -> ' +  transformer.getOutputCol())
    
    if type(transformer) is VectorAssembler:
        return (transformer.__class__.__name__ + " : " + str(transformer.getInputCols()) + ' -> ' +  transformer.getOutputCol())
    


In [14]:
## 4.1 - Feature Indexers

# from pyspark.ml.feature import StringIndexer

print("indexing categorical columns : ", categorical_columns)

## TODO : create indexers in a loop
## loop through 'categorical_columns'
indexers = [StringIndexer(inputCol=column, outputCol=column + "_index", handleInvalid="keep")\
            for column in categorical_columns ]

for indexer in indexers:
    print(pretty_print_transformer(indexer))


indexing categorical columns :  ['gender', 'InternetService', 'Contract', 'PaymentMethod']
StringIndexer : gender -> gender_index
StringIndexer : InternetService -> InternetService_index
StringIndexer : Contract -> Contract_index
StringIndexer : PaymentMethod -> PaymentMethod_index


In [15]:
## 4.2 - label indexer

from pyspark.ml.feature import StringIndexer

## TODO : we need to index 'Churn' column too
## Create a String Indexer with inputColumn='Churn' and outputCol='indexedLabel'
labelIndexer = StringIndexer(inputCol="Churn", outputCol="indexedLabel")

print(pretty_print_transformer(labelIndexer))


StringIndexer : Churn -> indexedLabel


In [16]:
## 4.3 - Vector assembler 
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=columns + categorical_index, outputCol="features")

print (pretty_print_transformer(assembler))


VectorAssembler : ['SeniorCitizen', 'PhoneService', 'Partner', 'Dependents', 'tenure', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'gender_index', 'InternetService_index', 'Contract_index', 'PaymentMethod_index'] -> features


In [17]:
## 4.5 - Scaler
from pyspark.ml.feature import StandardScaler

## TODO : scale 'features' column into 'scaledFeatures'
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

print (pretty_print_transformer(scaler))

StandardScaler : features -> scaledFeatures


## Step 5: Build the Pipeline
We are going to transform the data using Spark pipeline.

In [23]:
from pyspark.ml import Pipeline

##  with scaler
stages = indexers + [labelIndexer, assembler,  scaler] 

## without scaler
#stages = indexers + [assembler, labelIndexer] 

i = 0
for stage in stages:
    i = i+1
    print ("stage ", i , " : ", pretty_print_transformer(stage))
print()

## TODO : Create a 'Pipeline' passing 'stages' as input
pipeline = Pipeline(stages=stages)

print ("pipeline : ", pipeline.explainParams())

stage  1  :  StringIndexer : gender -> gender_index
stage  2  :  StringIndexer : InternetService -> InternetService_index
stage  3  :  StringIndexer : Contract -> Contract_index
stage  4  :  StringIndexer : PaymentMethod -> PaymentMethod_index
stage  5  :  StringIndexer : Churn -> indexedLabel
stage  6  :  VectorAssembler : ['SeniorCitizen', 'PhoneService', 'Partner', 'Dependents', 'tenure', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'gender_index', 'InternetService_index', 'Contract_index', 'PaymentMethod_index'] -> features
stage  7  :  StandardScaler : features -> scaledFeatures

pipeline :  stages: a list of pipeline stages (current: [StringIndexer_1cd671da3360, StringIndexer_9c939b99246d, StringIndexer_25e807aad05c, StringIndexer_4220cb781d0a, StringIndexer_f9b5c6071792, VectorAssembler_29aaa43db4b9, StandardScaler_52d6b23b837d])


In [24]:
%%time
## TODO : Run data through the pipeline
## Hint : first call 'fit' and then 'transform'
processed_data = pipeline.fit(dataset).transform(dataset)

print ("processed data count ", processed_data.count())

processed data count  7043
CPU times: user 68 ms, sys: 16 ms, total: 84 ms
Wall time: 2.04 s


In [25]:
## pretty print transformed data using pandas
x = processed_data.limit(2).toPandas()
# print horizontally
# x
# print veriticall
x.T

Unnamed: 0,0,1
customerID,7590-VHVEG,5575-GNVDE
gender,Female,Male
SeniorCitizen,0,0
Partner,1,0
Dependents,0,0
tenure,1,34
PhoneService,0,1
MultipleLines,0,0
InternetService,DSL,DSL
OnlineSecurity,0,1


## Step 6: Split into training and test.

In [26]:
## TODO : training=80%,  test=20%
(training, test) = processed_data.randomSplit([80., 20.])

print("training set count : ", training.count())
print("testing set count : ", test.count())

training set count :  5655
testing set count :  1388


## Step 7 - Create SVM Model

In [27]:
from pyspark.ml.classification import LinearSVC

## TODO : create 'LinearSVC' model
##    with labelCol='indexedLabel'
##    with featuresCol='scaledFeatures'
##    with maxIter=100
lsvc = LinearSVC(labelCol="indexedLabel", featuresCol="scaledFeatures", maxIter=100, regParam=0.1)

## Step 8: Train  Linear SVM model

In [28]:
print ("training starting on ", training.count() , " records")

training starting on  5655  records


In [29]:
%%time 

## TODO : train the model
## Hint :    call 'fit' on 'training' data
lsvcModel = lsvc.fit(training)
print ("training done")

training done
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 24.2 s


In [30]:
import pandas as pd

# Print the coefficients and intercept for linearsSVC
coef = lsvcModel.coefficients

df = pd.DataFrame({'input' : columns + categorical_index, 'coefficient': lsvcModel.coefficients})
print("Intercept: " + str(lsvcModel.intercept))

df
#df.sort_values(by=['input'])

Intercept: -0.07548510505921968


Unnamed: 0,input,coefficient
0,SeniorCitizen,0.09389
1,PhoneService,-0.030488
2,Partner,-0.015586
3,Dependents,-0.050076
4,tenure,-0.315892
5,MultipleLines,0.108757
6,OnlineSecurity,-0.151361
7,OnlineBackup,-0.085512
8,DeviceProtection,-0.043025
9,TechSupport,-0.153441


## Step 9 : Predict on Test Data

In [31]:
print ("predicting on " , test.count() , " records")

predicting on  1388  records


In [32]:
%%time

## TODO : predict on test data
## Hint : 'transform' on 'test'
predictions_test = lsvcModel.transform(test)

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 56.9 ms


In [33]:
predictions_test.show()

+----------+------+-------------+-------+----------+------+------------+-------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------------+----------------+--------------------+--------------+------------+-----+------------+---------------------+--------------+-------------------+------------+--------------------+--------------------+--------------------+----------+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|MultipleLines|InternetService|OnlineSecurity|OnlineBackup|DeviceProtection|TechSupport|StreamingTV|StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|gender_index|InternetService_index|Contract_index|PaymentMethod_index|indexedLabel|            features|      scaledFeatures|       rawPrediction|prediction|
+----------+------+-------------+-------+----------+------+------------+-------------+---------------+--------------+------------+------

## Step 10: See the evaluation metrics

In [34]:
predictions_test = lsvcModel.transform(test)
predictions_train = lsvcModel.transform(training)

### 10.1 Model Accuracy

In [35]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction",
                                              metricName="accuracy")

print("Training set accuracy = " , evaluator.evaluate(predictions_train))
print("Test set accuracy = " , evaluator.evaluate(predictions_test))


Training set accuracy =  0.8035366931918656
Test set accuracy =  0.7982708933717579


### 10.2 : Confusion matrix

**Interpret the confusion matrix output**

In [36]:
# Confusion matrix
cm = predictions_test.groupBy('Churn').pivot('prediction', [0,1]).count().na.fill(0).orderBy('Churn')
cm.show()

+-----+---+---+
|Churn|  0|  1|
+-----+---+---+
|   No|923| 92|
|  Yes|188|185|
+-----+---+---+



In [37]:
import seaborn as sns

cm_pd = cm.toPandas()
cm_pd.set_index("Churn", inplace=True)
# print(cm_pd)

# colormaps : cmap="YlGnBu" , cmap="Greens", cmap="Blues",  cmap="Reds"
sns.heatmap(cm_pd, annot=True,  fmt=',', cmap="Blues").plot()

[]

### 10.3 - AUC

In [38]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# default metrics for BinaryClassificationEvaluator is 'areaUnderCurve'
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol='indexedLabel' )
# print ("default metrics : " ,evaluator.getMetricName())

print("AUC for training: " , evaluator.evaluate(predictions_train))
print("AUC for test : " , evaluator.evaluate(predictions_test))

AUC for training:  0.8364188770439221
AUC for test :  0.8437565208204134


**=> What does AUC mean?** 

## Step 11: Try running without scaling features

In Step-5  we are adding a scaler at the end to normalize the vector.  
Try without scaler.  

Uncomment the following line   
```
#stages = indexers + [assembler, featureIndexer, labelIndexer] 
```

And run the whole notebook (Cell --> Run All)  
Do you see any improvement/degradation in accuracy / AUC ?