# Support Vector Machines: Churn Analysis

Let's look at a classification example in Spark MLLib.  We are going to look at some telecom data to see whether or not a customer "churned" or not.


In [None]:
%matplotlib inline

import time
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LinearSVC
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import pandas as pd



## Step 1: Load the data

In [None]:
t1 = time.perf_counter()
dataset = spark.read.csv("/data/churn/telco.csv.gz", header=True, inferSchema=True)
t2 = time.perf_counter()

print("read {:,} records in {:,.2f} ms".format(dataset.count(), (t2-t1)*1000))

dataset.printSchema()

## Step 2 : Basic Data Analytics
Let's see how the data is spread along some columns : Churn, Gender, Contract.

Do you think the data has skew?

In [None]:
## distribution buy Chrun
dataset.groupBy('Churn').count().show()

In [None]:
## TODO : Distribution by gender
dataset.groupBy('???').count().show()

In [None]:
## TODO : distribution by 'Contract'
???

In [None]:
## basic describe
## TODO : Feel free to add more attributes to describe
dataset.describe(['MultipleLines', 'MonthlyCharges']).show()

## Step 3 : Categorical Data

In [None]:
## Define columns
prediction = ['Churn']
categorical = ['gender',  'InternetService','Contract','PaymentMethod']
categorical_index = ['gender_index',  'InternetService_index','Contract_index','PaymentMethod_index']


columns = ['SeniorCitizen','PhoneService','Partner','Dependents','tenure','MultipleLines',
           'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport',
           'StreamingTV','StreamingMovies','PaperlessBilling',
           'MonthlyCharges','TotalCharges']

In [None]:
dataset.select(categorical).show(5)
dataset.select(prediction).show(5)
dataset.select(columns).show(1)

## Step 4: Deal with Categorical Columns

Let's deal with the categorical columns, including the output

In [None]:
print(categorical)
dataset.select(categorical).show(5)

indexers = [StringIndexer(inputCol=column, outputCol=column + "_index", handleInvalid="keep").\
            fit(dataset) for column in categorical ]

labelIndexer = StringIndexer(inputCol="Churn", outputCol="indexedLabel")


## Step 5: Build the Vector

In [None]:
assembler = VectorAssembler(inputCols=columns + categorical_index, outputCol="features")


In [None]:
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)



**=> TODO: Scale the features and output column "scaledFeatures"

In [None]:
# Scaler

scaler = StandardScaler(inputCol="indexedFeatures", outputCol="???") #TODO: Fix this.

## Step 6: Split into training and test.

**=> Split into training/test with an 80/20 split ** 

In [None]:
## Split into training and test
## TODO: create training and test with an 80/20 split
(training, test) = dataset.randomSplit([.8, .2])

print("training set count : ", training.count())
print("testing set count : ", test.count())

## Step 7: Build the Pipeline

In [None]:
## TODO : set maxIteration to 50

lsvc = LinearSVC(labelCol="indexedLabel", featuresCol="scaledFeatures", maxIter=???, regParam=0.1)

##  with scaler
stages = indexers + [assembler, featureIndexer, labelIndexer, scaler] 

## without scaler
#stages = indexers + [assembler, featureIndexer, labelIndexer] 


i = 0
for stage in stages:
    i = i+1
    print ("stage ", i, " : ", stage)
print()

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=stages)

print ("pipeline : ", pipeline.explainParams())



## Step 8: Train  Linear SVM model

In [None]:
# Fit the model
t1 = time.perf_counter()
scaledTraining = pipeline.fit(training).transform(training)
t2 = time.perf_counter()

print("ran pipeline on {:,} records using {:,} stages in {:,.2f} ms".\
      format(training.count(), len(stages), (t2-t1)*1000))




In [None]:
t1 = time.perf_counter()
## TODO : supply 'scaledTraining' for fitting
lsvcModel = lsvc.fit(???)
t2 = time.perf_counter()

print("trained on {:,} records using {:,} features in {:,.2f} ms".\
      format(scaledTraining.count(), len(columns), (t2-t1)*1000))

In [None]:
# Print the coefficients and intercept for linearsSVC
coef = lsvcModel.coefficients

df = pd.DataFrame({'input' : columns + categorical_index, 'coefficient': lsvcModel.coefficients})
print("Intercept: " + str(lsvcModel.intercept))

df
#df.sort_values(by=['input'])

## Step 9:  Predict on Test Data

**=> TODO: Transform the test dataset to get scaled Vector **



In [None]:
t1 = time.perf_counter()
scaledTest = pipeline.fit(test).transform(test)
t2 = time.perf_counter()

print("ran pipeline on {:,} records using {:,} stages in {:,.2f} ms".\
      format(training.count(), len(stages), (t2-t1)*1000))

In [None]:
t1 = time.perf_counter()

## TODO : create predictions on 'scaledTest' dataset
predictions = lsvcModel.transform(???)

t2 = time.perf_counter()
print("predicted on {:,} records in {:,.2f} ms".\
      format(training.count(),  (t2-t1)*1000))

## Step 10: See the evaluation metrics

### 10.1 AUC

In [None]:
evaluator = BinaryClassificationEvaluator(labelCol='indexedLabel', rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)  #AUC


**=> What does AUC mean?** 

### 10.2 Model Accuracy

In [None]:

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print ("accuracy ", accuracy)
print("Test Error = %g" % (1.0 - accuracy))


### 10.3 : confusion matrix

In [None]:
# Confusion matrix
predictions.groupBy('Churn').pivot('prediction', [0,1]).count().na.fill(0).orderBy('Churn').show()

**=> TODO: What is the meaning of the confusion matrix? **



## Step 11: Try running without scaling features

In Step-5  we are adding a scaler at the end to normalize the vector.  
Try without scaler.  

Uncomment the following line   
#stages = indexers + [assembler, featureIndexer, labelIndexer] 

And run the whole notebook (Cell --> Run All)  
Do you see any improvement/degradation in accuracy / AUC ?