## Spark and Batch to predict customer churn

You will use a data set, **Telco Customer Churn**, which contains a telecommunications company's anonymous customer data . Use the details of this data set to predict customer churn, something which is critical to business as it's easier to retain existing customers rather to acquire new ones.

## Learning goals

In this notebook, you will learn how to:

-  Load a CSV file into an Apache Spark DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create an Apache Spark machine learning pipeline.
-  Train and evaluate a model.


<a id="load"></a>
## 1. Load and explore data

In [None]:
df_data = spark.read.load("customer_churn.csv",format="csv", sep=",", inferSchema="true", header="true",option('nanValue', ' '),option('nullValue', ' '))

Explore the loaded data by using the following Apache Spark DataFrame methods:
-  print schema
-  count all records
-  show distribution of label classes

In [None]:
df_data.printSchema()

print("Number of fields: %3g" % len(df_data.schema))

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: double (nullable = true)
 |-- Churn: string (nullable = true)

Number of fields:  21


As you can see, the data contains 21 fields. "Churn" field is the one you would like to predict (label).

In [None]:
print("Total number of records: " + str(df_data.count()))

Total number of records: 7043


The data set contains 7043 records.

Now you will check if all records have complete data.

In [None]:
df_complete = df_data.dropna()

print("Number of records with complete data: %3g" % df_complete.count())

Number of records with complete data: 7032


You can see that there are some missing values. You can investigate that all missing values are present in `TotalCharges` feature. For training and evaluation you will use the data set with the missing values removed.

Inspect the class distribution in the label column.

In [None]:
df_complete.groupBy('Churn').count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|   No| 5163|
|  Yes| 1869|
+-----+-----+



<a id="model"></a>
## 2. Create an Apache Spark machine learning model

In this section you will learn how to:

- [2.1 Prepare data](#prep)
- [2.2 Create an Apache Spark machine learning pipeline](#pipe)
- [2.3 Train a model](#train)

### 2.1 Prepare data<a id="prep"></a>

In this subsection you will split your data into: 
- train data set
- test data set
- predict data set

In [None]:
(train_data, test_data, predict_data) = df_complete.randomSplit([0.8, 0.18, 0.02], 24)

print("Number of records for training: " + str(train_data.count()))
print("Number of records for evaluation: " + str(test_data.count()))
print("Number of records for prediction: " + str(predict_data.count()))

Number of records for training: 5638
Number of records for evaluation: 1261
Number of records for prediction: 133


As you can see your data has been successfully split into three data sets: 

-  The train data set, which is the largest group, is used for training.
-  The test data set will be used for model evaluation and to test the assumptions of the model.
-  The predict data set will be used for prediction.

### 2.2 Create the pipeline<a id="pipe"></a>

In this section you will create an Apache Spark machine learning pipeline and then train the model.

In the first step you need to import the Apache Spark machine learning packages that will be needed in the subsequent steps.

In [None]:
from pyspark.ml.feature import StringIndexer, IndexToString, RFormula
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, convert all the predictors to features vectors and convert the label feature to numeric.

In [None]:
lab = StringIndexer(inputCol = 'Churn', outputCol = 'label')
features = RFormula(formula = "~ gender + SeniorCitizen +  Partner + Dependents + tenure + PhoneService + MultipleLines + InternetService + OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges")

Next, define estimators you want to use for classification. Logistic Regression is used in the following example.

In [None]:
features.getFormula()

'~ gender + SeniorCitizen +  Partner + Dependents + tenure + PhoneService + MultipleLines + InternetService + OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport + StreamingTV + StreamingMovies + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges'

In [None]:
lr = LogisticRegression(maxIter = 10)

Now build the pipeline. A pipeline consists of transformers and an estimator.

In [None]:
pipeline_lr = Pipeline(stages = [features, lab , lr])

### 2.3 Train the model<a id="train"></a>

Now, you can train your Logistic Regression model by using the previously defined **pipeline** and **train data**.

In [None]:
model_lr = pipeline_lr.fit(train_data)

You can check your **model accuracy** now. Use **test data** to evaluate the model.

In [None]:
predictions = model_lr.transform(test_data)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print("Test dataset:")
print("Accuracy = %3.2f" % accuracy)

Test dataset:
Accuracy = 0.80


You can tune your model now to achieve better accuracy. For simplicity, the tuning example is omitted in this example.