#Assignment 6 - Logistic Regression

## Binary Customer Churn

A company has a lot of  customers that use their service to produce ads for the customer websites. They've noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager. Luckily they have some historical data, can you help them out? Create a classification algorithm that will help classify whether or not a customer churned. 

The data is saved as historical_data.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
**NB:Create the model and evaluated it.**

In [1]:
#Step 1: Install Dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark

#Step 2: Add environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

#Step 3: Initialize Pyspark
import findspark
findspark.init()

#creating spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
sc

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('historical').getOrCreate()

#importing the logistic regression library
from pyspark.ml.classification import LogisticRegression

In [28]:
# Load training data
training = spark.read.csv('historical_data.csv',inferSchema=True,header=True)

#lr = LogisticRegression()

# Fit the model
#lrModel = lr.fit(training)

#showing the schema
training.printSchema()
training.show()

#trainingSummary = lrModel.summary

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)

+-------------------+---+--------------+---------------+-----+---------+----------------+--------------------+--------------------+-----+
|              Names|Age|Total_Purchase|Account_Manager|Years|Num_Sites|    Onboard_date|            Location|             Company|Churn|
+-------------------+---+--------------+---------------+-----+---------+----------------+--------------------+--------------------+-----+
|   Cameron Williams| 42|       11066.8|              0| 7.22|        8|  8/30/2013 7:00|10265 Elizabeth M...|          Harvey LLC|    1|
|      Kevin Mueller| 41|      1191

In [34]:
training.select('Company').show(30)
#checking if there are companies that can be grouped together and discretized, but they're unique
#same problem with names, dates, and addresses

+--------------------+
|             Company|
+--------------------+
|          Harvey LLC|
|          Wilson PLC|
|Miller, Johnson a...|
|           Smith Inc|
|          Love-Jones|
|        Kelly-Warren|
|   Reynolds-Sheppard|
|          Singh-Cole|
|           Lopez PLC|
|       Reed-Martinez|
|Briggs, Lamb and ...|
|    Figueroa-Maynard|
|     Abbott-Thompson|
|Smith, Kim and Ma...|
|Snyder, Lee and M...|
|      Sanders-Pierce|
|Andrews, Adams an...|
|Morgan, Phillips ...|
|      Villanueva LLC|
|Berry, Orr and Ca...|
|       Parks-Bradley|
|           Olsen LLC|
|Clark, Campbell a...|
|          Dalton LLC|
|Thompson, Hansen ...|
|Yates, Martinez a...|
|       Reeves-Curtis|
|           Gates Ltd|
|     Dunlap and Sons|
|Taylor, Allen and...|
+--------------------+
only showing top 30 rows



In [9]:
training.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [12]:
#Columns selection 
my_cols = training.select([
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Churn'])

In [13]:
#feature engineering/data cleanup
my_final_data = my_cols.na.drop()

In [14]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
                                OneHotEncoder,StringIndexer)

assembler = VectorAssembler(inputCols=
['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Churn'],outputCol='features')

In [16]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

In [17]:
log_reg_historical = LogisticRegression(featuresCol='features',labelCol='Churn')

In [18]:
pipeline = Pipeline(stages=[assembler,log_reg_historical])

In [20]:
train_historical_data, test_historical_data = my_final_data.randomSplit([0.7,.3])
fit_model = pipeline.fit(train_historical_data)
results = fit_model.transform(test_historical_data)

In [22]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Churn')

results.select('Churn','prediction').show()

+-----+----------+
|Churn|prediction|
+-----+----------+
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
+-----+----------+
only showing top 20 rows



In [23]:
AUC = my_eval.evaluate(results)
AUC

1.0