### Telco Customer Churn
Focused customer retention programs

#### Context
"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

#### Content
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

-Customers who left within the last month – the column is called Churn
-Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
-Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
-Demographic info about customers – gender, age range, and if they have partners and dependents

#### Inspiration
To explore this type of models and learn more about the subject.

#### New version from IBM:
https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

In [1]:
from pyspark.sql import SparkSession

spark =  SparkSession.builder.appName("churn").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")

In [2]:
churn_df  = spark.read.csv("Telco-Customer-Churn.csv",inferSchema=True,header=True)
churn_df.show()

+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|   MultipleLines|InternetService|     OnlineSecurity|       OnlineBackup|   DeviceProtection|        TechSupport|        StreamingTV|    StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|7590-VHVEG|Female|            0|    Yes|        No|     1|  

In [3]:
churn_df.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)



In [4]:
churn_df = churn_df.drop("customerID")

In [5]:
from pyspark.sql.types import StringType

문자변수 = [변수.name for 변수 in churn_df.schema.fields if isinstance(변수.dataType, StringType)]
문자변수

['gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'TotalCharges',
 'Churn']

In [6]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

In [7]:
## StringIndexer
indexer  = StringIndexer(inputCols=문자변수,
                         outputCols=["{}_SI".format(c) for c in 문자변수])
encode_df  = indexer.fit(churn_df).transform(churn_df)
encode_df.printSchema()

root
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)
 |-- gender_SI: double (nullable = false)
 |-- Partner_SI: double (nullable = false)
 |-- Dependents_SI: double (nullable = 

In [18]:
설명변수 = ["SeniorCitizen", "tenure", "MonthlyCharges"]+["{}_SI".format(c) for c in 문자변수]
설명변수

['SeniorCitizen',
 'tenure',
 'MonthlyCharges',
 'gender_SI',
 'Partner_SI',
 'Dependents_SI',
 'PhoneService_SI',
 'MultipleLines_SI',
 'InternetService_SI',
 'OnlineSecurity_SI',
 'OnlineBackup_SI',
 'DeviceProtection_SI',
 'TechSupport_SI',
 'StreamingTV_SI',
 'StreamingMovies_SI',
 'Contract_SI',
 'PaperlessBilling_SI',
 'PaymentMethod_SI',
 'TotalCharges_SI',
 'Churn_SI']

In [8]:
설명변수 = 설명변수[0:-1]
설명변수

['SeniorCitizen',
 'MonthlyCharges',
 'gender_SI',
 'Partner_SI',
 'Dependents_SI',
 'PhoneService_SI',
 'MultipleLines_SI',
 'InternetService_SI',
 'OnlineSecurity_SI',
 'OnlineBackup_SI',
 'DeviceProtection_SI',
 'TechSupport_SI',
 'StreamingTV_SI',
 'StreamingMovies_SI',
 'Contract_SI',
 'PaperlessBilling_SI',
 'PaymentMethod_SI',
 'TotalCharges_SI']

In [22]:
from pyspark.ml.feature import VectorAssembler

변수묶음 = VectorAssembler(inputCols=설명변수,outputCol="features")
변환자료  = 변수묶음.transform(encode_df)
변환자료.select("features","Churn_SI").show()

+--------------------+--------+
|            features|Churn_SI|
+--------------------+--------+
|(20,[1,2,3,4,6,7,...|     0.0|
|(20,[1,2,8,9,11,1...|     0.0|
|(20,[1,2,8,9,10,1...|     1.0|
|(20,[1,2,6,7,8,9,...|     0.0|
|(20,[1,2,3,18,19]...|     1.0|
|(20,[1,2,3,7,11,1...|     1.0|
|(20,[1,2,5,7,10,1...|     0.0|
|(20,[1,2,3,6,7,8,...|     0.0|
|(20,[1,2,3,4,7,11...|     1.0|
|(20,[1,2,5,8,9,10...|     0.0|
|(20,[1,2,4,5,8,9,...|     0.0|
|[0.0,16.0,18.95,0...|     0.0|
|(20,[1,2,4,7,11,1...|     0.0|
|(20,[1,2,7,10,11,...|     1.0|
|(20,[1,2,9,11,12,...|     0.0|
|[0.0,69.0,113.25,...|     0.0|
|[0.0,52.0,20.65,1...|     0.0|
|(20,[1,2,5,7,9,11...|     0.0|
|(20,[1,2,3,4,5,8,...|     1.0|
|(20,[1,2,3,10,11,...|     0.0|
+--------------------+--------+
only showing top 20 rows



In [23]:
분류자료 = 변환자료.select(["features","Churn_SI"])
분류자료.show()

+--------------------+--------+
|            features|Churn_SI|
+--------------------+--------+
|(20,[1,2,3,4,6,7,...|     0.0|
|(20,[1,2,8,9,11,1...|     0.0|
|(20,[1,2,8,9,10,1...|     1.0|
|(20,[1,2,6,7,8,9,...|     0.0|
|(20,[1,2,3,18,19]...|     1.0|
|(20,[1,2,3,7,11,1...|     1.0|
|(20,[1,2,5,7,10,1...|     0.0|
|(20,[1,2,3,6,7,8,...|     0.0|
|(20,[1,2,3,4,7,11...|     1.0|
|(20,[1,2,5,8,9,10...|     0.0|
|(20,[1,2,4,5,8,9,...|     0.0|
|[0.0,16.0,18.95,0...|     0.0|
|(20,[1,2,4,7,11,1...|     0.0|
|(20,[1,2,7,10,11,...|     1.0|
|(20,[1,2,9,11,12,...|     0.0|
|[0.0,69.0,113.25,...|     0.0|
|[0.0,52.0,20.65,1...|     0.0|
|(20,[1,2,5,7,9,11...|     0.0|
|(20,[1,2,3,4,5,8,...|     1.0|
|(20,[1,2,3,10,11,...|     0.0|
+--------------------+--------+
only showing top 20 rows



In [24]:
from pyspark.ml.classification import LogisticRegression

train_data, test_data =분류자료.randomSplit([0.7, 0.3], 316)

In [25]:
분석모형 =  LogisticRegression(labelCol="Churn_SI").fit(train_data)
분석모형.summary



<pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary at 0x1d755c3cdf0>

In [26]:
분석모형.summary.predictions.show()

+--------------------+--------+--------------------+--------------------+----------+
|            features|Churn_SI|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|(20,[0,1,2,3,4,5,...|     0.0|[19.3403516169458...|[0.99999999601349...|       0.0|
|(20,[0,1,2,3,4,5,...|     0.0|[20.0279718703656...|[0.99999999799570...|       0.0|
|(20,[0,1,2,3,4,5,...|     0.0|[19.8583148344511...|[0.99999999762511...|       0.0|
|(20,[0,1,2,3,4,5,...|     0.0|[19.1711541456363...|[0.99999999527857...|       0.0|
|(20,[0,1,2,3,4,5,...|     0.0|[19.2311451861710...|[0.99999999555348...|       0.0|
|(20,[0,1,2,3,4,5,...|     0.0|[19.7596754198973...|[0.99999999737890...|       0.0|
|(20,[0,1,2,3,4,5,...|     0.0|[19.3154920582169...|[0.99999999591315...|       0.0|
|(20,[0,1,2,3,4,5,...|     1.0|[-19.369107281707...|[3.87350023347706...|       1.0|
|(20,[0,1,2,3,4,5,...|     0.0|[18.8308566126647...|[0.9999999933

In [27]:
분석모형.summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|           Churn_SI|         prediction|
+-------+-------------------+-------------------+
|  count|               4861|               4861|
|   mean|0.27196050195433036|0.27196050195433036|
| stddev| 0.4450154240671786| 0.4450154240671786|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



In [28]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [29]:
예측 = 분석모형.evaluate(test_data)
예측.predictions.show()

+--------------------+--------+--------------------+--------------------+----------+
|            features|Churn_SI|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|(20,[0,1,2,3,4,5,...|     0.0|[19.2646612333412...|[0.99999999570004...|       0.0|
|(20,[0,1,2,3,4,5,...|     1.0|[-19.383891550606...|[3.81665461056438...|       1.0|
|(20,[0,1,2,3,4,5,...|     1.0|[-19.921920326882...|[2.22853742670308...|       1.0|
|(20,[0,1,2,3,4,6,...|     0.0|[19.1453914182629...|[0.99999999515535...|       0.0|
|(20,[0,1,2,3,4,6,...|     1.0|[-19.543793422330...|[3.25265768488868...|       1.0|
|(20,[0,1,2,3,4,6,...|     0.0|[18.8254039921970...|[0.99999999332838...|       0.0|
|(20,[0,1,2,3,4,6,...|     0.0|[18.9956233663884...|[0.99999999437262...|       0.0|
|(20,[0,1,2,3,4,6,...|     1.0|[-20.022207902330...|[2.01588424942380...|       1.0|
|(20,[0,1,2,3,4,7,...|     0.0|[19.8736439839356...|[0.9999999976

In [30]:
평가 = BinaryClassificationEvaluator(rawPredictionCol="prediction",labelCol="Churn_SI")
auc = 평가.evaluate(예측.predictions)
auc

1.0