# Machine Learning Project with Classification

<br>

This project focuses on analyzing data from a telecommunications company that would like to understand the profile of customers and the company's Churn (customer turnover). Therefore, the purpose of this project is to predict whether or not the customer will cancel the contracted service. I used the PySpark and Spark ML libraries in the algorithms.

# Preparing the Environment

Preparing the environment and installing the necessary dependencies to load data directly from a .csv document by uploading it within Google Colab.

### Instalation

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=3b14b07de3ceb1fea4e3d071193715ee20ad9741ed8a9e03386eda8842b94ce5
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
                    .master('local[*]')\
                    .appName("Classification with Spark")\
                    .getOrCreate()

spark

### Loading Data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data = spark.read.csv('/content/drive/MyDrive/Área de Estudos/Alura/Formações/Apache Spark com Python/spark_database/curso-classificacao/dados_clientes.csv',\
                      sep=',',\
                      header=True,\
                      inferSchema=True
)

data

DataFrame[id: int, Churn: string, Mais65anos: int, Conjuge: string, Dependentes: string, MesesDeContrato: int, TelefoneFixo: string, MaisDeUmaLinhaTelefonica: string, Internet: string, SegurancaOnline: string, BackupOnline: string, SeguroDispositivo: string, SuporteTecnico: string, TVaCabo: string, StreamingFilmes: string, TipoContrato: string, ContaCorreio: string, MetodoPagamento: string, MesesCobrados: double]

In [5]:
data.show()

+---+-----+----------+-------+-----------+---------------+------------+------------------------+-----------+------------------+------------------+------------------+------------------+------------------+------------------+------------+------------+----------------+-------------+
| id|Churn|Mais65anos|Conjuge|Dependentes|MesesDeContrato|TelefoneFixo|MaisDeUmaLinhaTelefonica|   Internet|   SegurancaOnline|      BackupOnline| SeguroDispositivo|    SuporteTecnico|           TVaCabo|   StreamingFilmes|TipoContrato|ContaCorreio| MetodoPagamento|MesesCobrados|
+---+-----+----------+-------+-----------+---------------+------------+------------------------+-----------+------------------+------------------+------------------+------------------+------------------+------------------+------------+------------+----------------+-------------+
|  0|  Nao|         0|    Sim|        Nao|              1|         Nao|    SemServicoTelefonico|        DSL|               Nao|               Sim|              

In [6]:
data.count()

10348

In [7]:
data.groupBy('Churn').count().show()

+-----+-----+
|Churn|count|
+-----+-----+
|  Sim| 5174|
|  Nao| 5174|
+-----+-----+



In [8]:
data.printSchema()

root
 |-- id: integer (nullable = true)
 |-- Churn: string (nullable = true)
 |-- Mais65anos: integer (nullable = true)
 |-- Conjuge: string (nullable = true)
 |-- Dependentes: string (nullable = true)
 |-- MesesDeContrato: integer (nullable = true)
 |-- TelefoneFixo: string (nullable = true)
 |-- MaisDeUmaLinhaTelefonica: string (nullable = true)
 |-- Internet: string (nullable = true)
 |-- SegurancaOnline: string (nullable = true)
 |-- BackupOnline: string (nullable = true)
 |-- SeguroDispositivo: string (nullable = true)
 |-- SuporteTecnico: string (nullable = true)
 |-- TVaCabo: string (nullable = true)
 |-- StreamingFilmes: string (nullable = true)
 |-- TipoContrato: string (nullable = true)
 |-- ContaCorreio: string (nullable = true)
 |-- MetodoPagamento: string (nullable = true)
 |-- MesesCobrados: double (nullable = true)



# Preparing the Data

I had to process the data with the dummy variable, in which most of my data was in the "string" type and I had to pass it to the "integer" type, having only register of "1" and "0 ", instead of "Sim" (Yes) or "Não" (No), respectively.

Dummy Variable

They are binary variables, that is, with two classes 0 or 1 generated from variables with two or more categories. The register will have a value of 1 if the class is present, and 0 if it is not present.

In [9]:
binaryColumns = [
    'Churn',
    'Conjuge',
    'Dependentes',
    'TelefoneFixo',
    'MaisDeUmaLinhaTelefonica',
    'SegurancaOnline',
    'BackupOnline',
    'SeguroDispositivo',
    'SuporteTecnico',
    'TVaCabo',
    'StreamingFilmes',
    'ContaCorreio'
]

In [10]:
from pyspark.sql import functions as f

In [11]:
allColumns = [f.when(f.col(c)=='Sim', 1)\
                .otherwise(0).alias(c) for c in binaryColumns]                  # Transforming binary data from "Sim" and "Não" to values ​​of "1" and "0", respectively

In [12]:
for column in reversed(data.columns):                                           # Reversed: reverses the order of the data leaving binary data at the end.
    if column not in binaryColumns:                                             # Only the list of data that was not among the binary columns.
       allColumns.insert(0, column)                                             # The first columns will be those in which I did not perform the transformations.

allColumns

['id',
 'Mais65anos',
 'MesesDeContrato',
 'Internet',
 'TipoContrato',
 'MetodoPagamento',
 'MesesCobrados',
 Column<'CASE WHEN (Churn = Sim) THEN 1 ELSE 0 END AS Churn'>,
 Column<'CASE WHEN (Conjuge = Sim) THEN 1 ELSE 0 END AS Conjuge'>,
 Column<'CASE WHEN (Dependentes = Sim) THEN 1 ELSE 0 END AS Dependentes'>,
 Column<'CASE WHEN (TelefoneFixo = Sim) THEN 1 ELSE 0 END AS TelefoneFixo'>,
 Column<'CASE WHEN (MaisDeUmaLinhaTelefonica = Sim) THEN 1 ELSE 0 END AS MaisDeUmaLinhaTelefonica'>,
 Column<'CASE WHEN (SegurancaOnline = Sim) THEN 1 ELSE 0 END AS SegurancaOnline'>,
 Column<'CASE WHEN (BackupOnline = Sim) THEN 1 ELSE 0 END AS BackupOnline'>,
 Column<'CASE WHEN (SeguroDispositivo = Sim) THEN 1 ELSE 0 END AS SeguroDispositivo'>,
 Column<'CASE WHEN (SuporteTecnico = Sim) THEN 1 ELSE 0 END AS SuporteTecnico'>,
 Column<'CASE WHEN (TVaCabo = Sim) THEN 1 ELSE 0 END AS TVaCabo'>,
 Column<'CASE WHEN (StreamingFilmes = Sim) THEN 1 ELSE 0 END AS StreamingFilmes'>,
 Column<'CASE WHEN (ContaCorr

In [13]:
data.select(allColumns).show()

+---+----------+---------------+-----------+------------+----------------+-------------+-----+-------+-----------+------------+------------------------+---------------+------------+-----------------+--------------+-------+---------------+------------+
| id|Mais65anos|MesesDeContrato|   Internet|TipoContrato| MetodoPagamento|MesesCobrados|Churn|Conjuge|Dependentes|TelefoneFixo|MaisDeUmaLinhaTelefonica|SegurancaOnline|BackupOnline|SeguroDispositivo|SuporteTecnico|TVaCabo|StreamingFilmes|ContaCorreio|
+---+----------+---------------+-----------+------------+----------------+-------------+-----+-------+-----------+------------+------------------------+---------------+------------+-----------------+--------------+-------+---------------+------------+
|  0|         0|              1|        DSL| Mensalmente|BoletoEletronico|        29.85|    0|      1|          0|           0|                       0|              0|           1|                0|             0|      0|              0|      

In [14]:
dataset = data.select(allColumns)

The columns "Internet", "TipoContrato" and "MetodoPagamento" still remain as strings, however, the ideal is to obtain entire data types to optimize the work carried out and to deal with this I will use the dummy variable

In [15]:
dataset.printSchema()

root
 |-- id: integer (nullable = true)
 |-- Mais65anos: integer (nullable = true)
 |-- MesesDeContrato: integer (nullable = true)
 |-- Internet: string (nullable = true)
 |-- TipoContrato: string (nullable = true)
 |-- MetodoPagamento: string (nullable = true)
 |-- MesesCobrados: double (nullable = true)
 |-- Churn: integer (nullable = false)
 |-- Conjuge: integer (nullable = false)
 |-- Dependentes: integer (nullable = false)
 |-- TelefoneFixo: integer (nullable = false)
 |-- MaisDeUmaLinhaTelefonica: integer (nullable = false)
 |-- SegurancaOnline: integer (nullable = false)
 |-- BackupOnline: integer (nullable = false)
 |-- SeguroDispositivo: integer (nullable = false)
 |-- SuporteTecnico: integer (nullable = false)
 |-- TVaCabo: integer (nullable = false)
 |-- StreamingFilmes: integer (nullable = false)
 |-- ContaCorreio: integer (nullable = false)



In [16]:
data.select(['Internet', 'TipoContrato', 'MetodoPagamento']).show()

+-----------+------------+----------------+
|   Internet|TipoContrato| MetodoPagamento|
+-----------+------------+----------------+
|        DSL| Mensalmente|BoletoEletronico|
|        DSL|       UmAno|          Boleto|
|        DSL| Mensalmente|          Boleto|
|        DSL|       UmAno|   DebitoEmConta|
|FibraOptica| Mensalmente|BoletoEletronico|
|FibraOptica| Mensalmente|BoletoEletronico|
|FibraOptica| Mensalmente|   CartaoCredito|
|        DSL| Mensalmente|          Boleto|
|FibraOptica| Mensalmente|BoletoEletronico|
|        DSL|       UmAno|   DebitoEmConta|
|        DSL| Mensalmente|          Boleto|
|        Nao|    DoisAnos|   CartaoCredito|
|FibraOptica|       UmAno|   CartaoCredito|
|FibraOptica| Mensalmente|   DebitoEmConta|
|FibraOptica| Mensalmente|BoletoEletronico|
|FibraOptica|    DoisAnos|   CartaoCredito|
|        Nao|       UmAno|          Boleto|
|FibraOptica|    DoisAnos|   DebitoEmConta|
|        DSL| Mensalmente|   CartaoCredito|
|FibraOptica| Mensalmente|Boleto

In [17]:
dataset.groupBy('id')\
       .pivot('Internet')\
       .agg(f.lit(1))\
       .na\
       .fill(0)\
       .show()

# pivot: takes each of the categories and turns them into new columns
# lit: creates new columns with the value I designate, in this case "1"
# fill: fills null data with a value equal to "0"

+----+---+-----------+---+
|  id|DSL|FibraOptica|Nao|
+----+---+-----------+---+
|7982|  1|          0|  0|
|9465|  0|          1|  0|
|2122|  1|          0|  0|
|3997|  1|          0|  0|
|6654|  0|          1|  0|
|7880|  0|          1|  0|
|4519|  0|          1|  0|
|6466|  0|          1|  0|
| 496|  1|          0|  0|
|7833|  0|          1|  0|
|1591|  0|          0|  1|
|2866|  0|          1|  0|
|8592|  0|          1|  0|
|1829|  0|          1|  0|
| 463|  0|          1|  0|
|4900|  0|          1|  0|
|4818|  0|          1|  0|
|7554|  1|          0|  0|
|1342|  0|          0|  1|
|5300|  0|          1|  0|
+----+---+-----------+---+
only showing top 20 rows



In [18]:
Internet = dataset\
           .groupBy('id')\
           .pivot('Internet')\
           .agg(f.lit(1))\
           .na\
           .fill(0)

TypeContract = dataset\
               .groupBy('id')\
               .pivot('TipoContrato')\
               .agg(f.lit(1))\
               .na\
               .fill(0)

PaymentMethod = dataset\
                  .groupBy('id')\
                  .pivot('MetodoPagamento')\
                  .agg(f.lit(1))\
                  .na\
                  .fill(0)


In [19]:
# Joining the data from the variables treated with the dataset
dataset\
    .join(Internet, 'id', how='inner')\
    .join(TypeContract, 'id', how='inner')\
    .join(PaymentMethod, 'id', how='inner')\
    .select(
        '*',
        # Renaming columns
        f.col('DSL').alias('Internet_DSL'),
        f.col('FibraOptica').alias('Internet_FibraOptica'),
        f.col('Nao').alias('Internet_Nao'),
        f.col('Mensalmente').alias('TipoContrato_Mensalmente'),
        f.col('UmAno').alias('TipoContrato_UmAno'),
        f.col('DoisAnos').alias('TipoContrato_DoisAnos'),
        f.col('DebitoEmConta').alias('MetodoPagamento_DebitoEmConta'),
        f.col('CartaoCredito').alias('MetodoPagamento_CartaoCredito'),
        f.col('BoletoEletronico').alias('MetodoPagamento_BoletoEletronico'),
        f.col('Boleto').alias('MetodoPagamento_Boleto')
    )\
    .drop(
        'Internet', 'TipoContrato', 'MetodoPagamento', 'DSL',
        'FibraOptica', 'Nao', 'Mensalmente', 'UmAno', 'DoisAnos',
        'DebitoEmConta', 'CartaoCredito', 'BoletoEletronico', 'Boleto'
    )\
    .show()

# Removing columns that I no longer need from my dataset

+----+----------+---------------+-----------------+-----+-------+-----------+------------+------------------------+---------------+------------+-----------------+--------------+-------+---------------+------------+------------+--------------------+------------+------------------------+------------------+---------------------+-----------------------------+-----------------------------+--------------------------------+----------------------+
|  id|Mais65anos|MesesDeContrato|    MesesCobrados|Churn|Conjuge|Dependentes|TelefoneFixo|MaisDeUmaLinhaTelefonica|SegurancaOnline|BackupOnline|SeguroDispositivo|SuporteTecnico|TVaCabo|StreamingFilmes|ContaCorreio|Internet_DSL|Internet_FibraOptica|Internet_Nao|TipoContrato_Mensalmente|TipoContrato_UmAno|TipoContrato_DoisAnos|MetodoPagamento_DebitoEmConta|MetodoPagamento_CartaoCredito|MetodoPagamento_BoletoEletronico|MetodoPagamento_Boleto|
+----+----------+---------------+-----------------+-----+-------+-----------+------------+----------------------

In [20]:
dataset = dataset\
    .join(Internet, 'id', how='inner')\
    .join(TypeContract, 'id', how='inner')\
    .join(PaymentMethod, 'id', how='inner')\
    .select(
        '*',
        f.col('DSL').alias('Internet_DSL'),
        f.col('FibraOptica').alias('Internet_FibraOptica'),
        f.col('Nao').alias('Internet_Nao'),
        f.col('Mensalmente').alias('TipoContrato_Mensalmente'),
        f.col('UmAno').alias('TipoContrato_UmAno'),
        f.col('DoisAnos').alias('TipoContrato_DoisAnos'),
        f.col('DebitoEmConta').alias('MetodoPagamento_DebitoEmConta'),
        f.col('CartaoCredito').alias('MetodoPagamento_CartaoCredito'),
        f.col('BoletoEletronico').alias('MetodoPagamento_BoletoEletronico'),
        f.col('Boleto').alias('MetodoPagamento_Boleto')
    )\
    .drop(
        'Internet', 'TipoContrato', 'MetodoPagamento', 'DSL',
        'FibraOptica', 'Nao', 'Mensalmente', 'UmAno', 'DoisAnos',
        'DebitoEmConta', 'CartaoCredito', 'BoletoEletronico', 'Boleto'
    )

In [21]:
dataset.show()

+----+----------+---------------+-----------------+-----+-------+-----------+------------+------------------------+---------------+------------+-----------------+--------------+-------+---------------+------------+------------+--------------------+------------+------------------------+------------------+---------------------+-----------------------------+-----------------------------+--------------------------------+----------------------+
|  id|Mais65anos|MesesDeContrato|    MesesCobrados|Churn|Conjuge|Dependentes|TelefoneFixo|MaisDeUmaLinhaTelefonica|SegurancaOnline|BackupOnline|SeguroDispositivo|SuporteTecnico|TVaCabo|StreamingFilmes|ContaCorreio|Internet_DSL|Internet_FibraOptica|Internet_Nao|TipoContrato_Mensalmente|TipoContrato_UmAno|TipoContrato_DoisAnos|MetodoPagamento_DebitoEmConta|MetodoPagamento_CartaoCredito|MetodoPagamento_BoletoEletronico|MetodoPagamento_Boleto|
+----+----------+---------------+-----------------+-----+-------+-----------+------------+----------------------

In [22]:
dataset.printSchema()

root
 |-- id: integer (nullable = true)
 |-- Mais65anos: integer (nullable = true)
 |-- MesesDeContrato: integer (nullable = true)
 |-- MesesCobrados: double (nullable = true)
 |-- Churn: integer (nullable = false)
 |-- Conjuge: integer (nullable = false)
 |-- Dependentes: integer (nullable = false)
 |-- TelefoneFixo: integer (nullable = false)
 |-- MaisDeUmaLinhaTelefonica: integer (nullable = false)
 |-- SegurancaOnline: integer (nullable = false)
 |-- BackupOnline: integer (nullable = false)
 |-- SeguroDispositivo: integer (nullable = false)
 |-- SuporteTecnico: integer (nullable = false)
 |-- TVaCabo: integer (nullable = false)
 |-- StreamingFilmes: integer (nullable = false)
 |-- ContaCorreio: integer (nullable = false)
 |-- Internet_DSL: integer (nullable = true)
 |-- Internet_FibraOptica: integer (nullable = true)
 |-- Internet_Nao: integer (nullable = true)
 |-- TipoContrato_Mensalmente: integer (nullable = true)
 |-- TipoContrato_UmAno: integer (nullable = true)
 |-- TipoContr

# Logistic Regression

I used Logistic Regression to start training and testing the model, vectorized the data, made adjustments, compared the metrics and created a confusion matrix.

### Separating Data for Training

Data Vectorization

This is the process of transforming data into vectors or matrices so that it can be processed by machine learning algorithms. Therefore, vectorization is important because the Spark ML library already considers my data as if it were already vectorized and ready for machine learning processes.

In [23]:
from pyspark.ml.feature import VectorAssembler

In [24]:
dataset = dataset.withColumnRenamed('Churn', 'label')

In [25]:
X = dataset.columns
X.remove('label')
X.remove('id')
X

# These two features are not necessary at this stage

['Mais65anos',
 'MesesDeContrato',
 'MesesCobrados',
 'Conjuge',
 'Dependentes',
 'TelefoneFixo',
 'MaisDeUmaLinhaTelefonica',
 'SegurancaOnline',
 'BackupOnline',
 'SeguroDispositivo',
 'SuporteTecnico',
 'TVaCabo',
 'StreamingFilmes',
 'ContaCorreio',
 'Internet_DSL',
 'Internet_FibraOptica',
 'Internet_Nao',
 'TipoContrato_Mensalmente',
 'TipoContrato_UmAno',
 'TipoContrato_DoisAnos',
 'MetodoPagamento_DebitoEmConta',
 'MetodoPagamento_CartaoCredito',
 'MetodoPagamento_BoletoEletronico',
 'MetodoPagamento_Boleto']

In [26]:
assembler = VectorAssembler(inputCols=X, outputCol='features')

In [27]:
dataset_prep = assembler.transform(dataset).select('features', 'label')

In [28]:
dataset_prep.show(10, truncate=False)

+-----------------------------------------------------------------------------------------------------------+-----+
|features                                                                                                   |label|
+-----------------------------------------------------------------------------------------------------------+-----+
|(24,[1,2,11,12,13,14,17,22],[1.0,45.30540797610398,1.0,1.0,1.0,1.0,1.0,1.0])                               |1    |
|(24,[1,2,3,5,6,8,9,11,12,13,15,17,22],[60.0,103.6142230120257,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|1    |
|(24,[1,2,5,6,10,11,12,13,14,18,23],[12.0,75.85,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                       |0    |
|(24,[1,2,3,5,8,12,13,14,19,21],[69.0,61.45,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                               |0    |
|(24,[1,2,3,5,6,11,13,15,17,22],[7.0,86.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                                 |1    |
|(24,[1,2,5,6,12,13,15,17,22],[14.0,85.03742670311915,1.0,1.0,1.0,1.0,1.

### Adjustment and Prediction

In [29]:
train, test = dataset_prep.randomSplit([0.7, 0.3], seed=101)                    # seed: the results can be reproduced in the same order as the data

In [30]:
train.count()

7206

In [31]:
test.count()

3142

In [32]:
from pyspark.ml.classification import LogisticRegression

In [33]:
lr = LogisticRegression()

In [34]:
model_lr = lr.fit(train)

In [35]:
prediction_lr_test = model_lr.transform(test)

In [36]:
prediction_lr_test.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    0|[3.02174179751551...|[0.95354674000282...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[-0.0922192966076...|[0.47696150091605...|       1.0|
|(24,[0,1,2,3,4,5,...|    1|[0.18744121711361...|[0.54672358463156...|       0.0|
|(24,[0,1,2,3,4,5,...|    1|[0.91716501260103...|[0.71446410549163...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[-0.1495904711610...|[0.46267196467801...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[-0.1680594619286...|[0.45808374494006...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[-1.4170949608173...|[0.19511740608882...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[0.14194260698794...|[0.53542619200881...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[0.67046644011599...|[0.66160759507905...|       0.0|
|(24,[0,1,2,3,4,

### Métricas

In [37]:
summary_lr_train = model_lr.summary

In [38]:
summary_lr_train.accuracy

0.7849014709963918

In [39]:
print("Accuracy: %f" % summary_lr_train.accuracy)                              # accuracy: the greater the accuracy, the closer the result found is to the reference or real value.
print("Precision: %f" % summary_lr_train.precisionByLabel[1])                  # precision: according to what the model points out as positive, it finds what is in fact positive.
print("Recall: %f" % summary_lr_train.recallByLabel[1])                        # recall: percentage of data classified as positive compared to the actual number of positives that exist in our sample.
print("F1: %f" % summary_lr_train.fMeasureByLabel()[1])                        # f1-score: the combination of precision + recall, in order to give a value that indicates the general quality of the model in question, that is, it takes both false positives and false negatives into consideration.

Accuracy: 0.784901
Precision: 0.770686
Recall: 0.812517
F1: 0.791049


### Confusion Matrix

In [40]:
prediction_lr_test.select('label', 'prediction')\
                  .where((f.col('label') == 1) &\
                         (f.col('prediction') == 1))\
                         .count()

# Total customer cancellations

1256

In [41]:
tp = prediction_lr_test.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 1)).count()         # True positives: 1256
tn = prediction_lr_test.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 0)).count()         # True negatives: 1179
fp = prediction_lr_test.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 1)).count()         # False positives: 400
fn = prediction_lr_test.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 0)).count()         # False negatives: 307

print(tp, tn,fp, fn)

1256 1179 400 307


In [42]:
def calculate_show_confusion_matrix(df_transform_model, normalize=False, percentage=True):
  tp = df_transform_model.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 1)).count()
  tn = df_transform_model.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 0)).count()
  fp = df_transform_model.select('label', 'prediction').where((f.col('label') == 0) & (f.col('prediction') == 1)).count()
  fn = df_transform_model.select('label', 'prediction').where((f.col('label') == 1) & (f.col('prediction') == 0)).count()

  valueP = 1
  valueN = 1

  if normalize:
    valueP = tp + fn
    valueN = fp + tn

  if percentage and normalize:
    valueP = valueP / 100
    valueN = valueN / 100

  print(' '*20, 'Predicted')
  print(' '*15, 'Churn', ' '*5 ,'Non-Churn')
  print(' '*4, 'Churn', ' '*6, int(tp/valueP), ' '*7, int(fn/valueP))
  print('Real')
  print(' '*4, 'Non-Churn', ' '*2, int(fp/valueN), ' '*7, int(tn/valueN))

In [43]:
calculate_show_confusion_matrix(prediction_lr_test, normalize=False)

                     Predicted
                Churn       Non-Churn
     Churn        1256         307
Real
     Non-Churn    400         1179


# Decision Tree

In the Decision Tree repeating the same training steps, metrics and confusion matrix comparing with the Logistic Regression data.

### Adjustment and Prediction

In [44]:
from pyspark.ml.classification import DecisionTreeClassifier

In [45]:
dtc = DecisionTreeClassifier(seed=101)

In [46]:
model_dtc = dtc.fit(train)

In [47]:
prediction_dtc_train = model_dtc.transform(train)

In [48]:
prediction_dtc_train.show()

+--------------------+-----+--------------+--------------------+----------+
|            features|label| rawPrediction|         probability|prediction|
+--------------------+-----+--------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    0|[2056.0,334.0]|[0.86025104602510...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[2056.0,334.0]|[0.86025104602510...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|    [22.0,3.0]|         [0.88,0.12]|       0.0|
|(24,[0,1,2,3,4,5,...|    0|    [22.0,3.0]|         [0.88,0.12]|       0.0|
|(24,[0,1,2,3,4,5,...|    0|    [22.0,3.0]|         [0.88,0.12]|       0.0|
|(24,[0,1,2,3,4,5,...|    1|[331.0,1951.0]|[0.14504820333041...|       1.0|
|(24,[0,1,2,3,4,5,...|    0| [239.0,205.0]|[0.53828828828828...|       0.0|
|(24,[0,1,2,3,4,5,...|    1|[331.0,1951.0]|[0.14504820333041...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[331.0,1951.0]|[0.14504820333041...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[331.0,1951.0]|[0.14504820333041...|       1.0|
|(24,[0,1,2,

### Metrics

Training Object

In [49]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [50]:
evaluator = MulticlassClassificationEvaluator()

In [51]:
evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: 'accuracy'})

# The greater the accuracy, the closer the result found is to the reference or real value.
# The training data presented 79% closeness to the data.

0.7917013599777962

In [52]:
print("Accuracy: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "accuracy"}))
print("Prediction: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))

Accuracy: 0.791701
Prediction: 0.805090
Recall: 0.770978
F1: 0.787664


In [53]:
prediction_dtc_test = model_dtc.transform(test)

In [54]:
prediction_dtc_test.show()

+--------------------+-----+--------------+--------------------+----------+
|            features|label| rawPrediction|         probability|prediction|
+--------------------+-----+--------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    0|[2056.0,334.0]|[0.86025104602510...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|  [62.0,128.0]|[0.32631578947368...|       1.0|
|(24,[0,1,2,3,4,5,...|    1| [239.0,205.0]|[0.53828828828828...|       0.0|
|(24,[0,1,2,3,4,5,...|    1| [239.0,205.0]|[0.53828828828828...|       0.0|
|(24,[0,1,2,3,4,5,...|    0| [239.0,205.0]|[0.53828828828828...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|  [51.0,141.0]| [0.265625,0.734375]|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[331.0,1951.0]|[0.14504820333041...|       1.0|
|(24,[0,1,2,3,4,5,...|    0| [239.0,205.0]|[0.53828828828828...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|  [63.0,118.0]|[0.34806629834254...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[2056.0,334.0]|[0.86025104602510...|       0.0|
|(24,[0,1,2,

Test Object

In [55]:
evaluator.evaluate(prediction_dtc_test, {evaluator.metricName: 'accuracy'})

# The greater the accuracy, the closer the result found is to the reference or real value.
# The training data presented 77% closeness to the data.

0.7714831317632082

### Summary Table - Decision Tree

In [56]:
print('Decision Tree Classifier')
print("="*40)
print("Training Data")
print("="*40)
print("Confusion Matrix")
print("-"*40)
calculate_show_confusion_matrix(prediction_dtc_train, normalize=False)
print("-"*40)
print("Metrics")
print("-"*40)
print("Accuracy: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(prediction_dtc_train, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))
print("")
print("="*40)
print("Test Data")
print("="*40)
print("Confusion Matrix")
print("-"*40)
calculate_show_confusion_matrix(prediction_dtc_test, normalize=False)
print("-"*40)
print("Metrics")
print("-"*40)
print("Accuracy: %f" % evaluator.evaluate(prediction_dtc_test, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(prediction_dtc_test, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(prediction_dtc_test, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(prediction_dtc_test, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))

Decision Tree Classifier
Training Data
Confusion Matrix
----------------------------------------
                     Predicted
                Churn       Non-Churn
     Churn        2784         827
Real
     Non-Churn    674         2921
----------------------------------------
Metrics
----------------------------------------
Accuracy: 0.791701
Precision: 0.805090
Recall: 0.770978
F1: 0.787664

Test Data
Confusion Matrix
----------------------------------------
                     Predicted
                Churn       Non-Churn
     Churn        1181         382
Real
     Non-Churn    336         1243
----------------------------------------
Metrics
----------------------------------------
Accuracy: 0.771483
Precision: 0.778510
Recall: 0.755598
F1: 0.766883


**Training Data**

If we look at the errors in the Confusion Matrix we see that it was predicted that 827 of the customers would not cancel the service, but they in fact canceled, being a false negative. As it was also predicted that 674 of the customers would cancel the service, but in fact they did not, being a true negative.

In the metrics I obtained good results, but even so I will explore even more algorithms to seek the highest possible quality of this data.

<br>

**Test Data**

Also observing the errors in the Confusion Matrix, I identified that it was predicted that 382 would not cancel the service, but the customers actually canceled the service, presenting a false negative. As 336 customers were also expected to drop out, however, these customers did not cancel the service, presenting a true negative result.

Logistic Regression:
* Non-Churn -> 307
* Churn -> 400

Decision tree:
* Non-Churn -> 382
* Churn -> 336

In terms of metrics, it also follows the same line of reasoning as the training data and even though it presents relatively good results, I will explore even more algorithms to seek the highest possible quality of this data.

# Random Forest

Use of Random Forest to investigate and compare whether the results were in accordance with those of the Decision Tree.

### Adjustment and Prediction

In [57]:
from pyspark.ml.classification import RandomForestClassifier

In [58]:
rfc = RandomForestClassifier(seed=101)

In [59]:
model_rfc = rfc.fit(train)

In [60]:
prediction_rfc_train = model_rfc.transform(train)

In [61]:
prediction_rfc_train.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    0|[15.0052773466704...|[0.75026386733352...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[16.9295040273249...|[0.84647520136624...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[9.13052909106814...|[0.45652645455340...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[9.13052909106814...|[0.45652645455340...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[8.59288938528764...|[0.42964446926438...|       1.0|
|(24,[0,1,2,3,4,5,...|    1|[5.59647122885698...|[0.27982356144284...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[9.33276328267787...|[0.46663816413389...|       1.0|
|(24,[0,1,2,3,4,5,...|    1|[5.21616013157118...|[0.26080800657855...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[5.45640255581361...|[0.27282012779068...|       1.0|
|(24,[0,1,2,3,4,

In [62]:
prediction_rfc_test = model_rfc.transform(test)

In [63]:
prediction_rfc_test.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(24,[0,1,2,3,4,5,...|    0|[16.7433871675615...|[0.83716935837807...|       0.0|
|(24,[0,1,2,3,4,5,...|    0|[7.27313214599648...|[0.36365660729982...|       1.0|
|(24,[0,1,2,3,4,5,...|    1|[7.46885072161585...|[0.37344253608079...|       1.0|
|(24,[0,1,2,3,4,5,...|    1|[9.33276328267787...|[0.46663816413389...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[7.79829004739264...|[0.38991450236963...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[7.13263407834549...|[0.35663170391727...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[4.45872635511159...|[0.22293631775557...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[7.84691519125130...|[0.39234575956256...|       1.0|
|(24,[0,1,2,3,4,5,...|    0|[9.94796150783366...|[0.49739807539168...|       1.0|
|(24,[0,1,2,3,4,

### Summary Table - Random Forest

In [64]:
print('Random Forest Classifier')
print("="*40)
print("Training Data")
print("="*40)
print("Confusion Matrix")
print("-"*40)
calculate_show_confusion_matrix(prediction_rfc_train, normalize=False)
print("-"*40)
print("Metrics")
print("-"*40)
print("Accuracy: %f" % evaluator.evaluate(prediction_rfc_train, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(prediction_rfc_train, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(prediction_rfc_train, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(prediction_rfc_train, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))
print("")
print("="*40)
print("Test Data")
print("="*40)
print("Confusion Matrix")
print("-"*40)
calculate_show_confusion_matrix(prediction_rfc_test, normalize=False)
print("-"*40)
print("Metrics")
print("-"*40)
print("Accuracy: %f" % evaluator.evaluate(prediction_rfc_test, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(prediction_rfc_test, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(prediction_rfc_test, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(prediction_rfc_test, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))

Random Forest Classifier
Training Data
Confusion Matrix
----------------------------------------
                     Predicted
                Churn       Non-Churn
     Churn        2950         661
Real
     Non-Churn    884         2711
----------------------------------------
Metrics
----------------------------------------
Accuracy: 0.785595
Precision: 0.769431
Recall: 0.816948
F1: 0.792478

Test Data
Confusion Matrix
----------------------------------------
                     Predicted
                Churn       Non-Churn
     Churn        1257         306
Real
     Non-Churn    416         1163
----------------------------------------
Metrics
----------------------------------------
Accuracy: 0.770210
Precision: 0.751345
Recall: 0.804223
F1: 0.776885


**Training Data**

Among the forecast errors, in the Confusion Matrix it was predicted that 661 customers would cancel the service, but in fact they canceled the service. And it was also predicted that 884 would cancel the service, however they did not cancel the service.

Decision tree:
* Non-Churn -> 827
* Churn -> 674

Random Forest:
* Non-Churn -> 661
* Churn -> 884

With this, I identified that I need to go deeper and even more algorithms to analyze and get even closer to the reality of this data. Because, according to the metrics, the data is not as good a fit in relation to the Decision Tree data.

<br>

**Test Data**

Among the forecast errors, in the Confusion Matrix it was predicted that 306 customers would cancel the service, but in fact they canceled the service. And it was also predicted that 416 would cancel the service, however they did not cancel the service.

Decision tree:
* Non-Churn -> 382
* Churn -> 336

Random Forest:
* Non-Churn -> 306
* Churn -> 416

Just like the training data, the test data also demonstrated a lateralization in the data, however, according to the metrics, the data is not fitting in relation to the Decision Tree data.

# Optimization Techniques

I was able to implement Cross Validation in the Decision Tree and Random Forest, among them Random Forest was the one that presented the best results according to the metrics.

### Decision Tree with Cross Validation

In [65]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [66]:
dtc = DecisionTreeClassifier(seed=101)

In [67]:
grid = ParamGridBuilder()\
        .addGrid(dtc.maxDepth, [2, 5, 10])\
        .addGrid(dtc.maxBins, [10, 32, 45])\
        .build()

# ParamGridBuilder: defines a grid with the parameters that will be used to test the model when performing cross validation.
# maxDepth: represents the maximum depth of the decision trees.
# maxBins: defines the number of times that my Decision Tree will try, respectively, to assemble the nodes and how refined these attempts will be.

In [68]:
evaluator = MulticlassClassificationEvaluator()

In [69]:
dtc_cv = CrossValidator(
    estimator=dtc,
    estimatorParamMaps=grid,
    evaluator=evaluator,
    numFolds=3,
    seed=101
)

# estimator: estimator that will be used, for example, RandomForestRegressor (rfc) or DecisionTreeRegressor (dtc).
# estimatorParamMaps: parameters and values ​​of these parameters that should be used when performing cross-validation.
# evaluator: defines the object responsible for evaluating the models.
# numFolds: number of folds, that is, how many parts of the data set should be used for training and testing in each iteration of cross-validation.

In [70]:
model_dtc_cv = dtc_cv.fit(train)

In [71]:
prediction_dtc_cv_test = model_dtc_cv.transform(test)

### Summary Table - Decision Tree with Cross Validation

In [72]:
print('Decision Tree Classifier - Tuning')
print("="*40)
print("Test Data")
print("="*40)
print("Confusion Matrix")
print("-"*40)
calculate_show_confusion_matrix(prediction_dtc_cv_test, normalize=False)
print("-"*40)
print("Metrics")
print("-"*40)
print("Accuracy: %f" % evaluator.evaluate(prediction_dtc_cv_test, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(prediction_dtc_cv_test, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(prediction_dtc_cv_test, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(prediction_dtc_cv_test, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))

Decision Tree Classifier - Tuning
Test Data
Confusion Matrix
----------------------------------------
                     Predicted
                Churn       Non-Churn
     Churn        1319         244
Real
     Non-Churn    430         1149
----------------------------------------
Metrics
----------------------------------------
Accuracy: 0.785487
Precision: 0.754145
Recall: 0.843890
F1: 0.796498


**Test Data**

Among the forecast errors, in the Confusion Matrix it was predicted that 236 customers would cancel the service, but in fact they canceled the service. And it was also predicted that 423 would cancel the service, however they did not cancel the service.

Decision tree:
* Non-Churn -> 382
* Churn -> 336

Decision Tree with Cross Validation:
* Non-Churn -> 236
* Churn -> 432

Looking at the metrics, I saw an improvement in accuracy, recall and f1, except for data precision, which saw a 2% drop. On the other hand, the current results were much more promising than the previous ones from the Decision Tree without Cross validation.


### Random Forest with Cross Validation

In [73]:
rfc = RandomForestClassifier(seed=101)

In [74]:
grid = ParamGridBuilder()\
        .addGrid(rfc.maxDepth, [2, 5, 10])\
        .addGrid(rfc.maxBins, [10, 32, 45])\
        .addGrid(rfc.numTrees, [10, 20, 50])\
        .build()

In [75]:
evaluator = MulticlassClassificationEvaluator()

In [76]:
rfc_cv = CrossValidator(
    estimator=rfc,
    estimatorParamMaps=grid,
    evaluator=evaluator,
    numFolds=3,
    seed=101
)

In [77]:
model_rfc_cv = rfc_cv.fit(train)

In [78]:
prediction_rfc_cv_test = model_rfc_cv.transform(test)

### Summary Table - Random Forest with Cross Validation

In [79]:
print('Random Forest Classifier - Tuning')
print("="*40)
print("Test Data")
print("="*40)
print("Confusion Matrix")
print("-"*40)
calculate_show_confusion_matrix(prediction_rfc_cv_test, normalize=False)
print("-"*40)
print("Metrics")
print("-"*40)
print("Accuracy: %f" % evaluator.evaluate(prediction_rfc_cv_test, {evaluator.metricName: "accuracy"}))
print("Precision: %f" % evaluator.evaluate(prediction_rfc_cv_test, {evaluator.metricName: "precisionByLabel", evaluator.metricLabel: 1}))
print("Recall: %f" % evaluator.evaluate(prediction_rfc_cv_test, {evaluator.metricName: "recallByLabel", evaluator.metricLabel: 1}))
print("F1: %f" % evaluator.evaluate(prediction_rfc_cv_test, {evaluator.metricName: "fMeasureByLabel", evaluator.metricLabel: 1}))

Random Forest Classifier - Tuning
Test Data
Confusion Matrix
----------------------------------------
                     Predicted
                Churn       Non-Churn
     Churn        1333         230
Real
     Non-Churn    337         1242
----------------------------------------
Metrics
----------------------------------------
Accuracy: 0.819542
Precision: 0.798204
Recall: 0.852847
F1: 0.824621


**Test Data**

Among the forecast errors, in the Confusion Matrix it was predicted that 241 customers would cancel the service, but in fact they canceled the service. And it was also predicted that 347 would cancel the service, however they did not cancel the service.

Random Forest:
* Non-Churn -> 306
* Churn -> 416

Random Forest with Cross Validation:
* Non-Churn -> 241
* Churn -> 347

Looking at the metrics, I saw an improvement in all of them, with both averaging 80%. Therefore, the current results were much more promising than the previous Random Forest results without Cross validation.

# Predicting Results

Finally, I used the Random Forest model with Cross Validation, selecting the customer's characteristics and successfully predicting whether the customer will cancel the service or not.

In [80]:
best_model_rfc_cv = model_rfc_cv.bestModel

In [81]:
print(best_model_rfc_cv.getMaxDepth())
print(best_model_rfc_cv.getMaxBins())
print(best_model_rfc_cv.getNumTrees)

10
45
50


In [82]:
rfc_tunning = RandomForestClassifier(maxDepth=10, maxBins=45, numTrees=10, seed=101)

In [83]:
model_rfc_tunning = rfc_tunning.fit(dataset_prep)

In [84]:
X

['Mais65anos',
 'MesesDeContrato',
 'MesesCobrados',
 'Conjuge',
 'Dependentes',
 'TelefoneFixo',
 'MaisDeUmaLinhaTelefonica',
 'SegurancaOnline',
 'BackupOnline',
 'SeguroDispositivo',
 'SuporteTecnico',
 'TVaCabo',
 'StreamingFilmes',
 'ContaCorreio',
 'Internet_DSL',
 'Internet_FibraOptica',
 'Internet_Nao',
 'TipoContrato_Mensalmente',
 'TipoContrato_UmAno',
 'TipoContrato_DoisAnos',
 'MetodoPagamento_DebitoEmConta',
 'MetodoPagamento_CartaoCredito',
 'MetodoPagamento_BoletoEletronico',
 'MetodoPagamento_Boleto']

In [85]:
new_customer = [{
    'Mais65anos': 0,
    'MesesDeContrato': 1,
    'MesesCobrados': 45.30540797610398,
    'Conjuge': 0,
    'Dependentes': 0,
    'TelefoneFixo': 0,
    'MaisDeUmaLinhaTelefonica': 0,
    'SegurancaOnline': 0,
    'BackupOnline': 0,
    'SeguroDispositivo': 0,
    'SuporteTecnico': 0,
    'TVaCabo': 1,
    'StreamingFilmes': 1,
    'ContaCorreio': 1,
    'Internet_DSL': 1,
    'Internet_FibraOptica': 0,
    'Internet_Nao': 0,
    'TipoContrato_Mensalmente': 1,
    'TipoContrato_UmAno': 0,
    'TipoContrato_DoisAnos': 0,
    'MetodoPagamento_DebitoEmConta': 0,
    'MetodoPagamento_CartaoCredito': 0,
    'MetodoPagamento_BoletoEletronico': 1,
    'MetodoPagamento_Boleto': 0
}]

In [86]:
new_customer = spark.createDataFrame(new_customer)
new_customer.show()

+------------+-------+------------+-----------+------------+--------------------+------------+----------+------------------------+-----------------+---------------+----------------------+--------------------------------+-----------------------------+-----------------------------+---------------+-----------------+---------------+--------------+-------+------------+---------------------+------------------------+------------------+
|BackupOnline|Conjuge|ContaCorreio|Dependentes|Internet_DSL|Internet_FibraOptica|Internet_Nao|Mais65anos|MaisDeUmaLinhaTelefonica|    MesesCobrados|MesesDeContrato|MetodoPagamento_Boleto|MetodoPagamento_BoletoEletronico|MetodoPagamento_CartaoCredito|MetodoPagamento_DebitoEmConta|SegurancaOnline|SeguroDispositivo|StreamingFilmes|SuporteTecnico|TVaCabo|TelefoneFixo|TipoContrato_DoisAnos|TipoContrato_Mensalmente|TipoContrato_UmAno|
+------------+-------+------------+-----------+------------+--------------------+------------+----------+------------------------+----

In [87]:
assembler = VectorAssembler(inputCols = X, outputCol = 'features')

In [88]:
new_customer_prep = assembler.transform(new_customer).select('features')

In [89]:
new_customer_prep.show(truncate=False)

+----------------------------------------------------------------------------+
|features                                                                    |
+----------------------------------------------------------------------------+
|(24,[1,2,11,12,13,14,17,22],[1.0,45.30540797610398,1.0,1.0,1.0,1.0,1.0,1.0])|
+----------------------------------------------------------------------------+



In [90]:
model_rfc_tunning.transform(new_customer_prep).show()

+--------------------+--------------------+--------------------+----------+
|            features|       rawPrediction|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|(24,[1,2,11,12,13...|[1.47984630190316...|[0.14798463019031...|       1.0|
+--------------------+--------------------+--------------------+----------+



With "0" being Non-Churn and "1" resulting in Churn, according to the specified characteristics, the model predicted that the customer in question has the possibility of canceling the service.