# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:120%;text-align:center;border-radius:5px 5px;">CLASSIFICAÇÃO BINÁRIA COM PYTHON E APACHE SPARK </p>

### Breast Cancer Wisconsin: Previsão de Câncer de Mama  [dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)

Objetivo: criar um modelo preditivo que, dadas as características de exames de uma paciente, será capaz de classificar o tumor como Maligno ou Benigno, a partir do aprendizado obtido no treinamento do algoritmo.

# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">SUMÁRIO</p>
 
* [1. ABORDAGEM](#1)
    
* [2. IMPORTAÇÃO BIBLIOTECAS](#2)
    
* [3. SPARK CONTEXT, SESSION E CARGA DOS DADOS](#3)
    
* [4. PRÉ-PROCESSAMENTO](#4)   

 - Label Encoding (String Indexer)
 - Standard Scaler (Padronização)
 - Vector Assembler
 - Split dados treino e teste
 
* [5. MODELOS](#5)
 - Logistic Regression (Avaliação da Acurácia e Confusion Matrix)
 - Decision Tree Classifier (Avaliação da Acurácia e Confusion Matrix)
 - Random Forest Classifier (Avaliação da Acurácia e Confusion Matrix)
* [6. PIPELINE: WORKFLOW (EM ANDAMENTO)](#6)

<a id="1"></a>
# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">1. ABORDAGEM</p>


 Semelhante ao notebook [Classificação Binária com Python](https://www.kaggle.com/code/engvictorfarias/classifica-o-breast-cancer-winsconsin-python-ml), serão criados diferentes modelos de classificação (aprendizado supervisionado) com diferentes técnicas de pré-processamento e ajustes, a fim de encontrar aquele com melhor precisão, mas, desta vez, utilizando o Apache Spark para realizar o processamento.

 A partir de dados históricos de pacientes, o algoritmo 'aprenderá' e conseguirá realizar previsões para novos pacientes indicando se o tumor em análise é Benigno ou Maligno. Trata-se de um classificador binário.

<a id="1"></a>
# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">2. IMPORTAÇÃO BIBLIOTECAS</p>

In [1]:
# Importa o findspark e inicializa

import findspark
findspark.init()

In [56]:
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.sql import Row
from pyspark.sql.functions import *
from pyspark.ml.feature import PCA
from pyspark.ml.feature import StringIndexer # Label Encoding
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, NaiveBayes
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

<a id="1"></a>
# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">3. SPARK CONTEXT, SESSION E CARGA DOS DADOS</p>

In [3]:
sc = SparkContext(appName = 'ModeloClassificacao')

In [4]:
# Spark Session - utilizado para trabalhar com DataFrames
# https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html

spSession = SparkSession.builder.master('local').getOrCreate()

### Carregando os Dados

In [5]:
# DataFrame Spark

SparkDF = spSession.read.csv('C:/Users/vmfb9/Downloads/dataset-winsconsin/data.csv', header = True, inferSchema = True) 

In [6]:
# Tipo

type(SparkDF) # pyspark.sql.dataframe.DataFrame

pyspark.sql.dataframe.DataFrame

In [7]:
SparkDF.printSchema()

root
 |-- id: integer (nullable = true)
 |-- diagnosis: string (nullable = true)
 |-- radius_mean: double (nullable = true)
 |-- texture_mean: double (nullable = true)
 |-- perimeter_mean: double (nullable = true)
 |-- area_mean: double (nullable = true)
 |-- smoothness_mean: double (nullable = true)
 |-- compactness_mean: double (nullable = true)
 |-- concavity_mean: double (nullable = true)
 |-- concave points_mean: double (nullable = true)
 |-- symmetry_mean: double (nullable = true)
 |-- fractal_dimension_mean: double (nullable = true)
 |-- radius_se: double (nullable = true)
 |-- texture_se: double (nullable = true)
 |-- perimeter_se: double (nullable = true)
 |-- area_se: double (nullable = true)
 |-- smoothness_se: double (nullable = true)
 |-- compactness_se: double (nullable = true)
 |-- concavity_se: double (nullable = true)
 |-- concave points_se: double (nullable = true)
 |-- symmetry_se: double (nullable = true)
 |-- fractal_dimension_se: double (nullable = true)
 |-- radi

In [8]:
# Visualizando os dados no formato do Pandas

SparkDF.limit(5).toPandas()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,_c32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


<a id="1"></a>
# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">4. PRÉ-PROCESSAMENTO</p>

### Drop das colunas ID e _c32

In [9]:
SparkDF = SparkDF.drop('ID','_c32')

In [10]:
SparkDF.limit(5).toPandas()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Checagem do Balanceamento do Dataset

In [11]:
SparkDF.groupby('diagnosis').count().show()

+---------+-----+
|diagnosis|count|
+---------+-----+
|        B|  357|
|        M|  212|
+---------+-----+



### Checagem de dados ausentes

In [12]:
# Checando Missing Values

SparkDF.select([count(when(isnan(coluna) | col(coluna).isNull(), coluna)).alias(coluna) for coluna in SparkDF.columns]).show(truncate=False, vertical=True)

-RECORD 0----------------------
 diagnosis               | 0   
 radius_mean             | 0   
 texture_mean            | 0   
 perimeter_mean          | 0   
 area_mean               | 0   
 smoothness_mean         | 0   
 compactness_mean        | 0   
 concavity_mean          | 0   
 concave points_mean     | 0   
 symmetry_mean           | 0   
 fractal_dimension_mean  | 0   
 radius_se               | 0   
 texture_se              | 0   
 perimeter_se            | 0   
 area_se                 | 0   
 smoothness_se           | 0   
 compactness_se          | 0   
 concavity_se            | 0   
 concave points_se       | 0   
 symmetry_se             | 0   
 fractal_dimension_se    | 0   
 radius_worst            | 0   
 texture_worst           | 0   
 perimeter_worst         | 0   
 area_worst              | 0   
 smoothness_worst        | 0   
 compactness_worst       | 0   
 concavity_worst         | 0   
 concave points_worst    | 0   
 symmetry_worst          | 0   
 fractal

### Resumo Estatístico: describe

In [13]:
SparkDF.describe(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean']).show()

+-------+------------------+-----------------+-----------------+-----------------+--------------------+
|summary|       radius_mean|     texture_mean|   perimeter_mean|        area_mean|     smoothness_mean|
+-------+------------------+-----------------+-----------------+-----------------+--------------------+
|  count|               569|              569|              569|              569|                 569|
|   mean|14.127291739894563|19.28964850615117|91.96903339191566|654.8891036906857|   0.096360281195079|
| stddev|3.5240488262120793|4.301035768166948| 24.2989810387549|351.9141291816529|0.014064128137673616|
|    min|             6.981|             9.71|            43.79|            143.5|             0.05263|
|    max|             28.11|            39.28|            188.5|           2501.0|              0.1634|
+-------+------------------+-----------------+-----------------+-----------------+--------------------+



### Label Encoding: StringIndexer

Label Encoding com StringIndexer: criando um índice para a coluna 'diagnosis'
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html#pyspark.ml.feature.StringIndexer

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’.

In [14]:
stringIndexer = StringIndexer(inputCol = 'diagnosis', outputCol = 'diag')

In [15]:
# Treinando o string Indexer

si_model = stringIndexer.fit(SparkDF)

In [16]:
# Aplicando o string Indexer

df_si = si_model.transform(SparkDF)

In [17]:
df_si.limit(5).toPandas()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diag
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1.0
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1.0
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1.0
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1.0
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1.0


### Correlação com a variável alvo


In [18]:
# Correlação entre as variáveis
for i in df_si.columns:
    if not(isinstance(df_si.select(i).take(1)[0][0], str)) :
        print( "Correlação da variável DIAGNOSIS com:", i, df_si.stat.corr('diag',i))

Correlação da variável DIAGNOSIS com: radius_mean 0.7300285113754562
Correlação da variável DIAGNOSIS com: texture_mean 0.4151852998452047
Correlação da variável DIAGNOSIS com: perimeter_mean 0.7426355297258332
Correlação da variável DIAGNOSIS com: area_mean 0.7089838365853901
Correlação da variável DIAGNOSIS com: smoothness_mean 0.3585599650859332
Correlação da variável DIAGNOSIS com: compactness_mean 0.5965336775082527
Correlação da variável DIAGNOSIS com: concavity_mean 0.6963597071719051
Correlação da variável DIAGNOSIS com: concave points_mean 0.776613840020437
Correlação da variável DIAGNOSIS com: symmetry_mean 0.3304985542625468
Correlação da variável DIAGNOSIS com: fractal_dimension_mean -0.012837602698431884
Correlação da variável DIAGNOSIS com: radius_se 0.5671338208247174
Correlação da variável DIAGNOSIS com: texture_se -0.008303332973877036
Correlação da variável DIAGNOSIS com: perimeter_se 0.556140703431483
Correlação da variável DIAGNOSIS com: area_se 0.548235940278024
Co

#### Seleção das 'features'

In [19]:
feat_cols = df_si.columns[1:-1]

In [20]:
# Preparamos o VectorAssembler

Assemb = VectorAssembler(inputCols = feat_cols, 
                        outputCol = 'features')

In [21]:
Assembled = Assemb.transform(df_si)

In [22]:
Assembled.show(1, vertical = True)

-RECORD 0---------------------------------------
 diagnosis               | M                    
 radius_mean             | 17.99                
 texture_mean            | 10.38                
 perimeter_mean          | 122.8                
 area_mean               | 1001.0               
 smoothness_mean         | 0.1184               
 compactness_mean        | 0.2776               
 concavity_mean          | 0.3001               
 concave points_mean     | 0.1471               
 symmetry_mean           | 0.2419               
 fractal_dimension_mean  | 0.07871              
 radius_se               | 1.095                
 texture_se              | 0.9053               
 perimeter_se            | 8.589                
 area_se                 | 153.4                
 smoothness_se           | 0.006399             
 compactness_se          | 0.04904              
 concavity_se            | 0.05373              
 concave points_se       | 0.01587              
 symmetry_se        

### Padronização

In [23]:
# Preparamos o padronizador
std = StandardScaler(inputCol = 'features', outputCol = 'standardized')

# Treinamos o padronizador
scale = std.fit(dataset = Assembled)

# Dataframe com dados padronizados
df = scale.transform(Assembled)

In [24]:
df.select('diag','features').show(5)

+----+--------------------+
|diag|            features|
+----+--------------------+
| 1.0|[17.99,10.38,122....|
| 1.0|[20.57,17.77,132....|
| 1.0|[19.69,21.25,130....|
| 1.0|[11.42,20.38,77.5...|
| 1.0|[20.29,14.34,135....|
+----+--------------------+
only showing top 5 rows



### Split dados de Treino e Teste

In [25]:
dados_treino, dados_teste = df.randomSplit([0.7,0.3])

In [26]:
dados_treino.count()

394

In [27]:
dados_teste.count()

175

<a id="5"></a>
# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">5. MODELOS</p>


### Criando o primeiro Modelo: V1
Logistic Regression [Link](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html#pyspark.ml.classification.LogisticRegression)

In [28]:
LR_Classifier = LogisticRegression(featuresCol = 'features',labelCol = 'diag')

In [29]:
lrc = LR_Classifier.fit(dados_treino)

#### Previsões com dados de Teste

In [30]:
predictions_lrc = lrc.transform(dados_teste)

In [31]:
predictions_lrc.show(1, vertical = True)

-RECORD 0---------------------------------------
 diagnosis               | B                    
 radius_mean             | 7.691                
 texture_mean            | 25.44                
 perimeter_mean          | 48.34                
 area_mean               | 170.4                
 smoothness_mean         | 0.08668              
 compactness_mean        | 0.1199               
 concavity_mean          | 0.09252              
 concave points_mean     | 0.01364              
 symmetry_mean           | 0.2037               
 fractal_dimension_mean  | 0.07751              
 radius_se               | 0.2196               
 texture_se              | 1.479                
 perimeter_se            | 1.445                
 area_se                 | 11.73                
 smoothness_se           | 0.01547              
 compactness_se          | 0.06457              
 concavity_se            | 0.09252              
 concave points_se       | 0.01364              
 symmetry_se        

In [32]:
predictions_lrc.select('diag','features','rawPrediction','prediction','probability').toPandas().head(5)

Unnamed: 0,diag,features,rawPrediction,prediction,probability
0,0.0,"[7.691, 25.44, 48.34, 170.4, 0.08668, 0.1199, ...","[1244.1925386735306, -1244.1925386735306]",0.0,"[1.0, 0.0]"
1,0.0,"[8.671, 14.45, 54.42, 227.2, 0.09138, 0.04276,...","[1870.1735935298636, -1870.1735935298636]",0.0,"[1.0, 0.0]"
2,0.0,"[9.042, 18.9, 60.07, 244.5, 0.09968, 0.1972, 0...","[1421.2520670213075, -1421.2520670213075]",0.0,"[1.0, 0.0]"
3,0.0,"[9.173, 13.86, 59.2, 260.9, 0.07721, 0.08751, ...","[1915.3408111862898, -1915.3408111862898]",0.0,"[1.0, 0.0]"
4,0.0,"[9.268, 12.87, 61.49, 248.7, 0.1634, 0.2239, 0...","[1894.8310544848177, -1894.8310544848177]",0.0,"[1.0, 0.0]"


#### Avaliação do Modelo

In [33]:
evaluator_lrc = MulticlassClassificationEvaluator(predictionCol = 'prediction', labelCol = 'diag', metricName = 'accuracy')

In [34]:
evaluator_lrc.evaluate(predictions_lrc)

0.9485714285714286

#### Confusion Matrix

In [53]:
predictions_lrc.groupBy('diag', 'prediction').count().show()

+----+----------+-----+
|diag|prediction|count|
+----+----------+-----+
| 1.0|       1.0|   68|
| 0.0|       1.0|    3|
| 1.0|       0.0|    6|
| 0.0|       0.0|   98|
+----+----------+-----+



In [54]:
predictions_lrc.groupBy('diag', 'prediction').count().toPandas()

Unnamed: 0,diag,prediction,count
0,1.0,1.0,68
1,0.0,1.0,3
2,1.0,0.0,6
3,0.0,0.0,98


### Criando o segundo Modelo: V2
Decision Tree Classifier [Link](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html#pyspark.ml.classification.DecisionTreeClassifier)

In [35]:
dtc = DecisionTreeClassifier(featuresCol = 'features',labelCol = 'diag')

In [36]:
modelo = dtc.fit(dados_treino) # Treinamento

In [37]:
modelo.depth

5

In [38]:
modelo.numNodes

29

In [39]:
predictions_dtc = modelo.transform(dados_teste)

In [40]:
predictions_dtc.show(1, vertical = True)

-RECORD 0---------------------------------------
 diagnosis               | B                    
 radius_mean             | 7.691                
 texture_mean            | 25.44                
 perimeter_mean          | 48.34                
 area_mean               | 170.4                
 smoothness_mean         | 0.08668              
 compactness_mean        | 0.1199               
 concavity_mean          | 0.09252              
 concave points_mean     | 0.01364              
 symmetry_mean           | 0.2037               
 fractal_dimension_mean  | 0.07751              
 radius_se               | 0.2196               
 texture_se              | 1.479                
 perimeter_se            | 1.445                
 area_se                 | 11.73                
 smoothness_se           | 0.01547              
 compactness_se          | 0.06457              
 concavity_se            | 0.09252              
 concave points_se       | 0.01364              
 symmetry_se        

In [41]:
predictions_dtc.select('diag','prediction','features','rawPrediction','probability').show(5)

+----+----------+--------------------+-------------+--------------------+
|diag|prediction|            features|rawPrediction|         probability|
+----+----------+--------------------+-------------+--------------------+
| 0.0|       0.0|[7.691,25.44,48.3...|  [220.0,2.0]|[0.99099099099099...|
| 0.0|       0.0|[8.671,14.45,54.4...|  [220.0,2.0]|[0.99099099099099...|
| 0.0|       0.0|[9.042,18.9,60.07...|  [220.0,2.0]|[0.99099099099099...|
| 0.0|       0.0|[9.173,13.86,59.2...|  [220.0,2.0]|[0.99099099099099...|
| 0.0|       0.0|[9.268,12.87,61.4...|  [220.0,2.0]|[0.99099099099099...|
+----+----------+--------------------+-------------+--------------------+
only showing top 5 rows



#### Avaliação do Modelo

In [42]:
avaliador = MulticlassClassificationEvaluator(predictionCol = 'prediction', labelCol = 'diag', metricName = 'accuracy')
avaliador.evaluate(predictions_dtc)

0.92

#### Confusion Matrix

In [52]:
predictions_dtc.groupBy('diag', 'prediction').count().show()

+----+----------+-----+
|diag|prediction|count|
+----+----------+-----+
| 1.0|       1.0|   67|
| 0.0|       1.0|    7|
| 1.0|       0.0|    7|
| 0.0|       0.0|   94|
+----+----------+-----+



### Criando o terceiro Modelo: V3
Random Forest Classifier [Aqui](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier)

In [43]:
rfClassifier = RandomForestClassifier(featuresCol = 'features', labelCol = 'diag')

In [44]:
# Treinando o modelo

modelo_rf = rfClassifier.fit(dados_treino)

In [45]:
# Realizando previsões

previsoes_rf = modelo.transform(dados_teste)

In [51]:
previsoes_rf.show(1, vertical = True)

-RECORD 0---------------------------------------
 diagnosis               | B                    
 radius_mean             | 7.691                
 texture_mean            | 25.44                
 perimeter_mean          | 48.34                
 area_mean               | 170.4                
 smoothness_mean         | 0.08668              
 compactness_mean        | 0.1199               
 concavity_mean          | 0.09252              
 concave points_mean     | 0.01364              
 symmetry_mean           | 0.2037               
 fractal_dimension_mean  | 0.07751              
 radius_se               | 0.2196               
 texture_se              | 1.479                
 perimeter_se            | 1.445                
 area_se                 | 11.73                
 smoothness_se           | 0.01547              
 compactness_se          | 0.06457              
 concavity_se            | 0.09252              
 concave points_se       | 0.01364              
 symmetry_se        

In [47]:
### Avaliando o Modelo

evaluator_rf = MulticlassClassificationEvaluator(predictionCol = 'prediction', labelCol = 'diag', metricName = 'accuracy')

In [48]:
evaluator_rf.evaluate(previsoes_rf)

0.92

#### Confusion Matrix

In [50]:
previsoes_rf.groupBy('diag', 'prediction').count().show()

+----+----------+-----+
|diag|prediction|count|
+----+----------+-----+
| 1.0|       1.0|   67|
| 0.0|       1.0|    7|
| 1.0|       0.0|    7|
| 0.0|       0.0|   94|
+----+----------+-----+



<a id="6"></a>
# <p style="background-color:#714e86;font-family:Segoe UI Semibold;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">6. PIPELINE: WORKFLOW (EM ANDAMENTO)</p>

In [58]:
treino_pipe, teste_pipe = SparkDF.randomSplit([0.7,0.3])
treino_pipe.count(), teste_pipe.count()

(414, 155)

In [None]:
# Label encoding (variável alvo), Vector Assembler e StandardScaler

pipeline = Pipeline(stages = [stringIndexer, Assemb, std])

# Treinamento com o Pipeline

modelo_pipeline = pipeline.fit(treino_pipe)

# Previsões nos dados de Teste

previsoes_pipe = modelo.transform(teste_pipe)

previsoes_pipe.select('diag', 'probability', 'prediction').show(5, False)

In [None]:
pipe_evaluator = MulticlassClassificationEvaluator(predictionCol = 'prediction', labelCol = 'diag', metricName = 'accuracy')

pipe_evaluator.evaluate(previsoes_pipe)

# Resumindo os resultados: Confusion Matrix

previsoes.groupBy('label','prediction').count().show()