<a href="https://colab.research.google.com/github/cruz-marco/pyspark_course/blob/main/pyspark_MachineLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [28]:
#@title
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
!tar xf spark-3.2.3-bin-hadoop3.2.tgz

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" 
os.environ["SPARK_HOME"] = '/content/spark-3.2.3-bin-hadoop3.2'

!pip install -q findspark

import findspark
findspark.init()
findspark.find()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

# Machine Learning com Spark

- PySpark tem bibliotecas voltadas para o treinamento e avaliação de modelos de aprendzagem de máquina;

- As variáveis independentes devem ficar, todas juntas, em um único vetor;

## Modelos de Regressão

> Importe das bibliotecas a serem usadas:

In [29]:
from pyspark.ml.regression import LinearRegression, RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import functions as f

> Criação do DataFrame carros_temp, a idéia deste mini-projeto é prever a potência do carro.

In [30]:
carros = spark.read.load(('/content/drive/MyDrive/Datasets/pyspark_course/'
                              'Carros.csv'), format='csv', sep=';', header=True,
                              inferSchema=True)

carros.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110|
|    187|        8|        360|            315| 344| 1702|        0|          0|      3|          2|175|
|    181|        6|        225|            276| 346| 2022|        1|          0|      3|          1|105|
|    143|        8|        360|            321| 357| 15

> Transformação das colunas em vetores.

In [31]:
vect_feats = VectorAssembler(inputCols=carros.columns[:-1], outputCol='vected_feats')

In [32]:
carros = vect_feats.transform(carros)
carros.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|        vected_feats|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|[21.0,6.0,160.0,3...|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|[21.0,6.0,160.0,3...|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93|[228.0,4.0,108.0,...|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110|[214.0,6.0,258.0,...|
|    187|        8|        360|            315| 344| 1702|        0|          0|      3|          2|175|[187.0,8.0,360

> Separando os dados de treino e teste.

In [33]:
carrosTreino, carrosTeste = carros.randomSplit([0.7, 0.3])

display(
    carrosTreino.count(),
    carrosTeste.count()
)

21

11

> Instanciando, treinando e avaliando o algoritmo de regressão linear clássico

In [34]:
reglin = LinearRegression(featuresCol='vected_feats', labelCol='HP')
model_RL = reglin.fit(carrosTreino)

In [35]:
carrosPred = model_RL.transform(carrosTeste)
carrosPred.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+------------------+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|        vected_feats|        prediction|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+------------------+
|    143|        8|        360|            321| 357| 1584|        0|          0|      3|          4|245|[143.0,8.0,360.0,...|218.50008548836007|
|    155|        8|        318|            276| 352| 1687|        0|          0|      3|          2|150|[155.0,8.0,318.0,...|173.12829107831396|
|    158|        8|        351|            422| 317|  145|        0|          1|      5|          4|264|[158.0,8.0,351.0,...|222.07795439592053|
|    173|        8|       2758|            307| 373|  176|        0|          0|      3|          3|180|[173.0,8.0,2758.0...|  169

In [36]:
evaluate = RegressionEvaluator(predictionCol='prediction', labelCol='HP',
                               metricName='rmse')

rmse_RL = evaluate.evaluate(carrosPred)
rmse_RL

41.98439696261191

> Instanciando, treinando e avaliando o algoritmo de regressão por random forest.

In [37]:
rfreg = RandomForestRegressor(featuresCol='vected_feats', labelCol='HP')
model_RF = rfreg.fit(carrosTreino)

In [38]:
carrosPredRF = model_RF.transform(carrosTeste)
carrosPredRF.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+-----------------+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|        vected_feats|       prediction|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+-----------------+
|    143|        8|        360|            321| 357| 1584|        0|          0|      3|          4|245|[143.0,8.0,360.0,...|           215.95|
|    155|        8|        318|            276| 352| 1687|        0|          0|      3|          2|150|[155.0,8.0,318.0,...|          164.075|
|    158|        8|        351|            422| 317|  145|        0|          1|      5|          4|264|[158.0,8.0,351.0,...|          176.975|
|    173|        8|       2758|            307| 373|  176|        0|          0|      3|          3|180|[173.0,8.0,2758.0...|           

In [39]:
rmse_RF = evaluate.evaluate(carrosPredRF)
rmse_RF

32.172481025770075

## Modelo de Classificação

> Import das bibliotecas a serem usadas

In [40]:
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

> Criação do DataFrame

In [41]:
churn = spark.read.load(('/content/drive/MyDrive/Datasets/pyspark_course/Churn'
        '.csv'), format='csv', sep=';', header=True, inferSchema=True)

churn.show(5)

+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|CreditScore|Geography|Gender|Age|Tenure| Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|        619|   France|Female| 42|     2|       0|            1|        1|             1|       10134888|     1|
|        608|    Spain|Female| 41|     1| 8380786|            1|        0|             1|       11254258|     0|
|        502|   France|Female| 42|     8| 1596608|            3|        1|             0|       11393157|     1|
|        699|   France|Female| 39|     1|       0|            2|        0|             0|        9382663|     0|
|        850|    Spain|Female| 43|     2|12551082|            1|        1|             1|         790841|     0|
+-----------+---------+------+---+------+--------+-------------+---------+--------------+-------

> Separação/Transformação dos dados:

- RFormula recebe uma fórmula do R e executa a transformação dos dados (one hot encoding e vetorização) usando métodos cuja lógica é semelhante a um modelo de machine learning. Lógica empregada no parâmetro fórmula: variável_target ~ features (usando o ponto '.' após o til indica que todas as outras colunas serão usadas como feature.

In [42]:
formula = RFormula(formula='Exited ~ .', featuresCol='features', 
                   labelCol='label', handleInvalid='skip')

churn_rdy = formula.fit(churn).transform(churn).select('features', 'label')

churn_rdy.show(truncate=False)

+----------------------------------------------------------------+-----+
|features                                                        |label|
+----------------------------------------------------------------+-----+
|[619.0,1.0,0.0,0.0,42.0,2.0,0.0,1.0,1.0,1.0,1.0134888E7]        |1.0  |
|[608.0,0.0,0.0,0.0,41.0,1.0,8380786.0,1.0,0.0,1.0,1.1254258E7]  |0.0  |
|[502.0,1.0,0.0,0.0,42.0,8.0,1596608.0,3.0,1.0,0.0,1.1393157E7]  |1.0  |
|(11,[0,1,4,5,7,10],[699.0,1.0,39.0,1.0,2.0,9382663.0])          |0.0  |
|[850.0,0.0,0.0,0.0,43.0,2.0,1.2551082E7,1.0,1.0,1.0,790841.0]   |0.0  |
|[645.0,0.0,0.0,1.0,44.0,8.0,1.1375578E7,2.0,1.0,0.0,1.4975671E7]|1.0  |
|[822.0,1.0,0.0,1.0,50.0,7.0,0.0,2.0,1.0,1.0,100628.0]           |0.0  |
|[376.0,0.0,1.0,0.0,29.0,4.0,1.1504674E7,4.0,1.0,0.0,1.1934688E7]|1.0  |
|[501.0,1.0,0.0,1.0,44.0,4.0,1.4205107E7,2.0,0.0,1.0,749405.0]   |0.0  |
|[684.0,1.0,0.0,1.0,27.0,2.0,1.3460388E7,1.0,1.0,1.0,7172573.0]  |0.0  |
|[528.0,1.0,0.0,1.0,31.0,6.0,1.0201672E7,2.0,0.0,0.

> Separação em treino e teste

In [43]:
churnTreino, churnTeste = churn_rdy.randomSplit([0.7, 0.3])

display(
    churnTreino.count(),
    churnTeste.count()
)

7082

2918

> Instanciação e treino do classificador

In [44]:
clf = DecisionTreeClassifier(labelCol='label', featuresCol='features')
model_clf = clf.fit(churnTreino)

> Previsão do modelo

In [45]:
churnPred = model_clf.transform(churnTeste)

churnPred.show()

+--------------------+-----+--------------+--------------------+----------+
|            features|label| rawPrediction|         probability|prediction|
+--------------------+-----+--------------+--------------------+----------+
|(11,[0,1,3,4,7,10...|  0.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,5,7,10...|  1.0|  [181.0,37.0]|[0.83027522935779...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,5,7,10...|  1.0| [119.0,114.0]|[0.51072961373390...|       0.0|
|(11,[0,1,4,5,7,10...|  1.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4402.0,504.0]|[0.89726865063187...|       0.0|
|(11,[0,1,4,

> Avaliação do modelo:

> Modelo acertando mais do que o acaso (acima de 50% da Area Under Curve)

In [46]:
evaluate_clf = BinaryClassificationEvaluator(rawPredictionCol='prediction', 
                labelCol='label', metricName='areaUnderROC')

auc = evaluate_clf.evaluate(churnPred)

auc

0.7037557198247631

## Pipelines

- Pipelines são formas de automatizar o fluxo de transformações e treino de um modelo, é a automatização juntamente com a abstração do workflow do modelo.

> Importando a classe Pipeline

In [58]:
from pyspark.ml import Pipeline

> Recriando o DataFrame de carros

In [59]:
carros_2 = spark.read.load(('/content/drive/MyDrive/Datasets/pyspark_course/'
                              'Carros.csv'), format='csv', sep=';', header=True,
                              inferSchema=True)

carros_2.show(5)

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110|
|    187|        8|        360|            315| 344| 1702|        0|          0|      3|          2|175|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
only showing top 5 rows



> Separando as features de interesse e a variável dependente (label/target)

In [60]:
feats = ['Consumo', 'Cilindros', 'Cilindradas', 'HP']

carros_ftd = carros_2.select(feats)

carros_ftd.show(5)

+-------+---------+-----------+---+
|Consumo|Cilindros|Cilindradas| HP|
+-------+---------+-----------+---+
|     21|        6|        160|110|
|     21|        6|        160|110|
|    228|        4|        108| 93|
|    214|        6|        258|110|
|    187|        8|        360|175|
+-------+---------+-----------+---+
only showing top 5 rows



> Vetorizando as features em uma nova coluna de vetores chamada características:

In [61]:
vctzd_feats = VectorAssembler(inputCols=feats[:-1], outputCol='caracteristicas')

v_carros_ftd = vctzd_feats.transform(carros_ftd)

v_carros_ftd.show(5)

+-------+---------+-----------+---+-----------------+
|Consumo|Cilindros|Cilindradas| HP|  caracteristicas|
+-------+---------+-----------+---+-----------------+
|     21|        6|        160|110| [21.0,6.0,160.0]|
|     21|        6|        160|110| [21.0,6.0,160.0]|
|    228|        4|        108| 93|[228.0,4.0,108.0]|
|    214|        6|        258|110|[214.0,6.0,258.0]|
|    187|        8|        360|175|[187.0,8.0,360.0]|
+-------+---------+-----------+---+-----------------+
only showing top 5 rows



> Instanciação e treinamento do modelo de regressão linear

In [62]:
reglin2 = LinearRegression(featuresCol='caracteristicas', labelCol='HP')
modelo = reglin2.fit(v_carros_ftd)

> Montagem da pipeline com 2 estágios: Vetorização (objeto vctzd_feats) e Configuração do modelo (reglin2)

In [64]:
pipeline = Pipeline(stages=[vctzd_feats, reglin2])

pipeModel = pipeline.fit(carros_2)

> Previsão usando o objeto pipeModel (com o modelo treinado)

In [65]:
prediction = pipeModel.transform(carros_2)
prediction.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+------------------+------------------+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|   caracteristicas|        prediction|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+------------------+------------------+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|  [21.0,6.0,160.0]|162.32154816816643|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|  [21.0,6.0,160.0]|162.32154816816643|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93| [228.0,4.0,108.0]| 82.51715587712931|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110| [214.0,6.0,258.0]|141.86680518718757|