<a href="https://colab.research.google.com/github/cruz-marco/pyspark_course/blob/main/pyspark_MachineLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
!tar xf spark-3.2.3-bin-hadoop3.2.tgz

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" 
os.environ["SPARK_HOME"] = '/content/spark-3.2.3-bin-hadoop3.2'

!pip install -q findspark

import findspark
findspark.init()
findspark.find()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

# Machine Learning com Spark

- PySpark tem bibliotecas voltadas para o treinamento e avaliação de modelos de aprendzagem de máquina;

- As variáveis independentes devem ficar, todas juntas, em um único vetor;


> Importe das bibliotecas a serem usadas:

In [2]:
from pyspark.ml.regression import LinearRegression, RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import functions as f

## Modelos de Regressão

> Criação do DataFrame carros_temp, a idéia deste mini-projeto é prever a potência do carro.

In [3]:
carros = spark.read.load(('/content/drive/MyDrive/Datasets/pyspark_course/'
                              'Carros.csv'), format='csv', sep=';', header=True,
                              inferSchema=True)

carros.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110|
|    187|        8|        360|            315| 344| 1702|        0|          0|      3|          2|175|
|    181|        6|        225|            276| 346| 2022|        1|          0|      3|          1|105|
|    143|        8|        360|            321| 357| 15

> Transformação das colunas em vetores.

In [4]:
vect_feats = VectorAssembler(inputCols=carros.columns[:-1], outputCol='vected_feats')

In [5]:
carros = vect_feats.transform(carros)
carros.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|        vected_feats|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|[21.0,6.0,160.0,3...|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|[21.0,6.0,160.0,3...|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93|[228.0,4.0,108.0,...|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110|[214.0,6.0,258.0,...|
|    187|        8|        360|            315| 344| 1702|        0|          0|      3|          2|175|[187.0,8.0,360

> Separando os dados de treino e teste.

In [6]:
carrosTreino, carrosTeste = carros.randomSplit([0.7, 0.3])

display(
    carrosTreino.count(),
    carrosTeste.count()
)

22

10

> Instanciando, treinando e avaliando o algoritmo de regressão linear clássico

In [8]:
reglin = LinearRegression(featuresCol='vected_feats', labelCol='HP')
model_RL = reglin.fit(carrosTreino)

In [9]:
carrosPred = model_RL.transform(carrosTeste)
carrosPred.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+------------------+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|        vected_feats|        prediction|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+------------------+
|    104|        8|        460|              3|5424| 1782|        0|          0|      3|          4|215|[104.0,8.0,460.0,...| 173.5649974544997|
|    133|        8|        350|            373| 384| 1541|        0|          0|      3|          4|245|[133.0,8.0,350.0,...|240.80817059157295|
|    152|        8|        304|            315|3435|  173|        0|          0|      3|          2|150|[152.0,8.0,304.0,...|151.90544420558822|
|    152|        8|       2758|            307| 378|   18|        0|          0|      3|          3|180|[152.0,8.0,2758.0...|177.6

In [13]:
evaluate = RegressionEvaluator(predictionCol='prediction', labelCol='HP',
                               metricName='rmse')

rmse_RL = evaluate.evaluate(carrosPred)
rmse_RL

56.514656844206385

> Instanciando, treinando e avaliando o algoritmo de regressão por random forest.

In [14]:
rfreg = RandomForestRegressor(featuresCol='vected_feats', labelCol='HP')
model_RF = rfreg.fit(carrosTreino)

In [16]:
carrosPredRF = model_RF.transform(carrosTeste)
carrosPredRF.show()

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+------------------+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|        vected_feats|        prediction|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+--------------------+------------------+
|    104|        8|        460|              3|5424| 1782|        0|          0|      3|          4|215|[104.0,8.0,460.0,...|            198.25|
|    133|        8|        350|            373| 384| 1541|        0|          0|      3|          4|245|[133.0,8.0,350.0,...|205.91666666666669|
|    152|        8|        304|            315|3435|  173|        0|          0|      3|          2|150|[152.0,8.0,304.0,...|             184.5|
|    152|        8|       2758|            307| 378|   18|        0|          0|      3|          3|180|[152.0,8.0,2758.0...|200.1

In [17]:
rmse_RF = evaluate.evaluate(carrosPredRF)
rmse_RF

40.36708314924993

## Modelo de Classificação

> Import das bibliotecas a serem usadas

In [18]:
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

> Criação do DataFrame

In [19]:
churn = spark.read.load(('/content/drive/MyDrive/Datasets/pyspark_course/Churn'
        '.csv'), format='csv', sep=';', header=True, inferSchema=True)

churn.show(5)

+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|CreditScore|Geography|Gender|Age|Tenure| Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|        619|   France|Female| 42|     2|       0|            1|        1|             1|       10134888|     1|
|        608|    Spain|Female| 41|     1| 8380786|            1|        0|             1|       11254258|     0|
|        502|   France|Female| 42|     8| 1596608|            3|        1|             0|       11393157|     1|
|        699|   France|Female| 39|     1|       0|            2|        0|             0|        9382663|     0|
|        850|    Spain|Female| 43|     2|12551082|            1|        1|             1|         790841|     0|
+-----------+---------+------+---+------+--------+-------------+---------+--------------+-------

> Separação/Transformação dos dados:

- RFormula recebe uma fórmula do R e executa a transformação dos dados (one hot encoding e vetorização) usando métodos cuja lógica é semelhante a um modelo de machine learning. Lógica empregada no parâmetro fórmula: variável_target ~ features (usando o ponto '.' após o til indica que todas as outras colunas serão usadas como feature.

In [21]:
formula = RFormula(formula='Exited ~ .', featuresCol='features', 
                   labelCol='label', handleInvalid='skip')

churn_rdy = formula.fit(churn).transform(churn).select('features', 'label')

churn_rdy.show(truncate=False)

+----------------------------------------------------------------+-----+
|features                                                        |label|
+----------------------------------------------------------------+-----+
|[619.0,1.0,0.0,0.0,42.0,2.0,0.0,1.0,1.0,1.0,1.0134888E7]        |1.0  |
|[608.0,0.0,0.0,0.0,41.0,1.0,8380786.0,1.0,0.0,1.0,1.1254258E7]  |0.0  |
|[502.0,1.0,0.0,0.0,42.0,8.0,1596608.0,3.0,1.0,0.0,1.1393157E7]  |1.0  |
|(11,[0,1,4,5,7,10],[699.0,1.0,39.0,1.0,2.0,9382663.0])          |0.0  |
|[850.0,0.0,0.0,0.0,43.0,2.0,1.2551082E7,1.0,1.0,1.0,790841.0]   |0.0  |
|[645.0,0.0,0.0,1.0,44.0,8.0,1.1375578E7,2.0,1.0,0.0,1.4975671E7]|1.0  |
|[822.0,1.0,0.0,1.0,50.0,7.0,0.0,2.0,1.0,1.0,100628.0]           |0.0  |
|[376.0,0.0,1.0,0.0,29.0,4.0,1.1504674E7,4.0,1.0,0.0,1.1934688E7]|1.0  |
|[501.0,1.0,0.0,1.0,44.0,4.0,1.4205107E7,2.0,0.0,1.0,749405.0]   |0.0  |
|[684.0,1.0,0.0,1.0,27.0,2.0,1.3460388E7,1.0,1.0,1.0,7172573.0]  |0.0  |
|[528.0,1.0,0.0,1.0,31.0,6.0,1.0201672E7,2.0,0.0,0.

> Separação em treino e teste

In [24]:
churnTreino, churnTeste = churn_rdy.randomSplit([0.7, 0.3])

display(
    churnTreino.count(),
    churnTeste.count()
)

7082

2918

> Instanciação e treino do classificador

In [25]:
clf = DecisionTreeClassifier(labelCol='label', featuresCol='features')
model_clf = clf.fit(churnTreino)

> Previsão do modelo

In [26]:
churnPred = model_clf.transform(churnTeste)

churnPred.show()

+--------------------+-----+--------------+--------------------+----------+
|            features|label| rawPrediction|         probability|prediction|
+--------------------+-----+--------------+--------------------+----------+
|(11,[0,1,3,4,7,10...|  0.0| [149.0,220.0]|[0.40379403794037...|       1.0|
|(11,[0,1,3,4,7,10...|  0.0|[4395.0,517.0]|[0.89474755700325...|       0.0|
|(11,[0,1,4,5,7,10...|  1.0|  [31.0,232.0]|[0.11787072243346...|       1.0|
|(11,[0,1,4,5,7,10...|  1.0|  [186.0,45.0]|[0.80519480519480...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4395.0,517.0]|[0.89474755700325...|       0.0|
|(11,[0,1,4,5,7,10...|  0.0|[4395.0,517.0]|[0.89474755700325...|       0.0|
|(11,[0,1,4,5,7,10...|  1.0| [149.0,220.0]|[0.40379403794037...|       1.0|
|(11,[0,1,4,5,7,10...|  1.0|     [1.0,9.0]|           [0.1,0.9]|       1.0|
|(11,[0,1,4,5,7,10...|  0.0|[4395.0,517.0]|[0.89474755700325...|       0.0|
|(11,[0,1,4,5,7,10...|  1.0| [149.0,220.0]|[0.40379403794037...|       1.0|
|(11,[0,1,4,

> Avaliação do modelo:

> Modelo acertando mais do que o acaso (acima de 50% da Area Under Curve)

In [27]:
evaluate_clf = BinaryClassificationEvaluator(rawPredictionCol='prediction', 
                labelCol='label', metricName='areaUnderROC')

auc = evaluate_clf.evaluate(churnPred)

auc

0.691766299393087