# Bootcamp: Cientista de Dados - Desafio

- Data: junho de 2022.

## Autor

Feito com por [Alexsander Lopes Camargos](https://github.com/alexcamargos): Entre em contato!

[![GitHub](https://img.shields.io/badge/-AlexCamargos-1ca0f1?style=flat-square&labelColor=1ca0f1&logo=github&logoColor=white&link=https://github.com/alexcamargos)](https://github.com/alexcamargos)
[![Twitter Badge](https://img.shields.io/badge/-@alcamargos-1ca0f1?style=flat-square&labelColor=1ca0f1&logo=twitter&logoColor=white&link=https://twitter.com/alcamargos)](https://twitter.com/alcamargos)
[![Linkedin Badge](https://img.shields.io/badge/-alexcamargos-1ca0f1?style=flat-square&logo=Linkedin&logoColor=white&link=https://www.linkedin.com/in/alexcamargos/)](https://www.linkedin.com/in/alexcamargos/)
[![Gmail Badge](https://img.shields.io/badge/-alcamargos@vivaldi.net-1ca0f1?style=flat-square&labelColor=1ca0f1&logo=Gmail&logoColor=white&link=mailto:alcamargos@vivaldi.net)](mailto:alcamargos@vivaldi.net)

## Licença

[MIT License](https://choosealicense.com/licenses/mit/)

# Módulo 2 - Desenvolvimento de Soluções Utilizando Spark

## Objetivos

Exercitar os seguintes conceitos trabalhados no Módulo:

- Exercitar o módulo Spark SQL do Apache Spark.
- Exercitar o módulo Spark MLLib do Apache Spark.

## Enunciado

Doenças ligadas ao coração afetam milhões de pessoas ao redor do mundo, e segundo a Organização Mundial de Saúde (OMS) é a segunda principal causa de morte na população mundial. Como cientista de dados, você foi contratado para criar um modelo preditivo que, a partir de dados de pacientes - como idade, gênero, nível de glicose, se é fumante ou não - vai prever se eles terão um derrame cerebral ou não.

Você tem acesso a um arquivo que possui atributos de pacientes e um atributo “stroke” (derrame), que indica se aquele paciente sofreu um evento de derrame ou não.

O conjunto de dados está disponível em: [https://dcc.ufmg.br/~pcalais/stroke_data.csv](https://dcc.ufmg.br/~pcalais/stroke_data.csv)

Para uma descrição das colunas, veja a seção “Attribute information” em [https://www.kaggle.com/fedesoriano/stroke-prediction-dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset).

As questões objetivas vão lhe guiar sobre a análise exploratória e o modelo preditivo que você criará a partir dos dados.

### Links úteis:
- [https://spark.apache.org/docs/latest/sql-getting-started.html](https://spark.apache.org/docs/latest/sql-getting-started.html)
- [https://spark.apache.org/docs/latest/ml-classificationregression.html#decision-tree-classifier](https://spark.apache.org/docs/latest/ml-classificationregression.html#decision-tree-classifier)

## Descrição das colunas

- *id*: unique identifier
- *gender*: "Male", "Female" or "Other"
- *age*: age of the patient
- *hypertension*: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- *heart_disease*: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- *ever_married*: "No" or "Yes"
- *work_type*: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- *Residence_type*: "Rural" or "Urban"
- *avg_glucose_level*: average glucose level in blood
- *bmi*: body mass index
- *smoking_status*: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- *stroke*: 1 if the patient had a stroke or 0 if not

_*Note: "Unknown" in smoking_status means that the information is unavailable for this patient_

In [None]:
# Criando o SparkSession
from pyspark import SparkFiles
from pyspark.sql import SparkSession

In [None]:
spark  = SparkSession.builder.appName('Desafio do Módulo 2').getOrCreate()
spark

In [None]:
!wget --no-check-certificate https://homepages.dcc.ufmg.br/~pcalais/stroke_data.csv

--2022-06-18 17:59:08--  https://homepages.dcc.ufmg.br/~pcalais/stroke_data.csv
Resolving homepages.dcc.ufmg.br (homepages.dcc.ufmg.br)... 150.164.0.136
Connecting to homepages.dcc.ufmg.br (homepages.dcc.ufmg.br)|150.164.0.136|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 4161594 (4.0M) [text/csv]
Saving to: ‘stroke_data.csv.3’


2022-06-18 17:59:13 (1.07 MB/s) - ‘stroke_data.csv.3’ saved [4161594/4161594]



In [None]:
# Informando ao Spark onde esta o arquivo para download.
spark.sparkContext.addFile('/databricks/driver/stroke_data.csv')

In [None]:
# Carregando o dataset. Nosso arquivo CSV tem cabeçalho definido e usa encoding Windows-1252.
#header=True - Indica que a primeira linha do arquivo é o cabeçalho.
# inferSchema=True - Tenta determinar o schema a partir dos dados.
df = spark.read.option('header', True) \
               .option('inferSchema', True) \
               .csv('file:/databricks/driver/stroke_data.csv')

In [None]:
df.printSchema()

root
 |-- 0: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- ever_married: string (nullable = true)
 |-- work_type: string (nullable = true)
 |-- Residence_type: string (nullable = true)
 |-- avg_glucose_level: double (nullable = true)
 |-- bmi: double (nullable = true)
 |-- smoking_status: string (nullable = true)
 |-- stroke: integer (nullable = true)



In [None]:
# Quantidade de linhas do dataframe.
df.count()

Out[7]: 67135

In [None]:
# Quantidade de colunas do dataframe.
len(df.columns)

Out[8]: 12

In [None]:
display(df)

0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
1,Female,18.0,0,0,No,Private,Urban,94.19,12.12,smokes,1
2,Male,58.0,1,0,Yes,Private,Rural,154.24,33.7,never_smoked,0
3,Female,36.0,0,0,Yes,Govt_job,Urban,72.63,24.7,smokes,0
4,Female,62.0,0,0,Yes,Self-employed,Rural,85.52,31.2,formerly smoked,0
5,Female,82.0,0,0,Yes,Private,Rural,59.32,33.2,smokes,1
6,Female,82.0,0,0,No,Govt_job,Urban,234.5,24.0,formerly smoked,0
7,Female,33.0,0,0,Yes,Self-employed,Urban,193.42,29.9,smokes,0
8,Female,37.0,0,0,Yes,Private,Rural,156.7,36.9,smokes,1
9,Female,41.0,0,0,Yes,Govt_job,Rural,64.06,33.8,smokes,1
10,Female,70.0,0,0,Yes,Self-employed,Rural,76.34,24.4,formerly smoked,1


## Análise descritiva

In [None]:
display(df.summary())

summary,0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
count,67135.0,67135,67135.0,67135.0,67135.0,67135,67135,67135,67135.0,67135.0,67135,67135.0
mean,33568.0,,51.95950845311693,0.1641021821702539,0.101422506889104,,,,113.41439606762462,29.16154047813857,,0.600089372160572
stddev,19380.349498052576,,23.413054156327917,0.3703710291636695,0.3018896147748789,,,,51.25881719094036,7.1020570070927205,,0.4898833455566829
min,1.0,Female,0.08,0.0,0.0,No,Govt_job,Rural,55.0,10.1,formerly smoked,0.0
25%,16782.0,,35.0,0.0,0.0,,,,78.37,24.4,,0.0
50%,33562.0,,56.0,0.0,0.0,,,,94.16,28.8,,1.0
75%,50346.0,,73.0,0.0,0.0,,,,126.46,32.6,,1.0
max,67135.0,Other,82.0,1.0,1.0,Yes,children,Urban,291.05,97.6,smokes,1.0


# Pergunta 1

Quantos registros existem no arquivo?

In [None]:
df.count()

Out[11]: 67135

# Pergunta 2

*Quantas colunas existem no arquivo?*

*Quantas são numéricas?*

Ao ler o arquivo com spark.read.csv, habilite `inferSchema=True`. Use a função `printSchema()` da API de Dataframes.

In [None]:
df.printSchema()

root
 |-- 0: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: double (nullable = true)
 |-- hypertension: integer (nullable = true)
 |-- heart_disease: integer (nullable = true)
 |-- ever_married: string (nullable = true)
 |-- work_type: string (nullable = true)
 |-- Residence_type: string (nullable = true)
 |-- avg_glucose_level: double (nullable = true)
 |-- bmi: double (nullable = true)
 |-- smoking_status: string (nullable = true)
 |-- stroke: integer (nullable = true)



In [None]:
# Quantidade de colunas no dataframe.
print(f'Quantidade de colunas no dataframe: {len(df.columns)}')
# Quantidade de colunas do tipo numéricas int ou double.
print('Quantidade de colunas numéricas no dataframe: ', end='')
print(len([column[1] for column in df.dtypes if column[1] in ['int', 'double']]))

Quantidade de colunas no dataframe: 12
Quantidade de colunas numéricas no dataframe: 7


# Pergunta 3

*No conjunto de dados, quantos pacientes sofreram e não sofreram derrame (stroke), respectivamente?*

In [None]:
# Usando df.groupby
# stroke: 1 if the patient had a stroke or 0 if not
strokes = df.groupby('stroke').count()
print(f"Quantidade de pacientes que sofreram derrame: {strokes.filter('stroke == 1').select('count').collect()[0][0]}")
print(f"Quantidade de pacientes que não sofreram derrame: {strokes.filter('stroke == 0').select('count').collect()[0][0]}")

Quantidade de pacientes que sofreram derrame: 40287
Quantidade de pacientes que não sofreram derrame: 26848


In [None]:
# Usando df.filter
# stroke: 1 if the patient had a stroke or 0 if not
print(f"Quantidade de pacientes que sofreram derrame: {df.filter('stroke == 1').count()}")
print(f"Quantidade de pacientes que não sofreram derrame: {df.filter('stroke == 0').count()}")

Quantidade de pacientes que sofreram derrame: 40287
Quantidade de pacientes que não sofreram derrame: 26848


# Pergunta 4

A partir do dataframe, crie uma tabela temporária usando `df.createOrReplaceTempView('table')` e a seguir use `spark.sql` para escrever uma consulta SQL que obtenha quantos pacientes tiveram derrame por tipo de trabalho (_work_type_).

*Quantos pacientes sofreram derrame e trabalhavam respectivamente, no setor privado, de forma independente, no governo e quantas são crianças?*

In [None]:
# Criando uma tabela de consulta temporária.
df.createOrReplaceTempView('strokes')

In [None]:
# Spark SQL API
# stroke: 1 if the patient had a stroke or 0 if not
spark.sql('SELECT work_type, COUNT(work_type) AS Stroke_Count FROM strokes WHERE stroke = 1 GROUP BY work_type ORDER BY Stroke_Count').show()

+-------------+------------+
|    work_type|Stroke_Count|
+-------------+------------+
| Never_worked|          85|
|     children|         520|
|     Govt_job|        5164|
|Self-employed|       10807|
|      Private|       23711|
+-------------+------------+



In [None]:
# Spark DataFrame API
# stroke: 1 if the patient had a stroke or 0 if not
df.filter('stroke == 1').groupBy('work_type').count().sort('count').show()

+-------------+-----+
|    work_type|count|
+-------------+-----+
| Never_worked|   85|
|     children|  520|
|     Govt_job| 5164|
|Self-employed|10807|
|      Private|23711|
+-------------+-----+



# Pergunta 5

Escreva uma consulta com `spark.sql` para determinar a proporção, por gênero, de participantes do estudo. A maioria dos participantes é:

In [None]:
# Spark SQL API
spark.sql('SELECT gender, COUNT(gender) AS Gender_Count FROM strokes GROUP BY gender ORDER BY Gender_Count DESC').show()

+------+------------+
|gender|Gender_Count|
+------+------------+
|Female|       39530|
|  Male|       27594|
| Other|          11|
+------+------------+



In [None]:
# Spark DataFrame API
df.groupBy('gender').count().sort('count').sort('count', ascending=False).show()

+------+-----+
|gender|count|
+------+-----+
|Female|39530|
|  Male|27594|
| Other|   11|
+------+-----+



# Pergunta 6

Escreva uma consulta com `spark.sql` para determinar quem tem mais probabilidade de sofrer derrame: hipertensos ou não-hipertensos. Você pode escrever uma consulta para cada grupo. A partir das probabilidades que você obteve, você conclui que:

In [None]:
# Spark SQL API
# hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension.
spark.sql('SELECT hypertension, COUNT(hypertension) as Hypertension_Count FROM strokes WHERE stroke = 1 GROUP BY hypertension').show()

+------------+------------------+
|hypertension|Hypertension_Count|
+------------+------------------+
|           1|              8817|
|           0|             31470|
+------------+------------------+



In [None]:
# Spark DataFrame API
# hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension.
df.filter('stroke == 1').groupby('hypertension').count().show()

+------------+-----+
|hypertension|count|
+------------+-----+
|           1| 8817|
|           0|31470|
+------------+-----+



# Pergunta 7

*Escreva uma consulta com `spark.sql` que determine o número de pessoas que sofreram derrame por idade. Com qual idade o maior número de pessoas do conjunto de dados sofreu derrame?*

In [None]:
# Spark DataFrame API
# stroke: 1 if the patient had a stroke or 0 if not.
spark.sql('SELECT age, COUNT(age) AS Age_Count from strokes WHERE  stroke = 1 GROUP BY age ORDER BY Age_Count DESC').show(5)

+----+---------+
| age|Age_Count|
+----+---------+
|79.0|     2916|
|78.0|     2279|
|80.0|     1858|
|81.0|     1738|
|82.0|     1427|
+----+---------+
only showing top 5 rows



In [None]:
# Spark DataFrame API
# stroke: 1 if the patient had a stroke or 0 if not
df.filter('stroke == 1').groupby('age').count().sort('count', ascending=False).show(5)

+----+-----+
| age|count|
+----+-----+
|79.0| 2916|
|78.0| 2279|
|80.0| 1858|
|81.0| 1738|
|82.0| 1427|
+----+-----+
only showing top 5 rows



# Pergunta 8

Usando a API de dataframes, determine quantas pessoas sofreram derrames após os 50 anos.

In [None]:
# Spark DataFrame API
df.filter('age > 50').where('stroke == 1').count()

Out[25]: 28938

In [None]:
# Spark SQL API
spark.sql('SELECT stroke, COUNT(stroke) as Stroke_Count from strokes WHERE age > 50 and stroke == 1 GROUP BY stroke').show()

+------+------------+
|stroke|Stroke_Count|
+------+------------+
|     1|       28938|
+------+------------+



# Pergunta 9

Usando `spark.sql`, determine qual o nível médio de glicose para pessoas que, respectivamente, sofreram e não sofreram derrame.

In [None]:
# Spark SQL API
# stroke: 1 if the patient had a stroke or 0 if not
spark.sql('SELECT stroke, avg(avg_glucose_level) as Glucose_AVG FROM strokes GROUP BY stroke').show()

+------+------------------+
|stroke|       Glucose_AVG|
+------+------------------+
|     1|119.95307046938272|
|     0|103.60273130214506|
+------+------------------+



In [None]:
# Spark DataFrame API
# stroke: 1 if the patient had a stroke or 0 if not
df.groupBy('stroke').agg({'avg_glucose_level' : 'avg'}).show()

+------+----------------------+
|stroke|avg(avg_glucose_level)|
+------+----------------------+
|     1|    119.95307046938272|
|     0|    103.60273130214506|
+------+----------------------+



# Pergunta 10

*Qual é o _BMI (IMC = índice de massa corpórea)_ médio de quem sofreu e não sofreu derrame?*

In [None]:
# Spark DataFrame API
# stroke: 1 if the patient had a stroke or 0 if not
spark.sql('SELECT stroke, ROUND(AVG(bmi), 4) as BMI_AVG from strokes GROUP BY stroke').show()

+------+-------+
|stroke|BMI_AVG|
+------+-------+
|     1|29.9425|
|     0|27.9897|
+------+-------+



In [None]:
# Spark DataFrame API
# stroke: 1 if the patient had a stroke or 0 if not
df.groupBy('stroke').agg({'bmi' : 'avg'}).show()

+------+------------------+
|stroke|          avg(bmi)|
+------+------------------+
|     1|29.942490629729495|
|     0|27.989678933253657|
+------+------------------+



# Pergunta 11

Crie um modelo de árvore de decisão que prevê a chance de derrame (stroke) a partir das variáveis contínuas/categóricas: idade, BMI, hipertensão, doença do coração, nível médio de glicose.

_Use o conteúdo da segunda aula interativa para criar e avaliar o modelo._

In [None]:
# Importando as bibliotecas para trabalhar com machine learning.
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import VectorAssembler

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
# Variáveis preditores numéricas.
predictions_columns = ['age', 'bmi', 'hypertension', 'heart_disease', 'avg_glucose_level']

In [None]:
# Transformando as colunas preditoras em um único vetor chamando features.
vector_assembler = VectorAssembler(inputCols=predictions_columns, outputCol='features')

In [None]:
# Criando a coluna features com união dos elementos preditores.
df_transform = vector_assembler.transform(df)

In [None]:
# Visualizando a variáveis resposta e as variáveis preditoras.
df_model = df_transform.select('stroke', 'features')
df_model.show(10, truncate=False)

+------+--------------------------+
|stroke|features                  |
+------+--------------------------+
|1     |[18.0,12.12,0.0,0.0,94.19]|
|0     |[58.0,33.7,1.0,0.0,154.24]|
|0     |[36.0,24.7,0.0,0.0,72.63] |
|0     |[62.0,31.2,0.0,0.0,85.52] |
|1     |[82.0,33.2,0.0,0.0,59.32] |
|0     |[82.0,24.0,0.0,0.0,234.5] |
|0     |[33.0,29.9,0.0,0.0,193.42]|
|1     |[37.0,36.9,0.0,0.0,156.7] |
|1     |[41.0,33.8,0.0,0.0,64.06] |
|1     |[70.0,24.4,0.0,0.0,76.34] |
+------+--------------------------+
only showing top 10 rows



In [None]:
# Criando os dataframes de treino (75% das informações) e teste (25% das informações).
df_training, df_test = df_model.randomSplit([.75, .25])

# Tamanho do dataframe de treino.
print(df_training.count())
# Tamanho do dataframe de teste.
print(df_test.count())

50343
16792


In [None]:
# Aplicando o algorítimo de classificação.
df_classifier = DecisionTreeClassifier(labelCol='stroke').fit(df_training)
df_prediction = df_classifier.transform(df_test)
df_prediction.show(truncate=False)

+------+--------------------------+-------------+------------------------------------------+----------+
|stroke|features                  |rawPrediction|probability                               |prediction|
+------+--------------------------+-------------+------------------------------------------+----------+
|0     |[0.08,12.1,0.0,0.0,125.11]|[3211.0,1.0] |[0.9996886674968867,3.1133250311332503E-4]|0.0       |
|0     |[0.08,12.2,0.0,0.0,111.09]|[3211.0,1.0] |[0.9996886674968867,3.1133250311332503E-4]|0.0       |
|0     |[0.08,31.2,0.0,0.0,93.04] |[3211.0,1.0] |[0.9996886674968867,3.1133250311332503E-4]|0.0       |
|0     |[0.16,13.1,0.0,0.0,117.56]|[3211.0,1.0] |[0.9996886674968867,3.1133250311332503E-4]|0.0       |
|0     |[0.16,13.9,0.0,0.0,109.52]|[3211.0,1.0] |[0.9996886674968867,3.1133250311332503E-4]|0.0       |
|0     |[0.16,14.0,0.0,0.0,84.53] |[3211.0,1.0] |[0.9996886674968867,3.1133250311332503E-4]|0.0       |
|0     |[0.16,14.7,0.0,0.0,70.87] |[3211.0,1.0] |[0.999688667496

In [None]:
# Calculando a acurácia do modelo.
accuracy = MulticlassClassificationEvaluator(labelCol='stroke', metricName='accuracy').evaluate(df_prediction)

# Calculando a precisão do modelo.
precision = MulticlassClassificationEvaluator(labelCol='stroke', metricName='weightedPrecision').evaluate(df_prediction)

print(f'Acurácia do modelo: {accuracy:.4f}')
print(f'Precisão do modelo: {precision:.4f}')

Acurácia do modelo: 0.6847
Precisão do modelo: 0.6834


In [None]:
# Visualizando a árvore de decisão de forma gráfica.
display(df_classifier)

treeNode
"{""index"":9,""featureType"":""continuous"",""prediction"":null,""threshold"":56.5,""categories"":null,""feature"":0,""overflow"":false}"
"{""index"":1,""featureType"":""continuous"",""prediction"":null,""threshold"":9.5,""categories"":null,""feature"":0,""overflow"":false}"
"{""index"":0,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":3,""featureType"":""continuous"",""prediction"":null,""threshold"":15.5,""categories"":null,""feature"":0,""overflow"":false}"
"{""index"":2,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":7,""featureType"":""continuous"",""prediction"":null,""threshold"":0.5,""categories"":null,""feature"":2,""overflow"":false}"
"{""index"":5,""featureType"":""continuous"",""prediction"":null,""threshold"":32.5,""categories"":null,""feature"":0,""overflow"":false}"
"{""index"":4,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":6,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":8,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"


# Pergunta 12

Adicione ao modelo as variáveis categóricas: gênero e status de fumante.

_Use o conteúdo da aula interativa para lidar com as variáveis categóricas._

*A acurácia (qualidade) do modelo aumentou para:*

In [None]:
category_variables = ['gender', 'smoking_status']

In [None]:
category_variables_indexer = StringIndexer(inputCols=category_variables,
                                           outputCols=[column + '_Indexer' for column in category_variables])

In [None]:
# Transformando as variáveis categóricas gender e smoking_status em numéricas.
df_transform_category = category_variables_indexer.fit(df).transform(df)
all_predictions_columns = predictions_columns + ['gender_Indexer', 'smoking_status_Indexer']
vector_assembler_category = VectorAssembler(inputCols=all_predictions_columns, outputCol='features')
df_transform_category = vector_assembler_category.transform(df_transform_category)
df_transform_category.show()

+---+------+----+------------+-------------+------------+-------------+--------------+-----------------+-----+---------------+------+--------------+----------------------+--------------------+
|  0|gender| age|hypertension|heart_disease|ever_married|    work_type|Residence_type|avg_glucose_level|  bmi| smoking_status|stroke|gender_Indexer|smoking_status_Indexer|            features|
+---+------+----+------------+-------------+------------+-------------+--------------+-----------------+-----+---------------+------+--------------+----------------------+--------------------+
|  1|Female|18.0|           0|            0|          No|      Private|         Urban|            94.19|12.12|         smokes|     1|           0.0|                   0.0|(7,[0,1,4],[18.0,...|
|  2|  Male|58.0|           1|            0|         Yes|      Private|         Rural|           154.24| 33.7|   never_smoked|     0|           1.0|                   2.0|[58.0,33.7,1.0,0....|
|  3|Female|36.0|           0|     

In [None]:
# Visualizando a variáveis resposta e as variáveis preditoras.
df_model_category = df_transform_category.select('stroke', 'features')
df_model_category.show(10, truncate=False)

+------+----------------------------------+
|stroke|features                          |
+------+----------------------------------+
|1     |(7,[0,1,4],[18.0,12.12,94.19])    |
|0     |[58.0,33.7,1.0,0.0,154.24,1.0,2.0]|
|0     |(7,[0,1,4],[36.0,24.7,72.63])     |
|0     |[62.0,31.2,0.0,0.0,85.52,0.0,1.0] |
|1     |(7,[0,1,4],[82.0,33.2,59.32])     |
|0     |[82.0,24.0,0.0,0.0,234.5,0.0,1.0] |
|0     |(7,[0,1,4],[33.0,29.9,193.42])    |
|1     |(7,[0,1,4],[37.0,36.9,156.7])     |
|1     |(7,[0,1,4],[41.0,33.8,64.06])     |
|1     |[70.0,24.4,0.0,0.0,76.34,0.0,1.0] |
+------+----------------------------------+
only showing top 10 rows



In [None]:
# Criando os dataframes de treino (75% das informações) e teste (25% das informações).
df_training_category, df_test_category = df_model_category.randomSplit([.75, .25])

# Tamanho do dataframe de treino.
print(df_training_category.count())
# Tamanho do dataframe de teste.
print(df_test_category.count())

50278
16857


In [None]:
df_classifier_category = DecisionTreeClassifier(labelCol='stroke').fit(df_training_category)
df_prediction_category = df_classifier_category.transform(df_test_category)
df_prediction_category.show(truncate=False)

+------+------------------------------+----------------+----------------------------------------+----------+
|stroke|features                      |rawPrediction   |probability                             |prediction|
+------+------------------------------+----------------+----------------------------------------+----------+
|0     |(7,[0,1,4],[11.0,20.1,85.08]) |[4213.0,12977.0]|[0.24508435136707388,0.7549156486329262]|1.0       |
|0     |(7,[0,1,4],[12.0,20.8,95.99]) |[4213.0,12977.0]|[0.24508435136707388,0.7549156486329262]|1.0       |
|0     |(7,[0,1,4],[15.0,31.2,86.84]) |[4213.0,12977.0]|[0.24508435136707388,0.7549156486329262]|1.0       |
|0     |(7,[0,1,4],[16.0,22.8,74.51]) |[4213.0,12977.0]|[0.24508435136707388,0.7549156486329262]|1.0       |
|0     |(7,[0,1,4],[16.0,24.8,78.0])  |[4213.0,12977.0]|[0.24508435136707388,0.7549156486329262]|1.0       |
|0     |(7,[0,1,4],[16.0,40.2,80.5])  |[4213.0,12977.0]|[0.24508435136707388,0.7549156486329262]|1.0       |
|0     |(7,[0,1,4],

In [None]:
# Calculando a acurácia do modelo.
accuracy_category = MulticlassClassificationEvaluator(labelCol='stroke', metricName='accuracy').evaluate(df_prediction_category)

# Calculando a precisão do modelo.
precision_category = MulticlassClassificationEvaluator(labelCol='stroke', metricName='weightedPrecision').evaluate(df_prediction_category)
print(f'Acurácia do modelo: {accuracy_category:.4f}')
print(f'Precisão do modelo: {precision_category:.4f}')

Acurácia do modelo: 0.8312
Precisão do modelo: 0.8487


In [None]:
df_classifier_category.featureImportances

Out[47]: SparseVector(7, {0: 0.1605, 1: 0.0016, 6: 0.8379})

In [None]:
# Visualizando a árvore de decisão de forma gráfica.
display(df_classifier_category)

treeNode
"{""index"":1,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[2.0],""feature"":6,""overflow"":false}"
"{""index"":0,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":5,""featureType"":""categorical"",""prediction"":null,""threshold"":null,""categories"":[1.0],""feature"":6,""overflow"":false}"
"{""index"":3,""featureType"":""continuous"",""prediction"":null,""threshold"":56.5,""categories"":null,""feature"":0,""overflow"":false}"
"{""index"":2,""featureType"":null,""prediction"":0.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":4,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":7,""featureType"":""continuous"",""prediction"":null,""threshold"":66.5,""categories"":null,""feature"":0,""overflow"":false}"
"{""index"":6,""featureType"":null,""prediction"":1.0,""threshold"":null,""categories"":null,""feature"":null,""overflow"":false}"
"{""index"":11,""featureType"":""continuous"",""prediction"":null,""threshold"":73.5,""categories"":null,""feature"":0,""overflow"":false}"
"{""index"":9,""featureType"":""continuous"",""prediction"":null,""threshold"":20.65,""categories"":null,""feature"":1,""overflow"":false}"


#Pergunta 13

Qual dessas variáveis é mais importante no modelo de árvore de decisão que você construiu na questão (12)?

In [None]:
all_predictions_columns

Out[49]: ['age',
 'bmi',
 'hypertension',
 'heart_disease',
 'avg_glucose_level',
 'gender_Indexer',
 'smoking_status_Indexer']

In [None]:
df_classifier_category.featureImportances

Out[50]: SparseVector(7, {0: 0.1605, 1: 0.0016, 6: 0.8379})

In [None]:
for feature in zip(all_predictions_columns, df_classifier_category.featureImportances):
    print(f'Variável: {feature[0].upper()}\tContribuição: {feature[1]:.4f}')

Variável: AGE	Contribuição: 0.1605
Variável: BMI	Contribuição: 0.0016
Variável: HYPERTENSION	Contribuição: 0.0000
Variável: HEART_DISEASE	Contribuição: 0.0000
Variável: AVG_GLUCOSE_LEVEL	Contribuição: 0.0000
Variável: GENDER_INDEXER	Contribuição: 0.0000
Variável: SMOKING_STATUS_INDEXER	Contribuição: 0.8379


# Pergunta 14

Qual a profundidade da árvore de decisão da questão (12)?

In [None]:
# Profundidade da árvore de decisão.
df_classifier_category.depth

Out[52]: 5

# Pergunta 15

Quantos nodos a árvore de decisão possui?

In [None]:
# Quantidade de nós da árvore de decisão.
df_classifier_category.numNodes

Out[53]: 13