# **1. Introdução**

A regressão logística é um método popular para prever uma resposta categórica. É um caso especial de modelos Lineares Generalizados que prevê a probabilidade dos resultados. Em `spark.ml`, a regressão logística pode ser usada para prever um resultado binário usando regressão logística binomial ou pode ser usada para prever um resultado multiclasse usando regressão logística multinomial. Use o parâmetro `family` para selecionar entre esses dois algoritmos ou deixe-o indefinido e o Spark inferirá a variante correta.

<font size=2>**Fonte:** [MLlib](https://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression)</font>

 <img src="https://miro.medium.com/max/1400/0*1KnKYuv0UDu_1-qM.gif?width=1191&height=670" alt="Minha Figura">

## **1.1 Carregando o pyspark**

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("Classificação com Spark 1.2").getOrCreate()

In [2]:
spark

## **1.2 Carregando as principais funções**

In [3]:
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
import pandas as pd
from datetime import datetime

start_time = datetime.now()

import sys # para julab, vscode não precisa
sys.path.append('../../../') # para julab, vscode não precisa
julab = '../../../'
from work.src.utils import *


n = 'best_lr_model'
caminho_modelo = 'work/models'

In [4]:
n_bootstrap = 1000

## **2.3 Métricas BASE DE VALIDAÇÃO**

<font size=2>**Documentação:**</font>
<font size=2>[LogisticRegressionTrainingSummary](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegressionTrainingSummary.html)</font>

In [5]:
predictions_val = spark.read.orc(f'{julab}work/data/final/predictions_val_{n}.orc')

In [6]:
# predictions_val.show(truncate=False)
# +-----------------------------------------------------------------------------------------------------------+-----+------------------------------------------+----------------------------------------+----------+
# |features                                                                                                   |label|rawPrediction                             |probability                             |prediction|
# +-----------------------------------------------------------------------------------------------------------+-----+------------------------------------------+----------------------------------------+----------+
# |(24,[1,2,5,6,10,11,12,13,14,18,23],[12.0,75.85,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                       |0    |[0.05103567311346513,-0.05103567311346513]|[0.5127561496338285,0.48724385036617146]|0.0       |
# |(24,[1,2,3,5,8,12,13,14,19,21],[69.0,61.45,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                               |0    |[2.382789551862372,-2.382789551862372]    |[0.9155054708840569,0.08449452911594313]|0.0       |
# |(24,[1,2,3,5,12,13,15,17,22],[46.0,80.8824189403559,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                          |1    |[-1.1704365714764247,1.1704365714764247]  |[0.23677608099237907,0.7632239190076209]|1.0       |

In [7]:
print('METRICAS RESUMO DA BASE DE VALIDAÇÃO')

auc_roc = calculate_auc_roc(predictions_val)
print(f"AUC ROC: {auc_roc}")
auc_pr = calculate_auc_pr(predictions_val)
print(f"AUC PR: {auc_pr}")
ks = calculate_ks(predictions_val)
print(f"KS: {ks}")

METRICAS RESUMO DA BASE DE VALIDAÇÃO
AUC ROC: 0.7897105323794532
AUC PR: 0.7617664882695159
KS: 0.5794210647589065


In [8]:
calculate_confusion_matrix(predictions_val)

{'TP': 870, 'TN': 774, 'FP': 235, 'FN': 201}

In [9]:
calcula_mostra_matriz_confusao(predictions_val, normalize=False)

                     Previsto
                Churn       Não-Churn
     Churn        870         201
Real
     Não-Churn    235         774


## **2.4 Métricas BASE DE TESTE**

In [10]:
predictions_test = spark.read.orc(f'{julab}work/data/final/predictions_test_{n}.orc')

In [11]:
print('METRICAS RESUMO DA BASE DE TESTE')

auc_roc = calculate_auc_roc(predictions_test)
print(f"AUC ROC: {auc_roc}")
auc_pr = calculate_auc_pr(predictions_test)
print(f"AUC PR: {auc_pr}")
ks = calculate_ks(predictions_test)
print(f"KS: {ks}")

METRICAS RESUMO DA BASE DE TESTE
AUC ROC: 0.7630023396167441
AUC PR: 0.7168174012626157
KS: 0.5260046792334883


In [12]:
calculate_confusion_matrix(predictions_test)

{'TP': 830, 'TN': 761, 'FP': 290, 'FN': 205}

In [13]:
calcula_mostra_matriz_confusao(predictions_test, normalize=False)

                     Previsto
                Churn       Não-Churn
     Churn        830         205
Real
     Não-Churn    290         761


# **3. Boostramp, intervalos de confiança, Permutacion test**

## **3.1 Boostramp - VALIDAÇÃO

In [14]:
predictions_val.count()

2080

In [None]:
# Calcule intervalos de confiança e média
scores, resultados = bootstrap_metric_spark(data = predictions_val, n_bootstrap = n_bootstrap)

In [16]:
pd.DataFrame(resultados).T
# 	confidence_interval	mean_score	std_dev
# ks	[0.5587163142695343, 0.5940801326675024]	0.579402	0.014293
# auc	[0.7793581571347672, 0.7970400663337511]	0.789701	0.007146
# auc_pr	[0.7587768523408239, 0.7755809192338347]	0.765438	0.007362

Unnamed: 0,confidence_interval,mean_score,std_dev
ks,"[0.5442817972833823, 0.6141448137404748]",0.579064,0.017854
auc,"[0.7721408986416911, 0.8070724068702374]",0.789532,0.008927
auc_pr,"[0.7382538714599085, 0.7850480036077823]",0.761565,0.012205


In [17]:
scores_df = df_scores(scores)
scores_df

# ks.scores	auc.scores	auc_pr.scores
# 0	0.583983	0.791992	0.769682
# 1	0.576596	0.788298	0.758596
# 2	0.556730	0.778365	0.762273
# 3	0.595138	0.797569	0.776236
# 4	0.584564	0.792282	0.760402

Unnamed: 0,ks.scores,auc.scores,auc_pr.scores
0,0.583983,0.791992,0.769682
1,0.576596,0.788298,0.758596
2,0.556730,0.778365,0.762273
3,0.595138,0.797569,0.776236
4,0.584564,0.792282,0.760402
...,...,...,...
995,0.600717,0.800358,0.776899
996,0.597139,0.798569,0.774559
997,0.605765,0.802883,0.766239
998,0.556320,0.778160,0.753546


## **3.2 Boostramp e Permutacion test**

* BASE: VALIDAÇÃO
* BASE: TESTE

In [18]:
print(predictions_val.count())
print(predictions_test.count())
# 2080
# 2086

2080
2086


In [None]:
resultados_1_2_permutacion, resultados_1_2 = bootstrap_metric_spark_permutacion(data1 = predictions_val, data2 = predictions_test, n_bootstrap = n_bootstrap)
# Iteração 1/5
# Sample1 count: 2129
# Sample2 count: 2136
# ----------
# Iteração 2/5
# Sample1 count: 2060
# Sample2 count: 2066
# ----------
# Iteração 3/5
# Sample1 count: 2062
# Sample2 count: 2066
# ----------
# Iteração 4/5
# Sample1 count: 2069
# Sample2 count: 2075
# ----------
# Iteração 5/5
# Sample1 count: 2091
# Sample2 count: 2095
# ----------
# ########################################
# ks
# ------------------------------
# ['\n Teste de Significancia ', '**$H_0$:** Diferença entre as médias das métricas é zero. \n', ' Arrays sizes: 5, 5 ', '* Difference between averages: 0.5794 - 0.5276 = 0.0518', '* p_val = 49.0000 ', 'The model seems to produce similar results with CI-0.95 (fail to reject H0).\n']
# ########################################
# auc
# ------------------------------
# ['\n Teste de Significancia ', '**$H_0$:** Diferença entre as médias das métricas é zero. \n', ' Arrays sizes: 5, 5 ', '* Difference between averages: 0.7897 - 0.7638 = 0.0259', '* p_val = 49.0000 ', 'The model seems to produce similar results with CI-0.95 (fail to reject H0).\n']
# ########################################
# auc_pr
# ------------------------------
# ['\n Teste de Significancia ', '**$H_0$:** Diferença entre as médias das métricas é zero. \n', ' Arrays sizes: 5, 5 ', '* Difference between averages: 0.7654 - 0.7160 = 0.0494', '* p_val = 18.0000 ', 'The model seems to produce similar results with CI-0.95 (fail to reject H0).\n']

Iteração 1/1000
Sample1 count: 2129
Sample2 count: 2136
----------
Iteração 2/1000
Sample1 count: 2060
Sample2 count: 2066
----------
Iteração 3/1000
Sample1 count: 2062
Sample2 count: 2066
----------
Iteração 4/1000
Sample1 count: 2069
Sample2 count: 2075
----------
Iteração 5/1000
Sample1 count: 2091
Sample2 count: 2095
----------
Iteração 6/1000
Sample1 count: 2114
Sample2 count: 2119
----------
Iteração 7/1000
Sample1 count: 2099
Sample2 count: 2101
----------
Iteração 8/1000
Sample1 count: 2096
Sample2 count: 2104
----------
Iteração 9/1000
Sample1 count: 2053
Sample2 count: 2059
----------
Iteração 10/1000
Sample1 count: 2069
Sample2 count: 2073
----------
Iteração 11/1000
Sample1 count: 2034
Sample2 count: 2041
----------
Iteração 12/1000
Sample1 count: 2125
Sample2 count: 2136
----------
Iteração 13/1000
Sample1 count: 2074
Sample2 count: 2086
----------
Iteração 14/1000
Sample1 count: 2076
Sample2 count: 2082
----------
Iteração 15/1000
Sample1 count: 2056
Sample2 count: 2059


In [None]:
pd.DataFrame(resultados_1_2).T
# 	confidence_interval1	mean_score1	std_dev1	confidence_interval2	mean_score2	std_dev2
# ks	[0.5587163142695343, 0.5940801326675024]	0.579402	0.014293	[0.4926559689045311, 0.5510315509616689]	0.527572	0.02525
# auc	[0.7793581571347672, 0.7970400663337511]	0.789701	0.007146	[0.7463279844522657, 0.7755157754808344]	0.763786	0.012625
# auc_pr	[0.7587768523408239, 0.7755809192338347]	0.765438	0.007362	[0.6988262164456484, 0.7300824389636984]	0.716001	0.01349

In [None]:
df_scores_result = df_scores_1_2(resultados_1_2_permutacion)
df_scores_result
# KS p_value': 49.0
# KS mean_diff': 0.05183035078404763
# auc p_value: 49.0
# auc mean_diff: 0.025915175392023593
# auc_pr p_value: 18.0
# auc_pr mean_diff: 0.04943703014541301
# ks.scores1	ks.scores2	auc.scores1	auc.scores2	auc_pr.scores1	auc_pr.scores2
# 0	0.583983	0.490242	0.791992	0.745121	0.769682	0.706142
# 1	0.576596	0.551584	0.788298	0.775792	0.758596	0.723848
# 2	0.556730	0.514383	0.778365	0.757191	0.762273	0.721227
# 3	0.595138	0.535587	0.797569	0.767793	0.776236	0.730775
# 4	0.584564	0.546064	0.792282	0.773032	0.760402	0.698013


# **4. Tempo de execução**

In [None]:
end_time = datetime.now()
execution_time = end_time - start_time

print(f"Tempo de execução: {execution_time}")
