# Análise Correlação Espúria

In [1]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [2]:
df = pd.read_csv('../../data/final_features_df.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Age,Income,faves_pca0,faves_pca1,unfaves_pca0,unfaves_pca1,accessories,alcohol,animamted,...,Drama.2,Entertainment (Variety Shows),Factual,Learning,Music,News,Religion &amp; Ethics,Sport.1,Weather,Rating_bin
0,0,62,1,-0.321485,0.0786,-0.19967,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1,1,62,1,-0.321485,0.0786,-0.19967,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
2,2,62,1,-0.321485,0.0786,-0.19967,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
3,3,62,1,-0.321485,0.0786,-0.19967,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
4,4,62,1,-0.321485,0.0786,-0.19967,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0


## Treinando o modelo de baseline com a correlação

In [3]:
df_0 = df.fillna(0)
y = df_0['Rating_bin']
X = df_0.drop(columns=['Rating_bin', 'Gender_F'])

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42, train_size=0.85)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_val)
print(classification_report(y_val, y_pred))
confusion_matrix(y_val, y_pred)

y_val_ = clf.predict(X_val)

print('F1-score:', f1_score(y_val, y_val_))

              precision    recall  f1-score   support

           0       0.90      0.91      0.91      4635
           1       0.43      0.40      0.41       783

    accuracy                           0.84      5418
   macro avg       0.66      0.65      0.66      5418
weighted avg       0.83      0.84      0.83      5418

F1-score: 0.4127405441274054


## Treinando o modelo de baseline sem a correlaçao

In [4]:
df_1 = df.fillna(0)
y = df_1['Rating_bin']
X = df_1.drop(columns=['Unnamed: 0', 'Rating_bin', 'Gender_F'])
# Note que aqui removemos a coluna 'Unnamed: 0', que é um índice sequencial

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42, train_size=0.85)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_val)
print(classification_report(y_val, y_pred))
confusion_matrix(y_val, y_pred)

y_val_ = clf.predict(X_val)

print('F1-score:', f1_score(y_val, y_val_))

              precision    recall  f1-score   support

           0       0.86      1.00      0.92      4635
           1       0.55      0.03      0.06       783

    accuracy                           0.86      5418
   macro avg       0.70      0.51      0.49      5418
weighted avg       0.81      0.86      0.80      5418

F1-score: 0.05575757575757576


## Análise de possível causa da correlação

O dataset segue um padrão de ordem nos anúncios. Para cada tema, são apresentados 15 anúncios, sendo que em cada tema os anúncios seguem sempre uma ordem:
- 5 anúncios com apenas textos
- 5 anúncios com texto e imagem de um produto
- 5 anúncios com apenas imagens (banners de internet)

Com base nisso, é possível que o modelo tenha encontrado uma correlação na ordem. Por exemplo: podemos considerar que a probabilidade da pessoa clicar nos 5 últimos anúncios de cada grupo é maior

In [5]:
df_3 = pd.read_csv('../../data/AllUsers_Ads_Ratings_df.csv', low_memory=False)
df_3.head()

Unnamed: 0,UserId,AdId,Age,Cap/Zip-Code,Countries visited,Fave Sports,Gender,Home country,Home town,Income,...,fave8,fave9,unfave1,unfave2,unfave3,unfave4,unfave5,unfave6,AdFilePath,Rating
0,U0001,A01_01,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0
1,U0001,A01_02,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0
2,U0001,A01_03,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0
3,U0001,A01_04,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0
4,U0001,A01_05,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0


In [6]:
df_3['Rating_bin'] = df_3['Rating'].apply(lambda x: 1 if x >= 4 else 0)
df_3.head()

Unnamed: 0,UserId,AdId,Age,Cap/Zip-Code,Countries visited,Fave Sports,Gender,Home country,Home town,Income,...,fave9,unfave1,unfave2,unfave3,unfave4,unfave5,unfave6,AdFilePath,Rating,Rating_bin
0,U0001,A01_01,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0,0
1,U0001,A01_02,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0,0
2,U0001,A01_03,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0,0
3,U0001,A01_04,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0,0
4,U0001,A01_05,62,15613,United States of America,I do not like Sports,F,United States of America,Apollo,1,...,,news headlines,homelessness,violence,war,human rights,,../../data/ads16-dataset/ADS16_Benchmark_part1...,1.0,0


In [7]:
df_4 = df_3[df_3.columns[df_3.columns.isin(['AdFilePath', 'Rating_bin'])]]
df_4.head()

Unnamed: 0,AdFilePath,Rating_bin
0,../../data/ads16-dataset/ADS16_Benchmark_part1...,0
1,../../data/ads16-dataset/ADS16_Benchmark_part1...,0
2,../../data/ads16-dataset/ADS16_Benchmark_part1...,0
3,../../data/ads16-dataset/ADS16_Benchmark_part1...,0
4,../../data/ads16-dataset/ADS16_Benchmark_part1...,0


In [8]:
df_4['ad-number'] = df_4['AdFilePath'].apply(lambda x: x.split('/')[-1].split('.')[0])
df_4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_4['ad-number'] = df_4['AdFilePath'].apply(lambda x: x.split('/')[-1].split('.')[0])


Unnamed: 0,AdFilePath,Rating_bin,ad-number
0,../../data/ads16-dataset/ADS16_Benchmark_part1...,0,1
1,../../data/ads16-dataset/ADS16_Benchmark_part1...,0,2
2,../../data/ads16-dataset/ADS16_Benchmark_part1...,0,3
3,../../data/ads16-dataset/ADS16_Benchmark_part1...,0,4
4,../../data/ads16-dataset/ADS16_Benchmark_part1...,0,5


In [9]:
# Grupo de ads em texto
group_0 = df_4.loc[df_4['ad-number'].isin(['1','2','3','4','5'])]

# Grupo de ads de produto
group_1 = df_4.loc[df_4['ad-number'].isin(['6','7','8','9','10'])]

# Grupo de ads de imagem
group_2 = df_4.loc[df_4['ad-number'].isin(['11','12','13','14','15', '16'])] # Há um grupo com 16, onde o adicional é uma imagem

In [10]:
import numpy as np

p0_1 = np.mean(group_0['Rating_bin'])
p0_0 = 1 - p0_1

print('Prioris do grupo 0')
print('1:', p0_1)
print('0:', p0_0)

Prioris do grupo 0
1: 0.1325
0: 0.8674999999999999


In [11]:
p1_1 = np.mean(group_1['Rating_bin'])
p1_0 = 1 - p1_1

print('Prioris do grupo 1')
print('1:', p1_1)
print('0:', p1_0)

Prioris do grupo 1
1: 0.133
0: 0.867


In [12]:
p2_1 = np.mean(group_2['Rating_bin'])
p2_0 = 1 - p2_1

print('Prioris do grupo 2')
print('1:', p2_1)
print('0:', p2_0)

Prioris do grupo 2
1: 0.13655115511551155
0: 0.8634488448844885


Como podemos ver acima, aparentemente a correlação encontrada não tem relação com o grupo (texto, produto ou imagem), portanto talvez ela tenha relação com a posição do anúncio. Vamos checar, agora, as prioris individualmente

In [13]:
possible_ads = df_4['ad-number'].unique()

for ad_number in possible_ads:
    print(f'Evaluating ad {ad_number}')
    group = df_4.loc[df_4['ad-number'] == ad_number]
    p1 = np.mean(group['Rating_bin'])
    p0 = 1 - p1

    print(f'Prioris para os ads de número {ad_number}')
    print('1:', p1)
    print('0:', p0)
    
    print('---')

Evaluating ad 1
Prioris para os ads de número 1
1: 0.12875
0: 0.87125
---
Evaluating ad 2
Prioris para os ads de número 2
1: 0.13833333333333334
0: 0.8616666666666667
---
Evaluating ad 3
Prioris para os ads de número 3
1: 0.12916666666666668
0: 0.8708333333333333
---
Evaluating ad 4
Prioris para os ads de número 4
1: 0.14083333333333334
0: 0.8591666666666666
---
Evaluating ad 5
Prioris para os ads de número 5
1: 0.12541666666666668
0: 0.8745833333333333
---
Evaluating ad 6
Prioris para os ads de número 6
1: 0.13208333333333333
0: 0.8679166666666667
---
Evaluating ad 7
Prioris para os ads de número 7
1: 0.12916666666666668
0: 0.8708333333333333
---
Evaluating ad 8
Prioris para os ads de número 8
1: 0.14166666666666666
0: 0.8583333333333334
---
Evaluating ad 9
Prioris para os ads de número 9
1: 0.13291666666666666
0: 0.8670833333333333
---
Evaluating ad 10
Prioris para os ads de número 10
1: 0.12916666666666668
0: 0.8708333333333333
---
Evaluating ad 11
Prioris para os ads de número 11
1

Com excessão do 16, todos estão tendo resultados similares, portanto não há uma causa explícita considerando apenas o número do anúncio em si

## Conclusao

Com base na análise prévia que fizemos acima, não foi possível encontrar uma correlação explícita entre o número do anúncio e uma maior priori de 1. Essa era nossa hipótese inicial que justificaria o fato do alto impacto na adição da coluna sequencial ao modelo. É evidente que existe a correlação espúria, porém não fomos capazes de encontrar sua causa.