# Atividade 6

Nesta atividade iremos realizar alguns testes com o classificar Naive Bayes na nossa base de dados, adaptando-o de acordo com as características da nossa base.


## 0. Preparação do ambiente.
Para os trabalhos desta atividade utilizaremos a base de dados previamente carregada no Github, [neste link](https://github.com/danielbdias/pattern-recognition-studies.git), em conjunto com alguns scripts auxiliares

In [1]:
# prepara a máquina local do google colab para receber a base (quando necessário)
# e baixa os scripts auxiliares para a montagem do notebook
# !rm -rf ./*

# !git clone https://github.com/danielbdias/pattern-recognition-studies.git
# !mv ./pattern-recognition-studies/* ./
# !rm -rf ./pattern-recognition-studies
# !rm -rf ./sample_data

# !pip install -r requirements.txt

In [2]:
# imports de libs necessárias para as análises
from scripts.database import load_datapoints_with_targets
from scripts.preprocessing import centralize_observations, principal_component_analysis

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold

## 1. Base de dados

Para o nosso trabalho, estamos utilizando o dataset [Grammatical Facial Expressions](https://archive.ics.uci.edu/ml/datasets/Grammatical+Facial+Expressions), que descreve expressões faciais gramaticais da linguagem brasileira de sinais (Libras).

A base possui `27965 instâncias`, subdivididas em 9 expressões: `Interrogativa (qu)`,`Interrogativa (s/n)`, `Interrogativa (dúvida)`, `Negativa`, `Afirmativa`, `Condicional`, `Relativa`, `Tópico` e `Foco`. 

Cada instância é estruturada em `300 características`, que representam 100 pontos com coordenadas (x, y, z) da face, seguindo a ordem abaixo:

| Coordenadas (x,y,z) | Região da Face | Coordenadas (x,y,z) | Região da Face |
| --- | --- | --- | --- |
| 0 - 7 (x,y,z) | olho esquerdo | 68 - 86 (x,y,z) | contorno da face |
| 8 - 15 (x,y,z) | olho direito | 87 (x,y,z) | iris esquerda |
| 16 - 25 (x,y,z) | sobrancelha esquerda | 88 (x,y,z) | iris direita |
| 26 - 35 (x,y,z) | sobrancelha direita | 89 (x,y,z) | ponta do nariz |
| 36 - 47 (x,y,z) | nariz | 90 - 94 (x,y,z) | linha acima da sobrancelha esquerda |
| 48 - 67 (x,y,z) | boca | 95 - 99 (x,y,z) | linha acima da sobrancelha direita |

Não há missing values nesses pontos e eles não por um processo de normalização. A classificação de instância é binária onde ela pode ser "Com Expressão" (`"Expression"`, onde os pontos representam a expressão facial) ou "Sem Expressão" (`"Not Expression"`, os pontos não representam uma expressão).

Normalizamos a base buscando centralizar cada frame (observação) em relação a um ponto em comum no frame.
Escolhemos o ponto 89 (`nose tip`) como referência e aplicamos o processo nas seguintes etapas:

1.   Encontramos os valores médios para o ponto 89 nos valores de `x` e `y` (ignoramos o valor de `z` por ele usar uma medida em milimetros ao invés de em pixels, [referência](https://archive.ics.uci.edu/ml/datasets/Grammatical+Facial+Expressions#));
2.   Calculamos os deltas do ponto 89 de cada observação em relação aos valores médios encontrados;
3.   Subtraimos esse delta de todos os pontos de cada observação.

Nesta atividade utilizaremos somente as observações da expressão facial `negativa` em sua forma pré-processada (com os pontos centralizados) de quatro formas diferentes:

- Base com todas as características
- Base transformada pela técnica PCA (resultante da Atividade 03)
- Base com as características encontradas pelo Relief-F (resultante da Atividade 04)
- Base com as características encontradas pelo Algoritmo Genético (resultante da Atividade 04)


### 1.1 Base total (todas as características)

A base de dados com todas as características contém 300 dimensões, estruturadas da seguinte forma:

In [7]:
category = 'negative'
raw_data = load_datapoints_with_targets(category)

dataset = centralize_observations(raw_data)
dataset.describe()

Unnamed: 0,0x,0y,0z,1x,1y,1z,2x,2y,2z,3x,...,97x,97y,97z,98x,98y,98z,99x,99y,99z,target
count,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,...,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0
mean,309.397827,219.230181,1060.284922,306.747875,217.450552,1131.164819,303.378426,216.894622,1193.764597,300.074385,...,336.2139,205.329225,1272.370658,341.628362,207.474289,1184.382114,344.936643,210.218903,986.881744,0.458241
std,1.625597,2.741896,439.370887,1.766325,2.647455,361.55821,1.909747,2.545192,266.93004,2.06836,...,3.90509,5.389459,89.417381,3.914957,6.521451,351.531081,3.385256,7.330508,555.178027,0.498345
min,303.950854,209.79712,0.0,300.704854,207.54012,0.0,296.182854,206.90412,0.0,291.035854,...,324.399854,191.58812,0.0,329.433854,190.30512,0.0,333.625854,190.42912,0.0,0.0
25%,308.392854,217.15362,1216.0,305.627854,215.37112,1221.0,302.096854,214.94862,1225.0,298.725104,...,332.981354,200.48687,1265.0,338.390354,201.33112,1275.0,342.261354,203.14512,1208.0,0.0
50%,309.606354,219.17962,1238.0,306.972854,217.45062,1243.0,303.588354,217.09362,1256.0,300.290354,...,335.892854,206.15562,1284.0,341.439354,208.10312,1294.0,345.075854,210.95462,1294.0,0.0
75%,310.441854,221.67462,1261.0,307.857104,219.73387,1270.0,304.494854,219.01862,1280.0,301.115854,...,340.044854,210.09087,1294.0,345.416854,213.09812,1299.0,348.082604,216.89537,1309.0,1.0
max,314.504854,224.50212,1299.0,312.710854,222.92012,1304.0,310.230854,222.41312,1314.0,307.853854,...,348.181854,214.91212,1329.0,356.601854,219.07712,1542.0,360.888854,223.07312,1563.0,1.0


### 1.2 Base com PCA

A base transformada pelo PCA, considerando apenas os componentes que representam até 80% da variabilidade dos dados, está estruturada da seguinte forma:

In [15]:
features = list(dataset.columns)

pca_results = principal_component_analysis(features, dataset.values)
number_of_chosen_components = len(pca_results.principal_components_under_threshold)
dataset_pca = pd.DataFrame(data=pca_results.X_transformed[:, 0:number_of_chosen_components], columns=pca_results.principal_components_under_threshold)
dataset_pca.insert(number_of_chosen_components, 'target', dataset["target"].values)

dataset_pca.describe()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,target
count,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0
mean,0.0,-9.679766e-14,0.0,2.688824e-15,6.453177e-14,-1.07553e-13,-4.571001e-14,1.814956e-14,-8.100082e-14,6.72206e-14,1.613294e-14,1.344412e-14,-5.175986e-14,5.444869e-14,0.458241
std,1529.487923,828.8783,754.072116,626.0129,546.234,534.0938,495.7597,443.5333,442.3319,417.6105,401.5236,387.808,371.7674,362.1114,0.498345
min,-2494.036515,-1817.949,-1412.855471,-1434.314,-1577.976,-1700.658,-1627.37,-1685.62,-1570.395,-1468.439,-1255.571,-1326.306,-1290.745,-1335.36,0.0
25%,-1444.534931,-499.244,-583.204813,-365.5466,-230.4158,-181.5366,-260.631,-186.634,-214.5056,-244.7499,-209.8125,-232.5568,-134.6116,-168.5787,0.0
50%,357.944939,-49.97479,-42.891897,-15.93658,-6.683547,-12.47349,43.15854,-3.389786,-12.53438,-8.60183,-33.32489,-3.781624,-10.65275,20.84432,0.0
75%,1319.487879,404.2879,602.397126,363.5263,274.3019,180.3365,235.3098,188.0283,172.1725,230.1877,199.1701,227.5928,117.118,132.0544,1.0
max,3017.776144,3035.289,2526.584373,1814.297,1674.262,1884.368,2005.806,1862.771,1639.981,1543.912,1848.362,1512.133,1503.722,1508.445,1.0


### 1.3 Base com características selecionadas pelo Relief-F

A base com características selecionadas pelo Relief-F considerou as seguintes características:

In [19]:
features_chosen_by_relief = ['41x', '40x', '86x', '42x', '39x', '85x', '44x', '53y', '84x', '68x', '60x', '69x', '15x', '63y', '83x', '70x', '14x', 'target']
dataset_with_relief = dataset[features_chosen_by_relief]
dataset_with_relief.describe()

Unnamed: 0,41x,40x,86x,42x,39x,85x,44x,53y,84x,68x,60x,69x,15x,63y,83x,70x,14x,target
count,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0
mean,315.351416,309.820054,349.447013,319.569119,308.938044,349.895513,326.612842,242.286661,349.721362,289.396139,309.200498,289.235498,329.038068,243.888143,348.805798,289.878882,332.324666,0.458241
std,0.492467,0.543975,3.408376,0.574203,0.611865,3.360874,0.480513,1.163002,3.364616,4.203998,1.904017,4.026562,1.264786,1.345394,3.425465,4.065845,1.345895,0.498345
min,313.976854,307.048854,335.942854,317.704854,305.724854,336.153854,325.002854,239.63812,336.030854,270.164854,303.034854,271.046854,323.388854,240.95912,335.005854,272.561854,326.322854,0.0
25%,315.099854,309.461104,347.459104,319.268854,308.556854,347.919104,326.376854,241.48737,347.467104,287.664604,308.210854,287.761104,328.031104,242.96537,346.491854,288.168104,331.285604,0.0
50%,315.339854,309.801854,348.869854,319.511854,308.962354,349.444854,326.611854,242.04262,349.403854,289.111354,309.081854,289.020854,329.034354,243.67212,348.531354,289.636854,332.358854,0.0
75%,315.680604,310.104604,351.472854,319.917854,309.283604,351.934604,326.923854,242.86387,351.836604,291.092854,310.050854,290.785604,329.924604,244.83337,351.170604,291.877104,333.259104,1.0
max,317.691854,311.745854,360.176854,321.897854,310.900854,360.316854,329.570854,248.53512,359.380854,304.734854,314.551854,303.302854,334.273854,250.78412,357.605854,303.659854,339.373854,1.0


### 1.4 Base com características selecionadas pelo Algoritmo Genético

A base com características selecionadas pelo algoritmo genético considerou as seguintes características:

In [21]:
features_chosen_by_genetic_algorithm = ['1x', '1y', '2x', '2y', '2z', '3x', '4z', '5y', '6z', '7z', '8x', '8y', '8z', '9x', '9z', '11x', '11y', '11z', '12x', '12z', '13y', '14x', '14y', '15y', '16y', '17y', '18x', '18y', '18z', '19x', '19z', '20x', '20y', '21x', '21z', '22z', '23y', '23z', '24x', '24y', '24z', '25x', '25z', '27y', '28x', '28y', '28z', '31y', '32z', '33x', '33y', '33z', '34y', '35x', '35z', '36x', '37x', '37y', '37z', '38x', '39y', '39z', '40z', '41y', '41z', '42x', '42y', '42z', '43x', '43y', '43z', '44x', '44y', '45z', '46y', '46z', '47z', '48y', '49x', '49y', '50x', '50y', '52z', '53x', '53y', '54x', '54z', '55y', '56x', '56y', '56z', '57x', '57y', '57z', '58x', '58y', '58z', '59x', '60z', '62x', '62y', '62z', '63y', '63z', '66z', '67y', '67z', '68x', '69x', '69y', '70z', '71y', '71z', '72y', '73y', '74y', '74z', '75y', '76x', '76y', '78z', '79z', '80y', '80z', '81y', '81z', '82x', '82y', '83x', '83z', '84x', '84y', '84z', '85y', '85z', '86y', '87z', '88x', '89y', '90x', '90y', '91z', '92y', '92z', '93y', '93z', '94x', '95x', '95z', '96y', '96z', '97x', '97y', '97z', '98y', '99z', 'target']
dataset_with_genetic_algorithm = dataset[features_chosen_by_genetic_algorithm]
dataset_with_genetic_algorithm.describe()

Unnamed: 0,1x,1y,2x,2y,2z,3x,4z,5y,6z,7z,...,95x,95z,96y,96z,97x,97y,97z,98y,99z,target
count,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,...,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0,2706.0
mean,306.747875,217.450552,303.378426,216.894622,1193.764597,300.074385,1230.535107,220.322486,1038.365484,1009.759793,...,323.691057,1226.466741,205.52479,1264.941981,336.2139,205.329225,1272.370658,207.474289,986.881744,0.458241
std,1.766325,2.647455,1.909747,2.545192,266.93004,2.06836,189.524172,2.627748,467.862472,486.391245,...,2.336497,214.587202,4.355729,85.920339,3.90509,5.389459,89.417381,6.521451,555.178027,0.498345
min,300.704854,207.54012,296.182854,206.90412,0.0,291.035854,0.0,211.01812,0.0,0.0,...,316.851854,0.0,193.75012,0.0,324.399854,191.58812,0.0,190.30512,0.0,0.0
25%,305.627854,215.37112,302.096854,214.94862,1225.0,298.725104,1234.0,218.37762,1221.0,1212.0,...,321.777604,1243.0,201.72037,1252.0,332.981354,200.48687,1265.0,201.33112,1208.0,0.0
50%,306.972854,217.45062,303.588354,217.09362,1256.0,300.290354,1261.0,220.78712,1243.0,1234.0,...,323.702354,1270.0,206.27512,1275.0,335.892854,206.15562,1284.0,208.10312,1294.0,0.0
75%,307.857104,219.73387,304.494854,219.01862,1280.0,301.115854,1280.0,222.49687,1270.0,1261.0,...,325.720354,1280.0,209.26587,1289.0,340.044854,210.09087,1294.0,213.09812,1309.0,1.0
max,312.710854,222.92012,310.230854,222.41312,1314.0,307.853854,1319.0,225.91912,1314.0,1309.0,...,330.071854,1319.0,213.00912,1319.0,348.181854,214.91212,1329.0,219.07712,1563.0,1.0


## 2. Revisitando a Validação Cruzada

Nesta atividade atualizamos o nosso algoritmo de validação cruzada para considerar o novo classificador e calcular as estatísticas de classificação:

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

def Stratified_KFold(parameters, X, y, K):

  precision=[[]]*len(parameters)
  recall=[[]]*len(parameters)
  accuracy=[[]]*len(parameters)
  precision_mean=[]
  recall_mean=[]
  accuracy_mean=[]
  alpha=[]

  for l in range(len(parameters)):
    c=parameters[l]
    print(c)
    clf=MultinomialNB(alpha=c)##Modelo utilizado
    kf = StratifiedKFold(n_splits=K, shuffle=False)
    train_and_test_indexes = kf.split(X, y)
    for i, indexes in enumerate(train_and_test_indexes, 1):
      train_index, test_index = indexes
      X1_train, X1_test = X[train_index], X[test_index]
      y1_train, y1_test = y[train_index], y[test_index]

      clf.fit(X1_train,y1_train)
      y_pred=clf.predict(X1_test)
    
      precision[l].append(precision_score(y1_test,y_pred))
      recall[l].append(recall_score(y1_test,y_pred))  
      accuracy[l].append(accuracy_score(y1_test,y_pred))
    #print(np.mean(precision[l]))
    precision_mean.append(np.mean(precision[l]))
    recall_mean.append(np.mean(recall[l]))
    accuracy_mean.append(np.mean(accuracy[l]))
    alpha.append(c)
  d={'precision':precision_mean,'recall':recall_mean,'accuacy':accuracy_mean,'alpha':alpha}
  return pd.DataFrame(d)

##4. Treinando o Modelo

###4.1 Modelo com todas as características

In [None]:
X = normalized_data.values[:,:-1]
y = normalized_data.values[:, -1:].ravel()
param=[0.1,0.5,1]
Stratified_KFold(param,X, y, K = 10)

###4.2 Modelo após utilicação do PCA


In [None]:
#principal_component_analysis(normalized_data.columns,normalized_data.values[:,:-1])