![Cabec%CC%A7alho_notebook.png](cabecalho_notebook.png)

# Classificação de Atividade Humana com PCA

Vamos trabalhar com a base da demonstração feita em aula, mas vamos explorar um pouco melhor como é o desempenho da árvore variando o número de componentes principais.

In [1]:
########## IMPORTS BLOCK ##########

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.graph_objs as go

from sklearn.tree import DecisionTreeClassifier

from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, plot_confusion_matrix
from plotly.subplots import make_subplots

In [2]:
########## FILE OPENING BLOCK (USING JUPYTER NOTEBOOK) ##########

base_dir = "C:/Users/User/Desktop/Usuário/Documents/Victor/Ciência dos Dados/Python/Exercícios/Entregues/Avaliados/Profissao Ciência de Dados/MOD27/Dados/UCI HAR Dataset/"

filename_features = base_dir + "features.txt"
filename_labels = base_dir + "activity_labels.txt"
filename_subtrain = base_dir + "train/subject_train.txt"
filename_xtrain = base_dir + "train/X_train.txt"
filename_ytrain = base_dir + "train/y_train.txt"
filename_subtest = base_dir + "test/subject_test.txt"
filename_xtest = base_dir + "test/X_test.txt"
filename_ytest = base_dir + "test/y_test.txt"

########## FILE READING BLOCK ##########

features = pd.read_csv(filename_features, header=None, names=['nome_var'], sep="#").squeeze("columns")
labels = pd.read_csv(filename_labels, delim_whitespace=True, header=None, names=['cod_label', 'label'])

subject_train = pd.read_csv(filename_subtrain, header=None, names=['subject_id']).squeeze("columns")
subject_test = pd.read_csv(filename_subtest, header=None, names=['subject_id']).squeeze("columns")

X_train = pd.read_csv(filename_xtrain, delim_whitespace=True, header=None, names=features.tolist())
y_train = pd.read_csv(filename_ytrain, header=None, names=['cod_label'])
X_test = pd.read_csv(filename_xtest, delim_whitespace=True, header=None, names=features.tolist())
y_test = pd.read_csv(filename_ytest, header=None, names=['cod_label'])

## PCA com variáveis padronizadas

Reflexão sobre a escala das variáveis:

**Variáveis em métricas muito diferentes** podem interferir na análise de componentes principais. Lembra que variância é informação pra nós? Pois bem, tipicamente se há uma variável monetária como salário, vai ter uma ordem de variabilidade bem maior que número de filhos, tempo de emprego ou qualquer variável dummy. Assim, as variáveis de maior variância tendem a "dominar" a análise. Nesses casos é comum usar a padronização das variáveis.

Faça duas análises de componentes principais para a base do HAR - com e sem padronização e compare:

- A variância explicada por componente
- A variância explicada acumulada por componente
- A variância percentual por componente
- A variância percentual acumulada por componente
- Quantas componentes você escolheria, em cada caso para explicar 90% da variância?

In [13]:
%%time

########## FUNCTION CONSTRUCTION BLOCK ##########

def padroniza(s):
    if s.std() > 0:
        s = (s - s.mean())/s.std()
    return s

########## STANDARDIZATION ##########

X_train_pad = pd.DataFrame(X_train).apply(padroniza, axis=0)
X_test_pad = pd.DataFrame(X_test).apply(padroniza, axis=0)

########## PCA ANALYSIS BLOCK ##########

pca = PCA()
pca.fit_transform(X_train) # Regular

pca_stded = PCA()
pca_stded.fit_transform(X_train_pad) # Standardized (Stded)

########## EXTRACTING NEEDED INFO TO COMPARE FROM SETS ##########

explained_variance = pca.explained_variance_
cumulative_explained_variance = np.cumsum(explained_variance)
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance_ratio = np.cumsum(explained_variance_ratio)

explained_variance_scaled = pca_stded.explained_variance_
cumulative_explained_variance_scaled = np.cumsum(explained_variance_scaled)
explained_variance_ratio_scaled = pca_stded.explained_variance_ratio_
cumulative_explained_variance_ratio_scaled = np.cumsum(explained_variance_ratio_scaled)

data = {
    'Explained Var (Reg)': explained_variance,
    'Cumulative Explained Var (Reg)': cumulative_explained_variance,
    'Explained Var Ratio (Reg)': explained_variance_ratio,
    'Cumulative Explained Var Ratio (Reg)': cumulative_explained_variance_ratio,
    'Explained Var (Stded)': explained_variance_scaled,
    'Cumulative Explained Var (Stded)': cumulative_explained_variance_scaled,
    'Explained Var Ratio (Stded)': explained_variance_ratio_scaled,
    'Cumulative Explained Var Ratio (Stded)': cumulative_explained_variance_ratio_scaled,
}

########## STEP TO FIND THE 90% ACCURACY NEEDED FOR BOTH CASES ##########

index_regular = np.argmax(cumulative_explained_variance_ratio > 0.9) + 1 # adding one finds the correct
index_stded = np.argmax(cumulative_explained_variance_ratio_scaled > 0.9) + 1

########## PLOTTING A VISUAL COMPARISON ##########

x_max = list(range(1, len(cumulative_explained_variance_ratio)+1))

fig = make_subplots(rows=1, cols=1)
fig.add_trace(go.Scatter(x=x_max,
                         y=cumulative_explained_variance_ratio,
                         mode='lines',
                         name='Regular'))
fig.add_trace(go.Scatter(x=x_max,
                         y=cumulative_explained_variance_ratio_scaled,
                         mode='lines',
                         name='Standardized'))
# Fine adjustments
fig.add_shape(type="line", x0=0, y0=0.9, x1=max(x_max), y1=0.9, line=dict(color="black", width=1, dash="dash"))
fig.add_shape(type="line", x0=index_regular, y0=0, x1=index_regular, y1=1, line=dict(color="black", width=1, dash="dash"))
fig.add_shape(type="line", x0=index_stded, y0=0, x1=index_stded, y1=1, line=dict(color="black", width=1, dash="dash"))
fig.update_layout(title='C.E.V. by Components - Standardized X Regular',
                  font=dict(size=24,
                            color='black',
                            family='Calibri'),
                  xaxis_title='Number of Components (-)',
                  yaxis_title='Cumulative Explained Variance (-)',
                  legend=dict(x=0.325, y=1.1, orientation='h'),
                  template='simple_white',
                  title_x=0.5)

########## EXHIBITING THE DATA USING PLOTLY & PANDAS ##########

fig.show()
print(f'\nNeeded components for Regular Set: {index_regular} & for Standardzied Set: {index_stded}.\n')
df_pca_info = pd.DataFrame(data)
df_pca_info.head(10).style.format(precision=2, decimal=',')


Needed components for Regular Set: 34 & for Standardzied Set: 63.

Wall time: 2.58 s


Unnamed: 0,Explained Var (Reg),Cumulative Explained Var (Reg),Explained Var Ratio (Reg),Cumulative Explained Var Ratio (Reg),Explained Var (Stded),Cumulative Explained Var (Stded),Explained Var Ratio (Stded),Cumulative Explained Var Ratio (Stded)
0,3482,3482,63,63,28488,28488,51,51
1,274,3756,5,67,3692,32180,7,57
2,229,3985,4,72,1574,33754,3,60
3,104,4090,2,73,1405,35159,3,63
4,94,4184,2,75,1059,36218,2,65
5,71,4255,1,76,967,37186,2,66
6,66,4320,1,78,769,37955,1,68
7,60,4380,1,79,673,38627,1,69
8,54,4434,1,80,559,39186,1,70
9,48,4482,1,81,541,39728,1,71


## Árvore com PCA

Faça duas uma árvore de decisão com 10 componentes principais - uma com base em dados padronizados e outra sem padronizar. Utilize o ```ccp_alpha=0.001```.

Compare a acurácia na base de treino e teste.

In [6]:
########## PERFORMING PCA WITH 10 COMPONENTS ##########

pca_10 = PCA(n_components=10)

X_train_pca = pca_10.fit_transform(X_train)
X_test_pca = pca_10.transform(X_test)

X_train_pca_std = pca_10.fit_transform(X_train_pad)
X_test_pca_std = pca_10.transform(X_test_pad)

########## CONSTRUCTING THE TREE ##########

dt = DecisionTreeClassifier(ccp_alpha=0.001, random_state=42)

########## TIMMING & TRAINING TREES ##########

%time
dt = dt.fit(X_train_pca, y_train)
dt_std = dt.fit(X_train_pca_std, y_train)

########## PREDICTING & EVALUATING ##########

y_train_pred = dt.predict(X_train_pca)
y_test_pred = dt.predict(X_test_pca)
accuracy_train = accuracy_score(y_train, y_train_pred) * 100
accuracy_test = accuracy_score(y_test, y_test_pred) * 100

y_train_pred_std = dt_std.predict(X_train_pca_std)
y_test_pred_std = dt_std.predict(X_test_pca_std)
accuracy_train_std = accuracy_score(y_train, y_train_pred_std) * 100
accuracy_test_std = accuracy_score(y_test, y_test_pred_std) * 100

########## EXHIBITING THE DATA USING PANDAS ##########

data = {'Dataset Config.': ['Regular', 'Standardized'],
        'Train Acc. (%)': [accuracy_train, accuracy_train_std],
        'Test Acc. (%)': [accuracy_test, accuracy_test_std]}

df_resultados = pd.DataFrame(data)
df_resultados.set_index('Dataset Config.', inplace=True)
df_resultados.style.format(precision=1, decimal=',')

Wall time: 0 ns


Unnamed: 0_level_0,Train Acc. (%),Test Acc. (%)
Dataset Config.,Unnamed: 1_level_1,Unnamed: 2_level_1
Regular,357,345
Standardized,858,776
