# Exemplo de um workflow de aprendizagem máquina

### *Conjunto de dados Human Activity Recognition using Smartphones*

Descrição do dataset: 
https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

*The experiments have been carried out with a group of 30 volunteers (…). Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (…). Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity (…). The experiments have been video-recorded to label the data manually. The dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data*

Descarregar os dados do link seguinte e descompactar o ZIP: https://archive.ics.uci.edu/ml/machine-learning-databases/00240/

**Estrutura dos dados** (ficheiros principais):
* Códigos das atividades: “activity_labels.txt” (2 colunas)
* Atributos: “features.txt” (561 linhas, 2 colunas)
* Indivíduos (treino ; teste): “train/subject_train.txt” (7352 linhas), “test/subject_test.txt” (2947 linhas), ambos com 1 coluna 
* Atributos de entrada – X (treino; teste): “train/X_train.txt” (7352 linhas), “test/X_test.txt” (2947 linhas), ambos com 561 colunas
* Atributo de saída (atividade) – y (treino; teste): “train/y_train.txt” (7352 linhas), “test/y_test.txt” (2947 linhas), ambos com 1 coluna


**Variáveis:**
For each record in the dataset it is provided: 
* A 561-feature vector with time and frequency domain variables. 
* Its activity label. 
* An identifier of the subject who carried out the experiment.


### Carregar os dados

Ao descompactar o ficheiro a pasta base será UCI HAR Dataset. Definir a variável folder abaixo com path absoluto dessa pasta. Se estiver na pasta onde está o notebook bastará: `folder = "./UCI HAR Dataset/"`

In [1]:
folder = "/Users/miguelrocha/Dropbox/Programming/Python3/scipy-examples/UCI HAR Dataset"

In [2]:
import pandas as pd

In [3]:
activities = pd.read_csv(folder+'/activity_labels.txt', sep=' ', header=None, names=('ID','Activity'))
print(activities)

   ID            Activity
0   1             WALKING
1   2    WALKING_UPSTAIRS
2   3  WALKING_DOWNSTAIRS
3   4             SITTING
4   5            STANDING
5   6              LAYING


In [4]:
features = pd.read_csv(folder+"/features.txt", sep = " ", header = None, names=('ID','Sensor'))
print(features.shape)
features.head()

(561, 2)


Unnamed: 0,ID,Sensor
0,1,tBodyAcc-mean()-X
1,2,tBodyAcc-mean()-Y
2,3,tBodyAcc-mean()-Z
3,4,tBodyAcc-std()-X
4,5,tBodyAcc-std()-Y


In [5]:
subjects_tr = pd.read_csv(folder+"/train/subject_train.txt", header = None, names=['SubjectID'])
subjects_tr.head()

Unnamed: 0,SubjectID
0,1
1,1
2,1
3,1
4,1


In [6]:
subjects_tst = pd.read_csv(folder+"/test/subject_test.txt", header = None, names=['SubjectID'])
print(subjects_tr.shape, subjects_tst.shape)

(7352, 1) (2947, 1)


In [7]:
x_train = pd.read_csv(folder+"/train/X_train.txt", sep = "\s+", header = None)
x_test = pd.read_csv(folder+"/test/X_test.txt", sep = "\s+", header = None)
print(x_train.shape, x_test.shape)

(7352, 561) (2947, 561)


In [8]:
y_train = pd.read_csv(folder+"/train/y_train.txt", header=None, names=['ActivityID'])
y_test = pd.read_csv(folder+"/test/y_test.txt", header=None, names=['ActivityID'])
print(y_train.shape, y_test.shape)

(7352, 1) (2947, 1)


### Preparação dos dados

Juntar os conjuntos de dados de treino e teste

In [9]:
subjects_all = pd.concat([subjects_tr, subjects_tst], ignore_index=True)
print(subjects_all.shape)

(10299, 1)


In [10]:
x_all = pd.concat([x_train, x_test], ignore_index = True)
print(x_all.shape)

(10299, 561)


In [11]:
y_all = y_train.append(y_test, ignore_index=True)
print(y_all.shape)

(10299, 1)


Colocar nomes das colunas de X como nomes das features

In [12]:
sensorNames = features['Sensor']
x_all.columns = sensorNames
x_all.head()

Sensor,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-meanFreq(),fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)"
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.074323,-0.298676,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,0.158075,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,0.414503,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,0.404573,-0.11729,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,0.087753,-0.351471,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892


Substituir códigos de atividade pela designação (string)

In [13]:
for i in activities['ID']:
    activity = activities[activities['ID'] == i]['Activity'] 
    y_all = y_all.replace({i: activity.iloc[0]})
    
y_all.columns = ['Activity']
y_all.head()

Unnamed: 0,Activity
0,STANDING
1,STANDING
2,STANDING
3,STANDING
4,STANDING


In [14]:
y_all.tail()

Unnamed: 0,Activity
10294,WALKING_UPSTAIRS
10295,WALKING_UPSTAIRS
10296,WALKING_UPSTAIRS
10297,WALKING_UPSTAIRS
10298,WALKING_UPSTAIRS


Juntar tudo num único DataFrame e guardar num CSV

In [15]:
x_all = pd.concat([x_all, subjects_all], axis=1)
allXy = pd.concat([x_all, y_all], axis=1)
print(allXy.shape)

allXy.to_csv("HAR_clean.csv")

(10299, 563)


Agregação dos dados para um dataset mais pequeno (por indivíduo e por atividade)

In [16]:
import numpy as np
grouped = allXy.groupby (['SubjectID', 'Activity']).aggregate(np.mean)

print(grouped.shape)
grouped.head()

grouped.to_csv("HAR_grouped.csv")

(180, 561)


### Exploração do conjunto de dados

Caraterizar as distribuições de valores das variáveis de entrada

In [17]:
input_data = allXy.iloc[:,:-2]
#...

Caraterizar a distribuição de valores da variável de saída

In [18]:
output_data = allXy.iloc[:,-1]
# ...

Verificar se existem valores nulos

Standardizar os dados de entrada

In [None]:
from sklearn import preprocessing
# sc_input = ...

### Análise não supervisionada

Realizar um processo de PCA que explique pelo menos 80% da variabilidade

Represente a variância explicada por cada uma das primeiras 10 PCs usando um gráfico apropriado

Construir scores plot com os resultados do PCA e comparar com variável *Activity*

Correr clustering k-means; comparar clusters com variável *Activity*

Correr clustering hierárquico usando os dados agregados e visualizar árvore resultante colorindo folhas com variável *Activity*

### Aprendizagem máquina - modelos supervisionados

Dividir dados em partição de treino e teste (mantendo 30% no test set); verificar distribuição nas labels no training e test set

Treinar modelos de base com vários classificadores no conjunto de dados de treino. Avaliar cada um destes modelos com validação cruzada.

Considerando o modelo mais prometedor, experimentar a seleção de atributos baseadas em testes estatísticos univariados (ANOVA), considerando uma redução para metade do número de variáveis.

Considerando o modelo mais prometedor fazer um processo de otimização de hiperparâmetros

Criar um modelo ensemble com base nos 3 melhores modelos que experimentou e avaliá-lo.

Estime o erro do melhor modelo obtido no test set.

Treinar o modelo final da forma mais adequada