# Padronizador de views de HAR

Este notebook auxilia a padronizar as views dos datasets de HAR

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import hashlib

In [2]:
root_views_dir = Path("../data/processed/")    # Local onde as views de encontram (entrada)
output_dir = Path("../data/processed_standard/")      # Local onde as views padronizadas serão colocadas

In [3]:
# Código de atividades padronizado
standartized_codes = {
    0: "sit",
    1: "stand",
    2: "walk",
    3: "stair up",
    4: "stair down",
    5: "run",
    6: "stair up and down"
}

## Descrição das views a serem processadas

A variavel `views` é um dicionário, onde cada chave é o nome do dataset (nome da pasta do dataset raíz, onde possui as views dentro) e o valor é uma lista de dicionários com meta informações (pode ter várias meta-informações de processamento. Elas serão processadas em ordem).

Cada meta-informação é um dicionário deve conter as seguintes informações:

* **view**: nome da pasta com a view
* **output**: nome da pasta de saída (será criada)
* **train**: nome do arquivo csv de treino
* **validation**: nome do arquivo csv de validação (None, se não houver)
* **test**: nome do arquivo csv de teste (None, se não houver)
* **activity code**: Dicionário com o mapeamento entre os nomes das atividades originais e seus respectivos códigos
* **select activities**: Lista com quais serão as atividades selecionadas (em relação ao activity code)
* **standard activity code map**: mapeamento do código das atividades originais (chave) para o código de atividade padronizado (valor) 
* **brief**: resumo para o README.md


In [4]:
views = {

    "MotionSense": [
        {
            "view": "balanced",
            "output": "standard_balanced",
            "train": "train.csv",
            "validation": "validation.csv",
            "test": "test.csv",
            "activity code": {
                0: "downstairs",
                1: "upstairs",
                2: "sitting",
                3: "standing",
                4: "walking",
                5: "jogging"
            },
            "select activities": [
                0, 1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                0: 4,
                1: 3,
                2: 0,
                3: 1,
                4: 2,
                5: 5
            },
            "brief": """# Balanced MotionSense View Resampled to 20Hz - Multiplied acc by 9.81m/s²

This is a view from [MotionSense] that was spllited into 3s windows and was resampled to 20Hz using the [FFT method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample). 

The data was first splitted in three sets: train, validation and test. Each one with the following proportions:
- Train: 70% of samples
- Validation: 10% of samples
- Test: 20% of samples

After splits, the datasets were balanced in relation to the activity code column, that is, each subset have the same number of activitiy samples.

**NOTE**: Each subset contain samples from distinct users, that is, samples of one user belongs exclusivelly to one of three subsets.

"""
        },
        
#         Podes adicionar outras views do MotionSense aqui   
        
        {
            "view": "balanced_normalized",
            "output": "standard_balanced_normalized",
            "train": "train.csv",
            "validation": "validation.csv",
            "test": "test.csv",
            "activity code": {
                0: "downstairs",
                1: "upstairs",
                2: "sitting",
                3: "standing",
                4: "walking",
                5: "jogging"
            },
            "select activities": [
                0, 1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                0: 4,
                1: 3,
                2: 0,
                3: 1,
                4: 2,
                5: 5
            },
            "brief": """# Raw balanced MotionSense View

This is a view from [MotionSense] that was spllited into 3s windows. 

The data was first splitted in three sets: train, validation and test. Each one with the following proportions:
- Train: 70% of samples
- Validation: 10% of samples
- Test: 20% of samples

After splits, the datasets were balanced in relation to the activity code column, that is, each subset have the same number of activitiy samples.

**NOTE**: Each subset contain samples from distinct users, that is, samples of one user belongs exclusivelly to one of three subsets.

"""
        }
    ]
}

O código abaixo processa as views (às padroniza) e gera na pasta de saída

In [5]:
backslash = "\n"

for dataset_name, values_list in views.items():
    split_counts = {}
    root_output_path = output_dir / dataset_name
    for values in values_list:
        for split in ("train", "validation", "test"):
            if values[split] is None:
                continue

            path = root_views_dir / dataset_name / values["view"] / values[split]
            df = pd.read_csv(path)
            df = df.loc[df["activity code"].isin(values["select activities"])]
            df["standard activity code"] = df["activity code"].replace(values["standard activity code map"])
            if "normalized activity code" in df.columns:
                df = df.drop(columns="normalized activity code")
                
            df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
            df = df.dropna()
            
            path = root_output_path / values["output"] / f"{split}.csv"
            path.parent.mkdir(exist_ok=True, parents=True)
            df.to_csv(path, index=False)
            
            md5sum = hashlib.md5(path.open('rb').read()).hexdigest()
            path = path.parent / (path.name + '.md5')
            with path.open("w") as f:
                f.write(md5sum)

            split_counts[split] = {
                "standard activity code": df["standard activity code"].value_counts(),
                "activity code": df["activity code"].value_counts()
            }

        readme = values["brief"]
        readme = readme + f"""## Activity codes
- {f"- ".join(f'{k}: {v} ({split_counts["train"]["activity code"][k]} train, {split_counts["validation"]["activity code"][k] if "validation" in split_counts else 0} validation, {split_counts["test"]["activity code"][k] if "test" in split_counts else 0} test) {backslash}' for k, v in values['activity code'].items() if k in split_counts["train"]["activity code"])} 

## Standartized activity codes
- {f"- ".join(f'{k}: {v} ({split_counts["train"]["standard activity code"][k]} train, {split_counts["validation"]["standard activity code"][k] if "validation" in split_counts else 0} validation, {split_counts["test"]["standard activity code"][k] if "test" in split_counts else 0} test) {backslash}' for k, v in standartized_codes.items() if k in split_counts["train"]["standard activity code"]  )    }      


"""
        readme_path = root_output_path / values["output"] / "README.md"
        with readme_path.open("w") as f:
            f.write(readme)
        print(f"Processed {dataset_name}")

Processed MotionSense
Processed MotionSense


Processamento especifico para o MotionSense. Troca o nome das colunas: `userAcceleration` para `accel` e `rotationRate` para `gyro`

In [6]:
files = list(output_dir.rglob("train.csv")) + list(output_dir.rglob("validation.csv")) + list(output_dir.rglob("test.csv"))
for path in files:
    df = pd.read_csv(path)
    num_cols = max([int(c.split("-")[1]) for c in df.columns if c.startswith("userAcceleration.x")])
    replace_map = {
        f"{c}-{i}": f"{new_c}-{i}"
        for c, new_c in [("userAcceleration.x", "accel-x"), 
                         ("userAcceleration.y", "accel-y"), 
                         ("userAcceleration.z", "accel-z"),
                         ("rotationRate.x", "gyro-x"), 
                         ("rotationRate.y", "gyro-y"),
                         ("rotationRate.z", "gyro-z")]
        for i in range(num_cols)
    }
    df.rename(columns=replace_map, inplace=True)
    df.to_csv(path, index=False)
    print(f"Dataframe saved to {path}")

Dataframe saved to ../data/processed_standard/MotionSense/standard_balanced/train.csv
Dataframe saved to ../data/processed_standard/MotionSense/standard_balanced_normalized/train.csv
Dataframe saved to ../data/processed_standard/MotionSense/standard_balanced/validation.csv
Dataframe saved to ../data/processed_standard/MotionSense/standard_balanced_normalized/validation.csv
Dataframe saved to ../data/processed_standard/MotionSense/standard_balanced/test.csv
Dataframe saved to ../data/processed_standard/MotionSense/standard_balanced_normalized/test.csv


In [7]:
df

Unnamed: 0,attitude.roll-0,attitude.roll-1,attitude.roll-2,attitude.roll-3,attitude.roll-4,attitude.roll-5,attitude.roll-6,attitude.roll-7,attitude.roll-8,attitude.roll-9,...,accel-z-56,accel-z-57,accel-z-58,userAcceleration.z-59,activity code,length,trial_code,index,user,standard activity code
0,0.459486,2.010055,1.690739,2.011734,2.678560,2.004338,-3.023557,-2.613010,-2.838432,-2.782050,...,3.525640,-0.004362,-1.834985,1.297758,0,60,2,0,24,4
1,-2.885092,-3.215592,-2.785755,-3.495716,1.390879,3.088453,1.663313,2.031326,2.170790,-2.478387,...,2.426194,0.056155,-2.098584,-3.678155,0,60,1,300,5,4
2,0.629593,0.679241,0.680821,0.665189,0.650472,0.709087,0.710647,0.821341,0.930563,0.888926,...,-0.021874,-1.446251,-0.290899,-1.910267,0,60,2,720,8,4
3,-0.549137,-0.751263,-0.997039,-1.485333,-2.214510,-2.437384,-2.287579,-1.784931,-1.661485,-1.449583,...,-0.038779,-0.062287,1.351608,-0.232767,0,60,2,1140,18,4
4,-1.616488,-1.227771,-0.874226,-0.981324,-0.894324,-1.138450,-0.916488,-0.876108,-0.400426,-1.010380,...,0.032942,2.497021,1.125488,1.977079,0,60,2,840,18,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1009,-2.298168,-2.292712,-2.233460,-1.479755,-0.811095,-0.346581,-0.215786,-0.705519,-1.599768,-2.343966,...,1.416281,-3.585038,-2.100037,-4.560880,5,60,9,60,8,5
1010,-2.547119,-2.716967,-2.525013,-2.509815,-1.172907,-0.540750,-0.647906,-1.500023,-2.289281,-2.788637,...,11.736866,3.597550,6.320432,-2.605721,5,60,9,600,5,5
1011,-0.583672,1.958751,2.565088,-3.307028,-1.800610,-0.198927,-0.100970,-0.267889,-0.268892,-0.162512,...,7.167679,-0.428993,-3.842913,-0.663975,5,60,16,60,18,5
1012,-1.479468,-1.101888,-1.339976,-2.086903,-2.747720,-2.932873,-2.824491,-2.780553,-2.527721,-2.061352,...,-1.249928,1.959698,1.891035,5.699832,5,60,16,360,18,5


In [8]:
# for path in Path("data/views/").rglob("*.csv"):
#     df = pd.read_csv(path)
#     print(path, list(df.columns)[:10])

In [9]:
pd.DataFrame(standartized_codes.items(), columns=["standard code", "standard label"]).to_csv(output_dir / "standard_codes.csv", index=False)