
El dataset contiene registros de sensores de smartphones de 4 actividades relacionadas con caídas y 9 actividades normales.

Las que se corresponden con caídas son:  
* FOL:&nbsp;Caerse hacia adelante  
* FKL: &nbsp;Caerse de rodillas  
* SDL: &nbsp;Caerse de costado  
* BSC: &nbsp;Caerse de una silla  

Las actividades normales son:
* STD: &nbsp;Estar parado  
* WAL: &nbsp;Caminar  
* JOG: &nbsp;Trotar  
* JUM: &nbsp;Saltar   
* STU: &nbsp;Subir escaleras  
* STN: &nbsp;Bajar escaleras  
* SCH: &nbsp;Sentarse  
* CSI:&nbsp; Entrar a un automovil  
* CSO:&nbsp; Salir de un automovil  

Los registro del dataset fueron registrados por 11 individuos.

Cada registro pertenece a una ventana temporal de 6 segundos, conteniendo 
datos del acelerómetro y del giroscopio, dando lugar a las siguientes features:

* acc_max:        dato de aceleración máxima del 4to segundo.  
* acc_kurtosis:   kurtosis de la aceleración durante los 6 segundos.  
* acc_skewness:   simetría de la aceleración durante los 6 segundos.  
* gyro_max:       dato máximo del giroscopio en el 4to segundo.  
* gyro_kurtosis:  kurtosis del giroscopio durante los 6 segundos.  
* gyro_skewness:  simetría del giroscopio durante los 6 segundos.  
* lin_max:        aceleración lineal máxima (excluyendo la gravedad) del 4to segundo.  
* post_lin_max:   aceleración lineal máxima en el 6to segundo.  
* post_gyro_max:  dato máximo del giroscopio en el 6to segundo.  
* fall:           1 si se corresponde con una caída, 0 si no.  
* label:          código de la actividad.  

El dataset contiene 1784 registros, habiendo 1017 que se corresponden con actividades normales y 767 que se corresponden con caídas.

In [1]:

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier
import warnings
warnings.filterwarnings('ignore')
from sklearn import set_config
set_config(display="diagram")


In [2]:
df1 = pd.read_csv('../tp_final_no_anda_la_clase_outlier/Train.csv')
df2 = pd.read_csv('../tp_final_no_anda_la_clase_outlier/Test.csv')

In [3]:
df1.shape

(1428, 12)

In [4]:
df2.shape

(356, 12)

In [5]:
df = pd.concat([df1, df2])

In [6]:
df.shape

(1784, 12)

In [7]:
df.isnull().sum()

Unnamed: 0       0
acc_max          0
gyro_max         0
acc_kurtosis     0
gyro_kurtosis    0
label            0
lin_max          0
acc_skewness     0
gyro_skewness    0
post_gyro_max    0
post_lin_max     0
fall             0
dtype: int64

In [8]:
df.describe()

Unnamed: 0.1,Unnamed: 0,acc_max,gyro_max,acc_kurtosis,gyro_kurtosis,lin_max,acc_skewness,gyro_skewness,post_gyro_max,post_lin_max,fall
count,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0
mean,891.5,21.768998,5.028728,10.031186,3.916387,7.976308,1.732918,1.629258,3.191397,5.228546,0.429933
std,515.140757,5.47998,2.943876,11.836305,5.489329,4.258842,1.529711,0.999016,3.429678,5.004165,0.495205
min,0.0,9.787964,0.026257,-1.743347,-1.532044,0.043625,-14.066208,-0.46016,-4.984168,-5.382828,0.0
25%,445.75,18.751488,3.104216,0.469997,0.186524,4.832765,0.458187,0.811557,0.286294,0.907965,0.0
50%,891.5,22.924268,4.568088,8.423476,2.028413,8.282902,1.520431,1.542694,2.452813,3.727967,0.0
75%,1337.25,25.865634,6.428771,15.717815,5.582912,11.100896,2.912764,2.291739,5.22624,9.629489,1.0
max,1783.0,32.885551,17.288546,231.134385,34.163811,25.382307,6.782592,5.174101,16.204944,23.972115,1.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1784 entries, 0 to 355
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     1784 non-null   int64  
 1   acc_max        1784 non-null   float64
 2   gyro_max       1784 non-null   float64
 3   acc_kurtosis   1784 non-null   float64
 4   gyro_kurtosis  1784 non-null   float64
 5   label          1784 non-null   object 
 6   lin_max        1784 non-null   float64
 7   acc_skewness   1784 non-null   float64
 8   gyro_skewness  1784 non-null   float64
 9   post_gyro_max  1784 non-null   float64
 10  post_lin_max   1784 non-null   float64
 11  fall           1784 non-null   int64  
dtypes: float64(9), int64(2), object(1)
memory usage: 181.2+ KB


In [10]:
df.sample()

Unnamed: 0.1,Unnamed: 0,acc_max,gyro_max,acc_kurtosis,gyro_kurtosis,label,lin_max,acc_skewness,gyro_skewness,post_gyro_max,post_lin_max,fall
1044,879,22.960623,6.481883,4.701671,2.504065,CSI,12.424865,1.209656,1.738483,4.721564,10.974288,0


In [11]:
df.shape

(1784, 12)

In [12]:
df['Unnamed: 0'].value_counts().mean() #acá vemos que cada valor de esta columna aparece una sola vez, por lo que es un índice. 
#será dropeada

1.0

In [13]:
df['label'].value_counts()

FOL    192
SDL    192
FKL    192
BSC    191
CSO    113
STD    113
SCH    113
STU    113
CSI    113
STN    113
JUM    113
WAL    113
JOG    113
Name: label, dtype: int64

In [14]:
df['fall'].value_counts()

0    1017
1     767
Name: fall, dtype: int64

In [15]:
# acá vemos que las categorías 'BSC', 'FKL', 'FOL', y 'STD' se corresponden al valor '1' de la columna 'fall' por lo que representan caídas, 
# mientras que el resto de las categorias se corresponden con el valor '0' por lo que representan movimientos que no son caídas
grouped = df.groupby('label').agg({'fall': 'mean'}) 
grouped

Unnamed: 0_level_0,fall
label,Unnamed: 1_level_1
BSC,1
CSI,0
CSO,0
FKL,1
FOL,1
JOG,0
JUM,0
SCH,0
SDL,1
STD,0


In [16]:
#acá confirmamos que los que no corresponden a caídas coinciden en cantidad con los "0" de la categoría a predecir
#df.loc[df['label'].isin(grupo[grupo < 120].index.tolist())]['label'].value_counts().sum() == df['fall'].value_counts()[0]

In [17]:
corr_matrix = df.corr() #vamos a ver como correlacionan entre si las features

In [18]:
#fig, ax = plt.subplots(figsize=(10, 6))
#sns.heatmap(corr_matrix, cmap="Blues", annot=True)

In [19]:
corr_matrix['fall'].sort_values(ascending = False) #ordenamos de mayor a menor las correlaciones con 'fall'

fall             1.000000
post_lin_max     0.864964
post_gyro_max    0.765410
acc_skewness     0.713811
gyro_skewness    0.685179
acc_max          0.609653
lin_max          0.581044
gyro_kurtosis    0.550182
acc_kurtosis     0.547179
gyro_max         0.468947
Unnamed: 0      -0.857480
Name: fall, dtype: float64

In [20]:
corr_matrix['fall'][corr_matrix['fall'] < 0.5] # aca vemos que 'gyro_max' correlaciona poco con 'fall'

Unnamed: 0   -0.857480
gyro_max      0.468947
Name: fall, dtype: float64

In [21]:
df.columns

Index(['Unnamed: 0', 'acc_max', 'gyro_max', 'acc_kurtosis', 'gyro_kurtosis',
       'label', 'lin_max', 'acc_skewness', 'gyro_skewness', 'post_gyro_max',
       'post_lin_max', 'fall'],
      dtype='object')

Entocnes como las columnas "Unnamed: 0', 'gyro_max', y 'label' son innecesarioas, usamos una clase para preprocesar los datos que elimine estas columnas del dataframe

In [22]:
class FeatureSelection(BaseEstimator, TransformerMixin):

    def __init__(self,selected_features):
        self.selected_features=selected_features
    
    def fit(self,X,y=None):
        return self

    def transform(self, X, y=None):
        return X[self.selected_features]

In [23]:
class OutlierRemover(BaseEstimator, TransformerMixin):
    
    def __init__(self, n_std=3):
        self.n_std = n_std
    
    def fit(self, X, y = None):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        return self
    
    def transform(self, X, y):
        print(y)
       
            # Filtrar las filas que no contienen valores atípicos
        limite_inferior = self.mean_ - self.n_std * self.std_
        limite_superior = self.mean_ + self.n_std * self.std_
        mask = np.all((X > limite_inferior) & (X < limite_superior), axis=1)
        
        X_filtrado = X[mask]
        y = y[mask]
        return X_filtrado, y
    
    def fit_transform(self, X, y=None, **fit_params):
        return self.fit(X, y).transform(X, y)

In [24]:
# Separamos las variables independientes de la target
X=df.drop(columns=['fall'])
y=df['fall']

# Dividimos los datos en el set de train y el de test: 
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=100, stratify=y)
display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
display(type(X_train), type(X_test), type(y_train), type(y_test))

(1427, 11)

(357, 11)

(1427,)

(357,)

pandas.core.frame.DataFrame

pandas.core.frame.DataFrame

pandas.core.series.Series

pandas.core.series.Series

In [25]:
X_train.sample()

Unnamed: 0.1,Unnamed: 0,acc_max,gyro_max,acc_kurtosis,gyro_kurtosis,label,lin_max,acc_skewness,gyro_skewness,post_gyro_max,post_lin_max
765,931,17.310921,5.78264,5.979438,-0.16566,CSO,4.717529,1.367272,0.811601,5.699724,4.569499


In [26]:
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) #preparo el cross validation

In [27]:
#le pongo estos pasos x defecto al pipeline
pipeline = Pipeline([('FeatureSelection', FeatureSelection(['acc_max', 'acc_kurtosis', 'gyro_kurtosis',
       'lin_max', 'acc_skewness', 'gyro_skewness', 'post_gyro_max', 'post_lin_max'])), 
#      ('OutlierRemover', OutlierRemover()),
       ('scaler', StandardScaler()), 
       ('model', LogisticRegression())], verbose = False) 


In [28]:
# pipeline.steps[0][1].fit_transform(X_train, y_train)

In [29]:
# en esta lista de diccionarios pongo las cosas que quiero que pruebe el CV
# en el pipe vamos a probar 4 modelos con varios hiperparámetros
param_grid = [ {'model': [KNeighborsClassifier()], "model__n_neighbors": [2, 3, 4, 5, 6, 7, 8], 'model__weights' : ['uniform', 'distance'], 'scaler' : [StandardScaler(), MinMaxScaler(), None]}, 
               {'model': [LogisticRegression()], 'model__C': [0.01, 0.1, 1, 10, 100, 1000], 'model__penalty': ['l2', None], 'scaler' : [StandardScaler(), MinMaxScaler(), None]} ,
               {'model': [RandomForestClassifier()], 'model__criterion': ['gini', 'entropy'], 'scaler' : [StandardScaler(), MinMaxScaler(), None]},
               {'model': [XGBClassifier(objective='binary:logistic', eval_metric='logloss')], 'model__learning_rate': [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2], 'scaler' : [StandardScaler(), MinMaxScaler(), None] }
               ]

In [30]:
grid = GridSearchCV(pipeline, param_grid, cv=folds)

In [31]:
grid.fit(X_train, y_train) #muestra los pasos x defecto

In [32]:
grid.best_estimator_ #el mejor modelo 

In [33]:
print("El modelo arrojó un accuracy score en el conjunto de entrenamiento de: ", grid.best_score_) #vemos el accuracy del mejor modelo

El modelo arrojó un accuracy score en el conjunto de entrenamiento de:  0.9824880382775121


In [34]:
grid.best_params_ #vemos los mejores hiperparámetros del mejor modelo

{'model': XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
               colsample_bynode=None, colsample_bytree=None,
               eval_metric='logloss', gamma=None, gpu_id=None,
               importance_type='gain', interaction_constraints=None,
               learning_rate=0.5, max_delta_step=None, max_depth=None,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               n_estimators=100, n_jobs=None, num_parallel_tree=None,
               random_state=None, reg_alpha=None, reg_lambda=None,
               scale_pos_weight=None, subsample=None, tree_method=None,
               validate_parameters=None, verbosity=None),
 'model__learning_rate': 0.5,
 'scaler': StandardScaler()}

In [35]:
print("El modelo tiene un accuracy score de: ", accuracy_score(grid.best_estimator_.predict(X_test),y_test))

El modelo tiene un accuracy score de:  0.9803921568627451


In [36]:
y_pred = grid.best_estimator_.predict(X_test)

In [37]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
import itertools
y_pred_ = list(itertools.chain(y_pred))
y_test_ = list(itertools.chain(y_test))

print(classification_report(y_test_, y_pred_))


              precision    recall  f1-score   support

           0       0.98      0.99      0.98       204
           1       0.99      0.97      0.98       153

    accuracy                           0.98       357
   macro avg       0.98      0.98      0.98       357
weighted avg       0.98      0.98      0.98       357



In [38]:
import pickle

In [39]:
best_model = grid.best_estimator_


with open('mejor_modelo_tp4.pkl', 'wb') as f:
    pickle.dump(best_model, f)