# Challenge 2

## Introducción

El sueño es un proceso fisiológico esencial para la salud y el bienestar. La clasificación de los estados de sueño es importante para el diagnóstico y tratamiento de trastornos del sueño. El presente trabajo tiene como objetivo desarrollar e implementar algoritmos de aprendizaje automático para la clasificación automática de estados de sueño. Se hará uso de la base de datos CAP Sleep database[1]. El análisis deseñales EEG ya ha sido utilizdo para la clasificación de estados del sueño[2]. En esta ocasión, se analizarán tres algoritmos: regresión lineal, K-Nearest Neighbors (KNN) y Naive Bayes. La clasificación se basará en tres tipos de datos: posición corporal, eventos de sueño y duración de los eventos.

[1] MG Terzano, L Parrino, A Sherieri, R Chervin, S Chokroverty, C Guilleminault, M Hirshkowitz, M Mahowald, H Moldofsky, A Rosa, R Thomas, A Walters. Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (CAP) in human sleep. Sleep Med 2001 Nov; 2(6):537-553.

[2] Tsinalis, O., Matthews, P. M., Guo, Y., & Zafeiriou, S. (2016). Automatic sleep stage scoring with single-channel EEG using convolutional neural networks. arXiv preprint arXiv:1610.01683.

## Metodología

Primero se realizará un preprocesamiento de los datos para luego utilizar la clasificación.

In [42]:
# download the db
import requests
from urllib.parse import urlparse

url = 'https://physionet.org/files/capslpdb/1.0.0/n4.txt'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Extract the filename from the URL
    parsed_url = urlparse(url)
    filename = parsed_url.path.split('/')[-1]

    # Open a file in binary mode and write the downloaded content to it
    with open(filename, 'wb') as file:
        file.write(response.content)
    print(f"File downloaded successfully as '{filename}'")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

File downloaded successfully as 'n4.txt'


In [43]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
limit=23
data=[]

with open("n4.txt", "r") as f:
    current_row=0
    for line in f:
        current_row+=1
        if current_row>=limit:
            columns=line.strip().split("\t")
            data.append(columns)

df=pd.DataFrame(data)
# columnames
df.columns=["Sleep Stage","Position","Time [hh:mm:ss]","Event","Duration[s]","Location"]
# datatypes
df.dtypes


Sleep Stage        object
Position           object
Time [hh:mm:ss]    object
Event              object
Duration[s]        object
Location           object
dtype: object

In [44]:
# cast datetime values
df['Time [hh:mm:ss]'] = df['Time [hh:mm:ss]'].astype(str)
df["Time [hh:mm:ss]"] = pd.to_datetime('01/01/2004 '+df["Time [hh:mm:ss]"], format='%d/%m/%Y %H:%M:%S')
df.loc[df['Time [hh:mm:ss]'].dt.hour < 8, 'Time [hh:mm:ss]'] += pd.DateOffset(days=1)

df["Duration[s]"] = df["Duration[s]"].astype(int)

df

Unnamed: 0,Sleep Stage,Position,Time [hh:mm:ss],Event,Duration[s],Location
0,W,Prone,2004-01-01 22:36:37,SLEEP-S0,30,EOG-Left-A2
1,W,Prone,2004-01-01 22:37:07,SLEEP-S0,30,EOG-Left-A2
2,W,Prone,2004-01-01 22:37:37,SLEEP-S0,30,EOG-Left-A2
3,W,Prone,2004-01-01 22:38:07,SLEEP-S0,30,EOG-Left-A2
4,W,Prone,2004-01-01 22:38:37,SLEEP-S0,30,EOG-Left-A2
...,...,...,...,...,...,...
1261,W,Prone,2004-01-02 06:59:37,SLEEP-S0,30,EOG-Left-A2
1262,W,Unknown Position,2004-01-02 07:00:07,SLEEP-S0,30,EOG-Left-A2
1263,W,Unknown Position,2004-01-02 07:00:37,SLEEP-S0,30,EOG-Left-A2
1264,W,Supine,2004-01-02 07:01:07,SLEEP-S0,30,EOG-Left-A2


In [45]:
# Delete a column if all values are the same
for col in df.columns:
    if df[col].nunique() == 1:
        df.drop(col, axis=1, inplace=True)

print(df.head())

  Sleep Stage Position     Time [hh:mm:ss]     Event  Duration[s]     Location
0           W    Prone 2004-01-01 22:36:37  SLEEP-S0           30  EOG-Left-A2
1           W    Prone 2004-01-01 22:37:07  SLEEP-S0           30  EOG-Left-A2
2           W    Prone 2004-01-01 22:37:37  SLEEP-S0           30  EOG-Left-A2
3           W    Prone 2004-01-01 22:38:07  SLEEP-S0           30  EOG-Left-A2
4           W    Prone 2004-01-01 22:38:37  SLEEP-S0           30  EOG-Left-A2


In [46]:
df["Sleep Stage"].value_counts() # 7 sleep stages

Sleep Stage
S2    565
W     223
R     199
S4    119
S3    117
S1     22
MT     21
Name: count, dtype: int64

In [47]:
df["Event"].value_counts() # 9 events

Event
SLEEP-S2     401
SLEEP-S0     223
SLEEP-REM    198
MCAP-A1      192
SLEEP-S4      76
SLEEP-S3      63
MCAP-A3       52
MCAP-A2       43
SLEEP-S1      18
Name: count, dtype: int64

In [48]:
df["Position"].value_counts() # 4 positions + unknown position

Position
Supine              586
Right               494
Prone               126
Left                 47
Unknown Position     13
Name: count, dtype: int64

In [49]:
# replace 'Unknown Position' with the last non unknown position because it's
# maybe due to a sensor issue and position isn't that important when predicting
# the sleep stage
df['Position'] = df['Position'].replace('Unknown Position', method='ffill')
df["Position"].value_counts()

Position
Supine    590
Right     494
Prone     135
Left       47
Name: count, dtype: int64

In [50]:
df.dtypes

Sleep Stage                object
Position                   object
Time [hh:mm:ss]    datetime64[ns]
Event                      object
Duration[s]                 int64
Location                   object
dtype: object

In [51]:
from sklearn.preprocessing import MinMaxScaler

df['Time [hh:mm:ss]']=pd.to_numeric(df['Time [hh:mm:ss]'])

# Scale the 'Time [hh:mm:ss]' column from 0 to 1
scaler = MinMaxScaler(feature_range=(0, 1))
df['Time [hh:mm:ss]'] = scaler.fit_transform(df[['Time [hh:mm:ss]']])

In [52]:
from sklearn.preprocessing import LabelEncoder
# Convert category variables to dummy variables
df = pd.get_dummies(df, columns=['Position','Event'])

df["Sleep Stage"]=LabelEncoder().fit_transform(df["Sleep Stage"])
df["Location"]=LabelEncoder().fit_transform(df["Location"])

print(df.head())

   Sleep Stage  Time [hh:mm:ss]  Duration[s]  Location  Position_Left  \
0            6          0.00000           30         1          False   
1            6          0.00099           30         1          False   
2            6          0.00198           30         1          False   
3            6          0.00297           30         1          False   
4            6          0.00396           30         1          False   

   Position_Prone  Position_Right  Position_Supine  Event_MCAP-A1  \
0            True           False            False          False   
1            True           False            False          False   
2            True           False            False          False   
3            True           False            False          False   
4            True           False            False          False   

   Event_MCAP-A2  Event_MCAP-A3  Event_SLEEP-REM  Event_SLEEP-S0  \
0          False          False            False            True   
1         

In [53]:
# check datatypes
df.dtypes

Sleep Stage          int64
Time [hh:mm:ss]    float64
Duration[s]          int64
Location             int64
Position_Left         bool
Position_Prone        bool
Position_Right        bool
Position_Supine       bool
Event_MCAP-A1         bool
Event_MCAP-A2         bool
Event_MCAP-A3         bool
Event_SLEEP-REM       bool
Event_SLEEP-S0        bool
Event_SLEEP-S1        bool
Event_SLEEP-S2        bool
Event_SLEEP-S3        bool
Event_SLEEP-S4        bool
dtype: object

In [54]:
valores_nulos=df.isnull().sum().sum()
valores_nulos #No existen valores nulos.

0

In [55]:
# see the df
df

Unnamed: 0,Sleep Stage,Time [hh:mm:ss],Duration[s],Location,Position_Left,Position_Prone,Position_Right,Position_Supine,Event_MCAP-A1,Event_MCAP-A2,Event_MCAP-A3,Event_SLEEP-REM,Event_SLEEP-S0,Event_SLEEP-S1,Event_SLEEP-S2,Event_SLEEP-S3,Event_SLEEP-S4
0,6,0.00000,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
1,6,0.00099,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
2,6,0.00198,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
3,6,0.00297,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
4,6,0.00396,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1261,6,0.99604,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
1262,6,0.99703,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
1263,6,0.99802,30,1,False,True,False,False,False,False,False,False,True,False,False,False,False
1264,6,0.99901,30,1,False,False,False,True,False,False,False,False,True,False,False,False,False


In [56]:
# import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

## Resultados

### Logistic Regression

In [57]:
# Define features and target variable
features = df[['Time [hh:mm:ss]','Duration[s]','Location','Position_Left','Position_Prone','Position_Right','Position_Supine','Event_MCAP-A1','Event_MCAP-A2','Event_MCAP-A3','Event_SLEEP-REM','Event_SLEEP-S0','Event_SLEEP-S1','Event_SLEEP-S2','Event_SLEEP-S3','Event_SLEEP-S4']]
target = df['Sleep Stage']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=42)

# Initialize and train logistic regression model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9409448818897638


### KNN

In [58]:
# Define features and target variable
features = df[['Time [hh:mm:ss]','Duration[s]','Location','Position_Left','Position_Prone','Position_Right','Position_Supine','Event_MCAP-A1','Event_MCAP-A2','Event_MCAP-A3','Event_SLEEP-REM','Event_SLEEP-S0','Event_SLEEP-S1','Event_SLEEP-S2','Event_SLEEP-S3','Event_SLEEP-S4']]
target = df['Sleep Stage']

# Preprocess categorical variables (e.g., one-hot encoding)
features_encoded = pd.get_dummies(features)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.2, random_state=42)

# Scale numeric features if needed
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train KNN classifier
k = 10  # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


Accuracy: 0.9212598425196851


### Naive Bayes

In [59]:
# Define features and target variable
features = df[['Time [hh:mm:ss]','Duration[s]','Location','Position_Left','Position_Prone','Position_Right','Position_Supine','Event_MCAP-A1','Event_MCAP-A2','Event_MCAP-A3','Event_SLEEP-REM','Event_SLEEP-S0','Event_SLEEP-S1','Event_SLEEP-S2','Event_SLEEP-S3','Event_SLEEP-S4']]
target = df['Sleep Stage']

# Preprocess categorical variables (e.g., label encoding)
label_encoder = LabelEncoder()
target_encoded = label_encoder.fit_transform(target)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target_encoded, test_size=0.2, random_state=42)

# Initialize and train Naive Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8543307086614174


## Discusiones

El buen desempeño (acc=94.09%) de la regresión lineal tiene buenos resultados debido a la relación linear que tienen principalmente los eventos con los estados de sueño.

KNN tiene un buen desempeño (acc=92.13%) con un número de vecinos bajo. Cabe aclarar que la data no tiene relaciones tan complejas y es por ello que KNN no presenta una mejora con respecto al resto.

Naive Bayes no tiene tan buenos resultados (acc=85.43%), lo cual podría deberse a que la condición "naive" no aplica para los datos.

## Conclusiones

Es posible la clasificación de estados del sueño mediante técnicas de clasificación. Es importante setear los parámetros adecuados, como por ejemplo el número de vecinos.

En esta ocasión, la regresión lineal obtuvo los mejores resultados (acc=94.09%).