# COGS 118B - Project Proposal

# Names

Antara Sengupta 

Dhathry Doppalapudi 

Abhinav Chandra 

Austin Calza 


# Abstract 

Our objective is to employ unsupervised machine learning techniques to cluster EEG signal data into distinct sleep stages. The dataset comprises EEG signals from different brain lobes, representing various frequency subbands associated with sleep stages. The unsupervised approach involves clustering similar patterns without predefined labels, allowing the model to identify inherent structures in the data. By utilizing techniques like hierarchical, k-means, or GMM clustering, we aim to group EEG signals into clusters corresponding to different sleep stages. We will do model selection to figure out which clustering model to pursue. Success will be evaluated based on the coherence and meaningfulness of the identified clusters, providing insights into the potential unsupervised categorization of sleep stages in the dataset.

# Background

The exploration of neurological data through the utilization of machine learning algorithms is a booming field with various research initiatives tackling it. Machine learning methods allow us to sift through billions of neurobiological connections and uncover correlations and build tools that can be utilized in health care, pharmaceuticals, and technology development. The field of automatic sleep scoring, as addressed in the provided articles, has witnessed significant advancements in response to the challenges associated with traditional manual sleep staging methods. Prior research has delved into diverse approaches to enhance the accuracy and efficiency of sleep staging, driven by the importance of this process in diagnosing and treating sleep disorders. 

One avenue of exploration involves the utilization of machine learning algorithms for sleep staging, aiming to automate a process that is traditionally time-consuming and subjective when conducted by human experts. Sinha (Reference 9) employed a combination of wavelet transform and artificial neural network (ANN) procedures to classify three sleep stages—sleep spindles, rapid eye movement (REM), and awake (AW). The study reported a commendable 95.35% accuracy rate, demonstrating the potential of machine learning techniques in sleep staging. However, this approach was limited to a specific set of sleep stages.

In a different study, Chapotot and Becq (Reference 10) focused on feature extraction methods for sleep scoring, emphasizing the importance of robust features in the classification process. While the specific details of the feature extraction method are not outlined, the work contributes to the broader understanding of feature engineering in the context of sleep staging.

These prior works underscore the ongoing efforts to streamline and enhance the sleep staging process. However, a common limitation is the focus on specific sleep stages or the lack of a comprehensive hybrid approach. The articles provided for this study address this gap by proposing a novel hybrid model—CVNF+CVANN—for automatic sleep scoring, combining complex-valued nonlinear features (CVNF) with a complex-valued neural network (CVANN). This hybrid approach, as presented by the authors (Reference Fourth Article), introduces innovations in feature representation and classification techniques, aiming to overcome the limitations of previous methodologies.

The hybrid model leverages nine nonlinear features commonly used in EEG signal classification, transforming them into a complex-valued number format using a phase encoding method. This novel feature representation is a departure from conventional real-valued features and adds a layer of complexity to the analysis. The resulting complex-valued feature set is then fed into the CVANN algorithm, contributing to a comprehensive and promising methodology for automatic sleep scoring. The study reports encouraging accuracy rates of 91.57% and 93.84% according to Rechtschaffen & Kales (R&K) and American Academy of Sleep Medicine (AASM) standards, respectively.

In summary, prior research has explored machine learning techniques and feature extraction methods for sleep staging, setting the foundation for the current study's innovative hybrid model. The proposed CVNF+CVANN approach builds upon these prior works, introducing complex-valued features and demonstrating notable improvements in accuracy, which positions it as a promising advancement in the field of automatic sleep scoring.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 2 or 3 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

# Problem Statement

Studying the sleeping patterns of humans has revealed that sleep can be divided into several stages, and each of these stages is categorized by different activity in the brain. The stages of wake, N1, N2, N3, and REM sleep all have different characteristics in terms of the levels of delta, alpha, theta, gamma, and beta waves. Because subjects in similar stages of sleep should have similar activity in the brain, and unsupervised machine learning methods could be used to cluster similar EEG observations, and hence, group the different stages of sleep together. Hence, since brain activity can be observed with EEG data, this project will look to take the observed EEG data of subjects in different stages of sleep and build a model that can cluster these data points into groups that reflect different stages of sleep.

# Data

The data can be found at https://www.kaggle.com/datasets/rafsanjany44/rem-and-nrem-sleep-classification. This data comes from a Dutch sleep center Haaglanden Medisch Centrun, and it consists of a series of EEG observations of individuals in different stages of sleep. The dataset consists of 75 variables and around 89,100 observations. With over a 1000 times as many observations as variables, the data should be well suited to create a solution that generalizes well. Each observation consists of different measurements of alpha, beta, gamma, theta, and delta waves, such as the mean, median, peak value, and spectral edge of each of these wave measures. There are three sets of each of these measurements from three different parts of the brain: F4 in the frontal lobe, C4 in the central lobe, and O2 in the occipital lobe. No individual variable should hold more weight than another, considering that prior to the investigation, it is unclear if a particular type of brain wave or brain area is more important. Each of these variables are on different scales, and to ensure no one variable has a stronger influence on the model, each variable will be normalized.


# Proposed Solution

EEG signal analysis requires careful preprocessing and feature extraction to prepare the data for use in unsupervised algorithms. Our solution will begin with a brief exploratory data analysis to better understand our data and help inform our preprocessing efforts of signal denoising and artifact reduction. Once preprocessed we will extract features and then classification using an unsupervised learning model. Assuming EDA showed no issues, we will start out by applying a band-pass filter to the EEG data to retain only the relevant frequency ranges, typically between 1-55 Hz, to exclude high-frequency noise and very low-frequency drift. It may also be necessary to downsample the signal to a lower frequency to reduce the computation load and we would do this following the Nyquist theorem. Following this, Independent Component Analysis (ICA) will be employed to decompose the signal into independent components, allowing for the identification and removal of components associated with muscle movements, eye blinks, and other artifacts. After filtering and artifact rejection, the data will be segmented into epochs or separated time windows, providing manageable data segments for further analysis.

Principal Component Analysis (PCA) will then be used to reduce the dimensionality of the dataset, focusing on components that capture the most variance. This step will be complemented by additional feature engineering efforts to extract time-domain, frequency-domain, and time-frequency domain features, offering a comprehensive representation of the EEG signals. For classification, we will explore unsupervised learning approaches such as K-means and hierarchical clustering. The optimal number of clusters will be determined using methods like the elbow method, or gap statistic, ensuring a meaningful classification of the EEG signals into distinct groups.

Given that our dataset is pre-segmented into NREM and REM sleep categories, we can leverage these classifications to validate our unsupervised learning model. This initial step will enable us to refine our data processing pipeline for these two broad sleep categories. Our ultimate objective is to expand this model to classify all distinct sleep stages without relying on predefined labels or external knowledge, such as the established sleep stage categories in neuroscience. This approach ensures that our model independently discovers the underlying patterns in the data.

For the validation of our more refined classification model, we will employ a Fourier Transform on the epochs. This technique will allow us to analyze the frequency components of the EEG signals. By examining these frequency patterns, the model can algorithmically assign a sleep stage to each epoch, effectively labeling it based on its spectral characteristics. This method will provide an objective basis to assess the model's performance in distinguishing between the various stages of sleep, relying solely on the intrinsic information contained within the EEG signals.

# Evaluation Metrics

For the unsupervised clustering task of sleep stage classification using EEG signal data, the Silhouette Score is a relevant evaluation metric. The Silhouette Score measures how well-defined and separated the clusters are within the data. It takes into account both the cohesion within clusters and the separation between clusters. The Silhouette Score ranges from -1 to 1, where a higher score indicates better-defined clusters. The overall Silhouette Score for the entire dataset is the average of the Silhouette Scores for each data point. This metric provides a quantitative measure of the clustering quality, with a higher score indicating well-separated and distinct clusters, aligning with the goal of accurately identifying sleep stages in EEG signals. The formula for the Silhouette Score for a single data point 'i' is given by: 
$$S(i)=[b(i)-a(i)/[max(a(i),b(i)]$$
Where a(i) is the average distance from the i-th data point to other points in the same cluster and b(i) is the smallest average distance from the i-th data point to points in a different cluster.

# Ethics & Privacy

In our EEG signal classification project, we are aware of potential ethical concerns and data privacy issues that demand our careful consideration. Firstly, utilizing medical data, especially from sleep recordings, requires us to prioritize patient privacy. Our dataset, collected from Haaglanden Medisch Centrum in The Netherlands, is publicly available on Kaggle. It is anonymized, and any personally identifiable information has been handled.

Furthermore, the nature of classifying sleep stages using EEG data raises the specter of revealing sensitive health information. We acknowledge the risk of stigmatization or discrimination based on the model's results, and we will ensure that only authorized individuals, like our instructor and TA, can access and understand the model's results. Furthermore, this project will be made private upon completion to ensure security.

We understand that biased outcomes can arise if the training data does not adequately represent a diverse population. Given that our dataset is specific to a sleep center in The Netherlands, we understand that it may not be representative of people from all over the world and will make sure not to claim that it is. To mitigate ethical concerns and potential unintended consequences, we prioritize transparency in our model development process. Documenting our data preprocessing steps, model architectures, and evaluation metrics serves to enhance accountability and facilitates external scrutiny.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* *Everybody does their assigned parts on time as planned.*
* *Everyone communicates clearly with team members - if anything comes up, they let the team know promptly.*
* *All members attend decided meetings, and if they are not able to, they complete tasks remotely.*
* *All team members are respectful of each other.*
* *Everybody contributes equally, and puts in their best effort.*


# Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/11  |  2 PM |  Brainstorm topics/questions  | Determine best form of communication; discuss possible topics; find potential datasets; discuss potential hypotheses | 
| 2/20  |  2 PM |  Find more datasets, narrow down potential datasets | Decide on a dataset, problem, and hypothesis; discuss expectations, ethics, and schedule; complete project proposal together | 
| 2/27  | 2 PM  | Brainstorm and split up EDA, Import & Wrangle Data, do some EDA  | Review/Edit wrangling/EDA, split up any further EDA and programming   |
| 3/5  | 2 PM  | Finalize wrangling/EDA; Begin programming for project | Discuss/edit project code; Complete project, Split up work for the analysis   |
| 3/12  | 2 PM  | Complete analysis; Draft results/conclusion/discussion | Complete results/conclusion/discussion sections, Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
