# <span style='color:#A80808'>Problem description</span>

You've been provided with thousands of sixty-second sequences of biological sensor data recorded from several hundred participants who could have been in either of two possible activity states. Can you determine what state a participant was in from the sensor data?

![](https://cdn.usharama.edu.in/blog/biomedical-signal-processing/bms-usha-rama-blog.PNG)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['axes.facecolor'] = 'gray'

import warnings
warnings.simplefilter('ignore')

# <span style='color:#A80808'>Data</span>

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
train_labels = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')
train = train.merge(train_labels, how='left')

test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')

train.head(3)

There are 25968 sequences, each sequence has 60 steps (one step per second). The total number of rows in the dataset is 25968*60=1558080. No sequence has missing step.

# <span style='color:#A80808'>Target: state</span>

Target is equally distributed between two categories.

In [None]:
plt.figure(figsize=(5,3))

train_labels.state.hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Target', fontsize=16)
plt.tight_layout()
plt.show()

# <span style='color:#A80808'>Subject</span>

A unique id for the subjects in the experiments. There are in total 991 subjects: 672 subjects in the train set and 319 subjects in the test set.

Subjects with high number of sequences tend to have more state 1. Precisely, subjects with more than 100 sequences have more than 80% of state 1.

In [None]:
plt.figure(figsize=(10,7))
plt.plot(train.groupby('subject').size()/60,train.groupby('subject').state.sum()/train.groupby('subject').size()*100, 'w.')
plt.xlabel('Number of sequences per subject', fontsize=16)
plt.ylabel('% state 1', fontsize=16)
plt.title('The influence of number of sequences per subject on the % of state 1', fontsize=16)
plt.show()

Beside, we can observe an artificial pattern of the data for subjects with number of sequences below 50. Indeed, the whole data follow smooth curves. For example, we can fit the curves y=100/x, y=200/x, y=300/x etc. for part of the data as shown below. We can observe also that all the subject with 100% state 0 have less than 40 sequences.

In [None]:
x = np.arange(5,40)

In [None]:
plt.figure(figsize=(10,7))
plt.plot(train.groupby('subject').size()/60,train.groupby('subject').state.sum()/train.groupby('subject').size()*100, 'w.', markersize=10)
plt.plot(x,100/x,'r', label='y=100/x')
plt.plot(x,200/x,'b', label='y=200/x')
plt.plot(x,300/x,'g', label='y=300/x')
plt.plot(x,400/x,'y', label='y=400/x')
plt.plot(x,500/x,'k', label='y=500/x')
plt.plot(x,600/x,'orange', label='y=600/x')
plt.xlabel('Number of sequences per subject', fontsize=16)
plt.ylabel('% state 1', fontsize=16)
plt.xlim(0,50)
plt.ylim(-1,40)
plt.legend()
plt.title('Data follow artificial smooth curves', fontsize=16)
plt.show()

# <span style='color:#A80808'>Subjects with 100% state 0</span>

The subjects with 100% state 0 are listed below

In [None]:
val = train.groupby('subject').state.sum()
list_subs_0 = val[val==0].index
list_subs_0

In [None]:
subs_0 = train[train.subject.isin(list_subs_0)]
subs_0 = subs_0.set_index('step')
subs_0.head(10)

In [None]:
plt.figure(figsize=(10,7))
plt.plot(subs_0.sensor_02, 'w.')
plt.title('Subjects with 100% state 0', fontsize=16)
plt.xlabel('Steps', fontsize=16)
plt.ylabel('Sensor_02', fontsize=16)
plt.tight_layout()

We can observe that the subjects with 100% state 0 have constant values (at each sequence) for the sensor_02.

# <span style='color:#A80808'>Subjects following the curve y=100/x</span>

The subjects of which the % of state 1 is linked to the number of sequences by the smooth curve: y=100/x is listed below

In [None]:
val1 = train.groupby('subject').state.sum()/train.groupby('subject').size()*100
val2 = train.groupby('subject').size()/60

list_subs_1 = val1[val1==100/val2].index
list_subs_1

In [None]:
subs_1 = train[train.subject.isin(list_subs_1)]
subs_1 = subs_1.set_index('step')
subs_1.head(3)

There are many sensors with constant values in this case: sensor_00, sensor_01, etc.

In [None]:
plt.figure(figsize=(10,7))
plt.plot(subs_1.sensor_02, 'w.')
plt.title('Subjects following smooth curve y=100/x', fontsize=16)
plt.xlabel('Steps', fontsize=16)
plt.ylabel('Sensor_02', fontsize=16)
plt.tight_layout()

# <span style='color:#A80808'>Subject sequences distribution</span>

In [None]:
plt.figure(figsize=(10,5))
(train.groupby('subject').size()/60).hist(bins=200, color='yellow', label='Train')
(test.groupby('subject').size()/60).hist(bins=100, color='black', label='Test')
plt.xlabel('Number of sequences per subject', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.legend(fontsize=16)
plt.show()

# <span style='color:#A80808'>Subject with high number of sequences</span>

Most subject in the train set has around 30 sequences, but some subjects have more than 100 sequences. The subject 437 has highest number of sequences (199) following by the subject 1 with 175 sequences and subject 635 with 158 sequences.

Most signal sequences of the subjects with high number of sequences (subjects 437, 1 and 635) are of state 1 as shown below.

In [None]:
plt.figure(figsize=(15,6))

plt.subplot(2,3,1)
train.state[(train.subject==1) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 1', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,2)
train.state[(train.subject==87) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 87', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,3)
train.state[(train.subject==207) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 207', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,4)
train.state[(train.subject==421) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 421', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,5)
train.state[(train.subject==437) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 437', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,6)
train.state[(train.subject==635) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 635', fontsize=16)
plt.tight_layout()

plt.show()

# <span style='color:#A80808'>Subject with low number of sequences</span>



Inversely, the subjects 45, 73, 365, 472, 486, 519 have very few number of sequences. Most sequences of these subject show state 0 as below.

In [None]:
plt.figure(figsize=(15,6))

plt.subplot(2,3,1)
train.state[(train.subject==45) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 45', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,2)
train.state[(train.subject==73) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 73', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,3)
train.state[(train.subject==265) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 265', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,4)
train.state[(train.subject==472) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 472', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,5)
train.state[(train.subject==486) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 486', fontsize=16)
plt.tight_layout()

plt.subplot(2,3,6)
train.state[(train.subject==519) & (train.step==0)].hist(color='yellow')
plt.xlabel('State', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title('Subject 519', fontsize=16)
plt.tight_layout()

plt.show()

# <span style='color:#A80808'>Subject 472</span>

This particular subject has lowest number of sequences (only 2 sequences)

In [None]:
sub472 = train[train.subject==472]

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(2,3,1)
plt.plot(sub472.step[sub472.sequence==4275], sub472.sensor_00[sub472.sequence==4275], 'yellow', label='sequence 4275')
plt.plot(sub472.step[sub472.sequence==9285], sub472.sensor_00[sub472.sequence==9285], 'white', label='sequence 9285')
plt.title('sensor_00', fontsize=16)
plt.legend()
plt.tight_layout()

plt.subplot(2,3,2)
plt.plot(sub472.step[sub472.sequence==4275], sub472.sensor_01[sub472.sequence==4275], 'yellow', label='sequence 4275')
plt.plot(sub472.step[sub472.sequence==9285], sub472.sensor_01[sub472.sequence==9285], 'white', label='sequence 9285')
plt.title('sensor_01', fontsize=16)
plt.legend()
plt.tight_layout()

plt.subplot(2,3,3)
plt.plot(sub472.step[sub472.sequence==4275], sub472.sensor_02[sub472.sequence==4275], 'yellow', label='sequence 4275')
plt.plot(sub472.step[sub472.sequence==9285], sub472.sensor_02[sub472.sequence==9285], 'white', label='sequence 9285')
plt.title('sensor_02', fontsize=16)
plt.legend()
plt.tight_layout()

plt.subplot(2,3,4)
plt.plot(sub472.step[sub472.sequence==4275], sub472.sensor_03[sub472.sequence==4275], 'yellow', label='sequence 4275')
plt.plot(sub472.step[sub472.sequence==9285], sub472.sensor_03[sub472.sequence==9285], 'white', label='sequence 9285')
plt.title('sensor_03', fontsize=16)
plt.legend()
plt.tight_layout()

plt.subplot(2,3,5)
plt.plot(sub472.step[sub472.sequence==4275], sub472.sensor_04[sub472.sequence==4275], 'yellow', label='sequence 4275')
plt.plot(sub472.step[sub472.sequence==9285], sub472.sensor_04[sub472.sequence==9285], 'white', label='sequence 9285')
plt.title('sensor_04', fontsize=16)
plt.legend()
plt.tight_layout()

plt.subplot(2,3,6)
plt.plot(sub472.step[sub472.sequence==4275], sub472.sensor_05[sub472.sequence==4275], 'yellow', label='sequence 4275')
plt.plot(sub472.step[sub472.sequence==9285], sub472.sensor_05[sub472.sequence==9285], 'white', label='sequence 9285')
plt.title('sensor_05', fontsize=16)
plt.legend()
plt.tight_layout()

plt.show()

It is hard to find a correlation between the two sequences of the subject 472 eventhough both of them correspond to state 0.