# CMI – Detect Behavior with Sensor Data
Exploratory Data Analysis (EDA) starter notebook.

This notebook shows how to download, load and take a first look at the competition data.

## 1. Setup & Data Download
This notebook expects that you have **Kaggle API** credentials set as environment variables `KAGGLE_USERNAME` and `KAGGLE_KEY`. If you are running inside a Kaggle Notebook these are already configured. Otherwise, uncomment the cell below and add your credentials.

In [None]:

# Uncomment and fill in your credentials if running locally:
# import os
# os.environ['KAGGLE_USERNAME'] = "YOUR_USERNAME"
# os.environ['KAGGLE_KEY'] = "YOUR_KEY"

!pip -q install kaggle -U
!kaggle competitions download -c cmi-detect-behavior-with-sensor-data -p ./data -q


In [None]:

# Extract all downloaded zip files
import zipfile, glob, pathlib, warnings
warnings.filterwarnings('ignore')

data_dir = pathlib.Path('./data')
data_dir.mkdir(exist_ok=True)
for zip_path in data_dir.glob('*.zip'):
    with zipfile.ZipFile(zip_path, 'r') as z:
        z.extractall(data_dir)
    zip_path.unlink()  # remove the zip after extraction
print('Extraction finished. Sample files:', list(data_dir.iterdir())[:10])


## 2. Libraries

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
plt.rcParams['figure.figsize'] = (10, 6)
sns.set_style('whitegrid')


## 3. Load tabular data

In [None]:

# Common file names in this competition
possible_files = [
    'train.csv',
    'train_series.parquet',
    'train_events.csv',
    'train_meta.csv',
    'test.csv',
    'sample_submission.csv'
]

available = [f for f in possible_files if (data_dir/f).exists()]
print('Detected files:', available)

dfs = {}
for fname in available:
    path = data_dir / fname
    if fname.endswith('.csv'):
        dfs[fname] = pd.read_csv(path)
    elif fname.endswith('.parquet'):
        dfs[fname] = pd.read_parquet(path)
    print(f'{fname}:', dfs[fname].shape)


## 4. Quick inspection

In [None]:

for name, df in dfs.items():
    print(f"\n{'#'*40}\n{name}")
    display(df.head())
    print(df.info())
    display(df.describe(include='all').T)


## 5. Label distribution

In [None]:

label_candidates = ['behavior_id', 'event', 'label', 'target', 'class']
for name, df in dfs.items():
    for col in label_candidates:
        if col in df.columns:
            df[col].value_counts().sort_index().plot(kind='bar')
            plt.title(f'{col} distribution in {name}')
            plt.show()


## 6. Time‑series visualisation

In [None]:

if 'train_series.parquet' in dfs:
    series_df = dfs['train_series.parquet']
    # pick the first series
    first_series_id = series_df['series_id'].iloc[0]
    subset = series_df[series_df['series_id'] == first_series_id]
    sensor_cols = [c for c in subset.columns if c not in ['series_id', 'step']]
    subset.set_index('step')[sensor_cols[:4]].plot(subplots=True, sharex=True,
                                                   title=f'Sensor signals for series {first_series_id}')
    plt.tight_layout()
    plt.show()


## 7. Missing values

In [None]:

!pip -q install missingno
import missingno as msno
for name, df in dfs.items():
    msno.matrix(df.sample(min(1000, len(df))), fontsize=8)
    plt.title(f'Missing value pattern — {name}')
    plt.show()


## 8. Correlation analysis

In [None]:

for name, df in dfs.items():
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 1:
        corr = df[numeric_cols].corr()
        sns.heatmap(corr, center=0, square=True)
        plt.title(f'Correlation matrix for {name}')
        plt.show()


---

### Next steps
* Engineer windowed features (mean, std, skew, etc.)
* Build cross‑validation splits by `subject_id` or `series_id`.
* Implement baseline models (gradient boosting, CNNs, transformers).
* Optimise the custom metric used in the leaderboard.