# High-Dimensional Data Analysis

**Course**: Computational Data Analysis (02582)  
**Case**: Case 1 - The High-Dimensional Standoff

## Overview

This notebook contains the complete analysis workflow for exploring and modeling high-dimensional data.

### Objectives

1. Load and explore the dataset
2. Perform exploratory data analysis (EDA)
3. Preprocess and clean the data
4. Apply dimensionality reduction techniques
5. Build and evaluate machine learning models
6. Visualize results and draw conclusions

## Setup

Import required libraries and configure settings.

In [None]:
# Standard library imports
import os
import sys
from pathlib import Path

# Third-party imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Local imports (custom modules)
sys.path.append(str(Path.cwd().parent / 'src'))
from data_processing import load_data, preprocess_data
from visualization import plot_correlation_matrix, plot_pca_variance
from models import train_model, evaluate_model

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

print("✓ All libraries imported successfully")

## 1. Data Loading

Load the dataset from the data directory.

In [None]:
# Define data paths
DATA_DIR = Path.cwd().parent / 'data'
RAW_DATA_PATH = DATA_DIR / 'raw' / 'dataset.csv'
PROCESSED_DATA_PATH = DATA_DIR / 'processed' / 'cleaned_data.csv'

# Load data
# df = load_data(RAW_DATA_PATH)

# For demonstration, create a sample dataset
n_samples = 1000
n_features = 50
df = pd.DataFrame(
    np.random.randn(n_samples, n_features),
    columns=[f'feature_{i}' for i in range(n_features)]
)
df['target'] = np.random.choice([0, 1], size=n_samples)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## 2. Exploratory Data Analysis

Explore the dataset structure, distributions, and relationships.

In [None]:
# Basic statistics
print("Dataset Information:")
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes.value_counts()}")
print(f"\nMissing values:\n{df.isnull().sum().sum()}")
print(f"\nTarget distribution:\n{df['target'].value_counts()}")

# Summary statistics
df.describe()

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('Feature Distributions', fontsize=16)

for idx, ax in enumerate(axes.flat):
    if idx < len(df.columns) - 1:
        df[df.columns[idx]].hist(bins=30, ax=ax, edgecolor='black')
        ax.set_title(f'{df.columns[idx]}')
        ax.set_xlabel('Value')
        ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## 3. Data Preprocessing

Clean and prepare the data for modeling.

In [None]:
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print("✓ Data preprocessing complete")

## 4. Dimensionality Reduction

Apply PCA to reduce dimensionality and visualize the results.

In [None]:
# Apply PCA
pca = PCA(n_components=0.95)  # Keep 95% of variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print(f"Original number of features: {X_train.shape[1]}")
print(f"Reduced number of features: {X_train_pca.shape[1]}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_.sum():.2%}")

# Plot variance explained
plt.figure(figsize=(10, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA - Explained Variance by Components')
plt.grid(True)
plt.show()

## 5. Model Training

Train machine learning models on the processed data.

In [None]:
# Train Random Forest classifier
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)

model.fit(X_train_scaled, y_train)
print("✓ Model training complete")

## 6. Model Evaluation

Evaluate the model performance on the test set.

In [None]:
# Make predictions
y_pred = model.predict(X_test_scaled)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 20 features
plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(20), x='importance', y='feature')
plt.title('Top 20 Most Important Features')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

## 7. Conclusions

### Key Findings

1. **Data characteristics**: [Describe the main characteristics of your data]
2. **Model performance**: [Summarize model performance metrics]
3. **Important features**: [Discuss the most important features]
4. **Recommendations**: [Provide recommendations based on findings]

### Next Steps

- Try additional feature engineering
- Experiment with different models
- Perform hyperparameter tuning
- Collect more data if needed

---

**End of Analysis**