# Stage 6: Data Preprocessing Homework

**Objective**: Apply modular cleaning functions to raw stock market data, document assumptions, and compare original vs cleaned datasets.

## Table of Contents
1. [Data Loading](#data-loading)
2. [Initial Data Exploration](#initial-exploration)
3. [Data Cleaning](#data-cleaning)
4. [Data Comparison](#data-comparison)
5. [Save Processed Data](#save-processed-data)
6. [Assumptions and Tradeoffs](#assumptions-and-tradeoffs)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add src directory to path to import our cleaning functions
sys.path.append('../src')
from cleaning import fill_missing_median, drop_missing, normalize_data, get_data_summary, print_cleaning_report

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 1. Data Loading {#data-loading}

Load the raw stock market dataset and perform initial inspection.

In [None]:
# Load the raw dataset
raw_data_path = '../homework/data/raw/stock_data_raw.csv'
df_original = pd.read_csv(raw_data_path)

print(f"Dataset loaded successfully!")
print(f"Shape: {df_original.shape}")
print(f"\nColumns: {list(df_original.columns)}")

## 2. Initial Data Exploration {#initial-exploration}

Examine the structure, data types, and missing values in the original dataset.

In [None]:
# Display basic information about the dataset
print("=== DATASET OVERVIEW ===")
print(f"Shape: {df_original.shape}")
print(f"Memory usage: {df_original.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n=== DATA TYPES ===")
print(df_original.dtypes)
print("\n=== FIRST 5 ROWS ===")
df_original.head()

In [None]:
# Check for missing values
print("=== MISSING VALUES ANALYSIS ===")
missing_values = df_original.isnull().sum()
missing_percentage = (missing_values / len(df_original)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print(missing_df)
else:
    print("No missing values found!")

In [None]:
# Statistical summary of numerical columns
print("=== STATISTICAL SUMMARY ===")
df_original.describe()

In [None]:
# Visualize missing values pattern
if df_original.isnull().sum().sum() > 0:
    plt.figure(figsize=(12, 6))
    
    # Missing values heatmap
    plt.subplot(1, 2, 1)
    sns.heatmap(df_original.isnull(), cbar=True, yticklabels=False, cmap='viridis')
    plt.title('Missing Values Pattern')
    plt.xlabel('Columns')
    plt.xticks(rotation=45)
    
    # Missing values bar chart
    plt.subplot(1, 2, 2)
    missing_counts = df_original.isnull().sum()
    missing_counts = missing_counts[missing_counts > 0]
    if len(missing_counts) > 0:
        missing_counts.plot(kind='bar')
        plt.title('Missing Values Count by Column')
        plt.xlabel('Columns')
        plt.ylabel('Missing Count')
        plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()
else:
    print("No missing values to visualize.")

## 3. Data Cleaning {#data-cleaning}

Apply our modular cleaning functions to handle missing values and normalize the data.

### 3.1 Handle Missing Values

**Strategy**: 
- Fill missing values in technical indicators (volatility_20d, sma_20, sma_50) with median values
- These columns likely have missing values at the beginning of the time series due to insufficient historical data for calculation

In [None]:
# Step 1: Fill missing values with median for technical indicators
print("=== STEP 1: FILLING MISSING VALUES ===")

# Identify columns with missing values
columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()
print(f"Columns with missing values: {columns_with_missing}")

# Apply median filling to technical indicator columns
technical_columns = ['volatility_20d', 'sma_20', 'sma_50']
existing_technical_cols = [col for col in technical_columns if col in df_original.columns]

if existing_technical_cols:
    df_step1 = fill_missing_median(df_original, columns=existing_technical_cols)
    print(f"\nApplied median filling to: {existing_technical_cols}")
else:
    df_step1 = df_original.copy()
    print("No technical indicator columns found for median filling.")

### 3.2 Drop Remaining Missing Values

**Strategy**: 
- Drop any remaining rows with missing values in critical columns
- Focus on core price and volume data which should be complete

In [None]:
# Step 2: Drop rows with missing values in critical columns
print("=== STEP 2: DROPPING MISSING VALUES ===")

# Define critical columns that must have complete data
critical_columns = ['date', 'symbol', 'open', 'high', 'low', 'close', 'volume']
existing_critical_cols = [col for col in critical_columns if col in df_step1.columns]

print(f"Critical columns to check: {existing_critical_cols}")

# Check if there are missing values in critical columns
critical_missing = df_step1[existing_critical_cols].isnull().sum().sum()
print(f"Missing values in critical columns: {critical_missing}")

if critical_missing > 0:
    df_step2 = drop_missing(df_step1, columns=existing_critical_cols)
else:
    df_step2 = df_step1.copy()
    print("No missing values in critical columns - no rows dropped.")

### 3.3 Normalize Numerical Data

**Strategy**: 
- Apply standard scaling (z-score normalization) to price and volume data
- Exclude date, symbol, and already-normalized technical indicators from scaling

In [None]:
# Step 3: Normalize numerical columns
print("=== STEP 3: NORMALIZING DATA ===")

# Define columns to normalize (price and volume data)
normalize_columns = ['open', 'high', 'low', 'close', 'volume']
existing_normalize_cols = [col for col in normalize_columns if col in df_step2.columns]

print(f"Columns to normalize: {existing_normalize_cols}")

if existing_normalize_cols:
    df_cleaned, scaler = normalize_data(df_step2, columns=existing_normalize_cols, method='standard')
    print("\nNormalization completed using standard scaling.")
else:
    df_cleaned = df_step2.copy()
    scaler = None
    print("No columns found for normalization.")

## 4. Data Comparison {#data-comparison}

Compare the original and cleaned datasets to understand the impact of our cleaning operations.

In [None]:
# Generate comprehensive cleaning report
print_cleaning_report(df_original, df_cleaned)

In [None]:
# Compare statistical summaries
print("=== ORIGINAL DATA SUMMARY ===")
print(df_original.describe())

print("\n=== CLEANED DATA SUMMARY ===")
print(df_cleaned.describe())

In [None]:
# Visualize the impact of normalization
if existing_normalize_cols and len(existing_normalize_cols) > 0:
    fig, axes = plt.subplots(2, len(existing_normalize_cols), figsize=(15, 8))
    
    if len(existing_normalize_cols) == 1:
        axes = axes.reshape(-1, 1)
    
    for i, col in enumerate(existing_normalize_cols):
        # Original data distribution
        axes[0, i].hist(df_original[col].dropna(), bins=30, alpha=0.7, color='blue')
        axes[0, i].set_title(f'Original {col}')
        axes[0, i].set_ylabel('Frequency')
        
        # Normalized data distribution
        axes[1, i].hist(df_cleaned[col].dropna(), bins=30, alpha=0.7, color='red')
        axes[1, i].set_title(f'Normalized {col}')
        axes[1, i].set_ylabel('Frequency')
        axes[1, i].set_xlabel('Value')
    
    plt.tight_layout()
    plt.suptitle('Distribution Comparison: Original vs Normalized Data', y=1.02)
    plt.show()
else:
    print("No normalized columns to visualize.")

In [None]:
# Compare missing values before and after cleaning
print("=== MISSING VALUES COMPARISON ===")
comparison_df = pd.DataFrame({
    'Original_Missing': df_original.isnull().sum(),
    'Cleaned_Missing': df_cleaned.isnull().sum()
})
comparison_df['Difference'] = comparison_df['Original_Missing'] - comparison_df['Cleaned_Missing']
comparison_df = comparison_df[comparison_df['Original_Missing'] > 0]

if len(comparison_df) > 0:
    print(comparison_df)
else:
    print("No missing values in either dataset.")

## 5. Save Processed Data {#save-processed-data}

Save the cleaned dataset to the processed data directory.

In [None]:
# Ensure the processed data directory exists
processed_dir = '../homework/data/processed/'
os.makedirs(processed_dir, exist_ok=True)

# Save the cleaned dataset
output_path = os.path.join(processed_dir, 'stock_data_cleaned.csv')
df_cleaned.to_csv(output_path, index=False)

print(f"Cleaned dataset saved to: {output_path}")
print(f"File size: {os.path.getsize(output_path) / 1024:.2f} KB")

# Also save a summary of the cleaning process
summary_path = os.path.join(processed_dir, 'cleaning_summary.txt')
with open(summary_path, 'w') as f:
    f.write("DATA CLEANING SUMMARY\n")
    f.write("=" * 50 + "\n\n")
    f.write(f"Original dataset shape: {df_original.shape}\n")
    f.write(f"Cleaned dataset shape: {df_cleaned.shape}\n")
    f.write(f"Rows removed: {df_original.shape[0] - df_cleaned.shape[0]}\n")
    f.write(f"Missing values removed: {df_original.isnull().sum().sum() - df_cleaned.isnull().sum().sum()}\n\n")
    
    f.write("CLEANING OPERATIONS PERFORMED:\n")
    f.write("1. Filled missing values in technical indicators with median\n")
    f.write("2. Dropped rows with missing values in critical columns\n")
    f.write("3. Normalized price and volume data using standard scaling\n\n")
    
    if existing_normalize_cols:
        f.write(f"Normalized columns: {existing_normalize_cols}\n")
    if existing_technical_cols:
        f.write(f"Median-filled columns: {existing_technical_cols}\n")

print(f"Cleaning summary saved to: {summary_path}")

## 6. Assumptions and Tradeoffs {#assumptions-and-tradeoffs}

### Key Assumptions Made:

1. **Missing Value Imputation**:
   - **Assumption**: Missing values in technical indicators (volatility_20d, sma_20, sma_50) are due to insufficient historical data at the beginning of the time series
   - **Rationale**: These indicators require 20+ days of historical data to calculate
   - **Tradeoff**: Using median imputation may introduce bias, but preserves more data points than dropping

2. **Data Completeness**:
   - **Assumption**: Core price data (OHLCV) should be complete and accurate
   - **Rationale**: These are fundamental market data points that should always be available
   - **Tradeoff**: Dropping incomplete records reduces dataset size but ensures data quality

3. **Normalization Strategy**:
   - **Assumption**: Price and volume data follow approximately normal distributions
   - **Rationale**: Standard scaling works well for normally distributed data
   - **Tradeoff**: Loses original scale interpretation but enables better model performance

4. **Feature Selection**:
   - **Assumption**: All numerical columns are relevant for analysis
   - **Rationale**: Stock market data typically has interconnected features
   - **Tradeoff**: May include noisy features but preserves potentially useful information

### Alternative Approaches Considered:

1. **Forward-fill imputation** for technical indicators (would preserve trends)
2. **Min-max scaling** instead of standard scaling (would preserve relative relationships)
3. **Complete case analysis** (dropping all rows with any missing values)
4. **Interpolation methods** for time series data

### Impact Assessment:

- **Data Loss**: Minimal rows removed, preserving dataset size
- **Information Preservation**: Technical indicators maintained through imputation
- **Model Readiness**: Normalized features ready for machine learning algorithms
- **Reproducibility**: All operations documented and parameterized in reusable functions

In [None]:
# Final validation of cleaned dataset
print("=== FINAL VALIDATION ===")
print(f"✓ Dataset shape: {df_cleaned.shape}")
print(f"✓ Missing values: {df_cleaned.isnull().sum().sum()}")
print(f"✓ Data types consistent: {df_cleaned.dtypes.nunique()} unique types")
print(f"✓ No infinite values: {np.isinf(df_cleaned.select_dtypes(include=[np.number])).sum().sum() == 0}")

# Check for any remaining data quality issues
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
    print(f"✓ Numeric columns range check:")
    for col in numeric_cols[:5]:  # Show first 5 numeric columns
        min_val, max_val = df_cleaned[col].min(), df_cleaned[col].max()
        print(f"  {col}: [{min_val:.4f}, {max_val:.4f}]")

print("\n🎉 Data preprocessing completed successfully!")
print("📁 Cleaned dataset saved to /homework/data/processed/stock_data_cleaned.csv")