# Predictive Analytics - Exploratory Data Analysis

This notebook provides a template for exploratory data analysis and predictive modeling using the project's custom modules.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading and Overview](#data-loading)
3. [Data Preprocessing](#preprocessing)
4. [Exploratory Data Analysis](#eda)
5. [Feature Engineering](#feature-engineering)
6. [Model Training](#model-training)
7. [Model Evaluation](#model-evaluation)
8. [Predictions](#predictions)
9. [Results and Conclusions](#results)

## 1. Setup and Imports <a id="setup"></a>

In [None]:
# Standard library imports
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
sys.path.append('../src')

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Project modules
from data_preprocessing import DataPreprocessor
from model_training import ModelTrainer
from model_evaluation import ModelEvaluator
from prediction import Predictor

# Configuration
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("All imports successful!")

## 2. Data Loading and Overview <a id="data-loading"></a>

In [None]:
# Load your dataset
# Replace 'your_dataset.csv' with your actual data file
data_path = '../data/raw/your_dataset.csv'

# Initialize preprocessor
preprocessor = DataPreprocessor()

# Uncomment when you have data
# df = preprocessor.load_data(data_path)
# print(f"Dataset shape: {df.shape}")
# df.head()

# For demonstration, let's create a sample dataset
np.random.seed(42)
n_samples = 1000

# Create synthetic data for demonstration
df = pd.DataFrame({
    'feature_1': np.random.normal(50, 15, n_samples),
    'feature_2': np.random.normal(100, 25, n_samples),
    'feature_3': np.random.exponential(2, n_samples),
    'feature_4': np.random.uniform(0, 10, n_samples),
    'category': np.random.choice(['A', 'B', 'C'], n_samples),
})

# Create target variable with some relationship to features
df['target'] = (0.5 * df['feature_1'] + 
                0.3 * df['feature_2'] + 
                0.2 * df['feature_3'] + 
                np.random.normal(0, 10, n_samples))

print(f"Sample dataset shape: {df.shape}")
df.head()

In [None]:
# Basic information about the dataset
info = preprocessor.basic_info(df)
print("Dataset Information:")
print(f"Shape: {info['shape']}")
print(f"Columns: {info['columns']}")
print(f"Missing values: {info['missing_values']}")
print(f"Duplicates: {info['duplicates']}")

In [None]:
# Display basic statistics
df.describe()

## 3. Data Preprocessing <a id="preprocessing"></a>

In [None]:
# Handle missing values if any
df_processed = preprocessor.handle_missing_values(df)

# Encode categorical features
df_processed = preprocessor.encode_categorical_features(df_processed)

print("Data preprocessing completed!")
print(f"Processed data shape: {df_processed.shape}")
df_processed.head()

## 4. Exploratory Data Analysis <a id="eda"></a>

In [None]:
# Distribution of target variable
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(df['target'], bins=30, alpha=0.7, edgecolor='black')
plt.title('Distribution of Target Variable')
plt.xlabel('Target')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.boxplot(df['target'])
plt.title('Box Plot of Target Variable')
plt.ylabel('Target')

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix
numeric_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

In [None]:
# Pair plot of features vs target
feature_cols = [col for col in numeric_cols if col != 'target']

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for i, feature in enumerate(feature_cols[:4]):
    axes[i].scatter(df[feature], df['target'], alpha=0.6)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Target')
    axes[i].set_title(f'{feature} vs Target')
    
    # Add trend line
    z = np.polyfit(df[feature], df['target'], 1)
    p = np.poly1d(z)
    axes[i].plot(df[feature], p(df[feature]), "r--", alpha=0.8)

plt.tight_layout()
plt.show()

## 5. Feature Engineering <a id="feature-engineering"></a>

In [None]:
# Create additional features
df_engineered = preprocessor.create_features(df_processed)

print(f"Original features: {df_processed.shape[1]}")
print(f"After feature engineering: {df_engineered.shape[1]}")
print(f"New features created: {df_engineered.shape[1] - df_processed.shape[1]}")

# Display new feature names
new_features = [col for col in df_engineered.columns if col not in df_processed.columns]
print(f"\nNew features: {new_features[:10]}...")  # Show first 10

## 6. Model Training <a id="model-training"></a>

In [None]:
# Split the data
X_train, X_test, y_train, y_test = preprocessor.split_data(df_engineered, 'target', test_size=0.2)

# Scale features
X_train_scaled = preprocessor.scale_features(X_train, fit=True)
X_test_scaled = preprocessor.scale_features(X_test, fit=False)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")

In [None]:
# Initialize model trainer
trainer = ModelTrainer(random_state=42)

# Train multiple models
trained_models = trainer.train_multiple_models(X_train_scaled, y_train)

print(f"Trained {len(trained_models)} models:")
for model_name in trained_models.keys():
    print(f"- {model_name}")

## 7. Model Evaluation <a id="model-evaluation"></a>

In [None]:
# Initialize evaluator
evaluator = ModelEvaluator()

# Evaluate all models
comparison_df = evaluator.evaluate_multiple_models(trained_models, X_test_scaled, y_test)

# Display results
print("Model Performance Comparison:")
comparison_df.round(4)

In [None]:
# Find the best model
best_model_name, best_model, best_score = trainer.find_best_model(X_test_scaled, y_test, metric='r2')
print(f"Best model: {best_model_name} with R² score: {best_score:.4f}")

In [None]:
# Plot model comparison
evaluator.plot_model_comparison()

In [None]:
# Detailed analysis of best model
evaluator.plot_predictions_vs_actual(best_model_name)

In [None]:
# Residuals analysis
evaluator.plot_residuals(best_model_name)

In [None]:
# Feature importance (if supported)
if hasattr(best_model, 'feature_importances_'):
    evaluator.feature_importance_plot(best_model, X_train_scaled.columns.tolist(), best_model_name)
else:
    print(f"{best_model_name} doesn't support feature importance visualization")

## 8. Predictions <a id="predictions"></a>

In [None]:
# Initialize predictor with best model
predictor = Predictor(model=best_model)

# Make predictions
predictions = predictor.predict(X_test_scaled)

print(f"Made predictions for {len(predictions)} samples")
print(f"Sample predictions: {predictions[:5]}")

In [None]:
# Single prediction example
sample_features = X_test_scaled.iloc[0].tolist()
single_prediction = predictor.predict_single(sample_features)

print(f"Single prediction: {single_prediction:.2f}")
print(f"Actual value: {y_test.iloc[0]:.2f}")
print(f"Difference: {abs(single_prediction - y_test.iloc[0]):.2f}")

In [None]:
# Predictions with confidence intervals (for ensemble models)
if hasattr(best_model, 'estimators_'):
    pred, lower, upper = predictor.predict_with_confidence(X_test_scaled.iloc[:10])
    
    # Visualize confidence intervals
    plt.figure(figsize=(12, 6))
    x_range = range(len(pred))
    
    plt.plot(x_range, y_test.iloc[:10].values, 'o-', label='Actual', markersize=8)
    plt.plot(x_range, pred, 's-', label='Predicted', markersize=8)
    plt.fill_between(x_range, lower, upper, alpha=0.3, label='95% Confidence Interval')
    
    plt.xlabel('Sample Index')
    plt.ylabel('Target Value')
    plt.title('Predictions with Confidence Intervals')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print(f"{best_model_name} doesn't support confidence intervals")

## 9. Results and Conclusions <a id="results"></a>

In [None]:
# Generate evaluation report
report = evaluator.generate_evaluation_report()
print(report)

In [None]:
# Save the best model
model_save_path = '../models/best_model.pkl'
trainer.save_model(best_model, model_save_path)
print(f"Best model saved to {model_save_path}")

### Key Findings:

1. **Best Performing Model**: [To be filled based on results]
2. **Performance Metrics**: [To be filled based on results]
3. **Important Features**: [To be filled based on results]
4. **Model Insights**: [To be filled based on results]

### Next Steps:

1. **Hyperparameter Tuning**: Fine-tune the best model
2. **Feature Selection**: Remove irrelevant features
3. **Cross-Validation**: Perform more robust validation
4. **Deployment**: Prepare model for production

### Recommendations:

- [Add specific recommendations based on your analysis]
- [Include business insights if applicable]
- [Suggest improvements for model performance]