# Data Exploration Notebook

## Assignment Task 1: Dataset Summary
## Assignment Task 2: Data Exploration Plan

This notebook covers the initial exploration of the dataset, including:
- Loading and examining the dataset
- Creating a comprehensive dataset summary
- Developing a data exploration plan
- Initial data quality assessment

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
import os

# Add src directory to path
sys.path.append('../src')

# Import our custom modules
from data.data_loader import DataLoader
from analysis.eda import ExploratoryDataAnalysis
from visualization.plots import DataVisualizer

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 1. Dataset Loading and Initial Exploration

In [None]:
# Initialize data loader
loader = DataLoader('../data/raw')

# Load dataset (replace 'your_dataset.csv' with actual filename)
try:
    # Example: loader.load_dataset('your_dataset.csv')
    print("Please load your dataset using loader.load_dataset('filename.csv')")
    print("Supported formats: CSV, Excel, JSON")
    
    # For demonstration, let's create a sample dataset
    sample_data = {
        'id': range(1, 1001),
        'age': np.random.normal(35, 10, 1000),
        'income': np.random.lognormal(10, 1, 1000),
        'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 1000),
        'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], 1000),
        'target': np.random.choice([0, 1], 1000, p=[0.7, 0.3])
    }
    
    dataset = pd.DataFrame(sample_data)
    loader.dataset = dataset
    loader._extract_dataset_info()
    
    print("Sample dataset created for demonstration purposes.")
    print("Replace this with your actual dataset loading.")
    
except FileNotFoundError as e:
    print(f"Dataset not found: {e}")
    print("Please ensure your dataset is in the data/raw directory")

## 2. Dataset Summary (Task 1)

Creating a comprehensive summary of the dataset including size, variables, and potential target variables.

In [None]:
# Get dataset summary
dataset_summary = loader.get_dataset_summary()

print("=" * 50)
print("DATASET SUMMARY")
print("=" * 50)
print(f"Dataset Size: {dataset_summary['dataset_size']}")
print(f"Memory Usage: {dataset_summary['memory_usage']}")
print(f"Numeric Variables: {dataset_summary['numeric_variables']}")
print(f"Categorical Variables: {dataset_summary['categorical_variables']}")
print(f"Total Missing Values: {dataset_summary['total_missing_values']}")
print()

# Display columns information
print("COLUMNS INFORMATION:")
print("-" * 30)
columns_df = pd.DataFrame(dataset_summary['columns_info'])
print(columns_df[['name', 'dtype', 'missing_count', 'missing_percentage']].to_string(index=False))

# Identify potential target variables
target_candidates = loader.identify_target_variables()
print(f"\nPotential Target Variables: {target_candidates}")

## 3. Data Exploration Plan (Task 2)

Developing a structured exploration plan for the analysis.

In [None]:
print("=" * 50)
print("DATA EXPLORATION PLAN")
print("=" * 50)

print("""
1. INITIAL DATA ASSESSMENT
   - Dataset size and structure verification
   - Data types and memory usage analysis
   - Missing data patterns identification
   - Duplicate records detection

2. UNIVARIATE ANALYSIS
   - Distribution analysis of numerical variables
   - Frequency analysis of categorical variables
   - Identification of outliers and anomalies
   - Statistical summary generation

3. BIVARIATE ANALYSIS
   - Correlation analysis between numerical variables
   - Relationship analysis between categorical and numerical variables
   - Target variable relationship exploration
   - Cross-tabulation of categorical variables

4. MULTIVARIATE ANALYSIS
   - Multi-dimensional correlation analysis
   - Clustering tendency assessment
   - Feature interaction identification
   - Dimensionality reduction considerations

5. DATA QUALITY ASSESSMENT
   - Data consistency verification
   - Data integrity checks
   - Domain knowledge validation
   - Business rule compliance verification

6. EXPLORATORY VISUALIZATION
   - Distribution plots and histograms
   - Correlation heatmaps
   - Box plots and violin plots
   - Scatter plots and pair plots
""")

## 4. Initial Data Quality Assessment

In [None]:
# Initialize EDA class
eda = ExploratoryDataAnalysis(loader.dataset)

# Analyze missing data
missing_data = eda.analyze_missing_data()
print("MISSING DATA ANALYSIS:")
print("-" * 25)
if missing_data.empty:
    print("No missing data found.")
else:
    print(missing_data)

# Analyze data types
dtypes_analysis = eda.analyze_data_types()
print(f"\nDATA TYPES ANALYSIS:")
print("-" * 20)
print(dtypes_analysis)

# Basic statistics
basic_stats = eda.get_basic_statistics()
print(f"\nBASIC STATISTICS:")
print("-" * 17)
print(basic_stats)

## 5. Visual Exploration

In [None]:
# Initialize visualizer
visualizer = DataVisualizer(loader.dataset)

# Plot distributions of numerical variables
if visualizer.numeric_columns:
    fig1 = visualizer.plot_distribution(figsize=(15, 8))
    plt.show()

# Plot categorical distributions
if visualizer.categorical_columns:
    fig2 = visualizer.plot_categorical_distributions(figsize=(15, 8))
    plt.show()

# Plot correlation matrix
if len(visualizer.numeric_columns) > 1:
    fig3 = visualizer.plot_correlation_matrix(figsize=(10, 8))
    plt.show()

## 6. Summary and Next Steps

In [None]:
print("=" * 50)
print("EXPLORATION SUMMARY")
print("=" * 50)
print(f"- Dataset loaded successfully with {len(loader.dataset)} rows and {len(loader.dataset.columns)} columns")
print(f"- Identified {len(visualizer.numeric_columns)} numerical variables")
print(f"- Identified {len(visualizer.categorical_columns)} categorical variables")
print(f"- Found {loader.dataset.isnull().sum().sum()} missing values")

print("\nNEXT STEPS:")
print("-" * 12)
print("1. Proceed to detailed EDA in the next notebook")
print("2. Perform data cleaning and preprocessing")
print("3. Conduct feature engineering")
print("4. Formulate and test hypotheses")

# Generate EDA report
eda_report = eda.generate_eda_report()
print(f"\n{eda_report}")