# üìä Data Loading and Exploration

Learn how to load data from multiple sources and perform comprehensive data exploration.

## Topics Covered
1. Loading data from various sources (CSV, JSON, Parquet, Excel)
2. Connecting to databases
3. Exploratory Data Analysis (EDA)
4. Data quality checks
5. Visualization techniques

**Time Required**: ~20 minutes

In [None]:
import sys
sys.path.insert(0, '../../')

from data_science_master_system import DataLoader, Plotter
from data_science_master_system.data.processing import ProcessingEngine
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Imports ready!")

## 1. Loading Data from Files

The `DataLoader` automatically detects file formats and loads them appropriately.

In [None]:
# Initialize the DataLoader
loader = DataLoader()

# Load CSV file
df_churn = loader.read('../data/csv/customer_churn.csv')
print(f"üìÅ CSV loaded: {df_churn.shape}")

# Load JSON file
df_reviews = loader.read('../data/json/product_reviews.json')
print(f"üìÅ JSON loaded: {df_reviews.shape}")

In [None]:
# Preview the data
print("\nüîç Customer Churn Data:")
display(df_churn.head())

print("\nüîç Product Reviews Data:")
display(df_reviews.head())

## 2. Using the Processing Engine

The ProcessingEngine provides a unified interface for data manipulation.

In [None]:
# Initialize processing engine
engine = ProcessingEngine(backend='pandas')

# Filter data
high_value_customers = engine.filter(df_churn, 'monthly_charges > 100')
print(f"High-value customers: {len(high_value_customers)}")

# Group and aggregate
by_contract = engine.group_by(df_churn, 'contract_type').agg({
    'monthly_charges': ['mean', 'sum'],
    'churn': 'mean'
})
print("\nüìä Stats by Contract Type:")
display(by_contract)

## 3. Comprehensive EDA

In [None]:
# Basic statistics
print("üìà Numerical Statistics:")
display(df_churn.describe())

In [None]:
# Data types and missing values
print("\nüîß Data Info:")
info_df = pd.DataFrame({
    'Column': df_churn.columns,
    'Type': df_churn.dtypes.values,
    'Non-Null': df_churn.count().values,
    'Null %': (df_churn.isnull().sum() / len(df_churn) * 100).round(2).values,
    'Unique': df_churn.nunique().values
})
display(info_df)

In [None]:
# Visualizations
plotter = Plotter()

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Distribution of monthly charges
df_churn['monthly_charges'].hist(bins=30, ax=axes[0, 0], color='steelblue', edgecolor='white')
axes[0, 0].set_title('Monthly Charges Distribution')
axes[0, 0].set_xlabel('Monthly Charges ($)')

# Churn by contract type
churn_by_contract = df_churn.groupby('contract_type')['churn'].mean().sort_values()
churn_by_contract.plot(kind='barh', ax=axes[0, 1], color='coral')
axes[0, 1].set_title('Churn Rate by Contract Type')
axes[0, 1].set_xlabel('Churn Rate')

# Age distribution
df_churn['age'].hist(bins=25, ax=axes[1, 0], color='green', edgecolor='white')
axes[1, 0].set_title('Age Distribution')

# Tenure distribution
df_churn['tenure_months'].hist(bins=20, ax=axes[1, 1], color='purple', edgecolor='white')
axes[1, 1].set_title('Tenure Distribution (Months)')

plt.tight_layout()
plt.show()

## 4. Correlation Analysis

In [None]:
# Select numeric columns
numeric_cols = df_churn.select_dtypes(include=[np.number]).columns
numeric_data = df_churn[numeric_cols]

# Plot correlation matrix
fig = plotter.correlation_matrix(numeric_data, title='Feature Correlations')
plt.show()

## 5. Data Quality Checks

In [None]:
from data_science_master_system.utils.validators import validate_dataframe

# Validate the dataframe
try:
    validate_dataframe(
        df_churn,
        required_columns=['customer_id', 'churn', 'monthly_charges'],
        min_rows=100,
        allow_empty=False
    )
    print("‚úÖ Data validation passed!")
except Exception as e:
    print(f"‚ùå Validation failed: {e}")

In [None]:
# Check for duplicates
duplicates = df_churn.duplicated().sum()
print(f"\nüîç Duplicate rows: {duplicates}")

# Check for outliers (using IQR)
Q1 = df_churn['monthly_charges'].quantile(0.25)
Q3 = df_churn['monthly_charges'].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df_churn['monthly_charges'] < Q1 - 1.5*IQR) | (df_churn['monthly_charges'] > Q3 + 1.5*IQR)).sum()
print(f"üîç Monthly charges outliers: {outliers}")

## üéØ Key Takeaways

1. **DataLoader** - Universal data loading from files, databases, APIs
2. **ProcessingEngine** - Unified data manipulation interface
3. **Plotter** - Easy visualization creation
4. **validators** - Data quality validation

### Next: Feature Engineering ‚Üí