# Data Exploration Template

This notebook provides a template for exploratory data analysis (EDA).

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading](#loading)
3. [Data Overview](#overview)
4. [Data Quality Assessment](#quality)
5. [Univariate Analysis](#univariate)
6. [Bivariate Analysis](#bivariate)
7. [Multivariate Analysis](#multivariate)
8. [Key Insights](#insights)
9. [Next Steps](#next-steps)

## 1. Setup and Imports {#setup}

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path

# Custom utilities
import sys
sys.path.append('../scripts')
from data_utils import load_config, load_data, explore_data, plot_correlation_matrix, plot_missing_values

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette('Set2')

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Setup completed successfully!")

## 2. Data Loading {#loading}

In [None]:
# Load configuration
config = load_config('../config/config.yaml')

# Load your data here
# Example: df = load_data('../data/raw/your_dataset.csv')
# For demonstration, we'll create sample data

# Create sample dataset
np.random.seed(42)
n_samples = 1000

df = pd.DataFrame({
    'age': np.random.randint(18, 80, n_samples),
    'income': np.random.normal(50000, 15000, n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
    'experience': np.random.randint(0, 40, n_samples),
    'satisfaction': np.random.randint(1, 11, n_samples),
    'department': np.random.choice(['Sales', 'Engineering', 'Marketing', 'HR'], n_samples),
    'performance_score': np.random.normal(75, 10, n_samples)
})

# Introduce some missing values
missing_indices = np.random.choice(df.index, size=50, replace=False)
df.loc[missing_indices, 'income'] = np.nan

print(f"Dataset loaded successfully! Shape: {df.shape}")

## 3. Data Overview {#overview}

In [None]:
# Basic information about the dataset
explore_data(df)

In [None]:
# Display first few rows
print("First 5 rows:")
display(df.head())

print("\nLast 5 rows:")
display(df.tail())

## 4. Data Quality Assessment {#quality}

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Check data types
print("\nData types:")
print(df.dtypes)

# Check for outliers in numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns
print(f"\nNumerical columns: {list(numerical_cols)}")

In [None]:
# Visualize missing values
plot_missing_values(df)

## 5. Univariate Analysis {#univariate}

In [None]:
# Distribution of numerical variables
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, col in enumerate(numerical_cols[:4]):
    df[col].hist(bins=30, ax=axes[i], alpha=0.7)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Box plots for numerical variables
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, col in enumerate(numerical_cols[:4]):
    df.boxplot(column=col, ax=axes[i])
    axes[i].set_title(f'Box Plot of {col}')

plt.tight_layout()
plt.show()

In [None]:
# Categorical variables
categorical_cols = df.select_dtypes(include=['object']).columns

fig, axes = plt.subplots(1, len(categorical_cols), figsize=(15, 5))
if len(categorical_cols) == 1:
    axes = [axes]

for i, col in enumerate(categorical_cols):
    df[col].value_counts().plot(kind='bar', ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 6. Bivariate Analysis {#bivariate}

In [None]:
# Correlation matrix
plot_correlation_matrix(df)

In [None]:
# Scatter plots for key relationships
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Example relationships
df.plot.scatter(x='age', y='income', ax=axes[0,0], alpha=0.6)
axes[0,0].set_title('Age vs Income')

df.plot.scatter(x='experience', y='performance_score', ax=axes[0,1], alpha=0.6)
axes[0,1].set_title('Experience vs Performance Score')

df.plot.scatter(x='age', y='experience', ax=axes[1,0], alpha=0.6)
axes[1,0].set_title('Age vs Experience')

df.plot.scatter(x='satisfaction', y='performance_score', ax=axes[1,1], alpha=0.6)
axes[1,1].set_title('Satisfaction vs Performance Score')

plt.tight_layout()
plt.show()

In [None]:
# Group analysis
print("Average performance score by department:")
dept_performance = df.groupby('department')['performance_score'].agg(['mean', 'std', 'count'])
display(dept_performance)

# Visualize
plt.figure(figsize=(10, 6))
df.boxplot(column='performance_score', by='department')
plt.title('Performance Score by Department')
plt.suptitle('')  # Remove default title
plt.xticks(rotation=45)
plt.show()

## 7. Multivariate Analysis {#multivariate}

In [None]:
# Pair plot for numerical variables
numerical_subset = df[numerical_cols].select_dtypes(include=[np.number])
if len(numerical_subset.columns) <= 5:  # Only if not too many variables
    sns.pairplot(numerical_subset)
    plt.show()
else:
    print("Too many numerical variables for pair plot. Consider selecting a subset.")

In [None]:
# Grouped analysis
plt.figure(figsize=(12, 8))
for i, dept in enumerate(df['department'].unique()):
    dept_data = df[df['department'] == dept]
    plt.scatter(dept_data['age'], dept_data['income'], label=dept, alpha=0.6)

plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs Income by Department')
plt.legend()
plt.show()

## 8. Key Insights {#insights}

Based on the exploratory data analysis, here are the key insights:

### Data Quality
- Dataset contains X rows and Y columns
- Missing values found in: [list columns with missing values]
- No duplicate rows detected

### Distribution Insights
- [Describe key patterns in distributions]
- [Note any skewness or outliers]

### Relationships
- [Describe key correlations found]
- [Note interesting bivariate relationships]

### Business Insights
- [Translate statistical findings to business context]
- [Highlight actionable insights]


## 9. Next Steps {#next-steps}

Based on this analysis, the recommended next steps are:

1. **Data Cleaning**: 
   - Handle missing values in [specific columns]
   - Address outliers in [specific variables]

2. **Feature Engineering**:
   - Create new features based on [insights found]
   - Transform variables as needed

3. **Modeling**:
   - Consider [specific modeling approaches] based on the data characteristics
   - Focus on [key variables] as predictors

4. **Further Analysis**:
   - Investigate [specific patterns] in more detail
   - Collect additional data on [specific aspects]


In [None]:
# Save processed data for next steps
# df.to_csv('../data/processed/explored_data.csv', index=False)
print("Analysis completed!")