# Example Data Analysis with dbutils_batch_query

This notebook demonstrates how to use the `dbutils_batch_query` package for data analysis tasks. We'll walk through a complete workflow including:

1. Data import and preprocessing
2. Exploratory data analysis
3. Statistical analysis
4. Time series analysis and forecasting

This example notebook can serve as a starting point for your own analysis projects.

## 1. Setup and Data Import

First, we'll import the necessary packages and prepare our environment.

In [ ]:
# Import standard data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100
sns.set_palette('viridis')

# Import functions from our package
from dbutils_batch_query.data_import import load_csv, clean_data, transform_data
from dbutils_batch_query.example_analysis import descriptive_stats, correlation_analysis, time_series_analysis

# Display package version
import dbutils_batch_query
print(f"{{ project_name }} version: {{{ project_name }}.__version__}")

### Generate Sample Data

Let's create some synthetic data for our examples.

In [ ]:
# Set a random seed for reproducibility
np.random.seed(42)

# Create sample tabular data
n_samples = 500
data = {
    'numeric_feature_1': np.random.normal(loc=50, scale=10, size=n_samples),
    'numeric_feature_2': np.random.normal(loc=25, scale=5, size=n_samples),
    'category': np.random.choice(['A', 'B', 'C'], size=n_samples, p=[0.5, 0.3, 0.2])
}

# Add some correlation between features
data['numeric_feature_3'] = data['numeric_feature_1'] * 0.6 + data['numeric_feature_2'] * 0.4 + np.random.normal(0, 5, n_samples)

# Add some missing values
missing_indices = np.random.choice(n_samples, size=int(n_samples * 0.05), replace=False)
data['numeric_feature_1'][missing_indices] = np.nan

# Create a DataFrame
df = pd.DataFrame(data)

# Display the first few rows
print(f"Sample data shape: {df.shape}")
df.head()

### Data Cleaning and Preprocessing

Now we'll clean the data using functions from our `data_import` module.

In [ ]:
# Check missing values before cleaning
print("Missing values before cleaning:")
print(df.isnull().sum())

# Clean the data
cleaned_df = clean_data(
    df,
    drop_duplicates=True,
    fill_na={'numeric_feature_1': df['numeric_feature_1'].mean()}
)

# Check missing values after cleaning
print("\nMissing values after cleaning:")
print(cleaned_df.isnull().sum())

# Transform the data
transformed_df = transform_data(
    cleaned_df,
    normalize_cols=['numeric_feature_1', 'numeric_feature_2', 'numeric_feature_3'],
    categorical_cols=['category']
)

print("\nTransformed data:")
transformed_df.head()

## 2. Exploratory Data Analysis

Let's perform some basic exploratory data analysis using our `example_analysis` module.

In [ ]:
# Calculate descriptive statistics
stats_df, plots = descriptive_stats(
    cleaned_df,
    numeric_cols=['numeric_feature_1', 'numeric_feature_2', 'numeric_feature_3'],
    include_plots=True
)

# Display the statistics
print("Descriptive Statistics:")
stats_df.round(2)

In [ ]:
# Show distribution plots for feature 1
plots['numeric_feature_1']

In [ ]:
# Perform correlation analysis
corr_df, corr_plot = correlation_analysis(
    cleaned_df,
    numeric_cols=['numeric_feature_1', 'numeric_feature_2', 'numeric_feature_3'],
    method='pearson',
    threshold=0.0,
    include_plot=True
)

print("Correlation Matrix:")
print(corr_df.round(3))

# Display the correlation plot
corr_plot

### Group Analysis by Category

Let's analyze our data grouped by the categorical variable.

In [ ]:
# Group statistics by category
group_stats = cleaned_df.groupby('category')[['numeric_feature_1', 'numeric_feature_2', 'numeric_feature_3']].agg(['mean', 'std'])
print("Group Statistics by Category:")
group_stats

In [ ]:
# Visualize distributions by category
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for i, feature in enumerate(['numeric_feature_1', 'numeric_feature_2', 'numeric_feature_3']):
    sns.boxplot(x='category', y=feature, data=cleaned_df, ax=axes[i])
    axes[i].set_title(f'Distribution of {feature} by Category')
    axes[i].grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

## 3. Statistical Testing

Let's perform statistical tests to compare groups.

In [ ]:
from dbutils_batch_query.example_analysis import run_t_test

# Extract data for groups A and B for feature 1
group_a = cleaned_df[cleaned_df['category'] == 'A']['numeric_feature_1']
group_b = cleaned_df[cleaned_df['category'] == 'B']['numeric_feature_1']

# Run t-test between groups A and B
t_test_results, t_test_plot = run_t_test(
    group_a,
    group_b,
    equal_var=True,
    alternative='two-sided',
    include_plot=True
)

print("T-test Results for Feature 1 (Group A vs. Group B):")
for key, value in t_test_results.items():
    print(f"{key}: {value:.4f}")

# Show the t-test plot
t_test_plot

## 4. Time Series Analysis and Forecasting

Now let's create and analyze some time series data.

In [ ]:
# Create a time series with trend, seasonality, and noise
periods = 48  # 4 years of monthly data
dates = pd.date_range(start='2021-01-01', periods=periods, freq='M')

# Components
time_idx = np.arange(periods)
trend = 0.2 * time_idx  # Linear trend
seasonality = 5 * np.sin(2 * np.pi * time_idx / 12)  # Yearly seasonality
noise = np.random.normal(0, 1, periods)  # Random noise

# Combine components
ts_values = trend + seasonality + noise
time_series = pd.Series(ts_values, index=dates)

# Plot the time series
plt.figure(figsize=(14, 6))
time_series.plot()
plt.title('Sample Time Series Data (Monthly, 4 Years)')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [ ]:
# Perform time series analysis
ts_results, ts_plots = time_series_analysis(
    time_series,
    periods=12,  # Forecast next 12 months
    seasonal=True,
    seasonal_periods=12,  # Monthly data
    include_plots=True
)

# Display the decomposition and forecast plots
ts_plots

### Forecast Analysis

Let's examine the forecast values in more detail.

In [ ]:
# Display the forecast
print("Forecast for the next 12 months:")
forecast = ts_results['forecast']
forecast

In [ ]:
# Create a more detailed forecast plot
plt.figure(figsize=(14, 6))

# Plot historical data
time_series.plot(label='Historical Data', color='blue')

# Plot forecast
forecast.plot(label='Forecast', color='red', linestyle='--')

# Add trend line from decomposition if available
if 'decomposition' in ts_results:
    trend = ts_results['decomposition'].trend
    trend.plot(label='Trend', color='green', alpha=0.7)

    # Extend trend line for forecast period (simple linear projection)
    last_trend_value = trend.dropna().iloc[-1]
    trend_diff = (trend.dropna().iloc[-1] - trend.dropna().iloc[-13]) / 12  # Annual trend change
    forecast_trend = pd.Series(
        [last_trend_value + trend_diff * (i+1) for i in range(len(forecast))],
        index=forecast.index
    )
    forecast_trend.plot(color='green', linestyle='--', alpha=0.7)

plt.title('Time Series Forecast')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.show()

## 5. Conclusion

In this notebook, we've demonstrated the use of the `dbutils_batch_query` package for data analysis tasks:

1. **Data Import and Processing**:
   - Created and cleaned sample data
   - Transformed features with normalization and encoding

2. **Exploratory Data Analysis**:
   - Calculated descriptive statistics
   - Analyzed correlations between variables
   - Compared groups with statistical tests

3. **Time Series Analysis**:
   - Created a synthetic time series with trend and seasonality
   - Decomposed the series into trend, seasonal, and residual components
   - Generated forecasts for future periods

This workflow can serve as a starting point for your own analysis projects using the `dbutils_batch_query` package.