# Exploratory Data Analysis for Mint Replica Project

This notebook contains exploratory data analysis (EDA) for the Mint Replica project's financial data. It uses the data loading and preprocessing functions from the data_loader module to analyze and visualize various aspects of the financial data, providing insights for feature engineering and model development.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from src.utils.data_loader import load_and_prepare_data

# Load the data
data = load_and_prepare_data('database', target_column='category')
df = data[0]  # Assuming the first returned item is the full dataset

## Data Overview

In [None]:
print(df.info())
print(df.describe())
print(df.head())

## Missing Value Analysis

In [None]:
missing_values = df.isnull().sum()
missing_percentages = 100 * missing_values / len(df)
missing_table = pd.concat([missing_values, missing_percentages], axis=1, keys=['Missing Values', 'Percentage'])
print(missing_table)

plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.title('Missing Value Heatmap')
plt.show()

## Transaction Amount Distribution

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(df['amount'], kde=True)
plt.title('Distribution of Transaction Amounts')
plt.xlabel('Amount')
plt.ylabel('Frequency')
plt.show()

# Box plot for amount by category
plt.figure(figsize=(14, 8))
sns.boxplot(x='category', y='amount', data=df)
plt.title('Transaction Amounts by Category')
plt.xticks(rotation=45)
plt.show()

## Transaction Frequency Analysis

In [None]:
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.day_name()

# Transaction frequency by day of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.figure(figsize=(12, 6))
sns.countplot(x='day_of_week', data=df, order=day_order)
plt.title('Transaction Frequency by Day of Week')
plt.xticks(rotation=45)
plt.show()

# Transaction frequency by category
plt.figure(figsize=(14, 8))
category_counts = df['category'].value_counts()
sns.barplot(x=category_counts.index, y=category_counts.values)
plt.title('Transaction Frequency by Category')
plt.xticks(rotation=90)
plt.show()

## Time Series Analysis

In [None]:
df['month'] = df['date'].dt.to_period('M')
monthly_spending = df.groupby('month')['amount'].sum().reset_index()
monthly_spending['month'] = monthly_spending['month'].astype(str)

fig = px.line(monthly_spending, x='month', y='amount', title='Monthly Spending Over Time')
fig.show()

## Correlation Analysis

In [None]:
numerical_columns = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numerical_columns].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

## Categorical Feature Analysis

In [None]:
categorical_columns = df.select_dtypes(include=['object']).columns

for column in categorical_columns:
    plt.figure(figsize=(12, 6))
    sns.boxplot(x=column, y='amount', data=df)
    plt.title(f'Transaction Amounts by {column}')
    plt.xticks(rotation=90)
    plt.show()

## Insights and Conclusions

This cell will contain markdown text summarizing the key insights and conclusions from the exploratory data analysis. Some potential areas to focus on include:

1. Distribution of transaction amounts and any outliers
2. Most common transaction categories and their characteristics
3. Spending patterns over time (daily, weekly, monthly)
4. Correlations between numerical features
5. Relationships between categorical features and transaction amounts
6. Any notable missing data patterns
7. Potential features for machine learning models
8. Areas that require further investigation or data collection