# Exploratory Data Analysis: Rossmann Pharmaceuticals Sales Forecasting

This notebook contains the exploratory data analysis for the Rossmann Pharmaceuticals sales forecasting project. We'll analyze the data to understand customer purchasing behavior and prepare it for modeling.

In [68]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

if 'data_loader' in sys.modules:
    del sys.modules['data_loader']
if 'data_cleaner' in sys.modules:
    del sys.modules['data_cleaner']
if 'eda_analyzer' in sys.modules:
    del sys.modules['eda_analyzer']
if 'utils' in sys.modules:
    del sys.modules['utils']
    
from data_loader import DataLoader
from data_cleaner import DataCleaner
from eda_analyzer import EDAAnalyzer
import utils

%matplotlib inline
plt.style.use('seaborn-v0_8-dark')

## 1. Data Loading and Cleaning

In [None]:
# Load the data
loader = DataLoader()
loader.load_data('../resources/Data/train.csv', '../resources/Data/test.csv', '../resources/Data/store.csv')
merged_train, merged_test = loader.merge_data()

# Clean the data
cleaner = DataCleaner()
cleaned_train = cleaner.preprocess_data(merged_train)
cleaned_test = cleaner.preprocess_data(merged_test)

print("Train data shape:", cleaned_train.shape)
print("Test data shape:", cleaned_test.shape)
print("\nTrain data columns:", cleaned_train.columns.tolist())
print("\nTrain data info:")
cleaned_train.info()

## 2. Distribution Analysis: Promotions in Training and Test Sets

In [None]:
analyzer = EDAAnalyzer(cleaned_train, cleaned_test)
analyzer.analyze_promotions()
plt.show()

The bar plot above shows the distribution of promotions in both the training and test sets. This allows us to compare if promotions are distributed similarly in both datasets, ensuring a fair evaluation of our model.

## 3. Holiday Effects on Sales

In [None]:
analyzer.analyze_holiday_effects()
plt.show()

This line plot illustrates the average sales during different state holidays. It helps us understand how holidays impact sales patterns, which can inform promotional strategies around these periods.

## 4. Seasonality in Purchasing Behaviors

In [None]:
analyzer.analyze_seasonality()
plt.show()

The line plot above shows average monthly sales, helping us identify any seasonal patterns in purchasing behavior. This information can be crucial for inventory management and promotional planning.

## 5. Correlation between Sales and Number of Customers

In [None]:
analyzer.analyze_sales_customers_correlation()
plt.show()

This scatter plot visualizes the relationship between sales and the number of customers. The correlation coefficient provides a quantitative measure of this relationship, which can help in understanding how customer traffic relates to sales performance.

## 6. Promotional Impact on Sales and Customers

In [None]:
analyzer.analyze_promotional_impact()
plt.show()

This bar plot compares average sales and customer numbers during promotional and non-promotional periods. It helps us understand how promotions influence both sales and customer traffic, which is crucial for optimizing promotional strategies.

## 7. Additional Analyses

In [None]:
# Analyze sales patterns by day of the week
cleaned_train['DayOfWeek'] = pd.to_datetime(cleaned_train['Date']).dt.dayofweek
day_sales = cleaned_train.groupby('DayOfWeek')['Sales'].mean().reindex(range(7))

plt.figure(figsize=(10, 6))
day_sales.plot(kind='bar')
plt.title('Average Sales by Day of Week')
plt.xlabel('Day of Week (0=Monday, 6=Sunday)')
plt.ylabel('Average Sales')
plt.tight_layout()
plt.show()

This bar plot shows average sales by day of the week, helping us identify any weekly patterns in sales.

In [None]:
# Analyze the impact of competition distance on sales
plt.figure(figsize=(10, 6))
sns.scatterplot(x='CompetitionDistance', y='Sales', data=cleaned_train.sample(n=1000))
plt.title('Sales vs Competition Distance')
plt.xlabel('Competition Distance')
plt.ylabel('Sales')
plt.tight_layout()
plt.show()

This scatter plot visualizes the relationship between sales and the distance to the nearest competitor, helping us understand how competitor proximity affects sales.

## 8. Summary and Key Findings

Based on our exploratory data analysis, we can draw the following conclusions:

1. Promotion Distribution: [Insert findings about promotion distribution in train and test sets]
2. Holiday Effects: [Insert findings about how holidays affect sales]
3. Seasonality: [Insert findings about seasonal patterns in sales]
4. Sales and Customers Correlation: [Insert findings about the relationship between sales and number of customers]
5. Promotional Impact: [Insert findings about how promotions affect sales and customer numbers]
6. Weekly Sales Patterns: [Insert findings about sales patterns by day of week]
7. Competition Impact: [Insert findings about how competition distance affects sales]

These insights will be valuable for building our sales forecasting model and for informing business strategies related to promotions, holiday planning, and competitive positioning.