# Task 1: Exploratory Data Analysis (EDA) on Rossmann Sales Data

## Introduction

In this notebook, we will conduct an Exploratory Data Analysis (EDA) on the Rossmann Sales dataset. The primary objective of this analysis is to understand the underlying patterns and trends in the data, which will aid in building predictive models for future sales.

### Dataset Overview

The Rossmann dataset contains historical sales data from various Rossmann stores across different locations. The dataset includes several features such as:

- **Store**: Unique identifier for each store.
- **DayOfWeek**: Day of the week (e.g., 1 = Monday, 7 = Sunday).
- **Date**: The date of the sales record.
- **Sales**: Total sales for the store on that day.
- **Customers**: Number of customers who visited the store.
- **Open**: Indicates whether the store was open on that day (1 = Yes, 0 = No).
- **Promo**: Indicates whether a promotion was active (1 = Yes, 0 = No).
- **StateHoliday**: Indicates whether the day is a state holiday (0 = No holiday, a = public holiday, b = Easter holiday, c = Christmas).
- **SchoolHoliday**: Indicates whether the day is a school holiday (1 = Yes, 0 = No).

### Objectives

The main objectives of this EDA are to:

1. **Data Cleaning**: Identify and handle missing values, duplicates, and outliers.
2. **Descriptive Statistics**: Generate summary statistics to understand the distribution of numerical variables.
3. **Data Visualization**: Create visualizations to identify trends, seasonal patterns, and relationships between variables.
4. **Feature Engineering**: Explore potential new features that could enhance predictive modeling.

### Structure of the Notebook

This notebook is organized into the following sections:

1. **Data Loading**: Load the necessary libraries and the dataset.
2. **Data Cleaning**: Handle missing values and prepare the data for analysis.
3. **Descriptive Statistics**: Provide summary statistics for numerical features.
4. **Data Visualization**: Generate various plots to visualize the data.
5. **Feature Engineering**: Discuss potential new features based on insights gained from the analysis.
6. **Conclusion**: Summarize findings and outline next steps for modeling.

Let's get started by loading the required libraries and the dataset.


In [1]:
import os
import sys
sys.path.append(os.path.abspath('..'))

In [2]:
from scripts.EDA import DataExplorer

In [3]:
train_path="C:\\Users\\nadew\\10x\\week4\\Rosmann\\rossmann-store-sales\\train.csv"
test_path="C:\\Users\\nadew\\10x\\week4\\Rosmann\\rossmann-store-sales\\test.csv" 
store_path= "C:\\Users\\nadew\\10x\\week4\\Rosmann\\rossmann-store-sales\\store.csv"

In [4]:
eda = DataExplorer(train_path, test_path, store_path)

In [5]:
eda.load_data()

## descriptive statistics of datas

In [11]:
eda.test_data.head()

Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,1.0,4.0,2015-09-17,1.0,1.0,0,0.0,c,a,1270.0,9.0,2008.0,0.0,22.0,2012.0,"Jan,Apr,Jul,Oct"
1,2,3.0,4.0,2015-09-17,1.0,1.0,0,0.0,a,a,14130.0,12.0,2006.0,1.0,14.0,2011.0,"Jan,Apr,Jul,Oct"
2,3,7.0,4.0,2015-09-17,1.0,1.0,0,0.0,a,c,24000.0,4.0,2013.0,0.0,22.0,2012.0,"Jan,Apr,Jul,Oct"
3,4,8.0,4.0,2015-09-17,1.0,1.0,0,0.0,a,a,7520.0,10.0,2014.0,0.0,22.0,2012.0,"Jan,Apr,Jul,Oct"
4,5,9.0,4.0,2015-09-17,1.0,1.0,0,0.0,a,c,2030.0,8.0,2000.0,0.0,22.0,2012.0,"Jan,Apr,Jul,Oct"


In [12]:
eda.test_data.describe()

Unnamed: 0,Id,Store,DayOfWeek,Open,Promo,SchoolHoliday,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear
count,41088.0,41088.0,41088.0,41088.0,41088.0,41088.0,41088.0,41088.0,41088.0,41088.0,41088.0,41088.0
mean,20544.5,555.899533,3.979167,0.854361,0.395833,0.443487,5082.13785,7.392523,2009.14486,0.580607,23.408879,2011.896028
std,11861.228267,320.274496,2.015481,0.352748,0.489035,0.496802,7218.27018,2.537164,5.484756,0.493466,10.856721,1.292403
min,1.0,1.0,1.0,0.0,0.0,0.0,20.0,1.0,1900.0,0.0,1.0,2009.0
25%,10272.75,279.75,2.0,1.0,0.0,0.0,720.0,6.0,2008.0,0.0,18.0,2011.0
50%,20544.5,553.5,4.0,1.0,0.0,0.0,2410.0,8.0,2010.0,1.0,22.0,2012.0
75%,30816.25,832.25,6.0,1.0,1.0,1.0,6435.0,9.0,2011.0,1.0,31.0,2012.0
max,41088.0,1115.0,7.0,1.0,1.0,1.0,75860.0,12.0,2015.0,1.0,49.0,2015.0


In [13]:
eda.store_data.head()

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,


In [14]:
eda.store_data.describe()

Unnamed: 0,Store,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear
count,1115.0,1112.0,761.0,761.0,1115.0,571.0,571.0
mean,558.0,5404.901079,7.224704,2008.668857,0.512108,23.595447,2011.763573
std,322.01708,7663.17472,3.212348,6.195983,0.500078,14.141984,1.674935
min,1.0,20.0,1.0,1900.0,0.0,1.0,2009.0
25%,279.5,717.5,4.0,2006.0,0.0,13.0,2011.0
50%,558.0,2325.0,8.0,2010.0,1.0,22.0,2012.0
75%,836.5,6882.5,10.0,2013.0,1.0,37.0,2013.0
max,1115.0,75860.0,12.0,2015.0,1.0,50.0,2015.0


In [15]:
eda.train_data.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1.0,5.0,2015-07-31,5263,555,1.0,1.0,0,1.0,c,a,1270.0,9.0,2008.0,0.0,22.0,2012.0,"Jan,Apr,Jul,Oct"
1,2.0,5.0,2015-07-31,6064,625,1.0,1.0,0,1.0,a,a,570.0,11.0,2007.0,1.0,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3.0,5.0,2015-07-31,8314,821,1.0,1.0,0,1.0,a,a,14130.0,12.0,2006.0,1.0,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4.0,5.0,2015-07-31,13995,1498,1.0,1.0,0,1.0,c,c,620.0,9.0,2009.0,0.0,22.0,2012.0,"Jan,Apr,Jul,Oct"
5,6.0,5.0,2015-07-31,5651,589,1.0,1.0,0,1.0,a,a,310.0,12.0,2013.0,0.0,22.0,2012.0,"Jan,Apr,Jul,Oct"


In [16]:
eda.train_data.describe()

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,SchoolHoliday,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear
count,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0,970900.0
mean,558.081937,4.010365,5570.332094,600.430271,0.826499,0.37817,0.178281,4789.013338,7.513335,2009.336002,0.51594,22.593393,2011.867761
std,322.890286,1.99629,3518.951901,387.768543,0.37868,0.484931,0.382749,5864.239882,2.665871,3.427028,0.499746,10.120274,1.20431
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,20.0,1.0,1995.0,0.0,1.0,2009.0
25%,277.0,2.0,3674.0,399.0,1.0,0.0,0.0,710.0,6.0,2008.0,0.0,18.0,2012.0
50%,559.0,4.0,5681.0,603.0,1.0,0.0,0.0,2320.0,8.0,2010.0,1.0,22.0,2012.0
75%,841.0,6.0,7720.0,822.0,1.0,1.0,0.0,6470.0,9.0,2011.0,1.0,22.0,2012.0
max,1115.0,7.0,17323.0,2026.0,1.0,1.0,1.0,27650.0,12.0,2015.0,1.0,50.0,2015.0


### data analysis section
 

In [6]:
eda.merge_data()

In [7]:
eda.clean_data()

In [8]:
eda.analyze_data()

please click  <a href="https://github.com/chapi1420/Rosmann_Pharmaceuticals/tree/master/notebooks/results%20analyse_data">here</a> to see the results of this notebook.