In [2]:
import  pandas as pd
sales = pd.read_csv("sales_subset.csv") 

In [4]:
# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales["weekly_sales"].mean())

# Print the median of weekly_sales
print(sales["weekly_sales"].median())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------

In [5]:
# Print the maximum of the date column
print(sales['date'].max())

# Print the minimum of the date column
print(sales['date'].min())


2012-10-26
2010-02-05


# Exploring Sales Data

The dataset contains retail sales information with 10774 entries spanning from 2010-02-05 to 2012-10-05. The data includes important metrics such as:

| Feature | Description |
|---------|-------------|
| store | Store identifier number |
| type | Store classification type |
| department | Department identifier number |
| date | Date of sales record |
| weekly_sales | Sales amount for the week |
| is_holiday | Whether the week includes a holiday |
| temperature_c | Average temperature in Celsius |
| fuel_price_usd_per_l | Fuel price in USD per liter |
| unemployment | Unemployment rate |

The sales data shows significant variation with values ranging from nearly $0 to over $57,000 in weekly sales. This suggests potential seasonality or store-specific patterns that could be analyzed further using aggregation methods like `.agg()`.

In [6]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales["temperature_c"].agg(iqr))

16.583333333333336


In [7]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr))

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64


In [8]:
# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


  print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))
  print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))
  print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))


# Cumulative Statistics

Cumulative statistics help track summary statistics over time, providing insights into data evolution and trends. With cumulative calculations, we can:

- Identify total sales to date (cumulative sum)
- Track highest weekly sales reached (cumulative max)
- Monitor performance patterns over time
- Detect seasonal trends or anomalies

For retail data analysis, these metrics are particularly valuable for understanding sales momentum, identifying peak periods, and forecasting future performance.

In the next cell, we'll analyze department 1 of store 1 by filtering the sales data and calculating cumulative statistics to reveal sales patterns over time.

In [12]:
# Sort sales_1_1 by date
sales_1_1 = sales.sort_values("date")

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1["cum_weekly_sales"] = sales["weekly_sales"].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1["cum_max_sales"] = sales["weekly_sales"].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

             date  weekly_sales  cum_weekly_sales  cum_max_sales
0      2010-02-05      24924.50      2.492450e+04       24924.50
6437   2010-02-05      38597.52      1.629610e+08      293966.05
1249   2010-02-05       3840.21      2.668539e+07      140504.41
6449   2010-02-05      17590.59      1.633879e+08      293966.05
6461   2010-02-05       4929.87      1.635954e+08      293966.05
...           ...           ...               ...            ...
3592   2012-10-05        440.00      8.543040e+07      178982.89
8108   2012-10-05        660.00      2.028157e+08      293966.05
10773  2012-10-05        915.00      2.568947e+08      293966.05
6257   2012-10-12          3.00      1.583335e+08      293966.05
3384   2012-10-26        -21.63      7.879133e+07      178982.89

[10774 rows x 4 columns]
