# Homework Assignment: Data Aggregation with Pandas


## Introduction
In this assignment, you will apply data aggregation techniques using the Pandas library. You will perform groupings and apply window functions to a sales dataset.



## Objectives
- Practice using the `groupby` function for data aggregation.
- Understand and apply rolling window functions.
- Explore expanding window functions for cumulative statistics.


**Load Data**: `sales_region_hw.csv`

In [35]:
import pandas as pd
sales_df = pd.read_csv('sales_region_hw.csv')

In [36]:
sales_df.head()

Unnamed: 0,Date,Region,Category,Sales,Quantity
0,2021-02-14,East,Furniture,715,33
1,2021-02-17,South,Electronics,59,15
2,2021-04-28,West,Furniture,955,62
3,2021-03-06,North,Furniture,353,88
4,2021-03-09,South,Electronics,579,33


In [37]:
# Total of 500 entries, no missing data. data type is valid for the columns
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Date      500 non-null    object
 1   Region    500 non-null    object
 2   Category  500 non-null    object
 3   Sales     500 non-null    int64 
 4   Quantity  500 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 19.7+ KB


In [38]:
sales_df.describe()

Unnamed: 0,Sales,Quantity
count,500.0,500.0
mean,490.934,50.952
std,280.336231,28.596665
min,10.0,1.0
25%,251.0,25.75
50%,476.0,50.0
75%,725.5,76.0
max,995.0,99.0


### Task 1: Group By Region and Category
- Perform the following operations and answer the questions below:

In [39]:
# 1. Group the data by 'Region' and 'Category'. What is the total sales amount for each group?
# regional_sales will display all 4 regions and the total sales amount for each group
region_sales = sales_df.groupby('Region')['Sales'].sum()
region_sales

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,63757
North,61575
South,60250
West,59885


In [40]:
# 2. Which 'Region' and 'Category' combination has the highest average quantity sold?
region_category_avg = sales_df.groupby(['Region', 'Category'])['Quantity'].mean()
region_category_avg

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
Region,Category,Unnamed: 2_level_1
East,Clothing,46.162162
East,Electronics,55.0
East,Furniture,53.583333
East,Groceries,49.380952
North,Clothing,63.935484
North,Electronics,57.148148
North,Furniture,49.447368
North,Groceries,44.482759
South,Clothing,52.793103
South,Electronics,46.102564


In [41]:
# adding .idmax() at the end will provide the 'Region' and 'Category' that has the highest average
region_category_avg.idxmax()

('North', 'Clothing')

In [42]:
# 3. How many unique 'Category' entries are there for each 'Region'?
# using nunique() will describe the unique entries in "Category' for each 'Region'. The answer is 4 in each region
region_category_unique = sales_df.groupby('Region')['Category'].nunique()

In [43]:
region_category_unique

Unnamed: 0_level_0,Category
Region,Unnamed: 1_level_1
East,4
North,4
South,4
West,4


In [44]:
# 4. For each 'Region', what is the maximum sales value for 'Clothing'?
# ['Category'] == 'Clothing' is choosing specifically the column to provide a maximum sales value in each region.
sales_df[sales_df['Category'] == 'Clothing'].groupby('Region')['Sales'].max()


Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,970
North,994
South,995
West,961


In [45]:
# 5. Calculate the total 'Quantity' for each 'Category' across all 'Regions'. Which 'Category' has the highest total quantity?
# groupby the 'Category' by 'Quantity's sum to determine that Electronics is the highest total quantity with 6529 in 'Category'
region_category_quantity = sales_df.groupby('Category')['Quantity'].sum()

In [46]:
region_category_quantity

Unnamed: 0_level_0,Quantity
Category,Unnamed: 1_level_1
Clothing,6316
Electronics,6529
Furniture,6220
Groceries,6411


In [47]:
region_category_quantity.idxmax()

'Electronics'

### Task 2: Rolling Window Function
- Perform the following operations and answer the questions below:

In [48]:
# Perform the following operations and answer the questions below:
# 1. Calculate a 7-day rolling average of 'Sales'. On which date does the East region reach its highest 7-day rolling average of sales?

 # Instance 0 is restarting the 7-day roll, seeing as 'Sales' and '7_day_rolling_sales_avg' are the same
 # By creating a new column '7_day_rolling_sales_avg' I then put for the entries the rolling syntax to enter the mean within the past 7 days that moves along with the date
sales_df['7_day_rolling_sales_avg']= sales_df.groupby('Region')['Sales'].rolling(window=7, min_periods=1).mean().reset_index(drop=True)

In [49]:
sales_df.head(10)

Unnamed: 0,Date,Region,Category,Sales,Quantity,7_day_rolling_sales_avg
0,2021-02-14,East,Furniture,715,33,715.0
1,2021-02-17,South,Electronics,59,15,722.5
2,2021-04-28,West,Furniture,955,62,549.666667
3,2021-03-06,North,Furniture,353,88,434.5
4,2021-03-09,South,Electronics,579,33,364.4
5,2021-03-09,East,Furniture,730,57,450.5
6,2021-04-14,North,Furniture,51,29,519.428571
7,2021-01-10,North,Furniture,484,44,465.0
8,2021-03-25,North,Groceries,762,98,391.142857
9,2021-01-22,East,Furniture,204,51,427.714286


In [50]:
# To narrow down the data even more to focus on specifically the 'East'
east_region = sales_df[sales_df['Region'] == 'East']

In [51]:
# determine the max rolling avg date, specifically for the east_region displaying the 'Date'
east_region.head().sort_values(by='7_day_rolling_sales_avg', ascending=False)


Unnamed: 0,Date,Region,Category,Sales,Quantity,7_day_rolling_sales_avg
0,2021-02-14,East,Furniture,715,33,715.0
18,2021-03-07,East,Electronics,89,94,528.142857
19,2021-04-13,East,Groceries,84,23,460.571429
5,2021-03-09,East,Furniture,730,57,450.5
9,2021-01-22,East,Furniture,204,51,427.714286


In [52]:
max_rolling_avg_date = east_region.loc[east_region['7_day_rolling_sales_avg'] == east_region['7_day_rolling_sales_avg'].max()]['Date']


In [53]:
 # on 2021-03-09 the East region reached its highest of 7-day rolling average of sales with 528.142857, instance 18.
max_rolling_avg_date

Unnamed: 0,Date
0,2021-02-14


In [54]:
# 2. What is the overall average of the 7-day rolling sales amounts for each region?
# using the column '7_day_rolling_sales_avg' and the .mean() syntax, you are able to determine the overall average of the 7-day rolling sales for each region.
sales_df.groupby('Region')['7_day_rolling_sales_avg'].mean()

Unnamed: 0_level_0,7_day_rolling_sales_avg
Region,Unnamed: 1_level_1
East,500.035714
North,501.299371
South,485.847804
West,483.728614


In [None]:
!jupyter nbconvert --to html "PASTE_THE_COPIED_FILE_PATH_HERE"