<a href="https://www.kaggle.com/code/ameymane/portfolio-project-sales-analysis-around-holidays?scriptVersionId=115295918" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

Scenario:
We have have a dataset of major holidays around the world and sales data from an e-commerce business. <br>
We need to devise a marketing strategy for next year based on holidays that bring in the most revenue. 

# Initializing the Data

## Importing and Reading

In [1]:
print("Importing all the useful libraries and packages.")
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import calendar
from datetime import datetime as dt
pd.options.plotting.backend = "plotly"
from IPython.core.interactiveshell import InteractiveShell

Importing all the useful libraries and packages.


  shapely_geos_version, geos_capi_version_string


In [2]:
print("Loading the CSV files into pandas DF.")
ecom_df = pd.read_csv("/kaggle/input/salesandholidaydata/Ecommerce_Data.csv")
holiday_df = pd.read_csv("/kaggle/input/salesandholidaydata/US_Holiday_Dates_(2004-2021).csv")

Loading the CSV files into pandas DF.


In [3]:
# Display output from all lines in the code
InteractiveShell.ast_node_interactivity = "all"

print("Ecommerce DF: \n")
ecom_df.head()
ecom_df.info()
ecom_df.describe()

Ecommerce DF: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81601 entries, 0 to 81600
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   81601 non-null  int64  
 1   InvoiceNo    81601 non-null  object 
 2   StockCode    81601 non-null  object 
 3   Description  81601 non-null  object 
 4   Quantity     81601 non-null  int64  
 5   UnitPrice    81601 non-null  float64
 6   CustomerID   81601 non-null  int64  
 7   Country      81601 non-null  object 
 8   Date         81601 non-null  object 
 9   Hour         81601 non-null  int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 6.2+ MB


Unnamed: 0.1,Unnamed: 0,Quantity,UnitPrice,CustomerID,Hour
count,81601.0,81601.0,81601.0,81601.0,81601.0
mean,278271.366772,11.965736,3.168721,15283.816215,12.729783
std,152483.054308,45.782018,18.731668,1713.292081,2.288777
min,2.0,-3114.0,0.0,12347.0,6.0
25%,148283.0,2.0,1.25,13949.0,11.0
50%,284742.0,5.0,1.95,15144.0,13.0
75%,409445.0,12.0,3.75,16790.0,14.0
max,541908.0,3186.0,4287.63,18287.0,20.0


Since the unnamed column is not conveying any useful info, I've decided to drop it since it is impossible to figure out what it stands for.

In [4]:
ecom_df.drop(labels="Unnamed: 0", axis=1, inplace=True)
print("The columns now in Ecommerce DF are: \n")
ecom_df.columns

The columns now in Ecommerce DF are: 



Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice',
       'CustomerID', 'Country', 'Date', 'Hour'],
      dtype='object')

In [5]:
print("Holiday DF: \n")
holiday_df.head()
holiday_df.info()
holiday_df.describe()

Holiday DF: 



Unnamed: 0,Date,Holiday,WeekDay,Month,Day,Year
0,2004-07-04,4th of July,Sunday,7,4,2004
1,2005-07-04,4th of July,Monday,7,4,2005
2,2006-07-04,4th of July,Tuesday,7,4,2006
3,2007-07-04,4th of July,Wednesday,7,4,2007
4,2008-07-04,4th of July,Friday,7,4,2008


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 342 entries, 0 to 341
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     342 non-null    object
 1   Holiday  342 non-null    object
 2   WeekDay  342 non-null    object
 3   Month    342 non-null    int64 
 4   Day      342 non-null    int64 
 5   Year     342 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 16.2+ KB


Unnamed: 0,Month,Day,Year
count,342.0,342.0,342.0
mean,7.263158,15.853801,2012.5
std,3.899889,9.65333,5.195729
min,1.0,1.0,2004.0
25%,4.0,6.0,2008.0
50%,9.0,16.5,2012.5
75%,11.0,24.0,2017.0
max,12.0,31.0,2021.0


## Checking for null values

In [6]:
print("For Ecommerce DF: \n")
ecom_df.isna().any()
print("\n\nFor Holiday DF: \n")
holiday_df.isna().any()

For Ecommerce DF: 



InvoiceNo      False
StockCode      False
Description    False
Quantity       False
UnitPrice      False
CustomerID     False
Country        False
Date           False
Hour           False
dtype: bool



For Holiday DF: 



Date       False
Holiday    False
WeekDay    False
Month      False
Day        False
Year       False
dtype: bool

The data contains no null values so we can proceed with operations on data now.

## Fixing the date formatting

In [7]:
ecom_df["Date"] = pd.to_datetime(ecom_df["Date"], format='%Y-%m-%d')
holiday_df["Date"] = pd.to_datetime(holiday_df["Date"], format='%Y-%m-%d')
print("For Ecommerce DF: ")
ecom_df[["Date"]].dtypes

print("\n\nFor Holiday DF: ")
holiday_df[["Date"]].dtypes

For Ecommerce DF: 


Date    datetime64[ns]
dtype: object



For Holiday DF: 


Date    datetime64[ns]
dtype: object

# Analysing the Overall Data

## ROCCC Analysis

The ROCCC analysis is done to validate the authenticity of the data and understand how reliable and comprehensive it is to form meaningful conclusions from it.

1. Reliability: The data originates from an e-commerce platform and collected internally through various tools and hence, it is highly reliable.
2. Originality: As mentioned above, since it is internal data, we can be confident that it is original and specific to this business.
3. Comprehensive: The data spans a year from 2010 to 2011 but we should have more to concretely identify and distinguish between one-off trends and established ones.
4. Current: The data is more than 11 years old and hence, may not reflect current trends accurately.
5. Cited: No need to be cited as data is internal.

## Order Distribution Across the Globe

In [8]:
print("Total number of countries in the dataset: ", ecom_df["Country"].nunique())

Total number of countries in the dataset:  37


There seem to be orders from many different countries. It looks like the E-commerce business operates worldwide. Let us find out the distribution of orders by country so we can narrow our analysis to specific regions for a more effective marketing strategy.

In [9]:
# Display output from only the last expression in the code (Default Behaviour)
InteractiveShell.ast_node_interactivity = "last_expr"

In [10]:
# Distrubution of Orders by Country Bar Chart
orders_by_country = ecom_df.groupby("Country")["Quantity"].count().sort_values(ascending=False)

orders_graph = px.histogram(x=orders_by_country.index, y=orders_by_country.values)

orders_graph.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Distribution of Orders Worldwide",
                            y = 0.95,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'
                      ))
orders_graph.update_xaxes(title=None, tickangle=45)
orders_graph.update_yaxes(title="Total Number of Orders")
orders_graph.show()

In [11]:
# Distrubution of Orders by Country Pie Chart
orders_pie = px.pie(orders_by_country, values=orders_by_country.values, 
                    names=orders_by_country.index)
orders_pie.update_traces(textposition='inside', textinfo='percent+label')
orders_pie.update_layout(uniformtext_minsize=16, uniformtext_mode='hide')
orders_pie.update_layout(titlefont=dict(size=20, color='black'), showlegend=False,
                        title=dict(
                            text = "Distribution of Orders Worldwide",
                            y = 0.95,
                            x = 0.5,
                            xanchor = 'center',
                            yanchor = 'top'))

orders_pie.show()

Seems like the E-commerce company operates most of its business in the UK since we can see that most orders are from there itself with a few orders for other countries. <br> <br>
Since most of the presence of this company is based in and around the UK (European Region), we will focus our analysis towards them and drop other countries.

In [12]:
top_list = pd.Series.to_list(ecom_df.value_counts("Country").index[:5])
print("The top 5 countries in which the business has a significant presence are: ", ", ".join(top_list), "\n\n")

# Filtering orders from top countries
eu_ecom_df = ecom_df[ecom_df["Country"].isin(top_list)]
eu_ecom_df.info()

The top 5 countries in which the business has a significant presence are:  United Kingdom, Germany, France, EIRE, Spain 


<class 'pandas.core.frame.DataFrame'>
Int64Index: 78208 entries, 0 to 81600
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    78208 non-null  object        
 1   StockCode    78208 non-null  object        
 2   Description  78208 non-null  object        
 3   Quantity     78208 non-null  int64         
 4   UnitPrice    78208 non-null  float64       
 5   CustomerID   78208 non-null  int64         
 6   Country      78208 non-null  object        
 7   Date         78208 non-null  datetime64[ns]
 8   Hour         78208 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 6.0+ MB


## Orders per Country

In [13]:
country_df = eu_ecom_df.groupby(["Date", "Country", "Description"]).agg({
    "Quantity": pd.Series.sum
})
country_df.reset_index(inplace=True)

In [14]:
orders_from_countries = px.scatter(country_df, x="Date", y="Quantity", 
                                color="Description",
                                color_discrete_sequence=px.colors.qualitative.G10,
                                facet_row="Country", facet_col_wrap=2,
                                  category_orders={
                                      "Country": top_list
                                  })

orders_from_countries.for_each_annotation(lambda a: a.update(text=a.text.replace("Country=", "")))

orders_from_countries.update_layout(titlefont=dict(size=20, color='black'),
                      autosize=False, height=1600, width=1800, showlegend=False,
                      title=dict(
                          text = "Orders For Items Per Country",
                          y = 0.995,
                          x = 0.5,
                          xanchor = 'center',
                          yanchor = 'top'
                          ))

orders_from_countries.show()

**NOTE:** Hover over the dots to reveal the item description, quantity and date of purchase from each country.

### Insights:
1. As stated previously, it is clear that most orders are from the UK.
2. We see a rise in Christmas item orders for Ireland, France, Spain and Germany starting from November.
3. Summer season orders for Spain mostly include Greeting Cards and Cake/Baking related items.
4. For Germany and France, summer season orders mostly include fancy Cutlery and Cake/Baking related items.

## Trends Related to Customer Spending

We will add a column named `MoneySpent` to show the amount of money spent by the customer for a particular item. <br>`MoneySpent` = `Quantity` * `UnitPrice` <br>This will also help us quantify data regarding customer spending habits.

In [15]:
eu_ecom_df["MoneySpent"] = eu_ecom_df["Quantity"] * eu_ecom_df["UnitPrice"]
eu_ecom_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour,MoneySpent
0,536551,22112,CHOCOLATE HOT WATER BOTTLE,1,4.95,17346,United Kingdom,2010-12-01,14,4.95
1,536412,22900,SET 2 TEA TOWELS I LOVE LONDON,2,2.95,17920,United Kingdom,2010-12-01,11,5.9
2,536562,22313,OFFICE MUG WARMER PINK,6,2.95,13468,United Kingdom,2010-12-01,15,17.7
3,536528,22865,HAND WARMER OWL DESIGN,1,2.1,15525,United Kingdom,2010-12-01,13,2.1
4,536378,21975,PACK OF 60 DINOSAUR CAKE CASES,24,0.55,14688,United Kingdom,2010-12-01,9,13.2


Below are plots which show trends related to customer spending according to Country, Month, Week, Day and Hour of the Day.

In [16]:
# Building the DF for the box plot
sorted_ecom_df = eu_ecom_df.groupby("Date").agg({"MoneySpent": pd.Series.sum,
                                                 "Quantity": pd.Series.sum})
sorted_ecom_df.reset_index(inplace=True)

# Box Plot for Money Spent Per Month
money_spent_per_month = px.box(sorted_ecom_df, 
                               x=sorted_ecom_df["Date"].dt.month, 
                               y="MoneySpent")

money_spent_per_month.add_vrect(x0=8.5, x1=12.5, 
              annotation_text="Jump in Spending", annotation_position="top",
              fillcolor="green", opacity=0.25, line_width=0)

money_spent_per_month.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Money Spent Per Month",
                            y = 0.95,
                            x = 0.5,
                            xanchor = 'center',
                            yanchor = 'top'
                      ))

money_spent_per_month.update_layout(
                    xaxis = dict(
                        tickmode = 'array',
                        tickvals = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                        ticktext = list(calendar.month_name)
                    ))

money_spent_per_month.update_xaxes(title=None)
money_spent_per_month.update_yaxes(title="Average Money Spent")
money_spent_per_month.show()

In [17]:
jan_to_aug_spend = sorted_ecom_df[(sorted_ecom_df["Date"] > "Jan, 2011") & 
                                  (sorted_ecom_df["Date"] < "Sep, 2011")].agg({
                                      "MoneySpent": pd.Series.mean
                                      })
                                  
sep_to_dec_spend = sorted_ecom_df[(sorted_ecom_df["Date"] >= "Sep, 2011") & 
                                  (sorted_ecom_df["Date"] <= "Dec, 2011")].agg({
                                      "MoneySpent": pd.Series.mean
                                      })

spend_jump = px.bar(x=["January to August", "September to December"], 
                    y=[jan_to_aug_spend["MoneySpent"].sum(), 
                      sep_to_dec_spend["MoneySpent"].sum()],
                    text_auto=True)

spend_jump.add_scatter(x=["January to August", "September to December"], 
                       y=[jan_to_aug_spend["MoneySpent"].sum(), 
                          sep_to_dec_spend["MoneySpent"].sum()],
                       opacity=0.8, name="Increase in spending by 77%",
                       line=dict(
                           dash = "dash",
                           width = 1,
                           color = "green")
                      )

spend_jump.update_layout(titlefont=dict(size=20, color='black'),
                         autosize=False, width=800, height=500,
                         title=dict(
                             text = "Comparing Average Spendings",
                             y = 0.95,
                             x = 0.5,
                             xanchor =  'center',
                             yanchor = 'top'),
                         legend=dict(
                            orientation = "h", 
                            yanchor = "top", 
                            y = 0.9, 
                            xanchor = "left",
                            x = 0)
                        )


spend_jump.update_xaxes(title=None)
spend_jump.update_yaxes(title="Average Money Spent")

spend_jump.show()

In [18]:
# Box Plot for Money Spent Per Week
money_spent_per_week = px.box(sorted_ecom_df, 
                              x=sorted_ecom_df["Date"].dt.isocalendar().week, 
                              y="MoneySpent")

money_spent_per_week.add_vrect(x0=37.5, x1=40.5,
                              fillcolor="green", opacity=0.25, line_width=0)

money_spent_per_week.add_vrect(x0=43.5, x1=48.5, 
                              annotation_text="Jump in Spending", annotation_position="top",
                              fillcolor="green", opacity=0.25, line_width=0)

money_spent_per_week.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Money Spent Per Week",
                            y = 0.95,
                            x = 0.5,
                            xanchor = 'center',
                            yanchor = 'top'))
#                       xaxis = dict(
#                               tickmode = 'linear',
#                               tick0 = 0,
#                               dtick = 1
#                       ))

money_spent_per_week.update_xaxes(title="Week Number")
money_spent_per_week.update_yaxes(title="Average Money Spent")
money_spent_per_week.show()

In [19]:
# Box Plot for Money Spent Per Weekday
money_spent_per_weekday = px.box(sorted_ecom_df, 
                               x=sorted_ecom_df["Date"].dt.weekday, 
                               y="MoneySpent")

money_spent_per_weekday.add_vrect(x0=2.5, x1=3.5, 
              annotation_text="Most Money Spent", annotation_position="top",
              fillcolor="green", opacity=0.25, line_width=0)

money_spent_per_weekday.add_vrect(x0=5.5, x1=6.5, 
              annotation_text="Least Money Spent", annotation_position="top",
              fillcolor="red", opacity=0.25, line_width=0)

money_spent_per_weekday.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Money Spent Per Day of the Week",
                            y = 0.95,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'),
                      xaxis=dict(
                            tickmode = 'array',
                            tickvals = [0, 1, 2, 3, 4, 5, 6],
                            ticktext = list(calendar.day_name)
                      ))

money_spent_per_weekday.update_xaxes(title=None)
money_spent_per_weekday.update_yaxes(title="Average Money Spent")
money_spent_per_weekday.show()

In [20]:
# Histogram for hourly order distribution
hourly_stats = eu_ecom_df.groupby("Hour").agg({"MoneySpent": pd.Series.sum})
hourly_stats.reset_index(inplace=True)

hourly_spending = px.histogram(hourly_stats, x="Hour",
                               y="MoneySpent", nbins=48, histfunc="avg",
                               hover_data=hourly_stats.columns)

hourly_spending.add_scatter(x=hourly_stats["Hour"], y=hourly_stats["MoneySpent"],
                            mode="lines+markers", name="Trend Line", hoverinfo="skip",
                            line=dict(shape = 'spline', 
                                      smoothing =  1.0))

hourly_spending.update_layout(titlefont=dict(size=20, color='black'),
                              title=dict(
                                    text = "Hourly Customer Spending",
                                    y = 0.95,
                                    x = 0.5,
                                    xanchor = 'center',
                                    yanchor = 'top'
                             ),
                             legend=dict(
                                    yanchor = "top",
                                    y = 0.99,
                                    xanchor = "right",
                                    x = 0.99),
                             hovermode='x')

hourly_spending.update_yaxes(title="Total Money Spent")
hourly_spending.update_xaxes(title="Hour of the Day")
hourly_spending.show()

### Insights:

1. There is a jump in customer spending from September to December with the highest being in November.
2. January to August has consistent spending of around `$4,000` whereas September to December, it increases to approximately `$7,000`.
3. Weeks 38-40 and 40-44 see a jump in spending. Data is missing for Week 52 (last week) which maybe due to the business being shut for the winter holidays.
4. There is no data for any Saturday. It seems like the business is closed on Saturdays but it is quite strange for an e-commerce business.
5. Thursdays are most popular while Sundays are the least. This maybe due to shops closing early on Sundays.
6. Most orders steadily increase and peak at 12 PM and start decreasing throughout the rest of the day, following a bell curve.

## Investigating Top Products Overall

In [21]:
top_products = eu_ecom_df.groupby("Description").agg({
    "Quantity": pd.Series.sum,
    "MoneySpent": pd.Series.sum})
top_products.reset_index(inplace=True)
top_products.sort_values("Quantity", ascending=True, inplace=True)

# Taking the last 30 items in the top_spends DF which is sorted in ascending order
most_popular_items = px.bar(top_products[-10:], y="Description", x="Quantity", text_auto=True)

most_popular_items.update_layout(showlegend=False,
                              titlefont=dict(size=20, color='black'),
                              title=dict(
                                    text = "Top 10 Most Popular Items at the Business",
                                    y = 0.97,
                                    x = 0.5,
                                    xanchor =  'center',
                                    yanchor = 'top'))

most_popular_items.update_yaxes(title="Item Names")
most_popular_items.update_xaxes(title="Total Items Sold")

most_popular_items.show()

In [22]:
top_products.sort_values("MoneySpent", ascending=True, inplace=True)

# Taking the last 30 items in the top_spends DF which is sorted in ascending order
most_revenue_items = px.bar(top_products[-10:], y="Description", x="MoneySpent", text_auto=True)

most_revenue_items.update_layout(showlegend=False,
                              titlefont=dict(size=20, color='black'),
                              title=dict(
                                    text = "Top 10 Most Revenue Generating Items at the Business",
                                    y = 0.97,
                                    x = 0.5,
                                    xanchor =  'center',
                                    yanchor = 'top'))

most_revenue_items.update_yaxes(title="Item Names")
most_revenue_items.update_xaxes(title="Total Money Spent")

most_revenue_items.show()

### Insights:
1. Most popular product overall is the `WORLD WAR 2 GLIDERS ASSTD DESIGNS` with 8902 items sold. <br>
2. Most spent on product is the `REGENCY CAKESTAND 3 TIER` with $23,449 in revenue.



## Returns

In the `Orders For Items Per Country` graph, we can see that some days have negative customer spending. I'm assuming that this means the particular order was returned.

In [23]:
eu_ecom_df[eu_ecom_df["MoneySpent"] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour,MoneySpent
96,C536548,22168,ORGANISER WOOD ANTIQUE WHITE,-2,8.50,12472,Germany,2010-12-01,14,-17.00
194,C536548,22077,6 RIBBONS RUSTIC CHARM,-6,1.65,12472,Germany,2010-12-01,14,-9.90
230,C536548,20914,SET/5 RED RETROSPOT LID GLASS BOWLS,-1,2.95,12472,Germany,2010-12-01,14,-2.95
231,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,0.29,17548,United Kingdom,2010-12-01,10,-6.96
266,C536548,22654,DELUXE SEWING KIT,-1,5.95,12472,Germany,2010-12-01,14,-5.95
...,...,...,...,...,...,...,...,...,...,...
81315,C581330,22959,WRAP CHRISTMAS VILLAGE,-25,0.42,15877,United Kingdom,2011-12-08,11,-10.50
81366,C581468,22098,BOUDOIR SQUARE TISSUE BOX,-12,0.39,13599,United Kingdom,2011-12-08,19,-4.68
81382,C581316,21531,RED RETROSPOT SUGAR JAM BOWL,-1,2.55,12523,France,2011-12-08,11,-2.55
81398,C581465,22171,3 HOOK PHOTO SHELF ANTIQUE WHITE,-1,8.50,15755,United Kingdom,2011-12-08,18,-8.50


The dataset does not contain any column to explain why these items were returned. We need more data to understand why this happens and ensure minimum amount of items are returned.

# Holidays vs Purchases

Now let us try to visualize customer spending around the holidays. We will seek to find if and how major holidays influence what items are sold more than others.

In [24]:
print("Major Holidays: \n")
new_holiday = holiday_df[(holiday_df["Date"] >= "2010-12-01") & (holiday_df["Date"] <= "2011-12-07")].sort_values("Date")
new_holiday.reset_index(inplace=True)
new_holiday

Major Holidays: 



Unnamed: 0,index,Date,Holiday,WeekDay,Month,Day,Year
0,42,2010-12-24,Christmas Eve,Friday,12,24,2010
1,24,2010-12-25,Christmas Day,Saturday,12,25,2010
2,222,2010-12-31,New Year’s Eve,Friday,12,31,2010
3,207,2011-01-01,New Year's Day,Saturday,1,1,2011
4,168,2011-01-17,"Martin Luther King, Jr. Day",Monday,1,17,2011
5,277,2011-02-14,Valentine’s Day,Monday,2,14,2011
6,315,2011-02-21,Washington's Birthday,Monday,2,21,2011
7,341,2011-04-24,Western Easter,Sunday,4,24,2011
8,83,2011-04-24,Eastern Easter,Sunday,4,24,2011
9,193,2011-05-30,Memorial Day,Monday,5,30,2011


In [25]:
spending_vs_holidays = px.bar(data_frame=sorted_ecom_df, x="Date", y="MoneySpent")

for holiday in range(len(new_holiday)):
  spending_vs_holidays.add_vline(x=new_holiday.loc[holiday]["Date"], line_width=1, line_dash="dot")

spending_vs_holidays.update_layout(showlegend=False,
                      titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Customer Spending Around the Holidays",
                            y = 0.95,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'
                      ))

spending_vs_holidays.update_yaxes(title="Total Money Spent Every Day")
spending_vs_holidays.update_xaxes(title="Date")

spending_vs_holidays.show()

## Valentine's Day and Christmas
We will now track the popularity of christmas and valentine's items throughout the year since these two holidays are quite popular.




In [26]:
# Building the DF for Christmas items
xmas_items = eu_ecom_df[eu_ecom_df["Description"].str.
                        contains("CHRISTMAS|xmas", 
                                 case=False, 
                                 regex=True)].groupby("Date").agg({
                                     "Quantity": pd.Series.sum,
                                     "MoneySpent": pd.Series.sum
                                     })
xmas_items.reset_index(inplace=True)

# Building the DF for Valentine items
val_items = eu_ecom_df[eu_ecom_df["Description"].str.
                       contains("heart|love|valentine", 
                                case=False, 
                                regex=True)].groupby(["Date"]).agg({
                                    "Quantity": pd.Series.sum,
                                    "MoneySpent": pd.Series.sum
                                    })
val_items.reset_index(inplace=True)

# Plotting the comparison chart
xmas_vs_valentine_items = go.Figure()
xmas_vs_valentine_items.add_trace(go.Scatter(x=xmas_items["Date"], y=xmas_items["Quantity"],
                    mode='lines', name='Christmas Items'))
xmas_vs_valentine_items.add_trace(go.Scatter(x=val_items["Date"], y=val_items["Quantity"],
                    mode='lines', name='Valentine Items'))

xmas_vs_valentine_items.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                          text = "Comparing Item Sales for Valentine's Day and Christmas",
                          y = 0.95,
                          x = 0.5,
                          xanchor =  'center',
                          yanchor = 'top'
                          ),
                      legend=dict(
                          orientation = "h", 
                          yanchor = "bottom", 
                          y = 1.02, 
                          xanchor = "right",
                          x = 1))

xmas_vs_valentine_items.update_yaxes(title="Quantity of Items")
xmas_vs_valentine_items.update_xaxes(title="Date")

xmas_vs_valentine_items.show()

In [27]:
xmas_vs_valentine_spend = px.bar(x=["Christmas", "Valentine"], 
                                 y=[xmas_items["MoneySpent"].sum(), 
                                    val_items["MoneySpent"].sum()],
                                 text_auto=True)

xmas_vs_valentine_spend.add_scatter(x=["Christmas", "Valentine"], 
                                    y=[xmas_items["MoneySpent"].sum(), 
                                       val_items["MoneySpent"].sum()],
                                    opacity=0.8, name="147% more spent on Valentine Vs Christmas Items",
                                    line=dict(
                                           dash = "dash",
                                           width = 1,
                                           color = "green"))

xmas_vs_valentine_spend.update_layout(autosize=False, width=800, height=500,
                                      titlefont=dict(size=20, color='black'),
                                      title=dict(
                                                text = "Total Money Spent on Christmas vs Valentine's Day Items Throughout the Year",
                                                y = 0.95,
                                                x = 0.5,
                                                xanchor =  'center',
                                                yanchor = 'top'),
                                      legend=dict(
                                                orientation = "h", 
                                                yanchor = "top", 
                                                y = 0.9, 
                                                xanchor = "left",
                                                x = 0))

xmas_vs_valentine_spend.update_yaxes(title="Total Money Spent")
xmas_vs_valentine_spend.update_xaxes(title=None)


xmas_vs_valentine_spend.show()

### Insights:

1. We can see that the there is a steady rise in Christmas items from mid-August and keep increasing till peak Christmas season.
2. We expected items like Valentine's Day gifts to be popular mostly around the Valentines Season but looking at the data, that doesn't seem true. They keep selling throughout the year.
3. Valentine items bring in consistent and more money than Christmas items (about a 147% more to be precise), so the marketing team should focus more of their attention towards them.

## Other Holidays

In [28]:
# Grouping item sales every single day
items_df = eu_ecom_df.groupby(["Date", "Description"]).agg({"Quantity": pd.Series.sum,
                                                 "UnitPrice": pd.Series.sum,
                                                 "MoneySpent": pd.Series.sum})
items_df.sort_values(["Date", "Quantity"], ascending=[True,False], inplace=True)
items_df.reset_index(inplace=True)


# Columbus DF
columbus_df = items_df[(items_df["Date"] > "2011-10-05") & (items_df["Date"] < "2011-10-11")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
columbus_df.sort_values("MoneySpent", ascending=False, inplace=True)
columbus_df.reset_index(inplace=True)


# 4th of July DF
july_4_df = items_df[(items_df["Date"] > "2011-06-29") & (items_df["Date"] <= "2011-07-04")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
july_4_df.sort_values("MoneySpent", ascending=False, inplace=True)
july_4_df.reset_index(inplace=True)


# Labor Day DF
labor_day_df = items_df[(items_df["Date"] >= "2011-09-01") & (items_df["Date"] <= "2011-09-05")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
labor_day_df.sort_values("MoneySpent", ascending=False, inplace=True)
labor_day_df.reset_index(inplace=True)


# Thanksgiving DF
thanksgiving_df = items_df[(items_df["Date"] >= "2011-11-20") & (items_df["Date"] <= "2011-11-24")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
thanksgiving_df.sort_values("MoneySpent", ascending=False, inplace=True)
thanksgiving_df.reset_index(inplace=True)


# Memorial Day DF
memorial_day_df = items_df[(items_df["Date"] >= "2011-05-26") & (items_df["Date"] <= "2011-05-30")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
memorial_day_df.sort_values("MoneySpent", ascending=False, inplace=True)
memorial_day_df.reset_index(inplace=True)


# Easter DF
easter_df = items_df[(items_df["Date"] >= "2011-04-20") & (items_df["Date"] <= "2011-04-24")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
easter_df.sort_values("MoneySpent", ascending=False, inplace=True)
easter_df.reset_index(inplace=True)


# Juneteenth DF
juneteenth_df = items_df[(items_df["Date"] >= "2011-06-15") & (items_df["Date"] <= "2011-06-19")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
juneteenth_df.sort_values("MoneySpent", ascending=False, inplace=True)
juneteenth_df.reset_index(inplace=True)


# Veterans DF
veterans_df = items_df[(items_df["Date"] >= "2011-11-07") & (items_df["Date"] <= "2011-11-11")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
veterans_df.sort_values("MoneySpent", ascending=False, inplace=True)
veterans_df.reset_index(inplace=True)

In [29]:
# Pie charts for each major holiday and the items sold around it
specs = [[{'type':'domain'}, {'type':'domain'}],
         [{'type':'domain'}, {'type':'domain'}],
         [{'type':'domain'}, {'type':'domain'}],
         [{'type':'domain'}, {'type':'domain'}]]

most_popular_items = make_subplots(rows=4, cols=2, specs=specs, vertical_spacing=0.05, horizontal_spacing=0.1,
                    subplot_titles=["Columbus Day", "4th of July", "Labor Day", "Thanksgiving", "Memorial Day", "Easter", "Juneteenth", "Veteran's Day"])

most_popular_items.add_trace(go.Pie(labels=columbus_df["Description"][:5], 
                     values=columbus_df["MoneySpent"][:5], name='Columbus Day',
                     marker_colors=px.colors.sequential.ice), 1, 1)

most_popular_items.add_trace(go.Pie(labels=july_4_df["Description"][:5], 
                     values=july_4_df["MoneySpent"][:10], name='4th of July',
                     marker_colors=px.colors.sequential.ice), 1, 2)

most_popular_items.add_trace(go.Pie(labels=labor_day_df["Description"][:5], 
                     values=labor_day_df["MoneySpent"][:5], name='Labor Day',
                     marker_colors=px.colors.sequential.ice), 2, 1)

most_popular_items.add_trace(go.Pie(labels=thanksgiving_df["Description"][:5], 
                     values=thanksgiving_df["MoneySpent"][:5], name='Thanksgiving',
                     marker_colors=px.colors.sequential.ice), 2, 2)

most_popular_items.add_trace(go.Pie(labels=memorial_day_df["Description"][:5], 
                     values=memorial_day_df["MoneySpent"][:5], name='Memorial Day',
                     marker_colors=px.colors.sequential.ice), 3, 1)

most_popular_items.add_trace(go.Pie(labels=easter_df["Description"][:5], 
                     values=easter_df["MoneySpent"][:5], name='Easter',
                     marker_colors=px.colors.sequential.ice), 3, 2)

most_popular_items.add_trace(go.Pie(labels=juneteenth_df["Description"][:5], 
                     values=juneteenth_df["MoneySpent"][:5], name='Juneteenth',
                     marker_colors=px.colors.sequential.ice), 4, 1)

most_popular_items.add_trace(go.Pie(labels=veterans_df["Description"][:5], 
                     values=veterans_df["MoneySpent"][:5], name="Veteran's Day",
                     marker_colors=px.colors.sequential.ice), 4, 2)

most_popular_items.update_layout(titlefont=dict(size=20, color='black'),
                      autosize=False,
                      width=800,
                      height=1600,
                      title=dict(
                          text = "<b>Comparing the Top 5 Most Revenue Generating Items During Holidays</b>",
                          y = 0.99,
                          x = 0.5,
                          xanchor = 'center',
                          yanchor = 'top'
                          ),
                      showlegend=False)

most_popular_items.show()

#### Insights:

1. After checking out the items which were purchased around these holidays, there seems no relation between these purchases.
2. The holiday list contains holidays mostly celebrated in the USA but we see that most orders are from the UK. Hence, there is no correlation between most holidays in the given dataset to the items that are purchased with the exception of Valentine's Day and Christmas.
3. New Years' is also a major holiday but there is no data around that time, possibly due to the store being closed for the holidays. 