<a href="https://colab.research.google.com/github/ameymane09/Holidays-analysis-for-marketing/blob/main/Portfolio_Project_1_Holiday_Marketing_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

Having a great marketing strategy in place is key to the success of any business. Without a marketing strategy, you lack focus. And without focus, you will, quite simply, fail to reach any of the goals and objectives that you have set. Any information about customers allows marketers to gain a laser-sharp understanding of their target audience. The marketing budget is being set for the year 2023. The marketing director would like to know which holiday brings in the most money so the team can adjust the marketing dollars. <br><br>
### Objective

What holidays should the marketing team invest more marketing dollars in? Also, find out whatever other insights you can from the dataset. <br><br>

### Instructions

Answer the questions (using SQL, Python) and explain your rationale in the write-up:

1. How would you segment holidays based on the expenditure of customers
2. Which of these segments / sub-segments would you propose be approved?
    - For e.g. Does a certain holiday drive the sales of a particular segment of people and how can the marketing team use that to optimize their plan? Would a holiday season’s duration affect the number of sales and how should the marketing team strategize around that?
3. What other insights in general can you share about these segments?
4. Tell us what your observations were on the data itself (completeness, skews) and how you would treat any anomalies (for eg - missing data)

# Importing, reading and cleaning the data

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/salesandholidaydata/US_Holiday_Dates_(2004-2021).csv
/kaggle/input/salesandholidaydata/Ecommerce_Data.csv


In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import calendar
from datetime import datetime as dt
pd.options.plotting.backend = "plotly"

  shapely_geos_version, geos_capi_version_string


In [3]:
ecom_df = pd.read_csv("/kaggle/input/salesandholidaydata/Ecommerce_Data.csv")
holiday_df = pd.read_csv("/kaggle/input/salesandholidaydata/US_Holiday_Dates_(2004-2021).csv")

In [4]:
ecom_df.head()
ecom_df.info()
ecom_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81601 entries, 0 to 81600
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   81601 non-null  int64  
 1   InvoiceNo    81601 non-null  object 
 2   StockCode    81601 non-null  object 
 3   Description  81601 non-null  object 
 4   Quantity     81601 non-null  int64  
 5   UnitPrice    81601 non-null  float64
 6   CustomerID   81601 non-null  int64  
 7   Country      81601 non-null  object 
 8   Date         81601 non-null  object 
 9   Hour         81601 non-null  int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 6.2+ MB


Unnamed: 0.1,Unnamed: 0,Quantity,UnitPrice,CustomerID,Hour
count,81601.0,81601.0,81601.0,81601.0,81601.0
mean,278271.366772,11.965736,3.168721,15283.816215,12.729783
std,152483.054308,45.782018,18.731668,1713.292081,2.288777
min,2.0,-3114.0,0.0,12347.0,6.0
25%,148283.0,2.0,1.25,13949.0,11.0
50%,284742.0,5.0,1.95,15144.0,13.0
75%,409445.0,12.0,3.75,16790.0,14.0
max,541908.0,3186.0,4287.63,18287.0,20.0


Since the unnamed column is not conveying any useful info, I've decided to drop it since it is impossible to figure out what it stands for.

In [5]:
ecom_df.drop(labels="Unnamed: 0", axis=1, inplace=True)
ecom_df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice',
       'CustomerID', 'Country', 'Date', 'Hour'],
      dtype='object')

In [6]:
holiday_df.head()
holiday_df.info()
holiday_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 342 entries, 0 to 341
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     342 non-null    object
 1   Holiday  342 non-null    object
 2   WeekDay  342 non-null    object
 3   Month    342 non-null    int64 
 4   Day      342 non-null    int64 
 5   Year     342 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 16.2+ KB


Unnamed: 0,Month,Day,Year
count,342.0,342.0,342.0
mean,7.263158,15.853801,2012.5
std,3.899889,9.65333,5.195729
min,1.0,1.0,2004.0
25%,4.0,6.0,2008.0
50%,9.0,16.5,2012.5
75%,11.0,24.0,2017.0
max,12.0,31.0,2021.0


## Checking for null values

In [7]:
ecom_df.isna().any()

InvoiceNo      False
StockCode      False
Description    False
Quantity       False
UnitPrice      False
CustomerID     False
Country        False
Date           False
Hour           False
dtype: bool

In [8]:
holiday_df.isna().any()

Date       False
Holiday    False
WeekDay    False
Month      False
Day        False
Year       False
dtype: bool

The data contains no null values so we can proceed with operations on data now.

## Fixing the date formatting

In [9]:
ecom_df["Date"] = pd.to_datetime(ecom_df["Date"], format='%Y-%m-%d')
holiday_df["Date"] = pd.to_datetime(holiday_df["Date"], format='%Y-%m-%d')

In [10]:
ecom_df[["Date"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81601 entries, 0 to 81600
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    81601 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 637.6 KB


## ROCCC Analysis

1. Reliability: The data originates from an e-commerce platform and collected internally through various tools and hence, it is highly reliable.
2. Originality: As mentioned above, since it is internal data, we can be confident that it is original and specific to this business.
3. Comprehensive: The data spans a year from 2010 to 2011 but we should have more to concretely identify and distinguish between one-off trends and established ones.
4. Current: The data is more than 11 years old and hence, may not reflect current trends accurately.
5. Cited: No need to be cited as data is internal.

# Analysing the Overall Data

### Analysing the distribution of orders

In [11]:
print("Total number of countries in the dataset: ", ecom_df["Country"].nunique())

Total number of countries in the dataset:  37


There seem to be orders from many different countries. It looks like the E-commerce business operates worldwide. Let us find out the distribution of orders by country so we can narrow our analysis to specific regions for a more effective marketing strategy.

In [12]:
orders_by_country = ecom_df.groupby("Country")["Quantity"].count().sort_values(ascending=False)
orders_by_country.head()

Country
United Kingdom    72617
Germany            1898
France             1689
EIRE               1500
Spain               504
Name: Quantity, dtype: int64

In [13]:
orders_graph = px.histogram(x=orders_by_country.index, y=orders_by_country.values,
                            labels=dict(
                                x = "",
                                y = "Number of orders"
                                ))

orders_graph.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Distribution of Orders Worldwide",
                            y = 0.95,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'
                      ))
orders_graph.update_xaxes(tickangle=45)
orders_graph.show()

In [14]:
orders_pie = px.pie(orders_by_country, values=orders_by_country.values, 
                    names=orders_by_country.index)
orders_pie.update_traces(textposition='inside', textinfo='percent+label')
orders_pie.update_layout(uniformtext_minsize=16, uniformtext_mode='hide')
orders_pie.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Distribution of Orders Worldwide",
                            y = 0.95,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'
                      ))
orders_pie.update_layout(legend=dict(
                            yanchor="top",
                            y=0.99,
                            xanchor="left",
                            x=0.01
                        ))
orders_pie.show()

Seems like the E-commerce company operates most of its business in the UK since we can see that most orders are from there itself with a few orders for other countries. <br> <br>
Since most of the presence of this company is based in and around the UK, we will focus our analysis towards them and drop other countries.

In [15]:
top_list = pd.Series.to_list(ecom_df.value_counts("Country").index[:8])
print("The top countries are: ", ", ".join(top_list), "\n\n")

eu_ecom_df = ecom_df[ecom_df["Country"].isin(top_list)]
eu_ecom_df.info()

The top countries are:  United Kingdom, Germany, France, EIRE, Spain, Netherlands, Belgium, Switzerland 


<class 'pandas.core.frame.DataFrame'>
Int64Index: 79414 entries, 0 to 81600
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    79414 non-null  object        
 1   StockCode    79414 non-null  object        
 2   Description  79414 non-null  object        
 3   Quantity     79414 non-null  int64         
 4   UnitPrice    79414 non-null  float64       
 5   CustomerID   79414 non-null  int64         
 6   Country      79414 non-null  object        
 7   Date         79414 non-null  datetime64[ns]
 8   Hour         79414 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 6.1+ MB


### Analysing Trends Related to Customer Spending

We will add a column named `MoneySpent` to show the amount of money spent by the customer for a particular order. <br>`MoneySpent` = `Quantity` * `UnitPrice` <br>This will also help us quantify data regarding customer spending habits.

In [16]:
eu_ecom_df["MoneySpent"] = eu_ecom_df["Quantity"] * eu_ecom_df["UnitPrice"]
eu_ecom_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour,MoneySpent
0,536551,22112,CHOCOLATE HOT WATER BOTTLE,1,4.95,17346,United Kingdom,2010-12-01,14,4.95
1,536412,22900,SET 2 TEA TOWELS I LOVE LONDON,2,2.95,17920,United Kingdom,2010-12-01,11,5.9
2,536562,22313,OFFICE MUG WARMER PINK,6,2.95,13468,United Kingdom,2010-12-01,15,17.7
3,536528,22865,HAND WARMER OWL DESIGN,1,2.1,15525,United Kingdom,2010-12-01,13,2.1
4,536378,21975,PACK OF 60 DINOSAUR CAKE CASES,24,0.55,14688,United Kingdom,2010-12-01,9,13.2


In [17]:
sorted_ecom_df = eu_ecom_df.groupby("Date").agg({"MoneySpent": pd.Series.sum,
                                                 "Quantity": pd.Series.sum})
sorted_ecom_df.reset_index(inplace=True)
sorted_ecom_df.head()

Unnamed: 0,Date,MoneySpent,Quantity
0,2010-12-01,8025.85,3729
1,2010-12-02,7660.42,7154
2,2010-12-03,3362.06,1535
3,2010-12-05,6620.81,2916
4,2010-12-06,6375.55,3025


In [18]:
# Line Graph
# money_spent = go.Figure()
# money_spent.add_scatter(x=sorted_ecom_df["Date"], y=sorted_ecom_df["MoneySpent"], 
#                         fill='tozeroy', name="MoneySpent")
# money_spent.add_scatter(x=sorted_ecom_df["Date"], y=sorted_ecom_df["Quantity"],
#                         name="Quantity")

# money_spent.update_layout(titlefont=dict(size=20, color='black'),
#                       title=dict(
#                           text = "Customer Spending Throughout the Year",
#                           y = 0.95,
#                           x = 0.5,
#                           xanchor = 'center',
#                           yanchor = 'top'
#                           ),
#                       legend=dict(
#                           orientation = "h", 
#                           yanchor = "bottom", 
#                           y = 1.02, 
#                           xanchor = "right",
#                           x = 1))
# money_spent.show()

country_df = eu_ecom_df.groupby(["Date", "Country"]).agg({
    "MoneySpent": pd.Series.sum
})
country_df.reset_index(inplace=True)

orders_from_countries = px.area(country_df, x="Date", y="MoneySpent", 
                                color="Country",
                                color_discrete_sequence=px.colors.qualitative.G10,
                                line_shape='linear')

orders_from_countries.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                          text = "Customer Spending Throughout the Year Per Country",
                          y = 0.95,
                          x = 0.5,
                          xanchor = 'center',
                          yanchor = 'top'
                          ),
                      legend=dict(
                          orientation = "h", 
                          yanchor = "bottom", 
                          y = 0.9, 
                          xanchor = "left",
                          x = 0))

orders_from_countries.show()

In [19]:
# Box Plot for Money Spent Per Month
money_spent_per_month = px.box(sorted_ecom_df, 
                               x=sorted_ecom_df["Date"].dt.month, 
                               y="MoneySpent", 
                               labels={
                                      "x": ""
                                      })

money_spent_per_month.add_vrect(x0=9, x1=12, 
              annotation_text="Jump in Spending", annotation_position="top left",
              fillcolor="green", opacity=0.25, line_width=0)

money_spent_per_month.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Money Spent Per Month",
                            y = 0.95,
                            x = 0.5,
                            xanchor = 'center',
                            yanchor = 'top'
                      ))

money_spent_per_month.update_layout(
                    xaxis = dict(
                        tickmode = 'array',
                        tickvals = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                        ticktext = list(calendar.month_name)
                    )
                )

money_spent_per_month.show()

In [20]:
# Box Plot for Money Spent Per Week
money_spent_per_week = px.box(sorted_ecom_df, 
                               x=sorted_ecom_df["Date"].dt.isocalendar().week, 
                               y="MoneySpent", 
                               labels=dict(
                                      x = "Week Number"
                                      ))

money_spent_per_week.add_vrect(x0=35, x1=40, 
              annotation_text="Jump in Spending", annotation_position="top left",
              fillcolor="green", opacity=0.25, line_width=0)

money_spent_per_week.add_vrect(x0=44, x1=48, 
              annotation_text="Jump in Spending", annotation_position="top left",
              fillcolor="green", opacity=0.25, line_width=0)


money_spent_per_week.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Money Spent Per Week",
                            y = 0.95,
                            x = 0.5,
                            xanchor = 'center',
                            yanchor = 'top'),
                      xaxis = dict(
                              tickmode = 'linear',
                              tick0 = 0,
                              dtick = 1
    ))

money_spent_per_week.show()

In [21]:
# Box Plot for Money Spent Per Weekday
money_spent_per_weekday = px.box(sorted_ecom_df, 
                               x=sorted_ecom_df["Date"].dt.weekday, 
                               y="MoneySpent", 
                               labels={
                                      "x": ""
                                      })

money_spent_per_weekday.add_vrect(x0=5.5, x1=6.5, 
              annotation_text="Least Money Spent", annotation_position="top left",
              fillcolor="red", opacity=0.25, line_width=0)

money_spent_per_weekday.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Money Spent Per Day of the Week",
                            y = 0.95,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'),
                      xaxis = dict(
                            tickmode = 'array',
                            tickvals = [0, 1, 2, 3, 4, 5, 6],
                            ticktext = list(calendar.day_name)))

money_spent_per_weekday.show()

In [22]:
hourly_stats = eu_ecom_df.groupby("Hour").agg({"MoneySpent": pd.Series.sum,
                                               "Quantity": pd.Series.sum})
hourly_stats.sort_values("MoneySpent", ascending=False, inplace=True)

hourly_spending = px.histogram(hourly_stats, x=hourly_stats.index,
                               y="MoneySpent", nbins=48, histfunc="avg",
                               hover_data=hourly_stats.columns, text_auto=True)
hourly_spending.update_layout(titlefont=dict(size=20, color='black'),
                              title={
                                    'text': "Hourly Customer Spending",
                                    'y':0.95,
                                    'x':0.5,
                                    'xanchor': 'center',
                                    'yanchor': 'top'},
                              )
hourly_spending.show()

#### Insights:

1. From the graphs above, it is evident that there are spikes in both, number of orders and customer spending as a whole throughout the year. 
2. We can also see that the there is a jump in customer spending from September to December with the highest being in November.
3. Data is missing for Week 52 (last week). This maybe due to the business being shut for the winter holidays.
3. There is no data for any Saturday. It seems like the business is closed on Saturdays but it is quite strange for an e-commerce business.
4. Thursday is most popular while Sunday is the least.
5. Most active spending hours are from  10 AM to 3 PM.

### Investigating Top Products Overall

In [23]:
top_spends = eu_ecom_df.groupby("Description").agg({
    "Quantity": pd.Series.sum,
    "MoneySpent": pd.Series.sum})
top_spends.reset_index(inplace=True)
top_spends.sort_values("Quantity", ascending=False, inplace=True)

most_popular_items = px.bar(top_spends[:30], y="Description", x="Quantity")
most_popular_items.update_layout(autosize=False,
                                 width=1800,
                                 height=800)

most_popular_items.update_layout(showlegend=False,
                      titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Top 30 Most Popular Items at the Business",
                            y = 0.97,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'
                      ))
most_popular_items.show()

Most popular product overall is the `WORLD WAR 2 GLIDERS ASSTD DESIGNS` with 8902 items sold. <br>
Most spent on product is the `REGENCY CAKESTAND 3 TIER` with $23500.14 in revenue.



### Returns

From the graphs above, we can see that the `Quantity` data contains some negative values. I'm assuming this means that the particular item was returned. <br>
There are also some days where the `MoneySpent` is in negative (which means people have returned items worth more that day than they have purchased in total).

It will be interesting to see why such incidents happen where people return more items than they buy but that is out of our scope for now.

In [24]:
eu_ecom_df[eu_ecom_df["MoneySpent"] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour,MoneySpent
96,C536548,22168,ORGANISER WOOD ANTIQUE WHITE,-2,8.50,12472,Germany,2010-12-01,14,-17.00
194,C536548,22077,6 RIBBONS RUSTIC CHARM,-6,1.65,12472,Germany,2010-12-01,14,-9.90
230,C536548,20914,SET/5 RED RETROSPOT LID GLASS BOWLS,-1,2.95,12472,Germany,2010-12-01,14,-2.95
231,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,0.29,17548,United Kingdom,2010-12-01,10,-6.96
266,C536548,22654,DELUXE SEWING KIT,-1,5.95,12472,Germany,2010-12-01,14,-5.95
...,...,...,...,...,...,...,...,...,...,...
81315,C581330,22959,WRAP CHRISTMAS VILLAGE,-25,0.42,15877,United Kingdom,2011-12-08,11,-10.50
81366,C581468,22098,BOUDOIR SQUARE TISSUE BOX,-12,0.39,13599,United Kingdom,2011-12-08,19,-4.68
81382,C581316,21531,RED RETROSPOT SUGAR JAM BOWL,-1,2.55,12523,France,2011-12-08,11,-2.55
81398,C581465,22171,3 HOOK PHOTO SHELF ANTIQUE WHITE,-1,8.50,15755,United Kingdom,2011-12-08,18,-8.50


# Holidays vs Purchases

Now let us try to visualize customer spending around the holidays. The holidays have been marked with a vertical dotted line so we can clearly spot any trends.

In [25]:
new_holiday = holiday_df[(holiday_df["Date"] >= "2010-12-01") & (holiday_df["Date"] <= "2011-12-07")].sort_values("Date")
new_holiday.reset_index(inplace=True)
new_holiday

Unnamed: 0,index,Date,Holiday,WeekDay,Month,Day,Year
0,42,2010-12-24,Christmas Eve,Friday,12,24,2010
1,24,2010-12-25,Christmas Day,Saturday,12,25,2010
2,222,2010-12-31,New Year’s Eve,Friday,12,31,2010
3,207,2011-01-01,New Year's Day,Saturday,1,1,2011
4,168,2011-01-17,"Martin Luther King, Jr. Day",Monday,1,17,2011
5,277,2011-02-14,Valentine’s Day,Monday,2,14,2011
6,315,2011-02-21,Washington's Birthday,Monday,2,21,2011
7,341,2011-04-24,Western Easter,Sunday,4,24,2011
8,83,2011-04-24,Eastern Easter,Sunday,4,24,2011
9,193,2011-05-30,Memorial Day,Monday,5,30,2011


In [26]:
# color_discrete_sequence = ['#00ABB3']*len(sorted_ecom_df)
# sorted_ecom_df['Category'] = [str(i) for i in sorted_ecom_df.index]

# # If date is a holiday, mark it in a different colour
# bar_counter = 0
# for date in sorted_ecom_df.Date:
#   for hol_date in holiday_df.Date:
#     if date == hol_date:
#       color_discrete_sequence[bar_counter] = "#CF0A0A"
#   bar_counter += 1

spending_vs_holidays = px.bar(data_frame=sorted_ecom_df,
              x="Date", 
              y="MoneySpent",)
              # color=sorted_ecom_df['Category'],
              # color_discrete_sequence=color_discrete_sequence)

for holiday in range(len(new_holiday)):
  spending_vs_holidays.add_vline(x=new_holiday.loc[holiday]["Date"], line_width=1, line_dash="dot")

# def add_vertical_lines(date, holiday_name):
#   spending_vs_holidays.add_vline(x=date, line_width=1, line_dash="dot", annotation_text=holiday_name)

# add_vertical_lines("2011-10-10", "asdc")

spending_vs_holidays.update_layout(showlegend=False,
                      titlefont=dict(size=20, color='black'),
                      title=dict(
                            text = "Customer Spending Around the Holidays",
                            y = 0.95,
                            x = 0.5,
                            xanchor =  'center',
                            yanchor = 'top'
                      ))
spending_vs_holidays.show()

It is clear from the above graph that spending is definitely influenced by holidays but not on the exact date, rather some time before that. It makes sense since people tend to buy gifts in advance and are also aware of the shipping delays. <br><br>
Insights:


1.   September to December is the busy season as a lot of spending is concentrated there.
2.   There is a spike in spending before most holidays (some more than others).
Especially on holidays where gifting items is a tradtion, we see a massive spike in customer spending. For example, just before Valentine's Day.



In [27]:
jan_to_aug_spend = sorted_ecom_df[(sorted_ecom_df["Date"] > "Jan, 2011") & 
                                  (sorted_ecom_df["Date"] < "Sep, 2011")].agg({
                                      "MoneySpent": pd.Series.mean
                                      })
                                  
sep_to_dec_spend = sorted_ecom_df[(sorted_ecom_df["Date"] >= "Sep, 2011") & 
                                  (sorted_ecom_df["Date"] <= "Dec, 2011")].agg({
                                      "MoneySpent": pd.Series.mean
                                      })

spend_jump = go.Figure()
spend_jump.add_bar(x=["Jan-Aug", "Sept-Dec"], 
                   y=[jan_to_aug_spend["MoneySpent"].sum(), 
                      sep_to_dec_spend["MoneySpent"].sum()])
spend_jump.update_layout(titlefont=dict(size=20, color='black'),
                         autosize=False, width=800, height=500,
                         title=dict(
                             text = "Comparing Average Spendings from Jan-Aug to Sep-Dec",
                             y = 0.95,
                             x = 0.5,
                             xanchor =  'center',
                             yanchor = 'top'),
                         yaxis=dict(title_text="Total Money Spent"))
spend_jump.show()
                                  
# print(f"January to August average spending: {round(jan_to_aug_spend, 2)}")
# print(f"September to December average spending: {round(sep_to_dec_spend, 2)}")
# print(f"It equates to nearly {round(sep_to_dec_spend/jan_to_aug_spend, 1)}x more.")

Also, we can see that the months of September to December are more hectic than the first 8 months.
January to August has consistent spending of around 4k whereas September to December, the average spending increases significantly.

We can see the increase in spending is almost 1.8x. Hence, more marketing budget should be allotted to advertising products from September to December.


### Valentine's Day and Christmas
We will now track the popularity of christmas and valentine's items throughout the year since these two holidays are quite popular.




In [28]:
# Building the DF for Christmas items
xmas_items = eu_ecom_df[eu_ecom_df["Description"].str.
                        contains("CHRISTMAS|xmas", 
                                 case=False, 
                                 regex=True)].groupby("Date").agg({
                                     "Quantity": pd.Series.sum,
                                     "MoneySpent": pd.Series.sum
                                     })
xmas_items.reset_index(inplace=True)
xmas_items.sort_values(["Date", "Quantity"], ascending=[True, False], 
                       inplace=True)

# Building the DF for Valentine items
val_items = eu_ecom_df[eu_ecom_df["Description"].str.
                       contains("heart|love|valentine", 
                                case=False, 
                                regex=True)].groupby(["Date"]).agg({
                                    "Quantity": pd.Series.sum,
                                    "MoneySpent": pd.Series.sum
                                    })
val_items.reset_index(inplace=True)
val_items.sort_values(["Date", "Quantity"], ascending=[True, False], 
                      inplace=True)

# Plotting the comparison chart
xmas_vs_valentine_items = go.Figure()
xmas_vs_valentine_items.add_trace(go.Scatter(x=xmas_items["Date"], y=xmas_items["MoneySpent"],
                    mode='lines', fill='tozeroy', name='Christmas Items',
                    opacity=1))
xmas_vs_valentine_items.add_trace(go.Scatter(x=val_items["Date"], y=val_items["MoneySpent"],
                    mode='lines', fill='tozeroy', name='Valentine Items'))

xmas_vs_valentine_items.update_layout(titlefont=dict(size=20, color='black'),
                      title=dict(
                          text = "Comparing Item Sales for Valentine's Vs Christmas",
                          y = 0.95,
                          x = 0.5,
                          xanchor =  'center',
                          yanchor = 'top'
                          ),
                      legend=dict(
                          orientation = "h", 
                          yanchor = "bottom", 
                          y = 1.02, 
                          xanchor = "right",
                          x = 1))

xmas_vs_valentine_items.show()

In [29]:
xmas_vs_valentine_spend = go.Figure()
xmas_vs_valentine_spend.add_bar(x=["Christmas", "Valentine"], 
                          y=[xmas_items["MoneySpent"].sum(), 
                             val_items["MoneySpent"].sum()])
xmas_vs_valentine_spend.update_layout(autosize=False, width=800, height=500,
                                      titlefont=dict(size=20, color='black'),
                                      title=dict(
                                        text = "Total Money Spent on Christmas Items vs Valentine's Items Throughout the Year",
                                        y = 0.95,
                                        x = 0.5,
                                        xanchor =  'center',
                                        yanchor = 'top'),
                                      yaxis=dict(title_text="Total Money Spent"))
xmas_vs_valentine_spend.show()

#### Insights:

1. We can see that the there is a steady rise in Christmas items from mid-August and keep increasing till peak Christmas season.
2. We expected items like Valentine's Day gifts to be popular mostly around the Valentines Season but looking at the data, that doesn't seem true. They keep selling throughout the year.
3. Valentine items defintely bring in consistent and more money than Christmas items so the marketing team should focus more of their attention towards them.

### Other Holidays

In [30]:
items_df = eu_ecom_df.groupby(["Date", "Description"]).agg({"Quantity": pd.Series.sum,
                                                 "UnitPrice": pd.Series.sum,
                                                 "MoneySpent": pd.Series.sum})
items_df.sort_values(["Date", "Quantity"], ascending=[True,False], inplace=True)
items_df.reset_index(inplace=True)

In [31]:
# Columbus DF
columbus_df = items_df[(items_df["Date"] > "2011-10-05") & (items_df["Date"] < "2011-10-11")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
columbus_df.sort_values("MoneySpent", ascending=False, inplace=True)
columbus_df.reset_index(inplace=True)


# 4th of July DF
july_4_df = items_df[(items_df["Date"] > "2011-06-29") & (items_df["Date"] <= "2011-07-04")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
july_4_df.sort_values("MoneySpent", ascending=False, inplace=True)
july_4_df.reset_index(inplace=True)


# Labor Day DF
labor_day_df = items_df[(items_df["Date"] >= "2011-09-01") & (items_df["Date"] <= "2011-09-05")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
labor_day_df.sort_values("MoneySpent", ascending=False, inplace=True)
labor_day_df.reset_index(inplace=True)


# Thanksgiving DF
thanksgiving_df = items_df[(items_df["Date"] >= "2011-11-20") & (items_df["Date"] <= "2011-11-24")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
thanksgiving_df.sort_values("MoneySpent", ascending=False, inplace=True)
thanksgiving_df.reset_index(inplace=True)


# Memorial Day DF
memorial_day_df = items_df[(items_df["Date"] >= "2011-05-26") & (items_df["Date"] <= "2011-05-30")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
memorial_day_df.sort_values("MoneySpent", ascending=False, inplace=True)
memorial_day_df.reset_index(inplace=True)


# Easter DF
easter_df = items_df[(items_df["Date"] >= "2011-04-20") & (items_df["Date"] <= "2011-04-24")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
easter_df.sort_values("MoneySpent", ascending=False, inplace=True)
easter_df.reset_index(inplace=True)


# Juneteenth DF
juneteenth_df = items_df[(items_df["Date"] >= "2011-06-15") & (items_df["Date"] <= "2011-06-19")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
juneteenth_df.sort_values("MoneySpent", ascending=False, inplace=True)
juneteenth_df.reset_index(inplace=True)


# Veterans DF
veterans_df = items_df[(items_df["Date"] >= "2011-11-07") & (items_df["Date"] <= "2011-11-11")].groupby("Description").agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum
})
veterans_df.sort_values("MoneySpent", ascending=False, inplace=True)
veterans_df.reset_index(inplace=True)

In [32]:
specs = [[{'type':'domain'}, {'type':'domain'}],
         [{'type':'domain'}, {'type':'domain'}],
         [{'type':'domain'}, {'type':'domain'}],
         [{'type':'domain'}, {'type':'domain'}]]

most_popular_items = make_subplots(rows=4, cols=2, specs=specs, vertical_spacing=0.05, horizontal_spacing=0.1,
                    subplot_titles=["Columbus Day", "4th of July", "Labor Day", "Thanksgiving", "Memorial Day", "Easter", "Juneteenth", "Veteran's Day"])

most_popular_items.add_trace(go.Pie(labels=columbus_df["Description"][:5], 
                     values=columbus_df["MoneySpent"][:5], name='Columbus Day',
                     marker_colors=px.colors.sequential.ice), 1, 1)

most_popular_items.add_trace(go.Pie(labels=july_4_df["Description"][:5], 
                     values=july_4_df["MoneySpent"][:10], name='4th of July',
                     marker_colors=px.colors.sequential.ice), 1, 2)

most_popular_items.add_trace(go.Pie(labels=labor_day_df["Description"][:5], 
                     values=labor_day_df["MoneySpent"][:5], name='Labor Day',
                     marker_colors=px.colors.sequential.ice), 2, 1)

most_popular_items.add_trace(go.Pie(labels=thanksgiving_df["Description"][:5], 
                     values=thanksgiving_df["MoneySpent"][:5], name='Thanksgiving',
                     marker_colors=px.colors.sequential.ice), 2, 2)

most_popular_items.add_trace(go.Pie(labels=memorial_day_df["Description"][:5], 
                     values=memorial_day_df["MoneySpent"][:5], name='Memorial Day',
                     marker_colors=px.colors.sequential.ice), 3, 1)

most_popular_items.add_trace(go.Pie(labels=easter_df["Description"][:5], 
                     values=easter_df["MoneySpent"][:5], name='Easter',
                     marker_colors=px.colors.sequential.ice), 3, 2)

most_popular_items.add_trace(go.Pie(labels=juneteenth_df["Description"][:5], 
                     values=juneteenth_df["MoneySpent"][:5], name='Juneteenth',
                     marker_colors=px.colors.sequential.ice), 4, 1)

most_popular_items.add_trace(go.Pie(labels=veterans_df["Description"][:5], 
                     values=veterans_df["MoneySpent"][:5], name="Veteran's Day",
                     marker_colors=px.colors.sequential.ice), 4, 2)

most_popular_items.update_layout(titlefont=dict(size=20, color='black'),
                      autosize=False,
                      width=1000,
                      height=2150,
                      title=dict(
                          text = "<b>Comparing the Top 5 Most Revenue Generating Items During Holidays</b>",
                          y = 0.99,
                          x = 0.5,
                          xanchor = 'center',
                          yanchor = 'top'
                          ),
                      showlegend=False)

most_popular_items.show()

#### Insights:

1. After checking out the items which were purchased around these holidays, there seems no relation between these purchases.
2. The holiday list contains holidays mostly celebrated in the USA but we see that most orders are from the UK. Hence, there is no correlation between most holidays in the given dataset to the items that are purchased with the exception of Valentine's Day and Christmas.

# Insights About the Dataset

1. Both datasets are clean and contain no null values.
2. The e-commerce data spans roughly a year while the holiday data spans muliple years.
2. There is no data for any Saturday. We must assume that the business remains closed on Saturdays.
2. There is also data missing on for most public holidays in the UK for which we can assume that the business remained closed.