<a href="https://colab.research.google.com/github/ameymane09/Holidays-analysis-for-marketing/blob/main/Assignment_2_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

Having a great marketing strategy in place is key to the success of any business. Without a marketing strategy, you lack focus. And without focus, you will, quite simply, fail to reach any of the goals and objectives that you have set. Any information about customers allows marketers to gain a laser-sharp understanding of their target audience. The marketing budget is being set for the year 2023. The marketing director would like to know which holiday brings in the most money so the team can adjust the marketing dollars. <br><br>
### Objective

What holidays should the marketing team invest more marketing dollars in? Also, find out whatever other insights you can from the dataset. <br><br>

### Instructions

Answer the questions (using SQL, Python) and explain your rationale in the write-up:

1. How would you segment holidays based on the expenditure of customers
2. Which of these segments / sub-segments would you propose be approved?
    - For e.g. Does a certain holiday drive the sales of a particular segment of people and how can the marketing team use that to optimize their plan? Would a holiday season’s duration affect the number of sales and how should the marketing team strategize around that?
3. What other insights in general can you share about these segments?
4. Tell us what your observations were on the data itself (completeness, skews) and how you would treat any anomalies (for eg - missing data)

# Importing, reading and cleaning the data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns
import plotly.figure_factory as ff


In [None]:
pd.options.plotting.backend = "plotly"

In [None]:
ecom_df = pd.read_csv("/content/Ecommerce_Data.csv")
holiday_df = pd.read_csv("/content/US_Holiday_Dates_(2004-2021).csv")

In [None]:
ecom_df.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour
0,439570,574477,22591,CARDHOLDER GINGHAM CHRISTMAS TREE,1,3.25,15453.0,United Kingdom,2011-11-04,12
1,387281,570275,23541,WALL ART CLASSIC PUDDINGS,12,7.45,13098.0,United Kingdom,2011-10-10,10
2,337863,566482,22508,DOORSTOP RETROSPOT HEART,12,3.75,16609.0,United Kingdom,2011-09-13,9
3,57628,541215,22662,LUNCH BAG DOLLY GIRL DESIGN,10,1.65,14329.0,United Kingdom,2011-01-14,13
4,330897,565930,POST,POSTAGE,5,18.0,12685.0,France,2011-09-08,10


In [None]:
ecom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81601 entries, 0 to 81600
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   81601 non-null  int64  
 1   InvoiceNo    81601 non-null  object 
 2   StockCode    81601 non-null  object 
 3   Description  81601 non-null  object 
 4   Quantity     81601 non-null  int64  
 5   UnitPrice    81601 non-null  float64
 6   CustomerID   81601 non-null  float64
 7   Country      81601 non-null  object 
 8   Date         81601 non-null  object 
 9   Hour         81601 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 6.2+ MB


In [None]:
ecom_df.describe()

Unnamed: 0.1,Unnamed: 0,Quantity,UnitPrice,CustomerID,Hour
count,81601.0,81601.0,81601.0,81601.0,81601.0
mean,278271.366772,11.965736,3.168721,15283.816215,12.729783
std,152483.054308,45.782018,18.731668,1713.292081,2.288777
min,2.0,-3114.0,0.0,12347.0,6.0
25%,148283.0,2.0,1.25,13949.0,11.0
50%,284742.0,5.0,1.95,15144.0,13.0
75%,409445.0,12.0,3.75,16790.0,14.0
max,541908.0,3186.0,4287.63,18287.0,20.0


Since the unnamed column is not conveying any useful info, I've decided to drop it since it is impossible to figure out what it stands for.

In [None]:
ecom_df.drop(labels="Unnamed: 0", axis=1, inplace=True)
ecom_df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice',
       'CustomerID', 'Country', 'Date', 'Hour'],
      dtype='object')

In [None]:
holiday_df.head()

Unnamed: 0,Date,Holiday,WeekDay,Month,Day,Year
0,2004-07-04,4th of July,Sunday,7,4,2004
1,2005-07-04,4th of July,Monday,7,4,2005
2,2006-07-04,4th of July,Tuesday,7,4,2006
3,2007-07-04,4th of July,Wednesday,7,4,2007
4,2008-07-04,4th of July,Friday,7,4,2008


In [None]:
holiday_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 342 entries, 0 to 341
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     342 non-null    object
 1   Holiday  342 non-null    object
 2   WeekDay  342 non-null    object
 3   Month    342 non-null    int64 
 4   Day      342 non-null    int64 
 5   Year     342 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 16.2+ KB


In [None]:
ecom_df.isna().any()

InvoiceNo      False
StockCode      False
Description    False
Quantity       False
UnitPrice      False
CustomerID     False
Country        False
Date           False
Hour           False
dtype: bool

In [None]:
holiday_df.isna().any()

Date       False
Holiday    False
WeekDay    False
Month      False
Day        False
Year       False
dtype: bool

The data contains no null values so we can proceed with operations on data now.

# Solution

###Grouping Data from European Countries

The data is from many different countries. It looks like the E-commerce business operates worldwide.

In [None]:
ecom_df.groupby("Country")["Quantity"].count().nlargest(10)

Country
United Kingdom    72617
Germany            1898
France             1689
EIRE               1500
Spain               504
Netherlands         469
Belgium             379
Switzerland         358
Portugal            284
Australia           254
Name: Quantity, dtype: int64

In [None]:
hist1 = px.histogram(ecom_df, x="Country", y="Quantity")

hist1.update_layout(titlefont=dict(size=20, color='black'),
                      title={
                            'text': "Distribution of Orders Worldwide",
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'})
hist1.update_xaxes(tickangle=45)
hist1.update_yaxes(title="Number of Orders")
hist1.show()

Seems like the E-commerce company is based in the UK since we can see that most orders are from there itself with a few orders for other countries. <br> <br>
Since most of the presence of this company is based in and around the UK, we will focus our strategies towards them and drop other countries since they will give us erroneous data as most holidays in Europe may not be celebrated everywhere else.

In [None]:
top_list = pd.Series.to_list(ecom_df.value_counts("Country").index[:10])
print(top_list)

eu_ecom_df = ecom_df[ecom_df["Country"].isin(top_list)]
eu_ecom_df.info()

['United Kingdom', 'Germany', 'France', 'EIRE', 'Spain', 'Netherlands', 'Belgium', 'Switzerland', 'Portugal', 'Australia']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79952 entries, 0 to 81600
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    79952 non-null  object 
 1   StockCode    79952 non-null  object 
 2   Description  79952 non-null  object 
 3   Quantity     79952 non-null  int64  
 4   UnitPrice    79952 non-null  float64
 5   CustomerID   79952 non-null  float64
 6   Country      79952 non-null  object 
 7   Date         79952 non-null  object 
 8   Hour         79952 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 6.1+ MB


<br><br>

We will add a column named `MoneySpent` to show the amount of money spent by the customer for a particular order. This will also help us quantify data regarding customer spending habits.

In [None]:
eu_ecom_df["MoneySpent"] = eu_ecom_df["Quantity"] * eu_ecom_df["UnitPrice"]
eu_ecom_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour,MoneySpent
0,574477,22591,CARDHOLDER GINGHAM CHRISTMAS TREE,1,3.25,15453.0,United Kingdom,2011-11-04,12,3.25
1,570275,23541,WALL ART CLASSIC PUDDINGS,12,7.45,13098.0,United Kingdom,2011-10-10,10,89.4
2,566482,22508,DOORSTOP RETROSPOT HEART,12,3.75,16609.0,United Kingdom,2011-09-13,9,45.0
3,541215,22662,LUNCH BAG DOLLY GIRL DESIGN,10,1.65,14329.0,United Kingdom,2011-01-14,13,16.5
4,565930,POST,POSTAGE,5,18.0,12685.0,France,2011-09-08,10,90.0


In [None]:
sorted_ecom_df = eu_ecom_df.groupby("Date").agg({"MoneySpent": pd.Series.sum,
                                                 "Quantity": pd.Series.sum})
sorted_ecom_df.reset_index(inplace=True)
sorted_ecom_df.head()

Unnamed: 0,Date,MoneySpent,Quantity
0,2010-12-01,8093.25,3771
1,2010-12-02,7660.42,7154
2,2010-12-03,3396.26,1551
3,2010-12-05,6620.81,2916
4,2010-12-06,6395.05,3035


In [None]:
fig1 = px.line(sorted_ecom_df, x="Date", y="MoneySpent")
fig1.update_layout(titlefont=dict(size=20, color='black'),
                   title={
                        'text': "Customer Spending Throughout the Year",
                        'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                   )
fig1.show()

fig2 = px.line(sorted_ecom_df, x="Date", y="Quantity")
fig2.update_layout(titlefont=dict(size=20, color='black'),
                   title={
                        'text': "Amount of Items Purchased Throughout the Year",
                        'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                   )
fig2.show()

From the graphs above, it is evident that there are spikes in both, number of orders and customer spending as a whole throughout the year. We will investigate these spikes further.

<br>


#### Understanding Returns

From the graphs above, we can see that the `Quantity` data contains some negative values. I'm assuming this means that the particular item was returned. <br>
There are also some days where the `MoneySpent` is in negative (which means people have returned more items that day than they have purchased).

It will be interesting to see why such incidents happen where people return more items than they buy but that is out of our scope for now.

In [None]:
eu_ecom_df[eu_ecom_df["MoneySpent"] < 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour,MoneySpent
22,C569959,22197,POPCORN HOLDER,-3,0.85,13983.0,United Kingdom,2011-10-06,18,-2.55
62,C538701,22644,CERAMIC CHERRY CAKE MONEY BANK,-1,1.45,16863.0,United Kingdom,2010-12-14,10,-1.45
72,C543706,22423,REGENCY CAKESTAND 3 TIER,-1,12.75,15044.0,United Kingdom,2011-02-11,11,-12.75
129,C559309,22727,ALARM CLOCK BAKELIKE RED,-4,3.75,17719.0,United Kingdom,2011-07-07,13,-15.00
281,C542091,22442,GROW YOUR OWN FLOWERS SET OF 3,-1,7.95,17091.0,United Kingdom,2011-01-25,12,-7.95
...,...,...,...,...,...,...,...,...,...,...
81363,C547763,23182,TOILET SIGN OCCUPIED OR VACANT,-1,0.83,14194.0,United Kingdom,2011-03-25,11,-0.83
81397,C540788,21231,SWEETHEART CERAMIC TRINKET BOX,-2,1.25,13576.0,United Kingdom,2011-01-11,11,-2.50
81446,C562559,21231,SWEETHEART CERAMIC TRINKET BOX,-13,1.25,16743.0,United Kingdom,2011-08-05,17,-16.25
81567,C560540,23242,TREASURE TIN BUFFALO BILL,-1,2.08,12415.0,Australia,2011-07-19,12,-2.08


### Working Around Missing Data

We will merge this data with the holidays column to check what holidays have had most spending.

In [None]:
sorted_ecom_df["Date"] = pd.to_datetime(sorted_ecom_df["Date"], format='%Y-%m-%d')
holiday_df["Date"] = pd.to_datetime(holiday_df["Date"], format='%Y-%m-%d')
eu_ecom_df["Date"] = pd.to_datetime(eu_ecom_df["Date"], format='%Y-%m-%d')

In [None]:
df_merged = holiday_df.merge(sorted_ecom_df, how="inner").sort_values("MoneySpent", ascending=False)
df_merged.head(20)

Unnamed: 0,Date,Holiday,WeekDay,Month,Day,Year,MoneySpent,Quantity
7,2011-11-23,Thanksgiving Eve,Wednesday,11,23,2011,13208.58,7448
1,2011-10-10,Columbus Day,Monday,10,10,2011,8524.41,4666
9,2011-11-11,Veterans Day,Friday,11,11,2011,7014.75,4225
0,2011-07-04,4th of July,Monday,7,4,2011,6776.75,2346
3,2011-09-05,Labor Day,Monday,9,5,2011,6647.82,4100
6,2011-11-24,Thanksgiving Day,Thursday,11,24,2011,6232.85,3630
10,2011-02-21,Washington's Birthday,Monday,2,21,2011,5724.2,3713
8,2011-02-14,Valentine’s Day,Monday,2,14,2011,4681.92,3414
2,2011-06-19,Juneteenth,Sunday,6,19,2011,4431.4,2812
4,2011-09-04,Labor Day Weekend,Sunday,9,4,2011,3636.26,2799


We can see that `df_merged` is missing Christmas data. There is no data for 24th and 25th of December 2010 but we can't ignore christmas as it is one of the most important holidays of the year.

In [None]:
sorted_ecom_df[(sorted_ecom_df["Date"] >= "2010-12-24") & 
               (sorted_ecom_df["Date"] <= "2010-12-25")]

Unnamed: 0,Date,MoneySpent,Quantity


In [None]:
sorted_ecom_df[sorted_ecom_df["Date"].dt.year == 2011].head(5)

Unnamed: 0,Date,MoneySpent,Quantity
20,2011-01-04,2107.4,1381
21,2011-01-05,4905.26,4007
22,2011-01-06,7002.99,3359
23,2011-01-07,4250.22,2523
24,2011-01-09,2878.84,1428


Upon further inspection, there is data missing from 24th December 2010 to 3rd January 2011. We will have to adjust accordingly and add the missing data manually.

##Understanding the Data as a Whole

Now let us try to visualize customer spending around the holidays. I have highlighted the holidays in a different colour so as to make it easy to spot them.

In [None]:
color_discrete_sequence = ['#00ABB3']*len(sorted_ecom_df)
sorted_ecom_df['Category'] = [str(i) for i in sorted_ecom_df.index]

# If date is a holiday, mark it in a different colour
bar_counter = 0
for date in sorted_ecom_df.Date:
  for hol_date in holiday_df.Date:
    if date == hol_date:
      color_discrete_sequence[bar_counter] = "#CF0A0A"
  bar_counter += 1

bar1 = px.bar(data_frame=sorted_ecom_df,
              x="Date", 
              y="MoneySpent", 
              color=sorted_ecom_df['Category'], 
              color_discrete_sequence=color_discrete_sequence)

bar1.update_layout(showlegend=False, 
                   titlefont=dict(size=20, color='black'),
                   title={
                        'text': "Customer Spending Around the Holidays",
                        'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                   )
bar1.show()

It is clear from the above graph that spending is definitely influenced by holidays but not on the exact date, rather some time before that. It makes sense since people tend to buy gifts in advance and are also aware of the shipping delays. <br><br>
Insights:


1.   September to December is the busy season as a lot of spending is concentrated there.
2.   There is a spike in spending before most holidays (some more than others).
Especially on holidays where gifting items is a tradtion, we see a massive spike in customer spending. For example, just before Valentine's Day.



Also, we can see that the months of September to December are more hectic than the first 8 months.
January to August has consistent spending of around 3k whereas September to December, the average spending increases significantly.

In [None]:
jan_to_aug_spend = sorted_ecom_df[(sorted_ecom_df["Date"] > "Jan, 2011") & 
                                  (sorted_ecom_df["Date"] < "Sep, 2011")].agg({
                                      "MoneySpent": pd.Series.mean
                                      })["MoneySpent"]
                                  
sep_to_dec_spend = sorted_ecom_df[(sorted_ecom_df["Date"] >= "Sep, 2011") & 
                                  (sorted_ecom_df["Date"] <= "Dec, 2011")].agg({
                                      "MoneySpent": pd.Series.mean
                                      })["MoneySpent"]
                                  
print(f"January to August average spending: {round(jan_to_aug_spend, 2)}")
print(f"September to December average spending: {round(sep_to_dec_spend, 2)}")
print(f"It equates to nearly {round(sep_to_dec_spend/jan_to_aug_spend, 1)}x more.")

January to August average spending: 4257.46
September to December average spending: 7339.73
It equates to nearly 1.7x more.


We can see the increase in spending is almost 1.7x. Hence, more marketing budget should be allotted to advertising products from September to December.


###Comparing Spendings on Weekday, Week of the Year and Hour of the Day

In [None]:
eu_ecom_df["Weekday"] = eu_ecom_df.Date.dt.day_name()

weekday_stats = eu_ecom_df.groupby("Weekday").agg({"MoneySpent": pd.Series.sum,
                                                   "Quantity": pd.Series.sum})

hourly_stats = eu_ecom_df.groupby("Hour").agg({"MoneySpent": pd.Series.sum,
                                                   "Quantity": pd.Series.sum})
hourly_stats.sort_values("MoneySpent", ascending=False, inplace=True)

hol_group = eu_ecom_df.groupby(eu_ecom_df.Date.dt.isocalendar().week).agg({
    "MoneySpent": pd.Series.sum,
    "Quantity": pd.Series.sum})


In [None]:
hist2 = px.histogram(eu_ecom_df, x="Weekday", y="MoneySpent", nbins=7)
hist2.update_layout(titlefont=dict(size=20, color='black'),
                   title={
                        'text': "Customer Spending Trend on Weekdays",
                        'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                   ) 
hist2.show()

hist3 = px.histogram(hourly_stats, hourly_stats.index,
                     "MoneySpent", nbins=48, hover_data=hourly_stats.columns,
                     histfunc="avg", text_auto=True)
hist3.update_layout(titlefont=dict(size=20, color='black'),
                   title={
                        'text': "Hourly Customer Spending",
                        'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                   ) 
hist3.show()

bar2 = px.bar(hol_group, x=hol_group.index, y="MoneySpent")
bar2.update_layout(titlefont=dict(size=20, color='black'),
                   title={
                        'text': "Weekly Customer Spending",
                        'y':0.95,
                        'x':0.5,
                        'xanchor': 'center',
                        'yanchor': 'top'},
                   ) 
bar2.show()

**Insights:**


1.   Thursdays see most customer spending.
2.   10 AM to 3 PM is the most active time for customers.
3. There is a rise in customer spends from week 40-52 (Holiday Season).



Our data is missing Saturday values. There are no records for Saturday. Maybe the site is closed every Saturday but it is quite strange.

In [None]:
eu_ecom_df[eu_ecom_df["Weekday"] == "Saturday"]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Date,Hour,MoneySpent,Weekday


### Investigating Top Products Overall

In [None]:
top_spends = eu_ecom_df.groupby("Description").agg({"MoneySpent": pd.Series.sum})
top_spends.sort_values("MoneySpent", ascending=False)

Unnamed: 0_level_0,MoneySpent
Description,Unnamed: 1_level_1
REGENCY CAKESTAND 3 TIER,23500.14
WHITE HANGING HEART T-LIGHT HOLDER,17389.69
PARTY BUNTING,15676.31
JUMBO BAG RED RETROSPOT,12416.70
ASSORTED COLOUR BIRD ORNAMENT,11359.06
...,...
WHITE TALL PORCELAIN T-LIGHT HOLDER,-454.50
Discount,-995.22
CRUK Commission,-2387.48
PANTRY CHOPPING BOARD,-2604.21


In [None]:
top_purchases = eu_ecom_df.groupby("Description").agg({"Quantity": pd.Series.sum})
top_purchases.sort_values("Quantity", ascending=False)

Unnamed: 0_level_0,Quantity
Description,Unnamed: 1_level_1
WORLD WAR 2 GLIDERS ASSTD DESIGNS,8902
POPCORN HOLDER,8210
ASSORTED COLOUR BIRD ORNAMENT,6994
JUMBO BAG RED RETROSPOT,6684
WHITE HANGING HEART T-LIGHT HOLDER,6491
...,...
ASSTD DESIGN RACING CAR PEN,-77
GLASS JAR MARMALADE,-97
WHITE TALL PORCELAIN T-LIGHT HOLDER,-246
PANTRY CHOPPING BOARD,-511


Most popular product overall is the `WORLD WAR 2 GLIDERS ASSTD DESIGNS` with 8902 items sold. <br>
Most spent on product is the `REGENCY CAKESTAND 3 TIER` with $23500.14 in revenue.



## Most Purchased Items in the Holiday Season

In [None]:
items_df = eu_ecom_df.groupby(["Date", "Description"]).agg({"Quantity": pd.Series.sum,
                                                 "UnitPrice": pd.Series.sum})
items_df["MoneySpent"] = items_df["Quantity"] * items_df["UnitPrice"]

items_df.reset_index(inplace=True)
items_df.sort_values(["Date", "Quantity"], ascending=[True,False], inplace=True)
items_df["Date"] = pd.to_datetime(items_df["Date"], format='%Y-%m-%d')
items_df.head(20)

Unnamed: 0,Date,Description,Quantity,UnitPrice,MoneySpent
55,2010-12-01,CHILLI LIGHTS,224,13.02,2916.48
285,2010-12-01,WOODEN OWLS LIGHT GARLAND,192,3.37,647.04
261,2010-12-01,STRAWBERRY CERAMIC TRINKET BOX,156,2.5,390.0
277,2010-12-01,WHITE HANGING HEART T-LIGHT HOLDER,143,11.0,1573.0
115,2010-12-01,HAND WARMER SCOTTY DOG DESIGN,136,8.15,1108.4
133,2010-12-01,JUMBO BAG BAROQUE BLACK WHITE,100,1.65,165.0
193,2010-12-01,PINK HEART SHAPE EGG FRYING PAN,96,1.25,120.0
206,2010-12-01,RED HARMONICA IN BOX,96,3.75,360.0
16,2010-12-01,ANTIQUE SILVER TEA GLASS ENGRAVED,72,1.06,76.32
51,2010-12-01,CHARLOTTE BAG SUKI DESIGN,60,1.7,102.0


In [None]:
items_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67790 entries, 55 to 67784
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         67790 non-null  datetime64[ns]
 1   Description  67790 non-null  object        
 2   Quantity     67790 non-null  int64         
 3   UnitPrice    67790 non-null  float64       
 4   MoneySpent   67790 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 3.1+ MB


Filter out for holiday season

In [None]:
holiday_list = df_merged["Date"]
holiday_list.head(20)

7    2011-11-23
1    2011-10-10
9    2011-11-11
0    2011-07-04
3    2011-09-05
6    2011-11-24
10   2011-02-21
8    2011-02-14
2    2011-06-19
4    2011-09-04
5    2011-01-17
Name: Date, dtype: datetime64[ns]

Manually adding Christmas data as it is not available.

In [None]:
holiday_list.loc[len(holiday_list)] = "2010-12-25"
holiday_list


Inferring datetime64[ns] from data containing strings is deprecated and will be removed in a future version. To retain the old behavior explicitly pass Series(data, dtype={value.dtype})



7    2011-11-23
1    2011-10-10
9    2011-11-11
0    2011-07-04
3    2011-09-05
6    2011-11-24
10   2011-02-21
8    2011-02-14
2    2011-06-19
4    2011-09-04
5    2011-01-17
11   2010-12-25
Name: Date, dtype: datetime64[ns]

Filtering out data that does not have holidays near them.

In [None]:
items_hol = items_df[items_df["Date"].dt.month.isin(holiday_list.dt.month)]
items_hol.head()

Unnamed: 0,Date,Description,Quantity,UnitPrice,MoneySpent
55,2010-12-01,CHILLI LIGHTS,224,13.02,2916.48
285,2010-12-01,WOODEN OWLS LIGHT GARLAND,192,3.37,647.04
261,2010-12-01,STRAWBERRY CERAMIC TRINKET BOX,156,2.5,390.0
277,2010-12-01,WHITE HANGING HEART T-LIGHT HOLDER,143,11.0,1573.0
115,2010-12-01,HAND WARMER SCOTTY DOG DESIGN,136,8.15,1108.4


In [None]:
scatter1 = px.scatter(items_hol, x="Date", y="Quantity", color="Description")
scatter1.update_layout(showlegend=False, 
                       titlefont=dict(size=20, color='black'),
                        title={
                          'text': "Number of Items Bought During the Holidays",
                          'y':0.95,
                          'x':0.5,
                          'xanchor': 'center',
                          'yanchor': 'top'},
                   ) 
scatter1.show()

The scatter plot during September to December is more spread out compared to previous months. This indicates increase in popularity for particular items.
<br>
Also, some items like `PAPER POCKET TRAVELLING FAN` and `TRINKET BOX` have massive return orders. This may be due to poor product quality of items and may be needed to be discontinued to be sold.

## Holidays vs Items

In [None]:
max_selling = items_hol.groupby([items_hol.Date.dt.month, items_hol.Description]).agg({"Quantity": pd.Series.sum})
max_selling.reset_index(inplace=True)
max_selling.sort_values(["Date", "Quantity"], ascending=[True, False], inplace=True)
max_selling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13971 entries, 415 to 12211
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Date         13971 non-null  int64 
 1   Description  13971 non-null  object
 2   Quantity     13971 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 436.6+ KB


Selecting top 10 items from every holiday month to track their selling performance throughout the year.

In [None]:
top_selling_items = pd.DataFrame(columns=max_selling.columns)

for date in range(1, 13):
  top_selling_items = top_selling_items.append(
      max_selling[max_selling["Date"] == date][:10], 
      ignore_index=False, verify_integrity=False, sort=None)

top_selling_items.describe()

Unnamed: 0,Date,Description,Quantity
count,80,80,80
unique,8,64,78
top,1,ASSORTED COLOUR BIRD ORNAMENT,720
freq,10,3,2


In [None]:
# Create subplots, using 'domain' type for pie charts
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}], 
         [{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]]
fig = make_subplots(rows=4, cols=2, specs=specs, 
                    subplot_titles=["January", "February", "June", "July", 
                                    "September", "October", "November", "December"])

# Define pie charts
fig.add_trace(go.Pie(labels=top_selling_items["Description"][:10], 
                     values=top_selling_items["Quantity"][:10], name='January',
                     marker_colors=px.colors.sequential.RdBu), 1, 1)

fig.add_trace(go.Pie(labels=top_selling_items["Description"][10:20], 
                     values=top_selling_items["Quantity"][10:20], name='February',
                     marker_colors=px.colors.sequential.Viridis), 1, 2)

fig.add_trace(go.Pie(labels=top_selling_items["Description"][20:30], 
                     values=top_selling_items["Quantity"][20:30], name='June',
                     marker_colors=px.colors.sequential.Magenta), 2, 1)

fig.add_trace(go.Pie(labels=top_selling_items["Description"][30:40], 
                     values=top_selling_items["Quantity"][30:40], name='July',
                     marker_colors=px.colors.sequential.Plasma), 2, 2)

fig.add_trace(go.Pie(labels=top_selling_items["Description"][40:50],
                     values=top_selling_items["Quantity"][40:50], name='September',
                     marker_colors=px.colors.sequential.YlOrRd), 3, 1)

fig.add_trace(go.Pie(labels=top_selling_items["Description"][50:60], 
                     values=top_selling_items["Quantity"][50:60], name='October',
                     marker_colors=px.colors.sequential.ice), 3, 2)

fig.add_trace(go.Pie(labels=top_selling_items["Description"][60:70], 
                     values=top_selling_items["Quantity"][60:70], name='November',
                     marker_colors=px.colors.sequential.RdPu), 4, 1)

fig.add_trace(go.Pie(labels=top_selling_items["Description"][70:80], 
                     values=top_selling_items["Quantity"][70:80], name='December',
                     marker_colors=px.colors.sequential.solar), 4, 2)

# Tune layout, figure size and hover info
fig.update_traces(hoverinfo='label+percent+name', textinfo='none')

fig.update_layout(autosize=False, width=1000, height=2000,
                  showlegend=False, 
                  titlefont=dict(size=30, color='black'),
                  title={
                    'text': "<b>Number of Items Bought During the Holidays</b>",
                    'y':0.99,
                    'x':0.5,
                    'xanchor': 'center',
                    'yanchor': 'top'},
                   ) 

fig = go.Figure(fig)
fig.show()

The pie charts show us the top 10 most purchased items in the holiday months. Hover over them to find out the names and quantites of these items. <br>
**Insights:**


1.   Christmas items start gaining popularity as early as September.
2.   Other decorative items maintain their popularity throughout the year.


<br> <br>



### Christmas Items Throughout the Year
We will now track the popularity of items that have christmas in their names.




In [None]:
xmas_items = eu_ecom_df[eu_ecom_df["Description"].str.contains("CHRISTMAS")].groupby(["Date", "Description"]).agg({"Quantity": pd.Series.sum})
xmas_items.reset_index(inplace=True)
xmas_items.sort_values(["Date", "Quantity"], ascending=[True, False], inplace=True)

scatter2 = px.scatter(xmas_items, x="Date", y="Quantity", color="Description")
scatter2.update_layout(showlegend=False, 
                       titlefont=dict(size=20, color='black'),
                        title={
                          'text': "Popularity of Christmas Items Throughout the Year",
                          'y':0.95,
                          'x':0.5,
                          'xanchor': 'center',
                          'yanchor': 'top'},
                   ) 
scatter2.show()

We can see that the there is a steady rise in Christmas items from mid-August and keep increasing till peak Christmas season. <br> <br>

### Popularity of Valentine's Day Gifts Throughout the Year

In [None]:
val_items = eu_ecom_df[(eu_ecom_df["Description"].str.contains("HEART")) | 
                        (eu_ecom_df["Description"].str.contains("LOVE"))].groupby(
                            ["Date", "Description"]).agg({
                                "Quantity": pd.Series.sum})
val_items.reset_index(inplace=True)
val_items.sort_values(["Date", "Quantity"], ascending=[True, False], inplace=True)

scatter3 = px.scatter(val_items, x="Date", y="Quantity", color="Description")
scatter3.update_layout(showlegend=False, 
                       titlefont=dict(size=20, color='black'),
                        title={
                          'text': "Popularity of Valentine's Day Gifts Throughout the Year",
                          'y':0.95,
                          'x':0.5,
                          'xanchor': 'center',
                          'yanchor': 'top'},
                   )
scatter3.show()

We expected items like Valentine's Day gifts to be popular mostly around the Valentines Season but looking at the data, that doesn't seem true. They are not extremely popular but keep selling whole year round.