Data Cleaning

In [126]:
#read in the csv 
import pandas as pd
df=pd.read_csv("google_trends_competitors.csv")
#get number of rows and columns along with the data types
df.info()
#check for missing values 
df.isnull().values.any()
#okay so we have no missing values !

df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Date      608 non-null    object
 1   Sephora   608 non-null    int64 
 2   Ulta      608 non-null    int64 
 3   Fenty     608 non-null    int64 
 4   Glossier  608 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 23.9+ KB


Unnamed: 0,Date,Sephora,Ulta,Fenty,Glossier
0,2021-01-01,88,100,70,35
1,2021-01-02,97,96,100,41
2,2021-01-03,100,96,84,44
3,2021-01-04,85,74,72,32
4,2021-01-05,84,70,58,35


In [127]:
#Now we need to turn our attention to the data types since the date isnt a date object 
#df['Date'] =  pd.to_datetime(df['Date'], format='%Y-%M-%d')
#However careful in plotly teh date_yime object confuses the line chart so we left it the way it is !
#df.info()
#df['Date'] 
#now check for missing values 
df.isnull().sum()
#no missing nulll values so thats good
#rename date column 
df.rename(columns = {'Date':'date'}, inplace = True)
df.head(20)

Unnamed: 0,date,Sephora,Ulta,Fenty,Glossier
0,2021-01-01,88,100,70,35
1,2021-01-02,97,96,100,41
2,2021-01-03,100,96,84,44
3,2021-01-04,85,74,72,32
4,2021-01-05,84,70,58,35
5,2021-01-06,77,64,57,38
6,2021-01-07,80,63,51,29
7,2021-01-08,84,67,54,34
8,2021-01-09,87,80,57,34
9,2021-01-10,81,81,52,36


In [128]:
#Summary statistics
#only numeric columns first 
df2=df[["Sephora","Ulta","Fenty","Glossier"]]
df2



Unnamed: 0,Sephora,Ulta,Fenty,Glossier
0,88,100,70,35
1,97,96,100,41
2,100,96,84,44
3,85,74,72,32
4,84,70,58,35
...,...,...,...,...
603,95,78,33,43
604,93,90,24,38
605,88,67,24,41
606,95,65,33,36


In [129]:
#mean, max, min,q1,q3
import numpy as np 
df_mean = df2.mean()
df_mean.to_frame()
df_min=df2.min()
df_min.to_frame()
df_max=df2.max()
df_max.to_frame()
df_q1_Sephora= np.quantile(df2["Sephora"], 0.25)
df_q3_Sephora= np.quantile(df2["Sephora"], 0.75)
df_q1_Ulta= np.quantile(df2["Ulta"], 0.25)
df_q3_Ulta= np.quantile(df2["Ulta"], 0.75)
df_q1_Fenty= np.quantile(df2["Fenty"], 0.25)
df_q3_Fenty= np.quantile(df2["Fenty"], 0.75)
df_q1_Glossier= np.quantile(df2["Glossier"], 0.25)
df_q3_Glossier= np.quantile(df2["Glossier"], 0.75)

In [130]:
#move all the values into a table 
#concat everything together 
df_sum1=pd.concat([df_mean, df_min, df_max],axis=1)
df_sum1.columns = ['mean', 'min', 'max']
df_sum1.reset_index(drop=True, inplace=True)
df_sum1.head()
#add q1 and q3 
data = [["Sephora",df_q1_Sephora, df_q3_Sephora], ["Ulta",df_q1_Ulta, df_q3_Ulta], ["Fenty",df_q1_Fenty, df_q3_Fenty], ["Glossier",df_q1_Glossier,df_q3_Glossier]]
# Create the pandas DataFrame
df_q1_q3 = pd.DataFrame(data, columns=['brand','q1', 'q3'])
df_q1_q3
#concat the 2 dataframes 
df_summary=pd.concat([ df_q1_q3, df_sum1],axis=1)
new_cols= ["brand","min","max","mean","q1","q3"]
df_summary=df_summary[new_cols]
df_summary

Unnamed: 0,brand,min,max,mean,q1,q3
0,Sephora,25,100,61.779605,36.0,81.0
1,Ulta,21,100,51.328947,33.75,64.0
2,Fenty,14,100,37.78125,27.0,45.0
3,Glossier,12,100,31.116776,24.0,34.0


EDA

In [186]:
#viz 1 
#distributions of 4 differnt brands 
import plotly.figure_factory as ff
import numpy as np

# Add histogram data
x1 = df2['Sephora'].to_list()
x2 = df2['Ulta'].to_list()
x3 = df2['Fenty'].to_list()
x4 = df2['Glossier'].to_list()

# Group data together
hist_data = [x1, x2, x3, x4]

group_labels = ['Sephora', 'Ulta', 'Fenty', 'Glossier']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, 
                         group_labels, 
                         bin_size=1,
                         colors=['navy', 'blue','cornflowerblue','pink'],
                         show_rug=False)
fig.update_layout(plot_bgcolor = "white", 
                  title="Google Search Trends Activity for Glossier  <br><sup>January 2021 - August 2022</sup>",
                  xaxis_title="Normalized Count", 
                  yaxis_title="Probability", title_x=0.5)
fig.show()

Takeaways: The above density plot, maps the individual distributions for Glossier google search trends and its 3 main competitors. While Fenty and Glossier seem to have a similar right skewed density distributions with a clear peak, the Ulta and Sephora density distributions do not. From this result, it seems that Glossier and Fenty searches tend to be more consistent, meaning they hover around a similar number of searches versus Sephora and Ulta seem to have a lot more variabilty. We plan to build off of this EDA and determine wether google search trends are correlated to overall subreddit activity. 

In [185]:
#viz 1
#time series of Glossier searches over time 
import plotly.express as px
fig = px.line(data_frame = df
            ,x = 'date'
            ,y = 'Glossier'
            ,color_discrete_sequence = ['pink'],
            title="Google Search Trends Activity for Glossier  <br><sup>January 2021 - August 2022</sup>")
fig.update_layout(plot_bgcolor = "white",  xaxis_title="Date", yaxis_title="Normalized Count", title_x=0.5)

Takeaways: The above time series plot depics the normalized amount of searches for the term Glossier over time. Clearly there are spikes in the data which as expected seem to track with typically high sales periods (such as christmas and 4th of July) for retail and cosmetics brands. The March 2021 spike seemed like anamoly but after some reasearch it became clear that a promo code, allowing anyone to buy Glossier products for 50% off, went viral and created lots of interest in the brand. This could potentially explain the spike in search terms.

In [176]:
#viz2
#since we are dealing with temporal data it makes sense to make a simple time series chart 
#to do that I have to alter the dataframe a little bit and make 4 brands into one column 
df_EDA=pd.melt(df, id_vars=['date'], value_vars=['Sephora', 'Ulta','Fenty','Glossier'])
#df_EDA.head(1000)
df_EDA.count()
#rename the columns 
df_EDA.rename(columns = {'variable':'brand'}, inplace = True)
df_EDA.rename(columns = {'value':'normalized_count'}, inplace = True)
df_EDA.head(20)
#I am also going to summarize this by month so the plot is easier to read ]
#source https://stackoverflow.com/questions/45281297/group-by-week-in-pandas
df_EDA['date'] = pd.to_datetime(df_EDA['date'])
df_EDA.info()
df_EDA_month=df_EDA.groupby(['brand', pd.Grouper(key='date', freq='M')])['normalized_count'].mean().reset_index().sort_values('date')
df_EDA_month
#Now we can make our time series data



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2432 entries, 0 to 2431
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              2432 non-null   datetime64[ns]
 1   brand             2432 non-null   object        
 2   normalized_count  2432 non-null   int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 57.1+ KB


Unnamed: 0,brand,date,normalized_count
0,Fenty,2021-01-31,63.741935
40,Sephora,2021-01-31,82.451613
20,Glossier,2021-01-31,33.354839
60,Ulta,2021-01-31,68.064516
1,Fenty,2021-02-28,54.107143
...,...,...,...
18,Fenty,2022-07-31,41.741935
39,Glossier,2022-08-31,47.645161
19,Fenty,2022-08-31,37.483871
59,Sephora,2022-08-31,89.129032


In [184]:

import plotly.express as px
fig = px.line(data_frame = df_EDA_month
            ,x = 'date'
            ,y = 'normalized_count'
            ,color = 'brand'
            ,color_discrete_sequence = ['cornflowerblue', 'navy','pink','blue'],
            title="Google Search Trends Activity for Sephora, Ulta, Fenty and Glossier  <br><sup>January 2021 - August 2022</sup>")
fig.update_layout(plot_bgcolor = "white",  xaxis_title="Date (Months)", yaxis_title="Normalized Count", title_x=0.5)

Takeaways: The above time series helps compare the normalized amount of searches for the term Glossier and its competitors. Similar to what we saw in the density distribution plot (Figure ...) Sephora and Ulta tend to have far more variabilty with regards to search term populairty compared to Glossier and Fenty. It is also clear that Sephora and Ulta are more popular than Glossier and Fenty in terms of Google searches and that they seem to mirror each other. 