## **Introduction: Analyzing User Metrics and Keyword Performance**

## Demographics

In this analysis, I explored the relationship between user demographics (gender and age) and impressions. The goal was to understand if there were statistically significant differences between men and women impression amounts and if age had any impact on impressions.

To achieve this, I applied a series of hypothesis tests and analysis of variance (ANOVA) techniques, including:

1. **Hypothesis Test for Gender Differences**: I performed a hypothesis test using a two-way z-test to determine whether women had more impressions men. Using the resultant p-value and an alpha of 0.05, it was proven women have a statisitically different amount of impressions.

2. **One-Way ANOVA for Age Effect with Post-Hoc Tukey Test**: I conducted a one-way ANOVA to assess whether age influenced impressions, as it was revealed gender does. The test revealed that age has a moderate effect on the amount of impressions, with a p-value less thabn 0.1 but bigger than 0.05. This suggests further investigation should be done for age and its effect on impressions, but it could still be considered as relevant.
  - The post-hoc test found age groups 18-25 and 65+ to be the most different, with 25-25 having the most impressions and 65+ having the least.

3. **Two-Way ANOVA**: To refine the analysis, a two-way ANOVA with a post-hoc Tukey test was applied to assess the effect of both age and gender on conversion rates. Using a 0.05 value for alpha, there was no significant difference between any of the categories as the p-value for all categories was NAN. This does not actually mean there is no difference, but that there needs to be more data gathered to get a meaningful result.
  - Since this did not turn out any results, it is hard to tell what gender and age group will get the most amount of impressions. However, it was proven that women got significantly more impressions than men, and that age group 25-35 got the most impressions, so it isn't far-fetched to say that women in the 25-35 age group will have the most impressions.

## Time Analysis

1. **Start Hour ANOVA with Post-Hoc Tukey Test** I conducted an ANOVA test for impressions by hour of the day. The results showed no significant variance between hours, even with a p-value < 0.1. However, I did identify that the top 5 times impressions were during lunch and dinner hours, with the peak time being over 100 impressions. The top 5 times, in order, were: 7 PM, 8 PM, 1 PM, 2 PM, 11 AM. These all occurr either during the evening when people are generally more available after work and school, or during a lunch break period.

2. **Weekday ANOVA Test with Post-Hoc Tukey Test** I performed a one-way ANOVA test to determine if there was a significant difference in impressions for different days of the week. The results revealed there is no significant difference when using an alpha of 0.05. Overall, Tuesday had the most impressions followed by Friday, with Thursday having the least amount of impressions. To further explore the differences between the days, I used a post-hoc Tukey test, which compared all the days in pairs. This again revealed no signifcant difference between days as none had a p-value less than 0.1.

## Keyword Analysis

**Keyword Search ANOVA Test with Post-Hoc Tukey Test** To evaulte if there was any significant differenfe between the use of keywords, I conducted an ANOVA test using the keywords and the clicks recieved. This revealed that there is no significant difference between keywords, even with a p-value < 0.1. The leading three keywords were as follows: 'things to do', 'Things to do with family', and 'Things to do with kids'. They had 54, 19, and 14 clicks in that order. The 4th keyword was 'things to do near me' but it had a steep fall off of clicks, recieving only 6, and the rest recieving under 5. From this, I gather the folliwng
 - Two of the top keywords contain family-oriented words, such as family or kids. This suggests that the link was clicked by those looking to do something fun with their children and should be considered for future target demographics.
 - Two other top keywords contained relationship-oriented words, such as date idea or couples idea. While these keywords recieved less clicks than the above, this still sugggests the link was clicked by those with a partner looking to do something fun, and should be considered for future demopgrahics as well.





In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
camp_data = pd.read_csv('google ads campaign data.csv')

In [4]:
time_data = pd.read_csv('Day_&_hour(Day_Hour).csv')

In [5]:
dem_data = pd.read_csv('Demographics(Gender_Age_2024.09.01-2024.09.30).csv')

In [7]:
keyword_data = pd.read_csv('google ads keyword search.csv')

In [8]:
dem_data.head()

Unnamed: 0,Gender,Age Range,Impressions,Percent of known total
0,Male,18-24,73,7.95%
1,Male,25-34,98,10.68%
2,Male,35-44,87,9.48%
3,Male,45-54,36,3.92%
4,Male,55-64,16,1.74%


In [11]:
dem_data.groupby('Gender')['Impressions'].sum()

Unnamed: 0_level_0,Impressions
Gender,Unnamed: 1_level_1
Female,595
Male,323


In [13]:
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
#Use a 2-sample z proportion test to see if the amount of impressions is significantly different for women and men.

male_impressions = 595  # Total number of male impressions
female_impressions = 323  # Total number of female impressions
total_impressions = male_impressions + female_impressions  # Combined total impressions

# Count of male and female viewers
imp_counts = np.array([male_impressions, female_impressions])

# Number of groups (two groups: male and female)
impressions_n = np.array([total_impressions, total_impressions])  # Number of total impressions in each group

# Perform a z-test for proportions
stat_impressions, p_value_impressions = proportions_ztest(imp_counts, impressions_n)

# Print results

print(f"Z-statistic: {stat_impressions}")
print(f"P-value: {p_value_impressions}")

# Interpret the result
alpha = 0.05  # Significance level of 5%

if p_value_impressions < alpha:
    print("Reject the null hypothesis: There is a significant difference between male and female impressions.")
else:
    print("Fail to reject the null hypothesis: No significant difference between male and female impressions.")

Z-statistic: 12.695872761853956
P-value: 6.233119277895775e-37
Reject the null hypothesis: There is a significant difference between male and female impressions.


In [48]:
import scipy.stats as stats

# One-way ANOVA: Testing impressions across different Age Groups
age_groups = dem_data['Age Range']
impressions = dem_data['Impressions']

# Perform the one-way ANOVA
f_stat, p_value = stats.f_oneway(
    *[impressions[age_groups == group] for group in dem_data['Age Range'].unique()]
)

# Output the results
print(f"F-statistic: {f_stat}")
print(f"P-value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between age groups.")
elif p_value < 0.1:
    print("Moderate evidence against the null hypothesis: There is a moderate difference between age groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference between age groups.")


F-statistic: 3.5288814691151917
P-value: 0.07823528439346275
Moderate evidence against the null hypothesis: There is a moderate difference between age groups.


In [17]:
# Perform post-hoc test using tukey HSD to see if any pairs have a p value of under 0.1 or close
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(dem_data['Impressions'], dem_data['Age Range'], alpha=0.05)

# Display Tukey's test results
print(tukey_results)

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
 18-24  25-34     69.5 0.6059  -99.2098 238.2098  False
 18-24  35-44     45.0 0.8804 -123.7098 213.7098  False
 18-24  45-54    -25.0  0.988 -193.7098 143.7098  False
 18-24  55-64    -62.5 0.6904 -231.2098 106.2098  False
 18-24    65+    -69.0 0.6119 -237.7098  99.7098  False
 25-34  35-44    -24.5  0.989 -193.2098 144.2098  False
 25-34  45-54    -94.5 0.3431 -263.2098  74.2098  False
 25-34  55-64   -132.0  0.129 -300.7098  36.7098  False
 25-34    65+   -138.5 0.1087 -307.2098  30.2098  False
 35-44  45-54    -70.0 0.5999 -238.7098  98.7098  False
 35-44  55-64   -107.5 0.2461 -276.2098  61.2098  False
 35-44    65+   -114.0 0.2076 -282.7098  54.7098  False
 45-54  55-64    -37.5 0.9373 -206.2098 131.2098  False
 45-54    65+    -44.0 0.8891 -212.7098 124.7098  False
 55-64    65+     -6.5    1.0 -175.2098 162.2098

In [18]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Perform Two-Way ANOVA for age and gender
model = ols('Q("Impressions") ~ C(Q("Age Range")) + C(Gender) + C(Q("Age Range")):C(Gender)', data=dem_data).fit()

# Get the ANOVA table
anova_table = anova_lm(model)

# Output the results
print(anova_table)

                              df        sum_sq      mean_sq    F  PR(>F)
C(Q("Age Range"))            5.0  3.170700e+04  6341.400000  0.0     NaN
C(Gender)                    1.0  6.165333e+03  6165.333333  0.0     NaN
C(Q("Age Range")):C(Gender)  5.0  4.616667e+03   923.333333  0.0     NaN
Residual                     0.0  1.798603e-26          inf  NaN     NaN


  (model.ssr / model.df_resid))


In [19]:
time_data.head()

Unnamed: 0,Day,Start Hour,Impressions
0,Sunday,12 AM,0
1,Sunday,1 AM,0
2,Sunday,2 AM,0
3,Sunday,3 AM,0
4,Sunday,4 AM,0


In [47]:
import scipy.stats as stats

# One-way ANOVA: Testing impressions across different start times
time_groups = time_data['Start Hour']
impressions = time_data['Impressions']

# Perform the one-way ANOVA
f_stat, p_value = stats.f_oneway(
    *[impressions[time_groups == group] for group in time_data['Start Hour'].unique()]
)

# Output the results
print(f"F-statistic: {f_stat}")
print(f"P-value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between start times.")
elif p_value < 0.1:
    print("Moderate evidence against the null hypothesis: There is a moderate difference between start times.")
else:
    print("Fail to reject the null hypothesis: No significant difference between start times.")


F-statistic: 1.063306158076331
P-value: 0.3932769358453878
Fail to reject the null hypothesis: No significant difference between start times.


In [27]:
# Perform post-hoc test using tukey HSD to see if any pairs have a p value of under 0.1 or close
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(time_data['Impressions'], time_data['Start Hour'], alpha=0.05)

# Display Tukey's test results
print(tukey_results)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
 10 AM  10 PM  -1.7143    1.0 -27.8559 24.4273  False
 10 AM  11 AM   3.2857    1.0 -22.8559 29.4273  False
 10 AM  11 PM  -5.2857    1.0 -31.4273 20.8559  False
 10 AM  12 AM  -1.1429    1.0 -27.2845 24.9988  False
 10 AM  12 PM   4.8571    1.0 -21.2845 30.9988  False
 10 AM   1 AM  -2.2857    1.0 -28.4273 23.8559  False
 10 AM   1 PM      5.0    1.0 -21.1416 31.1416  False
 10 AM   2 AM  -4.1429    1.0 -30.2845 21.9988  False
 10 AM   2 PM   3.1429    1.0 -22.9988 29.2845  False
 10 AM   3 AM     -3.0    1.0 -29.1416 23.1416  False
 10 AM   3 PM  -0.4286    1.0 -26.5702 25.7131  False
 10 AM   4 AM  -3.2857    1.0 -29.4273 22.8559  False
 10 AM   4 PM   2.8571    1.0 -23.2845 28.9988  False
 10 AM   5 AM  -2.7143    1.0 -28.8559 23.4273  False
 10 AM   5 PM   1.2857    1.0 -24.8559 27.4273  False
 10 AM   6 AM     -4.0    1.

In [28]:
tukey_results.pvalues < 0.1

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [46]:
import scipy.stats as stats

# One-way ANOVA: Testing impressions across different weekdays
day_groups = time_data['Day']
impressions = time_data['Impressions']

# Perform the one-way ANOVA
f_stat, p_value = stats.f_oneway(
    *[impressions[day_groups == group] for group in time_data['Day'].unique()]
)

# Output the results
print(f"F-statistic: {f_stat}")
print(f"P-value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between weekdays.")
elif p_value < 0.1:
    print("Moderate evidence against the null hypothesis: There is a moderate difference between weekdays.")
else:
    print("Fail to reject the null hypothesis: No significant difference between weekdays.")


F-statistic: 1.4897284909539539
P-value: 0.18477399423346366
Fail to reject the null hypothesis: No significant difference between weekdays.


In [34]:
# Perform post-hoc test using tukey HSD to see if any pairs have a p value of under 0.1 or close
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(time_data['Impressions'], time_data['Day'], alpha=0.05)

# Display Tukey's test results
print(tukey_results)

   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2  meandiff p-adj   lower    upper  reject
----------------------------------------------------------
  Friday    Monday  -0.7917    1.0 -12.1071 10.5238  False
  Friday  Saturday   -2.125 0.9978 -13.4405  9.1905  False
  Friday    Sunday  -9.2083  0.193 -20.5238  2.1071  False
  Friday  Thursday   -2.875 0.9884 -14.1905  8.4405  False
  Friday   Tuesday   0.2083    1.0 -11.1071 11.5238  False
  Friday Wednesday  -0.7917    1.0 -12.1071 10.5238  False
  Monday  Saturday  -1.3333 0.9998 -12.6488  9.9821  False
  Monday    Sunday  -8.4167 0.2902 -19.7321  2.8988  False
  Monday  Thursday  -2.0833  0.998 -13.3988  9.2321  False
  Monday   Tuesday      1.0    1.0 -10.3155 12.3155  False
  Monday Wednesday      0.0    1.0 -11.3155 11.3155  False
Saturday    Sunday  -7.0833 0.5037 -18.3988  4.2321  False
Saturday  Thursday    -0.75    1.0 -12.0655 10.5655  False
Saturday   Tuesday   2.3333 0.9962  -8.9821 13.6488  Fal

In [36]:
keyword_data.head()

Unnamed: 0,Search Keyword,Match type,Criterion Status,Campaign Status,Ad Group Status,Cost,Clicks,Avg. CPC
0,things to do,Phrase match,Enabled,Paused,Enabled,$17.11,37,$0.46
1,things to do,Broad match,Removed,Paused,Enabled,$11.51,17,$0.68
2,Things to do with family,Phrase match,Enabled,Paused,Enabled,$10.13,19,$0.53
3,Things to do with kids,Phrase match,Enabled,Paused,Enabled,$7.15,14,$0.51
4,things to do near me,Phrase match,Enabled,Paused,Enabled,$3.17,6,$0.53


In [45]:
import scipy.stats as stats

# One-way ANOVA: Testing impressions across different weekdays
word_groups = keyword_data['Search Keyword']
clicks = keyword_data['Clicks']

# Perform the one-way ANOVA
f_stat, p_value = stats.f_oneway(
    *[clicks[word_groups == group] for group in keyword_data['Search Keyword'].unique()]
)

# Output the results
print(f"F-statistic: {f_stat}")
print(f"P-value: {p_value}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between keyword search clicks.")
elif p_value < 0.1:
    print("Moderate evidence against the null hypothesis: There is a moderate difference between keyword search clicks.")
else:
    print("Fail to reject the null hypothesis: No significant difference between keyword search clicks.")


F-statistic: 0.6061694099419933
P-value: 0.8610203126042437
Fail to reject the null hypothesis: No significant difference between keyword search clicks.


In [41]:
# Perform post-hoc test using tukey HSD to see if any pairs have a p value of under 0.1 or close
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(keyword_data['Clicks'], keyword_data['Search Keyword'], alpha=0.05)

# Display Tukey's test results
print(tukey_results)

                                Multiple Comparison of Means - Tukey HSD, FWER=0.05                                
                group1                                group2                meandiff p-adj   lower    upper  reject
-------------------------------------------------------------------------------------------------------------------
             Things to do with family  Things to do with family on vacation     -9.5    1.0 -54.9535 35.9535  False
             Things to do with family Things to do with family this weekend     -9.5    1.0 -54.9535 35.9535  False
             Things to do with family                Things to do with kids     -2.5    1.0 -39.6127 34.6127  False
             Things to do with family   Things to do with kids this weekend     -9.5    1.0 -54.9535 35.9535  False
             Things to do with family                   attractions near me     -9.5 0.9998 -46.6127 27.6127  False
             Things to do with family           couple date idea fort my

In [44]:
tukey_results.pvalues < 0.1

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,