![alt text](https://www.business.unsw.edu.au/style%20library/asb/assets/images/logo-unsw.png)

# MARK5828 Week 3: Google Trends Comparative Advertising

#Introduction

How can marketers analyse the effectiveness of a marketing ad? Asking your peers how good your advertisement was? Scanning brainwaves of people while they are watching your advertisement? (This is called Neuromarketing) or can we *compare the web search volume data to see how much interests your advertisement has generated*? 


<font color='Purple'>**In this tutorial, you will analyse the change in general consumer's interest in Samsung Galaxy2 after a big marketing campaign. Although, we can simply plot the difference between before and after the campaign, we will go further to see where this product position with its competitors in terms of web search.**</font>

##Interesting information:

Market share in US: Q4 2011 -> Q1 2012

*   Apple 45.3% -> 38.1%
*   Samsung 19.2% ->21.7%

[reference](https://www.statista.com/statistics/242388/market-share-of-smartphone-vendors-in-the-united-states-usa/)



#[Google Trends](https://trends.google.com/trends)

## What is [Google Trends](https://trends.google.com/trends/explore?q=%2Fm%2F02y17j,%2Fm%2F09gms,%2Fm%2F012y1_&date=all)?

- Explore what the world is searching for by entering a keyword or a topic in the Explore bar
- See Daily Search Trends
- See Year in Search data

## How Trends data is adjusted (What is the data telling us?)

- Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest.
- The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics.
- Different regions that show the same search interest for a term don't always have the same total search volumes

## Where Trends data comes from 
*There are 2 types of Trends data:*

- Realtime data is a random sample of searches from the last 7 days
- Non-realtime data is a random sample of Google search data that can be pulled from as far back as 2004 and up to 36 hours before your search

# Aim

Using an open source python library to extract Google Trends data so that we can later analyse the effectiveness of Samsung's comparative ad ["The Next Big Thing"](https://www.youtube.com/watch?v=GWnunavN4bQ) in 2011. 


# Installation


Let's install the libraries required for this exericise!

For Google Trends, we will be using an [pyTrends library](https://github.com/GeneralMills/pytrends).
To install and import the library, we run the following codes:

In [1]:
!pip install pytrends

Collecting pytrends
  Downloading https://files.pythonhosted.org/packages/68/ef/f2428c5333ad5c9c5161ce62c05a357213bafc269cb03e8bf2fd0f9d124d/pytrends-4.4.0-py2.py3-none-any.whl
Installing collected packages: pytrends
Successfully installed pytrends-4.4.0


In [2]:
from google.colab import files

# Upload week3_helpers.py
files.upload()

Saving week3_helpers.py to week3_helpers.py




In [0]:
from pytrends.request import TrendReq        # Google Trends API

import numpy as np
import pandas as pd

from week3_helpers import chunker, get_keywords_combination

from functools import reduce

# Getting your Variables

For each brand, we make unique brand pairs: 

Brand Pair | Google Search Queries
--|--
(Apple, Blackberry) | Apple Blackberry OR iPhone Blackberry 
(Apple, HTC) | Apple HTC OR Apple Evo OR iPhone HTC OR iPhone Evo
(Blackberry, HTC) | Blackberry HTC OR Blackberry Evo
(Blackberry, Samsung)| Blackberry Samsung OR Blackberry Galaxy	(Blackberry vs Samsung)
**(Apple, Samsung)** | **Apple Samsung OR Apple Galaxy OR iPhone Samsung OR iPhone Galaxy**
(HTC, Samsung) | HTC Samsung OR HTC Galaxy OR Evo Samsung OR Evo Galaxy

In [0]:
BRANDS= ['Apple', 'Blackberry', 'HTC', 'Samsung']
BRANDS_SYNONYMS = ['Apple,iPhone', 'Blackberry', 'HTC,Evo', 'Samsung,Galaxy']
TIMEFRAME = "2011-10-27 2012-01-22"
GEO = 'US'
CATEGORY = 0

In [0]:
keywords = get_keywords_combination(
  brands=BRANDS,
  brands_synonyms=BRANDS_SYNONYMS
)

Take the first row for example:

(Apple, Blackberry):	Apple Blackberry OR iPhone Blackberry

**Apple-Blackberry** is the brand pair. When people search up something on the lines of **Apple Blackberry** or **iPhone Blackberry,** this will count as a search query towards the Apple-Blackberry pair.

Now if we account for all of the other brand pairs, we can get a **daily relative Google search interest for each brand pairs**.

Your starting data should look something like:

Apple Blackberry + iPhone Blackberry | Apple HTC + Apple Evo + iPhone HTC + iPhone Evo | Blackberry HTC + Blackberry Evo | Blackberry Samsung + Blackberry Galaxy | Apple Samsung + Apple Galaxy + iPhone Samsung + iPhone Galaxy | HTC Samsung + HTC Galaxy + Evo Samsung + Evo Galaxy | date
--|--
35 | 26 | 6 | 7 | 49 | 24.43665768 | 27-10-2011
26 | 31 | 3 | 8 | 51 | 33.60040431 | 28-10-2011
27 | 25 | 6 | 8 | 51 | 31.56401617 | 29-10-2011
...|...|...|...|...|...|...
25 | 18 | 1 | 4 | 38 | 22.40026954 | 22-01-2012

The data means:

Brand Pair | Relative Interest (for 27-10-2011)
--|--
Apple Blackberry | 35
Apple HTC | 26
Blackberry HTC | 6
Blackberry Samsung | 7
Apple Samsung | 49
HTC Samsung | 24.4

Compared to all other brand pairs, **Apple Samsung** has the biggest web interest (7x more popular than Blackberry Samsung)

In [8]:
print(list(keywords.keys()))
print(list(keywords.values()))

[('Apple', 'Blackberry'), ('Apple', 'HTC'), ('Apple', 'Samsung'), ('Blackberry', 'HTC'), ('Blackberry', 'Samsung'), ('HTC', 'Samsung')]
['Apple Blackberry + iPhone Blackberry', 'Apple HTC + Apple Evo + iPhone HTC + iPhone Evo', 'Apple Samsung + Apple Galaxy + iPhone Samsung + iPhone Galaxy', 'Blackberry HTC + Blackberry Evo', 'Blackberry Samsung + Blackberry Galaxy', 'HTC Samsung + HTC Galaxy + Evo Samsung + Evo Galaxy']


# Google Trends API

## Understanding pyTrends (the unofficial Google Trend API)

**keywords** - list of keywords to be searched 

**category** - [Category](https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories) of the keyword search. <font color='Purple'>We are using the 0th category which is the index for 'all categories'.</font> There are different categories possibly more suitable for Smart Phone Queries but we are using 'ALL' as category classification algorithm in 2011 was not as advanced as now.

**geo** = Geographical location. <font color='Purple'>We are using 'US' as we are only interested in the effectiveness of the ad campaign which was broadcasted in the US.</font>

**timeframe** = time frame that we are interested in. We are interested in "2011-10-27 2012-01-22". <font color='Purple'>This will give us google trends data for before, during and after the samsung's ad.</font>

*Note: If the timeframe is less than 3 months, the API will return daily Google Trends Data. However, if it's more than 3 months, it will return weekly Google Trends Data.*

***`This will returns historical, indexed data for when the keyword was searched most as shown on Google Trends' Interest Over Time section`***

More info available on:
- pyTrends Library: https://github.com/GeneralMills/pytrends
- Google Trends: https://support.google.com/trends/answer/6248105?hl=en&ref_topic=6248052  and
https://support.google.com/trends/answer/4359582?hl=en


# Function Definitions

In [0]:
def join_mean(df1, df2):
  """
  Joins two dataframes via the column with the highest average
  
  Source:
  http://digitaljobstobedone.com/2017/07/10/how-do-you-compare-large-numbers-of-items-in-google-trends/

  Assumption: Column with the highest average in dataframe 1 also exists in
  dataframe 2.
  1. Find the column with the highest average in dataframe 1.
  2. Calculate average of that column.
  3. Scale values of dataframe 2 so that its average is the same as that in
  dataframe 1.
  """
  mean1 = df1.mean()
  mean2 = df2.mean()
  maxKey = df1.max().idxmax()
  res = (df2 / mean2[maxKey] * mean1[maxKey])
  res = res.drop(columns=[maxKey])
  return pd.concat([df1.astype(np.float64), res], axis=1)

In [0]:
def google_trends_payload(keywords, timeframe, geo='', category=0):
  """
  Get Google Trends interest over time for list of keywords provided.
  Make sure this function is called via "google_trends_payload_nitems" if you
  want to analyse more than 5 keywords.
  
  Inputs:
    keywords: List of keywords to get interest for (eg, ["Apple", "Samsung", "HTC"]). 
    timeframe: Start-End string timeframe (eg, "2011-10-27 2012-01-22")
    geo: Geographical region: https://en.wikipedia.org/wiki/ISO_3166-1
    category: Category ID (use "get_google_trend_categories" to see the categories)

  Output:
    Dataframe of relative interest values per keyword per day in the timeframe.
  """
  
  pytrends = TrendReq(hl='en-US', tz=360)
  pytrends.build_payload(keywords, cat=category, timeframe=timeframe, geo=geo, gprop='')
  data = pytrends.interest_over_time()
  data = data.drop(columns="isPartial")
  return data

In [0]:
def google_trends_payload_nitems(keywords, timeframe, geo='', category=0):
  """
  Get Google Trends interest over time for list of keywords provided.
  
  Inputs:
    keywords: List of keywords to get interest for (eg, ["Apple", "Samsung", "HTC"]). 
    timeframe: Start-End string timeframe (eg, "2011-10-27 2012-01-22")
    geo: Geographical region: https://en.wikipedia.org/wiki/ISO_3166-1
    category: Category ID (use "get_google_trend_categories" to see the categories)

  Output:
    Dataframe of relative interest values per keyword per day in the timeframe.
  """
  data_list = []
  max_key = keywords[0]
  for key_batch in chunker(keywords[1:], 4):
    key_batch.append(max_key)
    data = google_trends_payload(keywords=key_batch, timeframe=timeframe, geo=geo, category=category)
    data_list.append(data)
    max_key = data.max().idxmax()
        
  result = reduce(join_mean, data_list)
  max_key = result.max().idxmax()
  
  data_list = []
  keywords.remove(max_key)
  for key_batch in chunker(keywords, 4):
    key_batch.append(max_key)
    data = google_trends_payload(keywords=key_batch, timeframe=timeframe, geo=geo, category=category)
    data_list.append(data)
        
  result = reduce(join_mean, data_list)
            
  return result * 100 / result.max().max()

# Code

Let's extract data!

In [0]:
result = google_trends_payload_nitems(
  keywords=list(keywords.values()),
  timeframe=TIMEFRAME,
  geo=GEO,
  category=CATEGORY
)

In [14]:
result

Unnamed: 0_level_0,Apple Blackberry + iPhone Blackberry,Apple HTC + Apple Evo + iPhone HTC + iPhone Evo,Blackberry HTC + Blackberry Evo,Blackberry Samsung + Blackberry Galaxy,Apple Samsung + Apple Galaxy + iPhone Samsung + iPhone Galaxy,HTC Samsung + HTC Galaxy + Evo Samsung + Evo Galaxy
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011-10-27,42.0,32.0,5.0,7.0,63.0,20.749949
2011-10-28,24.0,35.0,2.0,7.0,58.0,38.387406
2011-10-29,27.0,34.0,6.0,8.0,52.0,29.049929
2011-10-30,17.0,26.0,6.0,7.0,58.0,24.899939
2011-10-31,34.0,21.0,5.0,5.0,50.0,39.424903
2011-11-01,19.0,26.0,4.0,9.0,47.0,33.199918
2011-11-02,21.0,21.0,4.0,3.0,54.0,49.799878
2011-11-03,21.0,31.0,3.0,4.0,44.0,37.349908
2011-11-04,27.0,26.0,3.0,7.0,52.0,47.724883
2011-11-05,21.0,37.0,3.0,7.0,54.0,33.199918


# Store results in a file

In [0]:
result['date'] = result.index
result.to_csv("brand_" + TIMEFRAME + ".csv", index=False)