# Lab 1: Playing with Google Trends


The goal of this lab is collecting Google Trends data using [PyTrends](https://pypi.org/project/pytrends/).

This lab is written by Dr. Jisun AN (jisunan@smu.edu.sg), Dr. Haewoon KWAK (hkwak@smu.edu.sg) and Ms Michelle Kan (michellekan@smu.edu.sg).

# Install

<b>[pip](https://realpython.com/what-is-pip/)</b> is the standard package manager for Python. It allows you to install and manage additional packages that are not part of the Python standard library. 

Let's use pip to install the required packages for this lab.

In [None]:
!pip install pytrends 

In [None]:
!pip install matplotlib

In [None]:
!pip install plotly

In [None]:
!pip install pandas

In [None]:
!pip install seaborn

# Add Google Drive as an accessible path

The following code sets the working folder for this Python notebook in our Google Drive for ease of access to the data saved. 
This step is optional if you are using Jupyter Notebook.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# change path to the designated google drive folder
# otherwise, data will be saved in /content folder which you may have issue locating
%cd /content/drive/My Drive/Colab Notebooks

# Set logger

The [Python Logging](https://docs.python.org/3/library/logging.html) module allows us to know what's happening in the 3rd party library.

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
                    datefmt='%m-%d %H:%M:%S')
logger = logging.getLogger(__name__)

The above code imports the logging module and calls the `basicConfig` method which does basic configuration for the logging system. The `format` string argument defines the format of the logger output with the following [LogRecord attributes](https://docs.python.org/3/library/logging.html#logrecord-attributes):

|Attribute name|Format|Description|
|:----|:-----:|:----|
|asctime|%(asctime)s|Human-readable time when the LogRecord was created.|
|name|%(name)s|Name of the logger used to log the call.|
|levelname|%(levelname)s|Text logging level for the message ('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL').|
|message|%(message)s|The logged message, computed as msg % args. This is set when [Formatter.format()](https://docs.python.org/3/library/logging.html#logging.Formatter.format) is invoked.|

# Connect to Google

Language = en-US, timezone (Singapore) = -480 (according to Google's convention)

In [None]:
from pytrends.request import TrendReq

pytrends = TrendReq(hl='en-US', tz=-480)

Below is a sample output of the logger stating that a HTTP request has been successfully completed (ie. [response status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) 200):
<img align="center" src="https://docs.google.com/uc?id=1VrpE6JCuzWNHcjMnJBdcjw7IXEViigZE"  style="height: 15px;"/>
(Note: The background colour of the message may defer based on the application you are using)

# Collect the Google Trends query's response using Pytrends

We collect all the data that is accessible through the web interface using the following methods:

1. Interest over time
2. Interest by city (region)
3. Related topics
4. Related queries
5. [Optional] Trending searches

<img align="center" src="https://docs.google.com/uc?id=1de3mEalTjfyb685stdgvhRdtb9GJ63mw" width="550" style="vertical-align:middle;margin:0px 10px"/>

https://trends.google.com/trends/explore?date=2021-12-05%202022-01-04&geo=SG&q=new%20year&hl=en

## Setting common parameters

The `build_payload` method allows us to define the key parameters for retrieval of google trend data.

In this example, we name the Google Trend search term parameter as `keywords` with value "new year". Check out the purpose of the remaining parameters `geo`, `timeframe` and `cat` parameters [here](https://pypi.org/project/pytrends/).

In [None]:
keywords = ["new year"]
pytrends.build_payload(keywords, geo='SG', timeframe='2021-12-05 2022-01-04', cat=0)

## 1. Interest over time

Let's retrieve the Google Trends' <b>Interest Over Time</b> section data based on the keyword defined above.

In [None]:
df = pytrends.interest_over_time()
df.tail(n=10)

In [None]:
df.to_csv('1-over-time.csv')

## Let's plot

### To draw a plot, we need to reset the index of dataframe as it's multilevel

In [None]:
# reset_index() -- Reset the index of the DataFrame, and use the default one instead. 
# If the DataFrame has a MultiIndex, this method can remove one or more levels.
df.reset_index(inplace=True)
df.head()

### Load Backup data -- if Google blocked our IP (Optional)

In the event that our IP is temporally blocked by Google due to too many queries being detected at once, carry out the following:<br>
a) Download the following files containing backup data for this lab [here](https://drive.google.com/drive/folders/1J8kq5aUbEZWxxT04q3rkKriNhbez2evh?usp=sharing).<br>
<img align="center" src="https://docs.google.com/uc?id=1dxtOro2MYdJC9Q7zAAEWilOTr_oxjbGb" width="300" style="vertical-align:middle;margin:0px 10px"/>

b) Place the backup files in the <b>same folder</b> where you save the current notebook<br>
c) Run the codes below (you can comment out all codes by select Ctrl+A and Ctrl+/ for Window or  Cmd+A and Cmd+/  for Mac)

In [None]:
# # Our IP may be temporally blocked by Google as we send too many queries at once.
# # For that case, we have a backup data.

# import pandas as pd # add pandas library to the notebook
# df = pd.read_csv('backup-1-over-time.csv')
# df.head()

### Plotly is a python library for interactive visualization of data

See more charts examples here in the below link.
Plotly for python: https://plotly.com/python/ 


In [None]:
import plotly.express as px # add library to the notebook!

In [None]:
fig = px.line(df, x="date", y="new year", title='New Year popularity in Google Search')
fig.show()

### Exercise 1. Change the country from Singapore to other 3 countries that you are curious

Update the `geo` parameter to retrieve Google Trend data for 3 other countries. The ISO-2 country code is available via https://en.wikipedia.org/wiki/ISO_3166-2#:~:text=It%20was%20first%20published%20in,form%20than%20their%20full%20names.

Don't forget to change XX, YY, and ZZ in the filename into your country names.

In [None]:
keywords = ["new year"]

In [None]:
pytrends.build_payload(keywords, geo='XX', timeframe='2021-12-05 2022-01-04', cat=0)
df = pytrends.interest_over_time()

df.to_csv('1-1-XX.csv')

In [None]:
pytrends.build_payload(keywords, geo='YY', timeframe='2021-12-05 2022-01-04', cat=0)
df = pytrends.interest_over_time()

df.to_csv('1-2-YY.csv')

In [None]:
pytrends.build_payload(keywords, geo='ZZ', timeframe='2021-12-05 2022-01-04', cat=0)
df = pytrends.interest_over_time()

df.to_csv('1-3-ZZ.csv')

## 2. Interest by city (region)

<img align="center" src="https://docs.google.com/uc?id=1dzQpQoVfxh1qwAOStg7jh37MIdCk9k_I" width="600" style="vertical-align:middle;margin:0px 10px"/><br>

`interest_by_region` method returns the Google Trend search interest for where (ie. city/sub-region) the keyword is most searched in the selected country.
Unfortunately, Google does not provide a fine-grained subregion view for Singapore.
Let's try with Australia.

<img align="left" src="https://docs.google.com/uc?id=1IegynNxVgb3GxQoXFD_HPJMRJcx8Rlmk" width="30" style="vertical-align:middle;margin:0px 5px"/> The `resolution` parameter allows us to define the granularity of our data based on the following values:
- 'CITY' returns city level data
- 'COUNTRY' returns country level data
- 'DMA' returns Metro level data
- 'REGION' returns Region level data

Let's retrieve 'REGION' data of Austrialia in the following example:



In [None]:
from pytrends.request import TrendReq
pytrends = TrendReq(hl='en-US', tz=-480)

keywords = ["new year"]
pytrends.build_payload(keywords, geo='AU', timeframe='2021-12-05 2022-01-04', cat=0)

In [None]:
df = pytrends.interest_by_region(resolution='REGION', inc_low_vol=True, inc_geo_code=False)
df.head()

In [None]:
df.to_csv('2-by-region.csv')

### Backup data -- Use below code if Google blocked our IP

In [None]:
# df = pd.read_csv("backup-2-by-region.csv")
# df.head()

### Let's plot using Matplotlib, another python library for visualiation!


In [None]:
import matplotlib.pyplot as plt

In [None]:
# it seems that matplotlib prints many DEBUG messages. Let's change it to print logs when having ERRORs
logging.getLogger().setLevel(logging.ERROR)

In [None]:
# reset_index() -- Reset the index of the DataFrame, and use the default one instead. 
# If the DataFrame has a MultiIndex, this method can remove one or more levels.
df.reset_index(inplace=True)
df.head()

In [None]:
df.plot.bar(x="geoName", y="new year", rot=70, title="New Year popularity by various regions in Australia")

plt.show(block=True)

see more example for bar chart: https://pythontic.com/pandas/dataframe-plotting/bar%20chart

## Exercise 2. Read the API doc and try different resolution

Check https://github.com/GeneralMills/pytrends#interest-by-region and try other resolution by changing XX to other options e.g., CITY

In [None]:
df = pytrends.interest_by_region(resolution=XX, inc_low_vol=True, inc_geo_code=False)
df.head()

### Bugs
<img align="left" src="https://docs.google.com/uc?id=1nhCz5zbFKKD4KD-kPxrMIvGBbgET_QZG" width="45" style="vertical-align:middle;margin:0px 10px"/> You might realize that PyTrends does not support the city-level data for AU. When the `resolution` parameter is set as `CITY`, you will observe that the data returned is region-specific instead of city-specific which is incorrect.

It is due to the following source code in the Pytrends library. 

    # make the request
    region_payload = dict()
    if self.geo == '':
        self.interest_by_region_widget['request']['resolution'] = resolution
    elif self.geo == 'US' and resolution in ['DMA', 'CITY', 'REGION']:
        self.interest_by_region_widget['request']['resolution'] = resolution

See https://github.com/GeneralMills/pytrends/blob/master/pytrends/request.py#L273

The above code extracted from Pytrends library does not support 'CITY' resolution when `geo` is not 'US'. Hence, city-level data is not being retrieved when `geo` is set to 'AU' in our example. To overcome this, let's see the workaround to this issue: https://github.com/haewoon/pytrends. <br/>

<br>If you wish to <b>retrieve city-level data for AU</b>, carry out the following steps: 
1. Download the __pytrends__ folder from the above github link <img align="center" src="https://docs.google.com/uc?id=1Vqi2cFkLFLCxWIJMtdD4MHC9ji4BfclG"  style="height: 100px;"/><br><br>
2. Place the __pytrends__ folder in the same folder as the current python notebook file. <br><br>
3. Go to menu of this notebook, select <b>Runtime -> select Restart runtime</b> to restart the notebook kernel (explanation: by restarting the kernel, the current notebook will be able to recognize the presence of the newly added 'pytrends' folder)<br><br>
4. <i>[Optional if running locally from Jupyter notebook]</i> If you are running this notebook from Colab, you will need to add Google Drive as an accessible path:<br>
```python
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab Notebooks/  # update path based on the location of your current notebook
```
5. Reimport pytrends.request (this now imports the 'pytrends' folder instead of 'pytrends' standard library) and set parameters:<br>
```python
from pytrends.request import TrendReq
pytrends = TrendReq(hl='en-US', tz=-480)
pytrends.build_payload(["new year"], geo='AU', timeframe='2021-12-05 2022-01-04', cat=0)
```
6. Rerun the `interest_by_region` API call by setting `resolution` as `CITY`<br>
```python
df = pytrends.interest_by_region(resolution='CITY', inc_low_vol=True, inc_geo_code=False)
df.head()
```

You should now observe that city-level data for AU is displayed as the `pytrends.interest_by_region` now uses the source code found in the newly added 'pytrends' folder instead of the 'pytrends' standard library.


In [None]:
# save the resolution data into a csv file
df.to_csv('2-by-XX.csv')

## 3. Related topics
Back to Singapore. 
Users searching for a search term (e.g., 'new year') also searched for these topics. 

Google Trends provide two options:
* <b>Top</b> - The most popular topics. Scoring is on a relative scale where a value of 100 is the most commonly searched topic and a value of 50 is a topic searched half as often as the most popular term, and so on.

* <b>Rising</b> - Related topics with the biggest increase in search frequency since the last time period. Results marked "Breakout" had a tremendous increase, probably because these topics are new and had few (if any) prior searches.

In [None]:
keywords = ["new year"]
pytrends.build_payload(keywords, geo='SG', timeframe='2021-12-05 2022-01-04', cat=0)
related_topics = pytrends.related_topics()

In [None]:
related_topics = pytrends.related_topics()

In [None]:
# let's take a look at the output
related_topics

In [None]:
# let's check what's the data type of the output
type(related_topics)

### Backup data -- Use below code if Google blocked our IP

In [None]:
# import pickle
# filename = 'backup-3-related-topics.pickle'
# infile = open(filename,'rb')
# related_topics = pickle.load(infile)
# infile.close()
# print(type(related_topics))
# related_topics


### Access to top related topics

In [None]:
related_topics['new year']['top']

### Access to rising related topics

In [None]:
related_topics['new year']['rising']

## Exercise 3. Compare related topics between different periods.

Related topics are also changing over time.
Compare related topics of 'covid-19' in the **United Kingdom** (1) between 2021/11/1 and 2021/12/1 with (2) between 2021/12/1 and 2022/1/1.

Tip: United Kingdom's code is not UK.


In [None]:
# enter your code below




## 4. Related queries

Users searching for a term (here 'new year') also searched for these queries. 

Similarly, Google provides two options:
* <b>Top</b> - The most popular search queries. Scoring is on a relative scale where a value of 100 is the most commonly searched query, 50 is a query searched half as often as the most popular query, and so on.

* <b>Rising</b> - Queries with the biggest increase in search frequency since the last time period. Results marked "Breakout" had a tremendous increase, probably because these queries are new and had few (if any) prior searches.

In [None]:
keywords = ["new year"]
pytrends.build_payload(keywords, geo='SG', timeframe='2021-12-05 2022-01-04', cat=0)

In [None]:
related_queries = pytrends.related_queries()

In [None]:
# let's take a look at the output
related_queries

### Backup data -- Use below code if Google blocked our IP

In [None]:
# import pickle
# filename = 'backup-4-related-queries.pickle'
# infile = open(filename,'rb')
# related_queries = pickle.load(infile)
# infile.close()
# print(type(related_queries))
# related_queries

### Access to top related queries

In [None]:
related_queries['new year']['top']

### Access to rising related queries

In [None]:
related_queries['new year']['rising']

## Exercise 4. Compare related queries between different periods.

Related queries are also changing over time.
Compare related queries of 'covid-19' in the **United States** (1) between 2020/11/1 and 2020/12/1 with (2) between 2020/12/1 and 2021/1/1


In [None]:
# enter your codes below




<br>

## 5. [Optional] Trending Searches and Top Searches

Besides retrieving trend data by keyword(s) search, we are also able to obtain <i><b>Trending Searches</b></i> and <i><b>Year in Search</b></i> which return the most popular topics searched. These functions can be accessed via the left-hand dropdown menu in the web interface. In the Pytrends library, the following options are available:
* <b>Daily Search Trends</b> - Daily Google trending searches that jumped significantly in popularity among all searches. Results are updated in real-time hourly. 

* <b>Real-time Search Trends</b> - Latest topics trending across Google search surfaces within the last 24 hours and are updated in real-time. 

* <b>Top searches for specific year/month</b> - Search topics that were trending historically in a specific year (or month).

<img align="center" src="https://docs.google.com/uc?id=1I_esqQ62Z5y_yXLPdZs3q_Gcb1t95jWG" width="550" style="vertical-align:middle;margin:0px 10px"/>

https://trends.google.com/trends/trendingsearches/daily?geo=SG

<br>

### a) Daily Search Trends

Let's try to answer <i>"What are the popular terms people search these days?"</i> by taking a look at real-time daily trending searches.

The `pn` argument specifies the geographical location ie. the United Kingdom in the following example. 
Note that the argument only accepts fully spelled out  country names in <i>lowercase</i>.

In [None]:
# real time trending searches in United States
country = 'united_kingdom'
df = pytrends.trending_searches(pn = country)
df.head()

In [None]:
# Let's give the trending search topic column a name
df = df.rename(columns={0: f"Daily Trending Searches of {country.capitalize()}"})
df.head(10)

Compare your results with the Daily Search Trends in the Google Trend page for United Kingdom:
https://trends.google.com/trends/trendingsearches/daily?geo=GB


Besides country-specific searches, you will be able to expand your search scope to <b>worldwide</b> by excluding the `pn` parameter.

In [None]:
# Retrieving daily trending searches globally
df = pytrends.trending_searches()
df = df.rename(columns={0: f"Global Daily Trending Searches"})

df.head()

### b) Real-time Search Trends

We can use similar approach to retrieve the real-time search trends in Australia for the past 24 hours. 


The `pn` argument specifies the geographical location and the country code here uses the [ISO-2 country code](https://en.wikipedia.org/wiki/ISO_3166-2#:~:text=It%20was%20first%20published%20in,form%20than%20their%20full%20names). Notice the difference that daily search trends method in Section 5a) uses fully spelled country names.

Let's limit our search count to 10 in this example.

In [None]:
df = pytrends.realtime_trending_searches(pn = 'AU', count = 10)

df.head()

**Note**: If you encounter <font color = 'purple'> *AttributeError: 'TrendReq' object has no attribute 'realtime_trending_searches'* </font>, rename the *pytrends* folder that you have saved in Exercise 2 to '*pytrends_backup*' so that the original Pytrends library is used. Restart runtime and retry.

Compare your results with the Real-time Search Trends in the Google Trend page for Australia
https://trends.google.com/trends/trendingsearches/realtime?geo=AU&category=all

<br>

### c) Top searches for specific year/month

Instead of real-time data, we may also be interested to know the historical trending searches of a specific year (or month) for a specific country.
Using the `top_charts` method, let's try to find out the trending searches in Australia in December 2021.

The `date` argument accepts format YYYY (ie. year) or YYYY-MM (ie. year and month). 
The `geo` argument specifies the geographical location and the country code here also uses the [ISO-2 country code](https://en.wikipedia.org/wiki/ISO_3166-2#:~:text=It%20was%20first%20published%20in,form%20than%20their%20full%20names). 

In [None]:
df = pytrends.top_charts(date = 2021-12, hl='en-US', tz=-480, geo='AU')
df

Besides country-specific historical trending searches, you will be able to expand your scope to <b>worldwide</b> for specific year.

In [None]:
# Retrieving global historical trending searches for year 2021
df = pytrends.top_charts(date = 2021, hl='en-US', tz=-480, geo='GLOBAL')
df