Skip to content
An upgraded Google Trends API and learning algorithm.
Jupyter Notebook Python Shell
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.idea
.ipynb_checkpoints
Visualization-Tableau workbooks
__pycache__
images
output
venv
.DS_Store
PlotlyGraphing.ipynb
README.md
collegeData.py
collegeMain.py
googleSearchScraper.py
googleTrendsUpgrade.py
harvardCategories.twb
kmeansWordsAlgorithm.py
netflixPractice.py

README.md

Upgrading Google Trends's API: Viral Trends Clustering

MEDIUM: Read more about our results in our Medium post: https://medium.com/harvard-open-data-project/analyzing-college-prestige-and-virality-through-google-trends-218b9ea767e6.

API: Download the API "InfiniteTrends": https://pypi.org/project/InfiniteTrends/ or run pip install InfiniteTrends.

Google Trends provides robust data about the cultural interests of human populations, but it does not support a public API, and its two private research API’s are limited to 5 and 30 keywords per query, respectively.

To remedy this, we first upgraded a Google Trends scraper to accommodate theoretically infinite keywords per query. We then created a function that returns a list of all “viral” topics related to one query over a time period, where a “viral” topic is defined to be one that increased in search traffic volume a specified percentage over a controllable smaller time interval. Finally, we clustered these keywords into groups by applying a k-means learning algorithm to term frequency-inverse document frequency (TF-IDF) vectorized data, which reflects the relative importance of each word, scraped from Google Search.

To apply our upgraded Google Trends API, we analyzed the time-series interest of colleges and then the clustered viral keywords related to Harvard and MIT specifically. We reported that Harvard’s search volume can justify choosing Harvard because it’s “Harvard” and that the clustered viral topics show a divergence in the cultures of the two Cambridge institutions.

Getting Started

Prerequisites

Python 3.7 or higher. Packages:

  • Infrastructure: anaconda, jupyter notebooks, os, string
  • Analysis: pandas, numpy, sci-kit learn, nltk
  • Data Acquisition: Pytrends, beautiful soup, fake_useragents, urllib, requests
  • Visualization: plot.ly, tableau, matplotlib.

Installing

Run "git clone URL" in terminal to install the repository on your computer. Open the main directory in any IDE.

Understand the Directory

To Run:

  • collegeMain.py : Has "get_pie_topics," the master function that returns clustered data given proper inputs.
  • PlotlyGraphing.ipynb : A Jupyter Notebook that plots and saves interest over time data for a list of keywords.

To House Functions and Data:

  • collegeData.py : Holds college names, arrays, and locations.
  • googleSearchScraper.py : A functional document that scrapes Google results given input keywords.
  • googleTrendsUpgrade.py : The upgraded Google Trends API that can search infinite keywords and find related viral data.
  • kmeansWordsAlgorithm.py : A functional document that clusters the results of the googleSearchScraper.py.
  • /output/ : The destination for raw CSV files.

To Visualize:

  • /images/ : This holds graphics and data visualizations.
  • /Visualization-Tableau workbooks/ : The two workbooks used for the pie charts.
  • netflixPractice.py : Some quick visualizations of Netflix's virality results

The Stack

Default Login Screen

Example Visualizations

Default Login Screen Default Login Screen Default Login Screen Default Login Screen

Built With

You can’t perform that action at this time.