COVID19 Tweets

Purpose and Scope

Over the past few weeks, COVID19 has dramatically impacted all realms of public life - from where we go to how we interact with one another - everywhere in the world. In North America, while the growing spread of the novel coronavirus was widely known since the beginning of February, its extreme repercussions begun in the week of March 9th, after the suspension of the NBA season and Trump's ban on European travel on March 12. The crisis came into full force in the following week in Canada and USA, leading to a stock market crash, closure of borders to all foreign nationals, mass lay offs and severe social distancing measures.

The purpose of this project is to allow analysts to examine public discourse via Twitter regarding COVID19 in a period of 7 days leading up to, including and immediately following March 12 - the point at which the crisis truly and fully erupted in North America - and how it related to the growing number of cases and deaths across the world. Some questions that can be tackled with the data gathered in the project are as follows:

What was the relationship between number of tweets and number of cases in different countries?
How did the volume of coronavirus related tweets and interactions with coronavirus related tweets change after March 12 in different countries?
Do countries with higher number of cases per capita experience higher number of deaths per capita?
What was the relationship between number of users publishing tweets, and the number in deaths and cases across different countries?

Data Sources

Ids of COVID19 related tweets
- Source: Georgia State University
- Size: 1,636,081 rows
- Update frequency: Daily
Data on the geographic distribution of COVID-19 cases worldwide
- Source: EU Open Data Portal
- Size: 10,538 rows
- Update frequency: Daily

Data Model

ETL Process and Data Pipeline

Source data was downloaded locally
Tweet ids were hydrated (i.e. Full details of the tweets were obtained from Twitter API).
- Twitter's Terms of Service do not allow the full JSON for datasets of tweets to be distributed to third parties. As such, collections of tweets such as the one used in this project get stored as tweet ids, and it is upon the user of the data to "hydrate" the tweet ids to get their full information.
Preprocessing was performed on hydrated tweet data and covid19 data, including de-duplication, null imputation and data type correction. Processed data was uploaded on S3
Redshift cluster was instantiated
Staging, fact and dimension tables were created in Redshift
Processed data was extracted from the S3 bucket and loaded on appropriate tables in Redshift
Data quality checks were performed to ensure that loaded data was complete, accurate and consistent

Steps 5-7 are visualised in the following DAG:

Quick Start

Download the raw data from the two sources linked above
git clone https://github.com/dunyaoguz/covid19_tweets
cd covid19_tweets
Obtain a Twitter developer account and get API keys. Create a .env file with your consumer key, consumer secret, twitter access and twitter secret keys
Run python hydrate.py
Create an IAM user on your AWS account with full S3 and Redshift access. Add your AWS secret access key and AWS access key on your .env file
Run python redshift_conn.py
Instantiate Airflow with the airflow scheduler and airflow webserver commands
Create connections on Airflow to your Redshift data warehouse (conn type=postgres) and AWS credentials (conn type=aws)
Move the dags and plugins folder to your Airflow home directory
Go localhost:8080 on your browser and turn on covid_tweets_dag

Other Scenarios

Let's address what would have to be done differently in case of some potential scenarios.

The data was increased by 100x: hydrate.py and staging_transform.py scripts would not be able to be run on a single machine. Distributed computing would need to be utilized in order the hydrate tweets. Data processing would need to be done using Apache Spark, instead of pandas dataframes, on Amazon EMR.
The database needed to be accessed by 100+ people: The Redshift cluster used was the cheapest one available. (4 dc2.large type nodes, with 160GB fixed local SSD storage). If the database needed to be accessed by 100+ people, the node type of the cluster would likely need to be changed to a more performant one with better CPU, RAM and storage capacity, and the number of nodes in the cluster would likely need to be increased.
The pipelines would be run on a daily basis by 7am every day: The schedule of the dag would need to be changed in order to ensure completion before 7 AM.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
dags		dags
images		images
plugins		plugins
tweet_ids		tweet_ids
.gitignore		.gitignore
README.md		README.md
hydrate.py		hydrate.py
redshift_conn.py		redshift_conn.py
staging_transform.py		staging_transform.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID19 Tweets

Purpose and Scope

Data Sources

Data Model

ETL Process and Data Pipeline

Directory

Quick Start

Other Scenarios

About

Releases

Packages

Languages

dunyaoguz/covid19_tweets

Folders and files

Latest commit

History

Repository files navigation

COVID19 Tweets

Purpose and Scope

Data Sources

Data Model

ETL Process and Data Pipeline

Directory

Quick Start

Other Scenarios

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages