<a href="https://colab.research.google.com/github/cagri32/Analyzing-the-Extent-of-Polarization-around-COVID-19-Policies-using-Social-Media/blob/main/Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


tutorial.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1CU6bPzyd1dU9i1ye6-nuci_2WNy60zaZ

# **COVID-19 Dataset - How to use it?**

## ***Background***

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. 

The data collected from the stream captures all languages, but the higher prevalence are:  English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file, and a cleaned version with no retweets on the full_dataset-clean.tsv file .

The main repository for this dataset (and latest version) can be found here https://doi.org/10.5281/zenodo.3723939

Intermediate bi-weekly updates are posted here: https://github.com/thepanacealab/covid19_twitter

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. They need to be hydrated to be used.

If you are using our dataset, please cite our preprint:
https://arxiv.org/abs/2004.03688

## ***Introduction***


In this tutorial, we will explain in a clear and detailed way how to use the data sets generated from this repository (https://github.com/thepanacealab/covid19_twitter/tree/master/dailies). It will explain how the process of hydration of tweets is done, the process of parsing, and an example that consists of counting the unique words of a certain dataset of tweets.



Requirements
First, we are going to install the following modules:

Twarc
Tweepy (v. 3.8.0)
Argparse (v 3.2)
Xtract (v 0.1 a3)
Wget (v 3.2)

In [1]:
from IPython.display import clear_output
!pip install twarc #Twarc
!pip install tweepy # Tweepy 3.8.0
!pip install argparse #Argparse 3.2
!pip install xtract #Xtract 0.1 a3
!pip install wget #Wget 3.2
clear_output()

Selecting the dataset and language
The dataset used for this tutorial was downloaded from here: https://github.com/thepanacealab/covid19_twitter/blob/master/dailies/2020-07-13/2020-07-13-dataset.tsv.gz?raw=true

More datasets can be obtained from here: https://github.com/thepanacealab/covid19_twitter/tree/master/dailies

The structure of the dataset is made up of the following fields:

tweet_id The integer representation of the unique identifier for this Tweet.
date Date when the tweet was posted (YYYY-MM-DD)
time Time when the tweet was posted (HH:mm:ss)
lang Language in which the text is written. Represented by a 2-character language code. If language is unknown, the value will be shown as 'und' (undefined)
country_code Two character string representing the country where the tweet was written. If not known, the field will show as NULL
Filtering a dataset from a language is done by specifying the language code. More information about language codes can be found here:

https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages

In this example,we are going to filter the dataset, so we can only obtain tweets in spanish (So, that means we are going to use the language code "es")

IMPORTANT: In this tutorial, after running the following code, please select the desired language to filter from the dropdown field (which is shown in the output code). If we don't want to filter the dataset, just select "all" in the dropdown field.

In [2]:
import gzip
import shutil
import os
import wget
import csv
import linecache
from shutil import copyfile
import ipywidgets as widgets
import numpy as np
import pandas as pd

dataset_URL = "https://github.com/thepanacealab/covid19_twitter/blob/master/dailies/2020-12-21/2020-12-21-dataset.tsv.gz?raw=true" #@param {type:"string"}


#Downloads the dataset (compressed in a GZ format)
#!wget dataset_URL -O dataset.tsv.gz
wget.download(dataset_URL, out='dataset.tsv.gz')

#Unzips the dataset and gets the TSV dataset
with gzip.open('dataset.tsv.gz', 'rb') as f_in:
    with open('dataset.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

#Deletes the compressed GZ file
os.unlink("dataset.tsv.gz")

#Gets all possible languages from the dataset
df = pd.read_csv('dataset.tsv',sep="\t")
lang_list = df.lang.unique()
lang_list= sorted(np.append(lang_list,'all'))
lang_picker = widgets.Dropdown(options=lang_list, value="en")
lang_picker

Dropdown(index=13, options=('all', 'am', 'ar', 'bg', 'bn', 'ca', 'ckb', 'cs', 'cy', 'da', 'de', 'dv', 'el', 'e…

Dropdown(options=('all', 'am', 'ar', 'bg', 'bn', 'bo', 'ca', 'ckb', 'cs', 'cy', 'da', 'de', 'dv', 'el', 'en', …
Filtering the dataset by language
After selecting the desired language, the following code will perform the corresponding filtering to show only the records in the dataset that have the selected language (in a new tsv file called dataset-filtered.tsv). If there's no language filter, no filter process will be taken (but the file name will be dataset-filtered.tsv anyways)

In [None]:
#Creates a new clean dataset with the specified language (if specified)
filtered_language = lang_picker.value

#If no language specified, it will get all records from the dataset
if filtered_language == "":
  copyfile('dataset.tsv', 'dataset-filtered.tsv')

#If language specified, it will create another tsv file with the filtered records
else:
  filtered_tw = list()
  current_line = 1
  with open("dataset.tsv") as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")

    if current_line == 1:
      filtered_tw.append(linecache.getline("dataset.tsv", current_line))

      for line in tsvreader:
        if line[3] == filtered_language:
          filtered_tw.append(linecache.getline("dataset.tsv", current_line))
        current_line += 1

  print('\033[1mShowing first 5 tweets from the filtered dataset\033[0m')
  print(filtered_tw[1:(6 if len(filtered_tw) > 6 else len(filtered_tw))])

  with open('dataset-filtered.tsv', 'w') as f_output:
      for item in filtered_tw:
          f_output.write(item)

Showing first 5 tweets from the filtered dataset
['1351757472873197568\t2021-01-20\t05:04:30\tes\tNULL\n', '1351757472965353477\t2021-01-20\t05:04:30\tes\tNULL\n', '1351757481723240448\t2021-01-20\t05:04:32\tes\tNULL\n', '1351757481727455234\t2021-01-20\t05:04:32\tes\tNULL\n', '1351757488706760705\t2021-01-20\t05:04:34\tes\tNULL\n']
Introducing our Twitter credentials to authenticate
Accessing the Twitter APIs requires a set of credentials that you must pass with each request. These credentials can come in different forms depending on the type of authentication that is required by the specific endpoint that you are using. More information: https://developer.twitter.com/en/docs/apps/overview

The credentials can be obtained from the developer portal (https://developer.twitter.com/en/portal/dashboard) and they look like these ones:

IMPORTANT: The following code will also generate an api_keys.json (With the twitter credentials entered) that will be required later.

In [4]:
import json
import tweepy
from tweepy import OAuthHandler

# Authenticate
CONSUMER_KEY = "" #@param {type:"string"}
CONSUMER_SECRET_KEY = "" #@param {type:"string"}
ACCESS_TOKEN_KEY = "" #@param {type:"string"}
ACCESS_TOKEN_SECRET_KEY = "" #@param {type:"string"}
BEARER_TOKEN_KEY = ""#@param {type:"string"}
#Creates a JSON Files with the API credentials
with open('api_keys.json', 'w') as outfile:
    json.dump({
    "consumer_key":CONSUMER_KEY,
    "consumer_secret":CONSUMER_SECRET_KEY,
    "access_token":ACCESS_TOKEN_KEY,
    "access_token_secret": ACCESS_TOKEN_SECRET_KEY,
    "bearer_token": BEARER_TOKEN_KEY
     }, outfile)

#The lines below are just to test if the twitter credentials are correct
#Authenticate
auth = tweepy.AppAuthHandler(CONSUMER_KEY, CONSUMER_SECRET_KEY)

api = tweepy.API(auth, wait_on_rate_limit=True,
				   wait_on_rate_limit_notify=True)

if (not api):
   print ("Can't Authenticate")
   sys.exit(-1)

Hydrating the tweets (filtered dataset)
Before parsing the dataset, an hydration process is required. In this tutorial it is done by using the following social media mining tool: https://github.com/thepanacealab/SMMT

To perform this action, a python file from that repository is required (get_metadata.py) and can be obtained in the following way:

In [5]:
from IPython.display import clear_output

!wget https://raw.githubusercontent.com/cagri32/Analyzing-the-Extent-of-Polarization-around-COVID-19-Policies-using-Social-Media/main/sample-data/get_metadata.py -O get_metadata.py

clear_output()

This utility will take a file which meets the following requirements:
*   A csv file which either contains one tweet id per line or contains at least one column of tweet ids
*   A text file which contains one tweet id per line
*   A tsv file which either contains one tweet id per line or contains at least one column of tweet ids

For this case, the filtered dataset generated before (dataset-filtered.tsv), which is in TSV format will be used for the hydration process

The arguments for this utily (get_metadata.py) are the following:

*  arguments.png

**PLEASE NOTE**: The -k argument refers to the json file with the Twitter credentials generated before (api_keys.json)

In [23]:
!python get_metadata.py -i dataset-filtered.tsv -o hydrated_tweets -k api_keys.json


Your twitter api credentials are valid.
hydrated_tweets
tab seperated file, using \t delimiter
total ids: 1491152
metadata collection complete
creating master json file
currently getting 0 - 100
currently getting 100 - 200
currently getting 200 - 300
currently getting 300 - 400
currently getting 400 - 500
creating ziped master json file
creating minimized json master file
creating CSV version of minimized json master file


From the code above, the output will generate four files:

*  A hydrated_tweets.json file which contains the full json object for each of the hydrated tweets
*  A hydrated_tweets.CSV file which contains partial fields extracted from the tweets.
*  A hydrated_tweets.zip file which contains a zipped version of the tweets_full.json file.
*  A hydrated_tweets_short.json which contains a shortened version of the hydrated tweets.

For this tutorial, we will use the hydrated_tweets_short.json file to parse all their tweets.

Parsing the tweets
Requirements
For this process, the following files are required and can be obtained from here:

*  https://raw.githubusercontent.com/thepanacealab/SMMT/master/data_preprocessing/parse_json_lite.py
*  https://raw.githubusercontent.com/thepanacealab/SMMT/master/data_preprocessing/fields.py

Also the following modules are required:

Emot

Emoji

In [7]:
from IPython.display import clear_output

!wget https://raw.githubusercontent.com/thepanacealab/SMMT/master/data_preprocessing/parse_json_lite.py -O parse_json_lite.py
!wget https://raw.githubusercontent.com/thepanacealab/SMMT/master/data_preprocessing/fields.py -O fields.py

!pip install emot --upgrade
!pip install emoji --upgrade

clear_output()

Pulling a sample set from the tweets
For this tutorial, we are going to get a sample of N tweets from the hydrated dataset generated before (hydrated_tweets_short.json). The code below will generate a new JSON file (sample_data.json) with the number of samples specified.

PLEASE NOTE: The code below will extract N samples from the hydrated tweets randomly

**We don't need this part for our project**

In [None]:
# import random 

# no_samples = "1000" #@param {type:"string"}
# list_tweets = None

# with open("hydrated_tweets_short.json", "r") as myfile:
#     list_tweets = list(myfile)

# if int(no_samples) > len(list_tweets):
#     no_samples = len(list_tweets)

# sample = random.sample(list_tweets, int(no_samples))

# file = open("sample_data.json", "w")
# for i in sample:
#   file.write(i)
# file.close() #This close() is important

# Parsing the tweets

**parse_json_lite.py**: The first argument is the json file. The second argument is optional. If the second argument is given, it will preprocess the json file. The preprocessing includes removal of URLs, twitter specific Urls, Emojis, Emoticons. The second argument (if given) must be "p" for the preprocessing work

The following code will extract all fields in a JSON file (sample_data.json in this case). Here is a list of all available fields: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet

Keep in mind that some fields could have null or empty values.

**We don't need this part for our project**

In [None]:
# !python parse_json_lite.py sample_data.json p

# clear_output()

Counting the unique words from a sample set
Given the previous sample dataset (from sample_data.json), the following code will count all the unique words (with it's frequency) in a pandas Dataframe.

For practical purposes, the first 20 most used words in the dataset sample are shown in the output (this variable can be modified below).

**We don't need this part for our project**

In [None]:
# import pandas as pd
# from collections import Counter
# import matplotlib.pyplot as plt

# no_top_unique_words = "20" #@param {type:"string"}

# df = pd.read_csv('sample_data.tsv',sep="\t")

# result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
# df2 = pd.DataFrame(result)
# df2.columns =['Word', 'Frequency']
# df2 = df2[df2.Word != ""] #Deletes the empty spaces counted
# df2 = df2.sort_values(['Frequency'], ascending=[False]) #Sort dataframe by frequency (Descending)

# print('\033[1mTop '+no_top_unique_words+' most unique words used from the dataset\033[0m \n')
# print(df2.head(int(no_top_unique_words)).to_string(index=False)) #Prints the top N unique words used
# print("\n")
# df3 = df2.head(int(no_top_unique_words))
# df3.plot(y='Frequency', kind='pie', labels=df3['Word'], figsize=(9, 9), autopct='%1.1f%%', title='Top '+no_top_unique_words+' most unique words used from the dataset')

Top 20 most unique words used from the dataset 

     Word  Frequency
       de        820
       la        391
       el        359
       en        325
      que        282
        y        266
        a        264
      por        167
      los        167
      del        132
       se        131
      las        127
     para        121
      con        107
       no        103
   contra        100
       un         90
       El         90
 Covid-19         87
       es         77


Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f7ab2e908>

In [None]:
# To download a file that is created in Colab:

# from google.colab import files
# files.download("hydrated_tweets.zip")

In [31]:
!python get_metadata.py -i dataset-filtered.tsv -o hydrated_tweets -k api_keys.json

Your twitter api credentials are valid.
hydrated_tweets
tab seperated file, using \t delimiter
total ids: 1491152
metadata collection complete
creating master json file
currently getting 0 - 100
currently getting 100 - 200
currently getting 200 - 300
currently getting 300 - 400
currently getting 400 - 500
creating ziped master json file
creating minimized json master file
creating CSV version of minimized json master file


In [34]:
!python main.py

[(11392912, '2911221'), (4849148157, '18329263'), (42358442, '128429068'), (1319307451200581632, '1249982359'), (1294920757437509632, '1294920757437509632'), (5385802, '21059255'), (246001930, '38978018'), (24816708, '247507520'), (1942592300, '1228754267835707399'), (81644444, '817976607398731776'), (1238835715875557379, '1235970884113850368'), (1170102547798847488, '14335586'), (258564911, '848148994102611969'), (147036773, '1626294277'), (250137471, '42971403'), (1106919002306011136, '140047339'), (1159220630329233409, '24611925'), (17515151, '18831926'), (3615421342, '18831926'), (1345996543, '1507338108'), (26078080, '1323730225067339784'), (1211754137995530240, '358545917'), (291947329, '16927893'), (1134653881454080000, '1131465805097447424'), (223542622, '976849451984867328'), (485497413, '956985015673303040'), (401598422, '20583993'), (980848637453447173, '2921790230'), (1246016557714735105, '553581707'), (1311631033507475457, '792096589191852033'), (257019790, '15495464'), (1