<a href="https://colab.research.google.com/github/cagri32/Analyzing-the-Extent-of-Polarization-around-COVID-19-Policies-using-Social-Media/blob/main/Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


tutorial.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1CU6bPzyd1dU9i1ye6-nuci_2WNy60zaZ

# **COVID-19 Dataset - How to use it?**

## ***Background***

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. 

The data collected from the stream captures all languages, but the higher prevalence are:  English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file, and a cleaned version with no retweets on the full_dataset-clean.tsv file .

The main repository for this dataset (and latest version) can be found here https://doi.org/10.5281/zenodo.3723939

Intermediate bi-weekly updates are posted here: https://github.com/thepanacealab/covid19_twitter

As always, the tweets distributed here are only tweet identifiers (with date and time added) due to the terms and conditions of Twitter to re-distribute Twitter data. They need to be hydrated to be used.

If you are using our dataset, please cite our preprint:
https://arxiv.org/abs/2004.03688

## ***Introduction***


In this tutorial, we will explain in a clear and detailed way how to use the data sets generated from this repository (https://github.com/thepanacealab/covid19_twitter/tree/master/dailies). It will explain how the process of hydration of tweets is done, the process of parsing, and an example that consists of counting the unique words of a certain dataset of tweets.



Requirements
First, we are going to install the following modules:

Twarc
Tweepy (v. 3.8.0)
Argparse (v 3.2)
Xtract (v 0.1 a3)
Wget (v 3.2)

In [None]:
from IPython.display import clear_output
!pip install twarc #Twarc
!pip install tweepy # Tweepy 3.8.0
!pip install argparse #Argparse 3.2
!pip install xtract #Xtract 0.1 a3
!pip install wget #Wget 3.2
clear_output()

Selecting the dataset and language
The dataset used for this tutorial was downloaded from here: https://github.com/thepanacealab/covid19_twitter/blob/master/dailies/2020-07-13/2020-07-13-dataset.tsv.gz?raw=true

More datasets can be obtained from here: https://github.com/thepanacealab/covid19_twitter/tree/master/dailies

The structure of the dataset is made up of the following fields:

tweet_id The integer representation of the unique identifier for this Tweet.
date Date when the tweet was posted (YYYY-MM-DD)
time Time when the tweet was posted (HH:mm:ss)
lang Language in which the text is written. Represented by a 2-character language code. If language is unknown, the value will be shown as 'und' (undefined)
country_code Two character string representing the country where the tweet was written. If not known, the field will show as NULL
Filtering a dataset from a language is done by specifying the language code. More information about language codes can be found here:

https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages

In this example,we are going to filter the dataset, so we can only obtain tweets in spanish (So, that means we are going to use the language code "es")

IMPORTANT: In this tutorial, after running the following code, please select the desired language to filter from the dropdown field (which is shown in the output code). If we don't want to filter the dataset, just select "all" in the dropdown field.

In [None]:
import gzip
import shutil
import os
import wget
import csv
import linecache
from shutil import copyfile
import ipywidgets as widgets
import numpy as np
import pandas as pd

date = "2020-09-10"
dataset_URL = "https://github.com/thepanacealab/covid19_twitter/blob/master/dailies/{}/{}-dataset.tsv.gz?raw=true".format(date,date) #@param {type:"string"}


#Downloads the dataset (compressed in a GZ format)
#!wget dataset_URL -O dataset.tsv.gz
wget.download(dataset_URL, out='dataset.tsv.gz')

#Unzips the dataset and gets the TSV dataset
with gzip.open('dataset.tsv.gz', 'rb') as f_in:
    with open('dataset-{}.tsv'.format(date), 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

#Deletes the compressed GZ file
os.unlink("dataset.tsv.gz")

#Gets all possible languages from the dataset
df = pd.read_csv('dataset-{}.tsv'.format(date),sep="\t")
lang_list = df.lang.unique()
lang_list= sorted(np.append(lang_list,'all'))
lang_picker = widgets.Dropdown(options=lang_list, value="en")
lang_picker

Dropdown(index=13, options=('all', 'am', 'ar', 'bg', 'bn', 'ca', 'ckb', 'cs', 'cy', 'da', 'de', 'dv', 'el', 'e…

Dropdown(options=('all', 'am', 'ar', 'bg', 'bn', 'bo', 'ca', 'ckb', 'cs', 'cy', 'da', 'de', 'dv', 'el', 'en', …
Filtering the dataset by language
After selecting the desired language, the following code will perform the corresponding filtering to show only the records in the dataset that have the selected language (in a new tsv file called dataset-filtered.tsv). If there's no language filter, no filter process will be taken (but the file name will be dataset-filtered.tsv anyways)

In [None]:
#Creates a new clean dataset with the specified language (if specified)
filtered_language = lang_picker.value

#If no language specified, it will get all records from the dataset
if filtered_language == "":
  copyfile('dataset-{}.tsv'.format(date), 'dataset-filtered-{}.tsv'.format(date)) # change all 'dataset.tsv' with 'dataset-{}.tsv'.format(date)

#If language specified, it will create another tsv file with the filtered records
else:
  filtered_tw = list()
  current_line = 1
  with open('dataset-{}.tsv'.format(date)) as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter="\t")

    if current_line == 1:
      filtered_tw.append(linecache.getline('dataset-{}.tsv'.format(date), current_line))

      for line in tsvreader:
        if line[3] == filtered_language:
          filtered_tw.append(linecache.getline('dataset-{}.tsv'.format(date), current_line))
        current_line += 1

  print('\033[1mShowing first 5 tweets from the filtered dataset\033[0m')
  print(filtered_tw[1:(6 if len(filtered_tw) > 6 else len(filtered_tw))])

  with open('dataset-filtered-{}.tsv'.format(date), 'w') as f_output:
      for item in filtered_tw:
          f_output.write(item)

[1mShowing first 5 tweets from the filtered dataset[0m
['1303906699888201729\t2020-09-10\t04:02:37\ten\tNULL\n', '1303906700114698240\t2020-09-10\t04:02:37\ten\tNULL\n', '1303906700165087232\t2020-09-10\t04:02:37\ten\tNULL\n', '1303906700169109505\t2020-09-10\t04:02:37\ten\tNULL\n', '1303906700286623749\t2020-09-10\t04:02:37\ten\tNULL\n']


Showing first 5 tweets from the filtered dataset
['1351757472873197568\t2021-01-20\t05:04:30\tes\tNULL\n', '1351757472965353477\t2021-01-20\t05:04:30\tes\tNULL\n', '1351757481723240448\t2021-01-20\t05:04:32\tes\tNULL\n', '1351757481727455234\t2021-01-20\t05:04:32\tes\tNULL\n', '1351757488706760705\t2021-01-20\t05:04:34\tes\tNULL\n']
Introducing our Twitter credentials to authenticate
Accessing the Twitter APIs requires a set of credentials that you must pass with each request. These credentials can come in different forms depending on the type of authentication that is required by the specific endpoint that you are using. More information: https://developer.twitter.com/en/docs/apps/overview

The credentials can be obtained from the developer portal (https://developer.twitter.com/en/portal/dashboard) and they look like these ones:

IMPORTANT: The following code will also generate an api_keys.json (With the twitter credentials entered) that will be required later.

In [None]:
import json
import tweepy
from tweepy import OAuthHandler

# Authenticate
CONSUMER_KEY = "" #@param {type:"string"}
CONSUMER_SECRET_KEY = "" #@param {type:"string"}
ACCESS_TOKEN_KEY = "" #@param {type:"string"}
ACCESS_TOKEN_SECRET_KEY = "" #@param {type:"string"}
BEARER_TOKEN_KEY = "AAAAAAAAAAAAAAAAAAAAAP0KUwEAAAAAh1pptk4pHUxP8LSHokl3DBiVxQg%3D1P0Ht0G0Jf4fAOUSxQxcFGzvExnF9xUqwkyBNmWPjgOJlbmamk"
#Creates a JSON Files with the API credentials
with open('api_keys.json', 'w') as outfile:
    json.dump({
    "consumer_key":CONSUMER_KEY,
    "consumer_secret":CONSUMER_SECRET_KEY,
    "access_token":ACCESS_TOKEN_KEY,
    "access_token_secret": ACCESS_TOKEN_SECRET_KEY,
    "bearer_token": BEARER_TOKEN_KEY
     }, outfile)

#The lines below are just to test if the twitter credentials are correct
#Authenticate
auth = tweepy.AppAuthHandler(CONSUMER_KEY, CONSUMER_SECRET_KEY)

api = tweepy.API(auth, wait_on_rate_limit=True,
				   wait_on_rate_limit_notify=True)

if (not api):
   print ("Can't Authenticate")
   sys.exit(-1)

Hydrating the tweets (filtered dataset)
Before parsing the dataset, an hydration process is required. In this tutorial it is done by using the following social media mining tool: https://github.com/thepanacealab/SMMT

To perform this action, a python file from that repository is required (get_metadata.py) and can be obtained in the following way:

In [None]:
from IPython.display import clear_output

# Get the required files from GitHub
!wget https://raw.githubusercontent.com/cagri32/Analyzing-the-Extent-of-Polarization-around-COVID-19-Policies-using-Social-Media/main/code/get_metadata.py -O get_metadata.py
!wget https://raw.githubusercontent.com/cagri32/Analyzing-the-Extent-of-Polarization-around-COVID-19-Policies-using-Social-Media/main/code/main.py -O main.py

clear_output()

This utility will take a file which meets the following requirements:
*   A csv file which either contains one tweet id per line or contains at least one column of tweet ids
*   A text file which contains one tweet id per line
*   A tsv file which either contains one tweet id per line or contains at least one column of tweet ids

For this case, the filtered dataset generated before (dataset-filtered.tsv), which is in TSV format will be used for the hydration process

The arguments for this utily (get_metadata.py) are the following:

*  arguments.png

**PLEASE NOTE**: The -k argument refers to the json file with the Twitter credentials generated before (api_keys.json)

In [None]:
dataset_filtered_input = "dataset-filtered-{}.tsv".format(date)

In [None]:
!python get_metadata.py -i {dataset_filtered_input} -o hydrated_tweets -k api_keys.json -ll 1000000 -ul 1500000

In [None]:
!python main.py

Test if we can bring a graph from a file

In [None]:
import pickle

with open("graph.txt", 'rb') as f:  # notice the r instead of w
    G_loaded = pickle.load(f)
print(G_loaded)

DiGraph with 172877 nodes and 253312 edges


Download files and then delete

In [None]:
#@title Utility to zip and download a directory
#@markdown Use this method to zip and download a directory. For ex. a TB logs 
#@markdown directory or a checkpoint(s) directory.

from google.colab import files
import os

dir_to_zip = '/content' #@param {type: "string"}
output_filename = "hydrated-0-500.zip".format()
delete_dir_after_download = "No"  #@param ['Yes', 'No']

os.system( "zip -r {} {}".format( output_filename , dir_to_zip ) )

if delete_dir_after_download == "Yes":
    os.system( "rm -r {}".format( dir_to_zip ) )

files.download( output_filename )

In [None]:
# files.download( "hydrated-all.zip" )

In [None]:
from google.colab import files

# files.download("hydrated_tweets.zip")
# files.download("hydrated_tweets.csv")
# files.download("hydrated_tweets_short.json")
# files.download("hydrated_tweets")
os.unlink("hydrated_tweets")
os.unlink("hydrated_tweets.zip")
os.unlink("hydrated_tweets.csv")
os.unlink("hydrated_tweets_short.json")
os.unlink("graph.txt")


From the code above, the output will generate four files:

*  A hydrated_tweets.json file which contains the full json object for each of the hydrated tweets
*  A hydrated_tweets.CSV file which contains partial fields extracted from the tweets.
*  A hydrated_tweets.zip file which contains a zipped version of the tweets_full.json file.
*  A hydrated_tweets_short.json which contains a shortened version of the hydrated tweets.
