# Covid19 - Twitter data extraction
by Victoria, Maha, Gopi

## Table of contents
- Introduction
- Authenticatications
    - Twitter
    - Google sheets
- Gathering data & storing


## Introduction
This notebook is part of the project developed for the FLT Big Data Hackathon, whose objective is to create interesting and trustworthy analyses and visualizations about the COVID19 situation and its correlation with the stock market. 

In this notebook we use the Twitter API to retrieve the tweets related to COVID19 hashtags and economic tags, to perform a sentimental analysis and store it programatically in a google sheets file. 

In [17]:
#Load important libraries
import gspread 
from df2gspread import df2gspread as d2g
from oauth2client.service_account import ServiceAccountCredentials
import json
import tweepy
from textblob import TextBlob
from tweepy import Stream
from tweepy import StreamListener
import pandas as pd
import re
import csv
import nltk
from  geopy.geocoders import Nominatim
from datetime import datetime
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\v.perez\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Authentication
### Twitter

In [10]:
# Load twitter credentials
with open("covid19-sentanalysis-twitter_credentials.json") as datafile:
  data = json.load(datafile)

# Define the keys
consumer_key= data['consumer_key'] #'API_CONSUMER_KEY_HERE'
consumer_secret=  data['consumer_secret']#'CONSUMER_SECRET_HERE'

access_token= data['access_token_key'] #'ACCESS_TOKEN_HERE'
access_token_secret= data['access_token_secret'] #'ACCESS_TOKEN_SECRET_HERE'


#Crate the auth object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# create API, set limits to avouid errors because of a timeout 
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

#Print 5 tweets for testing purposes - Should be deleted afterwards
home_tweets = api.home_timeline(count=5)
print("printing tweets from timeline \n ")
for tweet in home_tweets:
    print(tweet.text)
    print("")

Authentication OK
printing tweets from timeline 
 
#Coronavirus Alberto Fernández anunciará hoy una nueva prórroga de la cuarentena 
https://t.co/5HD7s3Oeds

¡Disfruta de "Volver", el NUEVO texto de Paula Román,Prosa Poética! ⏬ ⏬ ⏬  https://t.co/A4EqGFiERu

Mirá el valor que queda un iPhone XR

--&gt; https://t.co/clfEwkBVJZ 📱 https://t.co/9WTgWds1Kz

‼ ESTE VIERNES ‼ @alarconcasanova y @odonnellmaria conversarán en un Instagram Live sobre Aramburu, su último libro… https://t.co/G5aanj3jHc

#Seguridad Andaba en una moto con la numeración suprimida en plena cuarentena https://t.co/hYo5kFjN1V



### Google sheets

In [11]:
scope = [
   'https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']

#authenticate gsheets
google_key_file = 'service_key.json'
credentials = ServiceAccountCredentials.from_json_keyfile_name(google_key_file, scope)
gc = gspread.authorize(credentials)

# Define spreadsheet access
spreadsheet_key = '1auoQ9XanosnM7RUInzqeZi9EIgwtCtmtubNpXrfF6OM' 
wks_name = 'sentimentAnalysis'

# Open the file
book = gc.open_by_key(spreadsheet_key) 
worksheet = book.worksheet(wks_name) 

## Gathering data & storing
**GET Twitter Stream and Do Sentiment Analysis in Real time**

In [21]:
trump = 0
warren = 0

header_name = ['id', 'user_id','Text','created_at','timestamp','location','latitude','longitude','Trump','Warren']

class Listener(StreamListener):
    
    def __init__(self):
        super().__init__()
        self.max_tweets = 10
        self.tweet_count = 0
        self.geolocator = Nominatim()
        self.tweet_list = []
    
    def on_data(self, data):
        raw_twitts = json.loads(data)
        try:
            #  Fields we need: id, created_at, text, coordinates, author_id
            full_tweets = raw_twitts.copy()
            # TO DO: we must drop from full_tweets the tweets that consist only on RT, numbers, etc (see regex used below)
            tweets = raw_twitts['text']
            tweets = ' '.join(re.sub("(@[A-Za-z0-9]+) | ({*0-9A-Za-z \t]) |] (\wt:\/\/\St+)", " ", tweets).split())
            tweets = ' '.join(re.sub('RT',' ', tweets).split())  
  
  
            blob = TextBlob(tweets.strip())
            global trump
            global warren
  
            trump_sentiment = 0
            warren_sentiment = 0
  
            for sent in blob.sentences:
                if "Trump" in sent and "Warren" not in sent:
                    trump_sentiment = trump_sentiment + sent.sentiment.polarity
                else:
                    warren_sentiment = warren_sentiment + sent.sentiment.polarity
    
            trump = trump + trump_sentiment
            warren = warren + warren_sentiment
  
            #get timestamp from created_at
            time_created_at = raw_twitts['created_at']
            t = time_created_at.split('+0000 ')
            time = t[0] +t[1]
            format_time = '%a %b %d %H:%M:%S %Y'
            date_time = datetime.strptime(time,format_time)
            ts = int(date_time.timestamp())
            
            #get lat, long from location
            lat = None
            long = None
            if raw_twitts['user']['location']:
                loc = self.geolocator.geocode(raw_twitts['user']['location'])
                if loc:
                    lat = loc.latitude
                    long = loc.longitude
            
            if lat and long:
                info = {'id':raw_twitts['id'],
                            'user_id':raw_twitts['user']['id'], 
                            'Text':raw_twitts['text'],
                            'created_at':raw_twitts['created_at'],
                            'timestamp':ts,
                            'location':raw_twitts['user']['location'],
                            'latitude':lat,
                            'longitude':long,
                            'Trump': trump,
                            'Warren': warren}

                self.tweet_list.append(info)
  
            print (tweets,'\n')    
        except:
            print('ERROR got')
        else:
            self.tweet_count+=1
                # Once it reaches a fix limit the Write the data into gsheets
            if(self.tweet_count==self.max_tweets):          
                # save to a dataframe for eeasier file upload
                  df_tweet_list = pd.DataFrame(self.tweet_list, columns = header_name)
            
                  d2g.upload(df_tweet_list, spreadsheet_key, wks_name, credentials=credentials, row_names=False)
            
                  print("completed")
                  return(False)
            else:
                decoded = json.loads(data)

        def on_error(self, status):
            print(status)

In [19]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [22]:
twitter_stream = Stream(auth, Listener())
twitter_stream.filter(track = ['Trump','Warren'])

  if sys.path[0] == '':


⁦@realDonaldTrump⁩ More truth from the NY Times. You? Well, as least your a consistent liar. “Coronavirus Live U… https://t.co/e2rxKHWdbf 

Trump: Make America great again! USA with Trump: never been so weak Trump 2020: USA will be bigger… https://t.co/qEl3z44M10 

yes he is 

Brilliant! 

@HuffPostPol: Trump's donation to the Department of Health and Human Services shown off by Kayleigh McEnany included a few too many deta… 

@yonan_ann: The only way we win is down ballot 

@kylegriffin1: Michigan Attorney General Dana Nessel: "[Trump] has risked the health, safety and welfare of everyone who lives in this s… 

I got told “u should come to China” on a tweet that criticised Trump on the high infection numbers i… https://t.co/e4jtnUQQC3 

@BarbMcQuade: Michigan’s ⁦@GovWhitmer⁩, ⁦@dananessel⁩, and ⁦@JocelynBenson⁩ are giving a master class on leadership. Standing strong to… 

@Boyd_2650: 🔴🔵LET’S SUPPO OUR AWESOME WINNING .⁦@POTUS⁩ by supporting Tommy Tuberville for the US Senate from the