# Getting the twitter data

In order to extract twitter information for this study, we utilized the Twitter API (Application Programming Interface) through the Python library, tweepy. The API allows for the retrieval of tweets based on specific search criteria, such as keywords or hashtags. To begin, we first set up a developer account with Twitter and obtained the necessary API keys and tokens. These were then added to our Python script, allowing for access to the API. Next, we defined our search criteria and parameters. We divided our search of tweets into four big categories and the words were chosen in the biotechnology context: i) relevant news,  (covid and vaccine), ii) companies relevance in the field ("Johnson&Johnson", "Eli Lilly", "Novo Nordisk", "AbbVie", "Merck", "Pfizer", "Roche","AstraZeneca", "Novartis", "Moderna" "), iii) competitors, the same companies data of the others and iv)CEOs ("Alex Gorsky", "David A.Rick", "LarsFruergaardJørgensen", "RichardA.Gonzalez", "KennethC.Frazier", "AlbertBourla", "SeverinSchwan","PascalSoriot", "VasNarasimhan", "StéphaneBancel" , "WernerBaumann"). Using the tweepy library, we then executed the API call to retrieve the tweets based on our defined criteria. The criteria of time was defined to retrieve tweets every hour, during the last 7 days (2021-01-14 to 2021-01-20), given the restriction that Twitter has for downloading. The tweets were returned in JSON format, which we then parsed and extracted the relevant information, such as the tweet text, user information, and creation date.

In [22]:
#Imports
import sys
import tweepy
import requests
import json
from datetime import datetime, timedelta
from dateutil import parser
from prettytable import PrettyTable
import pandas as pd
import urllib.parse
import csv

In [34]:
def get_tweets_by_query(query: str, days: int = 7, hours_range = [19,20,21,22,23,24,25,26,27], number_tweets: int = 10):
    """
    Retrieve tweets based on a given query for a specified time range and days.
    
    Parameters:
    query (str): The search query for the tweets (e.g. "COVID").
    days (int, optional): The number of days to go back in time for the search (default is 7).
    hours_range (list, assigned): The list of hours to go back in time within the days specified (the assigned hours are [11,12,13,14,15, 16, 17, 18, 19, 20, 21]).
    number_tweets (int, optional): The number of tweets to retrieve per time range (default is 10).
    
    Returns: 
    A list of tuples containing the date range and tweets for each hour within the days specified.
    """
    tweets_list = []
    # For the asked days
    for i in range(days):
        #During the determined hours, in hour case the range has to covered the NY time stock market openig hours in the swiss hours time cause I am dowaload them from Switterland.
        for hour in hours_range:
            # Calculate the end time for the search (current time minus specified number of days and hours)
            date_end = (datetime.utcnow() - timedelta(days=i, hours=hour, seconds=10)).replace(microsecond=0, minute=30)
            # Calculate the start time for the search (subtracting 1 hour from the end time)
            date_start = date_end - timedelta(hours=1) - timedelta(seconds=2)
            # Retrieve tweets based on the query and specified start and end times
            tweets = client.search_recent_tweets(query=query, max_results=number_tweets, start_time=parser.isoparse(date_start.isoformat()).strftime("%Y-%m-%dT%H:%M:%SZ"),end_time=parser.isoparse(date_end.isoformat()).strftime("%Y-%m-%dT%H:%M:%SZ"))
            # Check if tweets were returned
            if tweets is not None:
                # Append the date range and tweets to the list
                tweets_list.append((date_start.strftime("%Y-%m-%dT%H:%M:%SZ"),date_end.strftime("%Y-%m-%dT%H:%M:%SZ"),tweets))
            else:
                #In case to don't be return 
                print(f"No tweets found for query '{query}' between {date_start} and {date_end}")
    return tweets_list

In [24]:
def save_tweets_to_txt(tweets_list, file_name):
    """
    This function takes in a list of tweets and saves it to a txt file.
    
    Parameters:
    tweets_list (list): a list of tweets (list of tuples), where each tuple consists of:
        - date (str): the start date of the tweet collection in the format YYYY-MM-DD
        - dateEnd (str): the end date of the tweet collection in the format YYYY-MM-DD
        - tweets (object): an object containing tweets (list of tweets)
        
    file_name (str): the name of the output file
    
    Returns:
    None
    """
    with open(file_name, "w") as file:
        for item in tweets_list:
            date = item[0]
            dateEnd = item[1]
            tweets = item[2]
            
            # Check if tweets is None or there are no data in tweets
            if tweets is None or not tweets.data:
                # Write "NA" to file
                file.write(f"NA\t{date}\t{dateEnd}\tNA\n")
                continue
            
            # Loop through each tweet in the tweets data
            for tweet in tweets.data:
                # Replace line breaks with spaces
                text = tweet.text.replace("\n", " ")
                # Write tweet data to file
                file.write(f"{tweet.id}\t{date}\t{dateEnd}\t{text}\n")
    return None


In [25]:
# Keys and tokens to acces to Twitter fro developers
consumer_key = "e3S5SnuxRoNh0a3UPkljKdVC6"
consumer_secret = "PWfZjQUDQx47bdwaN3uKxDJ2ViIBy9GDLA77yjcl5UrraWKNfh"
access_token = "799127098006835200-JAwJP1XPSq9bfLGIVhdcxQaRF15RlK8"
access_token_secret = "jVvsvY7VWGaoTluItO9MW0VQe6R89qMi4hmGYjjBPtWaV"
bearer_token="AAAAAAAAAAAAAAAAAAAAAHZ0jgEAAAAAUzpoNhs57bXRfwvswT0xd1b2YtM%3DFUOiuTafIyJphWyuzkPqyzUCnOSS5ASx7zlwpnZm9lle7Qafo5"

client = tweepy.Client(consumer_key= consumer_key,consumer_secret= consumer_secret,access_token= access_token,access_token_secret= access_token_secret, bearer_token=bearer_token)


# Searching the tweets

For the follwoing keywords:

News: vaccine, covid

Companies: Johnson & Johnson, Eli Lilly, Novo Nordisk, AbbVie, Merck, Pfizer, Roche, AstraZeneca, Novartis, Moderna

CEOs: Joaquin Duato, David A. Ricks, Lars Fruergaard Jørgensen, Richard A. Gonzalez, Kenneth C. Frazier, Albert Bourla, Severin Schwan, Pascal Soriot, Vas Narasimhan, Stéphane Bancel



In [8]:
#always do a test to check hours 
tweet = get_tweets_by_query(query= "GMO", days= 6, number_tweets=10)
save_tweets_to_txt(tweet, "./data/round2/GMO_16.txt")

In [9]:
#RUN NEXT 3 BLOCKS EVERY TIME YOU CAN TO COLLECT ALL DATA FROM THE 2 + 10 + 10 KEYWORDS IN THE LAST 7 DAYS
# Honestly I am separating by block beacuse twiter complain of the numer of data dowlaod in the same IP per certain time
#just change output file

# Get the tweets vaccine
tweetVaccine = get_tweets_by_query(query= "vaccine", days= 6, number_tweets=100)
save_tweets_to_txt(tweetVaccine, "./data/round2/tweetVaccine_200323.txt")


# Get the tweets covid
tweetCovid = get_tweets_by_query(query= "covid", days= 6, number_tweets=100)
save_tweets_to_txt(tweetCovid, "./data/round2/tweetCovid_200323.txt")
    


In [27]:
#CEO
#CEOs=["Joaquin Duato", "David A. Rick", "Lars Fruergaard Jørgensen", "Richard A. Gonzalez", "Kenneth C. Frazier", "Albert Bourla", "Severin Schwan","Pascal Soriot", "Vas Narasimhan", "Stéphane Bancel"]
CEOs=[ "Stéphane Bancel"]

#loop to go thourght the companies and save the files 
for ceo in CEOs:
    tweets = get_tweets_by_query(query=ceo, days= 6, number_tweets=100)
    save_tweets_to_txt(tweets, f"./data/round2/ceos/{ceo}_tweets_200323.txt")

In [37]:
## TWEET Companies
#companies = ["Johnson&Johnson", "Eli Lilly", "Novo Nordisk", "AbbVie", "Merck", "Pfizer", "Roche","AstraZeneca", "Novartis", "Moderna" ]
companies = ["Eli Lilly"]

#loop to go thourght the companies and save the files 
for company in companies:
    tweets = get_tweets_by_query(query=company, days= 4, number_tweets=100)
    save_tweets_to_txt(tweets, f"./data/round2/companies/{company}_tweets_140323.txt")
