# PM2 - Data Collection 

My initial plan was to collect Twitter data using the Python library GetOldTweets3. GetOldTweets3 is a library that allows you to scrape old tweets and does not require you to use the official Twitter API. However as of Thursday, September 17, 2020, Twitter no longer allows individuals to access tweets via GetOldTweets3 (for more information about this issue: https://github.com/Mottl/GetOldTweets3/issues/98 

For reasons of completeness I have included some of the code that I wrote to use GetOldTweets3 for older Twitter data. 

## Part 1 - Collecting data via GetOldTweets3

In [1]:
import GetOldTweets3 as got

My collection strategy was to collect Tweets via GetOldTweets3 in batches of one month from May 2019 to May 2020 to explore the changes in anti-vaccine sentiment on Twitter. The following is an example query I used to get the Tweets in the month of May 2019. Due to the fact that GetOldTweets3 no longer is working I was unable to collect and save any data from my old queries. 

In [None]:
text_query = '#vaccinesarepoison'
since_date = '2020-05-01'
until_date = '2020-05-07'
count = 1000
# Creation of query object
# tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query).setSince(since_date).setUntil(until_date).setMaxTweets(count)
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query).setSince(since_date).setUntil(until_date)

# Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)

# Creating list of chosen tweet data
text_tweets = [[tweet.id, tweet.date, tweet.text, tweet.username, tweet.favorites] for tweet in tweets]

# Creation of dataframe from tweets list
tweets_df = pd.DataFrame(text_tweets)

From my initial search I realized that particular accounts where posting on the topic more than others. In order to get a better sense of what these accounts were posting in general I decided to look into them individually (see screenshots of code in data exploration notebook). 

# Part 2 - Collecting data via Tweepy

After GetOldTweets3 stopped working I decided to shift my focus to Tweepy another Python library that allows one to extract tweets directly with the help of Twitter's API. We first begin by instantiating a search opject using our consumer key and token provided via the Twitter developer account. 

## 2.1 Authentication

In [2]:
import os
import tweepy as tw
import pandas as pd

In [3]:
consumer_key = '' # actual values omitted for security reasons
consumer_secret = ''
access_token = ''
access_token_secret = ''

In [4]:
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

In [5]:
api

<tweepy.api.API at 0x7fbda6722ad0>

## 2.2 Perform search and store data

After authentication we can perform our searches. Here I began by searching for the terms vaccines are poison. 

In [16]:
search_term = "vaccines+are+poison-filter:retweets" # search term without retweets

tweets = tw.Cursor(api.search,
                   q=search_term,
                   lang="en",
                   since='2019-09-01').items(1000)

all_tweets = [tweet.text for tweet in tweets]

In [13]:
all_tweets[5:8] # show a small subset of the data 

["@Frank38328140 @marklevinshow There's a reason why Bill Gate's is banned from entering certain countries. His vacci… https://t.co/sYXwN55RSs",
 'Vaccines are poison. &amp;  bio weapons. \nNever take them. https://t.co/NonXieJmKu',
 'Just step away from the vaccines and viruses Gates. Viruses in humans &amp; computers are not the same thing so quit tr… https://t.co/eZ4jYG1cXW']

In [15]:
# save the data in a text file for data exploration
fName = "tweets_1.txt"
with open(fName, 'w') as f:
    for tweet in all_tweets:
         f.write(tweet + '\n') 
f.close()

# Part 3 - More sophisticated search

After our basic search performed above, I decided that it would be good to not only analyze sentiment but also gather 
broader data that would allow me to do more with the data, similar to my original plan of using GetOldTweets to analyze anti-vaccination tweets over the course of one year. In the course of searching for a more sophisticated method to extract tweets I came accross some code by the user Leow Griffin on Github (https://github.com/leowgriffin/tweets_analysis_hkprotests_2019/blob/master/scraping_tweets.py). I have adapted some of his code for my purposes of creating a csv that contains relevant tweets for my project. 

In [18]:
import json
import pandas as pd
import csv
import re
from textblob import TextBlob
import string
import time

In [45]:
def getTweets(search_words, date_since, numTweets, numRuns):
    """
    Gets the tweets that are contained in the search query (search_words) starting at date_since, 
    with a maximum of numTweets and performs the API call numRuns times
    Data is saved in a CSV file 
    
    """
    
    # Define a pandas dataframe to store the date:
    db_tweets = pd.DataFrame(columns = ['username', 'acctdesc', 'location', 'following',
                                        'followers', 'totaltweets', 'usercreatedts', 'tweetcreatedts',
                                        'retweetcount', 'text', 'hashtags']
                                )
    
    # Define a for-loop to generate tweets at regular intervals
    for i in range(0, numRuns):
        
        # Collect tweets 
        tweets = tw.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(numTweets)

        # Store these tweets into a list
        tweet_list = [tweet for tweet in tweets]

        # Obtain the following info (methods to call them out):
            # user.screen_name - twitter handle
            # user.description - description of account
            # user.location - where is he tweeting from
            # user.friends_count - no. of other users that user is following (following)
            # user.followers_count - no. of other users who are following this user (followers)
            # user.statuses_count - total tweets by user
            # user.created_at - when the user account was created
            # created_at - when the tweet was created
            # retweet_count - no. of retweets
            # (deprecated) user.favourites_count - probably total no. of tweets that is favourited by user
            # retweeted_status.full_text - full text of the tweet
            # tweet.entities['hashtags'] - hashtags in the tweet

        # Begin scraping the tweets individually:
        noTweets = 0

        for tweet in tweet_list:

            # Pull the values
            username = tweet.user.screen_name
            acctdesc = tweet.user.description
            location = tweet.user.location
            following = tweet.user.friends_count
            followers = tweet.user.followers_count
            totaltweets = tweet.user.statuses_count
            usercreatedts = tweet.user.created_at
            tweetcreatedts = tweet.created_at
            retweetcount = tweet.retweet_count
            hashtags = tweet.entities['hashtags']

            try:
                text = tweet.retweeted_status.full_text
            except AttributeError:  # Not a Retweet
                text = tweet.full_text

            # Add the 11 variables to the empty list - ith_tweet:
            ith_tweet = [username, acctdesc, location, following, followers, totaltweets,
                         usercreatedts, tweetcreatedts, retweetcount, text, hashtags]

            # Append to dataframe - db_tweets
            db_tweets.loc[len(db_tweets)] = ith_tweet

            # increase counter - noTweets  
            noTweets += 1

        
        print('no. of tweets scraped for run {} is {}'.format(i, noTweets))
        
        time.sleep(300) #5 minute sleep time


    # Define working path and filename
    path = os.getcwd()
    filename = path + '/data/' + search_words + '_twitter_data.csv'

    # Store dataframe in csv with creation date timestamp
    db_tweets.to_csv(filename, index = False)
    
    print('Scraping has completed!') # notify when done

In [None]:
query1 = 'vaccines+are+poison' # testing query to see if our getTweets function works

In [28]:
getTweets(query1, '2020-09-03', 1000, 2)

no. of tweets scraped for run 0 is 1000
no. of tweets scraped for run 1 is 1000
Scraping has completed!


In [39]:
queries = ['vaccines+are+poison', 'vaccines+kill','vaccines+cause+aids', 'vaccines+cause+autism' ]

In [46]:
for query in queries:
    getTweets(query, '2020-09-03', 3000, 2)

no. of tweets scraped for run 0 is 2597
no. of tweets scraped for run 1 is 2597
Scraping has completed!
