# Scraping Tweets from Twitter

Goal: Scraping a specific user's tweets from this year.

## Why not Tweepy?

We're using Dmitry Mottl's GetOldTweets3 instead of the Tweepy, the official python library for accessing the Twitter API.

Unfortunately, among other limitations, Tweepy only allows retrieval of tweets from the past week. We would like to obtain a dataset from a much larger timeframe, and we don't need the extensive functionality of Tweepy; we are merely investigating the text body of the tweets.

Using GetOldTweets3, we can obtain the following:
* id (str)
* permalink (str)
* username (str)
* to (str)
* text (str)
* date (datetime) in UTC
* retweets (int)
* favorites (int)
* mentions (str)
* hashtags (str)

There is an open issue with accessing the geo data from a tweet obtained via GetOldTweets3.

First we need to import the library.

In [1]:
import GetOldTweets3 as got

Now let's use the library to get tweets. We first set the search criteria, then scrape the tweets.

Note that GetOldTweets3 doesn't store tweets in a dataframe, so we'll need to extract the information in which we're interested after scraping the tweets.

In [2]:
username='realDonaldTrump'
since='2020-01-01'
until='2020-04-12' # We'll go through Easter, because that's when he said he wanted to reopen things by.

# Set criteria for query
tweetCriteria=got.manager.TweetCriteria().setUsername(username).setSince(since).setUntil(until)

# Create a list to hold all tweets
tweets=got.manager.TweetManager.getTweets(tweetCriteria)

You can look at the GetOldTweets3 format by calling one of the items in the list `tweets` we generated.

In [39]:
tweets[0]

<GetOldTweets3.models.Tweet.Tweet at 0x1044e4950>

Extracting the data is relatively straightforward. We'll stick it in a list.

In [9]:
# Create a list of chosen tweet data
trump_tweets=[]

# Extract the information we care about
# The resulting type of trump_treets is [[a]], or matrix.
for tweet in tweets:
    trump_tweets.append([tweet.date,tweet.text])

We're accustomed to manipulating data in dataframes, and pandas has a function I remember off the top of my head to write a dataframe to a csv. Let's turn our list into a dataframe.

First we need to load pandas.

In [3]:
import pandas as pd

Turning a matrix (list of lists) into a dataframe with pandas is relatively simple. Since we know what the columns are, we'll define the headers.

In [4]:
df=pd.DataFrame(trump_tweets,columns=['DateTime','Text'])

Look at all our hard work paying off!

In [5]:
df

Unnamed: 0,DateTime,Text
0,2020-04-11 23:51:58+00:00,Will be interviewed by @JudgeJeanine tonight a...
1,2020-04-11 23:34:58+00:00,So now the Fake News @nytimes is tracing the C...
2,2020-04-11 23:28:18+00:00,Governor @GavinNewsom of California has been v...
3,2020-04-11 22:38:29+00:00,I will be watching. HAVE A GREAT EASTER!
4,2020-04-11 22:35:35+00:00,The Wall Street Journal Editorial Board doesn’...
...,...,...
1304,2020-01-01 01:30:35+00:00,HAPPY NEW YEAR!
1305,2020-01-01 01:22:28+00:00,Our fantastic First Lady!
1306,2020-01-01 01:16:27+00:00,Thank you Steve. The greatest Witch Hunt in U....
1307,2020-01-01 01:03:15+00:00,Thank you to the @dcexaminer Washington Examin...


Now we can save the dataframe to a csv file so that we can access it later.

In [6]:
df.to_csv('trumpTweets.csv')

Just in case we'd like to extract more information from the tweets we scraped later, I'm going to save `tweets` to a file. I'll save it as a .csv and as a .txt file.

In [16]:
pd.DataFrame(tweets).to_csv('trumpTweetsRaw.csv')

In [26]:
import sys

with open('trumpTweetsRaw.txt','w+') as file:
    for tweet in tweets:
        file.write('%s\n' % tweet)

Now we have the raw GetOldTweets3 object stored as a .txt file. To load it again, we'll use the following code.

In [40]:
tweetsCopy = []

with open('trumpTweetsRaw.txt','r') as file:
    for line in file:
        tweetsCopy.append(repr(line[:-1]))
        
# This gives ['a'] and I want [a] where a is the got tweet object

In [41]:
tweetsCopy

["'<GetOldTweets3.models.Tweet.Tweet object at 0x1044e4950>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x10491f250>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x10491ff50>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x1049113d0>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x104971610>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x104911890>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x10491fa90>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x10490bd90>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x1049719d0>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x104911190>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x10498fbd0>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x104971c50>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x1044e4c90>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x10491f450>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x104911ed0>'",
 "'<GetOldTweets3.models.Tweet.Tweet object at 0x104911