# 01. Twitter Scraping
---

The purpose of this notebook is to begin the data collection process of the project by collecting posts from the social media app Twitter.  Since Twitter limits the amount of tweets collected to just the past 7 days, we would be limited in our data collection.  Therefore, we chose an alternative route - GetOldTweets3.  GetOldTweets3 was developed by Jefferson Henrique ([GitHub](https://github.com/Jefferson-Henrique/GetOldTweets-python)) as a work-around to the Twitter limit.  It works by mimicing a Twitter search on the desktop site aand scraping the requested tweets.

Using GetOldTweets3, we are able to collect the following for our process:
- tweets from New Jersey 511 accounts and it's partners
- the aforementioned tweets filtered by the below keywords:
    - 'clos' (in order to account for words such as "close", "closure")
    - ' ' (in order to pull a wide range of tweets to analyze)
 
After a sufficient amount of tweets is collected, we will export the results to the csv file [scraped_tweets](./datasets/scraped_tweets.csv).  The csv format will allow us to pull the data into a seperate notebook in order to clean the data and prepare it for modeling.

---
## Table of Contents
---


 - [Import Resources](#Import-Resources)
 - [Scrape Twitter](#Scrape-Twitter)
 - [Organize and Save](#Organize-and-Save)

---
### Import Resources
---

Here we import the necessary libraries to enable Twitter scraping.  Furthermore, we read-in the Twitter account csv file for 511NJ.

In [1]:
import pandas as pd
import GetOldTweets3 as got

Read-in csv file

In [2]:
df = pd.read_csv("../datasets/njtwitteraccounts.csv")
df.head()

Unnamed: 0,source,twitter_handle
0,511NJ,511njgsp
1,511NJ,511njace
2,511NJ,511njtpk
3,511NJ,511nj55
4,511NJ,511nj42


---
### Scrape Twitter
---
Search for tweets posted from the above list of twitter accounts.

- for our first pull (*training data*), we used the following parameters:
    - keywords = ['clos']
    - max_tweets = 10000
    - .setSince('2019-10-24')
    - .setUntil('2019-11-07')
    
    
- for our second pull (*testing data*), we used the following parameters:
    - keywords = ['']
    - max_tweets = 50000
    - .setSince('2019-10-24')
    - .setUntil('2019-11-07')

***WARNING:*** *this process takes about 1 hour to run, given the amount of tweets we are scraping*

In [3]:
users = df['twitter_handle'].tolist()
keywords = ['']
max_tweets = 50000

username = []
text = []
date = []

for keyword in keywords:
    tweetCriteria = got.manager.TweetCriteria().setUsername(users)\
                                               .setSince('2019-10-24')\
                                               .setUntil('2019-11-07')\
                                               .setQuerySearch(keyword)\
                                               .setMaxTweets(max_tweets)
    
    tweets_collected = got.manager.TweetManager.getTweets(tweetCriteria)
    print(f"Total tweets scraped: {len(tweets_collected)}")
    
    for tweet in tweets_collected:
        username.append(tweet.username)
        text.append(tweet.text)
        date.append(tweet.date)
        

Total tweets scraped: 8015


---
### Organize and Save
---
We will organize the results of the scrape into a usable dataframe and save it down as a csv for use by other notebooks.

In [4]:
# Combining lists into dataframe
scraped_tweets = pd.DataFrame({'username': username,
                                'tweet': text,
                                'date_posted': date})

In [5]:
scraped_tweets.head()

Unnamed: 0,username,tweet,date_posted
0,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:59:56+00:00
1,511njbt,Delays on George Washington Bridge westbound f...,2019-11-06 23:58:57+00:00
2,511njtpk,Crash on New Jersey Turnpike - Eastern Spur so...,2019-11-06 23:58:56+00:00
3,511nji295,Crash on I-295 southbound South of Exit 29 - U...,2019-11-06 23:56:56+00:00
4,511njace,"Construction, bridge painting on Atlantic City...",2019-11-06 23:52:57+00:00


In [6]:
# Export dataframe to csv file
scraped_tweets.to_csv('../datasets/scraped_tweets.csv')

---