# Twitterscraper
> Author: [Dawn Graham](https://dawngraham.github.io/)

Use twitterscraper to get historical tweets.

Documentation:
- https://pypi.org/project/twitterscraper/0.2.7/
- https://github.com/taspinar/twitterscraper

Tweepy API is used to append user location. (http://www.tweepy.org/)

Versions used:
- Python 3.6.6
- pandas 0.23.4
- tweepy 3.7.0
- twitterscraper 0.9.3

### Import libraries

In [1]:
# For Tweepy API
import pickle
import os
import time
from tweepy import OAuthHandler
from tweepy import API

# For Twitterscraper
from twitterscraper import query_tweets

# For dataframes
import datetime as dt
import pandas as pd

### Authenticate Tweepy API

The first time you execute the notebook, add all credentials so that you can save them in the pkl file, then you can remove the secret keys from the notebook because they will just be loaded from the pkl file.

The pkl file contains sensitive information that can be used to take control of your twitter acccount, do not share it.

In [2]:
# Enter Twitter API info the first time running this notebook, then delete.
# Credentials will be saved into and loaded from separate pkl file.
if not os.path.exists('secret_twitter_credentials.pkl'):
    Twitter={}
    Twitter['Consumer Key'] = ''
    Twitter['Consumer Secret'] = ''
    Twitter['Access Token'] = ''
    Twitter['Access Token Secret'] = ''
    with open('secret_twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('secret_twitter_credentials.pkl','rb'))

auth = OAuthHandler(Twitter['Consumer Key'], Twitter['Consumer Secret'])
auth.set_access_token(Twitter['Access Token'], Twitter['Access Token Secret'])

api = API(auth)

# If the authentication was successful, you should
# see the name of the account print out
print(api.me().name)

Dawn Graham


### Set up dictionary and function to collect tweets

In [3]:
def get_query(query, begin, end):
    
    # Set up dictionary to collect tweets
    tweets_dict = {'timestamp':[],
                   'id':[],
                   'text':[],
                   'user':[],
                   'user_location':[],
                   'likes':[],
                   'replies':[],
                   'retweets':[],
                   'query':[]
                  }
    
    for tweet in query_tweets(query, begindate=begin, enddate=end, limit=10):
        # Append info to tweets_dict
        tweets_dict['timestamp'].append(tweet.timestamp)
        tweets_dict['id'].append(tweet.id)
        tweets_dict['text'].append(tweet.text)
        tweets_dict['user'].append(tweet.user)
        tweets_dict['user_location'].append(api.get_user(tweet.user).location) # Get with Tweepy API
        tweets_dict['likes'].append(tweet.likes)
        tweets_dict['replies'].append(tweet.replies)
        tweets_dict['retweets'].append(tweet.retweets)
        tweets_dict['query'].append(query)
        
    tweets = pd.DataFrame(tweets_dict)
    tweets.set_index('timestamp', inplace=True)
    return tweets

### Create and run desired query
Compile query here: https://twitter.com/search-advanced  

Notes:
- Then end date is NOT included in search.
- This seems to limit to 20 per day or 400 per search.

In [4]:
# Enter desired values for query
query = 'power outage'

# Begin and end dates - (yyyy, m, d)
begin = dt.date(2010, 1, 1)
end = dt.date(2010, 1, 2)

In [5]:
# Run query
tweets = get_query(query, begin, end)

INFO: queries: ['power outage since:2010-01-01 until:2010-01-02']
INFO: Querying power outage since:2010-01-01 until:2010-01-02
INFO: Got 0 tweets for power%20outage%20since%3A2010-01-01%20until%3A2010-01-02.
INFO: Got 0 tweets (0 new).


In [6]:
tweets.head()

Unnamed: 0_level_0,id,text,user,user_location,likes,replies,retweets,query
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1


### Append results to csv

In [7]:
# Append results to scrapedtweets.csv
with open('./data/scrapedtweets.csv', 'a') as f:
    tweets.to_csv(f, header=True)

### Get URL of specific tweet

In [8]:
# Create function to return URL
def get_url(timestamp):
    user = tweets[timestamp]['user'][0]
    tweet_id = tweets[timestamp]['id'][0]
    print(f'https://twitter.com/{user}/status/{tweet_id}')

In [9]:
## Request URL by timestamp
# get_url('2010-01-01 23:48:44')

---
### Scraper with for loop

Notes:
- To run this function, we can't get 'user location' with Tweepy due to Twitter limitations. If we want that info later, we could possibly do a separate function.
- Query by location is only available for September 2010 or later.

In [10]:
# Set up dictionary to collect tweets
tweets_dict = {'timestamp':[],
               'id':[],
               'text':[],
               'user':[],
               'likes':[],
               'replies':[],
               'retweets':[],
               'query':[]
              }

In [11]:
def query_by_month(query, y, m, limit=None):
    
    # Get number of days in each the month
    # Check for leap years
    if (y%4==0 and y%100!=0 or y%400==0) & (m == 2):
        total_d = 29
    elif m == 2:
        total_d = 28
    elif m == [4, 6, 9, 11]:
        total_d = 30
    else:
        total_d = 31
    
    # Set first start & end day
    d = 1
    end_d = d + 1
    
    # Run for number of days in month
    for day in range(total_d):
        
        # Set search begin date
        begin = dt.date(y, m, d)
        
        # Set search end date
        # Enables setting to 1st day of next month to get results from last day of search month
        if (end_d > total_d) & (m == 12):
            end = dt.date(y+1, 1, 1)
        elif end_d > total_d:
            end = dt.date(y, m+1, 1)
        else:
            end = dt.date(y, m, end_d)
        
        # Run twitterscraper query
        for tweet in query_tweets(query, begindate=begin, enddate=end, limit=limit):
            # Append info to tweets_dict
            tweets_dict['timestamp'].append(tweet.timestamp)
            tweets_dict['id'].append(tweet.id)
            tweets_dict['text'].append(tweet.text)
            tweets_dict['user'].append(tweet.user)
            tweets_dict['likes'].append(tweet.likes)
            tweets_dict['replies'].append(tweet.replies)
            tweets_dict['retweets'].append(tweet.retweets)
            tweets_dict['query'].append(query)
            
        # Pause
        time.sleep(1)
        
        # Increase begin and end search date by 1
        d += 1
        end_d += 1

In [12]:
# Enter desired values for query
query = 'power outage'
year = 2013
month = 7
limit = 10

# Run query
tweets = query_by_month(query, year, month)

INFO: queries: ['power outage since:2013-07-01 until:2013-07-02']
INFO: Querying power outage since:2013-07-01 until:2013-07-02
INFO: Got 0 tweets for power%20outage%20since%3A2013-07-01%20until%3A2013-07-02.
INFO: Got 0 tweets (0 new).
INFO: queries: ['power outage since:2013-07-02 until:2013-07-03']
INFO: Querying power outage since:2013-07-02 until:2013-07-03
INFO: Got 0 tweets for power%20outage%20since%3A2013-07-02%20until%3A2013-07-03.
INFO: Got 0 tweets (0 new).
INFO: queries: ['power outage since:2013-07-03 until:2013-07-04']
INFO: Querying power outage since:2013-07-03 until:2013-07-04
INFO: Got 0 tweets for power%20outage%20since%3A2013-07-03%20until%3A2013-07-04.
INFO: Got 0 tweets (0 new).
INFO: queries: ['power outage since:2013-07-04 until:2013-07-05']
INFO: Querying power outage since:2013-07-04 until:2013-07-05
INFO: Got 0 tweets for power%20outage%20since%3A2013-07-04%20until%3A2013-07-05.
INFO: Got 0 tweets (0 new).
INFO: queries: ['power outage since:2013-07-05 until

In [13]:
tweets = pd.DataFrame(tweets_dict)
tweets.set_index('timestamp', inplace=True)
print(tweets.shape)
print(tweets['id'].nunique())

(0, 7)
0


In [14]:
# # Get number of observations containing word
# tweets['text'].str.contains('power', case=False, regex=True).value_counts()