# Planning

### Things that need to be done    

- [x] Set up skeleton for a nice object oriented approach. 
- [x] Figure out best way to get tweets.
- [x] Data format
- [x] Build class and funcs
- [ ] Figure out how to best count all requests in a session and make sure it's functional
- [ ] Make oneDay() fail safe (it's a bad idea to get the data and combine it within a single for, because if something goes wrong in an interation all the data from previous iterations will be lost if something goes wrong).
- [ ] Save metadata for the payloads (i.e. self.most_recent, self.oldest, self.timeCovered, rs.total_results + other?)

### Rate Limits

- 10,000,000 Tweets per month (resets on the 19th of each month). 
- 300 requests/15 minute window, with 500 Tweets/request:
    - 150,000 tweets/15min 
    - 600,000 tweets/hour

### How many tweets to get?
- Period covered is: Oct 23rd - July 30th (-ish)
    - ~ 280 days
    - ~ 6720 hours
    - If we get 1000 tweets per hour: *6,720,000*. 
    - That's very little in terms of space, but might take quite a while for it to go through sentiment analysis.
    - It would take ~11.2 hours to get the whole data (due to rate limits)

### Best way to get tweets
- Period covered is: Oct 23rd - July 30th (-ish)
    - $n_h$ per hour/day
    - $n_d$ per day (where n_d would be ~ $n_h*24$)
    

- While it makes the wrangling a bit more difficult, I think it might be interesting to look at hour/day. 
- We can still combine all tweets for a given day. 

### Data format

- A single results call: **JSON to pd**.
    - This is relatively straightforward with one minor complication, i.e. entries such as this:
    <blockquote>{'newest_id': '1402310241992183808',
  'oldest_id': '1402310139630211083',
  'result_count': 100,
  'next_token': 'b26v89c19zqg8o3fpdg7rbcqdq8stpgmibslekg3kxail'}
    </blockquote>
    - This is used by the wrapper to get the next lot of tweets if max_tweets > results_per_call, but will also always be the last entry in a result.
    
    
- Multiple results calls: **pds in dict-of-dict**. 
    - I am thinking the best way to store all the data would be a dict-of-dict, but will see how it works  

To start with, get 10 tweets/hour from the start period (23.10.2020) to the most recent available HPS data. As of now (08.06.2021) this is 24.05.2021 for the Axios-Ipsos survey. This means: 

# Dev

To start with, get 10 tweets/hour from the start period (23.10.2020) to the most recent available HPS data. As of now (08.06.2021) this is 24.05.2021 for the Axios-Ipsos survey. This means: 

In [1]:
from datetime import date, datetime
import time
from os import path
from searchtweets import ResultStream, gen_request_parameters, load_credentials, collect_results, convert_utc_time
import pandas as pd
import numpy as np

In [2]:
def countTweets(startDate, endDate, tweets_per_hour):
    '''
    Specify dates in DD.MM.YYY format (no leading 0 for months or days)
    '''
    
    s_d, s_m, s_y = [ int(i) for i in startDate.split('.')]
    e_d, e_m, e_y = [ int(i) for i in endDate.split('.')]

    endDate = date(e_y, e_m, e_d)
    startDate = date(s_y, s_m, s_d)
    days = endDate-startDate
    print("From {} to {} we have {} days, {} hours, and {} tweets (with {} tweets per hour)".format(startDate, 
                                                                                                    endDate, 
                                                                                                    days.days, 
                                                                                                    days.days*24, 
                                                                                                   days.days*24*tweets_per_hour,
                                                                                                   tweets_per_hour))
    
    
countTweets('23.10.2020', '24.05.2021', 10)     

From 2020-10-23 to 2021-05-24 we have 213 days, 5112 hours, and 51120 tweets (with 10 tweets per hour)


Which is quite manageable.

However, it might be better to get a more complete set for a shorter period of time. So let's do 26.10.2020 - 01.11.2020. This would mean 216000 tweets, but this can actually be used going forward, so that future data collection can just start on 01.11.2020

In [138]:
countTweets('23.10.2020', '1.11.2020', 1000) 

From 2020-10-23 to 2020-11-01 we have 9 days, 216 hours, and 216000 tweets (with 1000 tweets per hour)


In [137]:
class twitterData():
    '''
    The acronym from the full title of the project is UCBSMASD. If we ignore the "C" this 
    can be unscrambled to DUMBASS, so I couldn't help it.
    
    A class for holding all the Twitter search related elements, from validating credentials
    to getting/cleaning the data.
    '''
        
    def __init__(self, main_path):
        '''
        
        '''
        self.main_path = main_path # /Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/
    
    def validate_credentials(self):
        '''
        
        '''
        c_path = path.join(self.main_path, 'twitter_keys.yaml')
        self.credentials = load_credentials('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/twitter_keys.yaml', 
                                       env_overwrite=True);
        self.all_requests = 0;

        return "Credentials validated successfully"
    
    
    
    def build_query(self,
                    mainTerms, 
                    startDate,
                    endDate,
                    inQuotes = True, 
                    language = 'en', 
                    country = 'US',
                    excludeRT = False,
                    results_per_call = 500,
                    return_fields = 'id,created_at,text,public_metrics',
                    otherTerms = []):
        
        '''
        Builds the query that is used to make the requests and get payloads.
        
        Parameters:
            mainTerms (str): The search terms we want, e.g. 'jobs'
            startDate (str): The lower end of the period we are interested in YYY-MM-DD HH:MM format, 
                             e.g. '2020-10-23 13:00'
            endDate (str): The higher end of the period we are interested in in YYY-MM-DD HH:MM format, 
                             e.g. '2020-10-23 14:00'
            inQuotes (bool): Do we want an exact phrase match? If true the terms will be put in quotes
            language (str): Language used in the query (only languages supported by Twitter + 
                            has to be in the correct format, see https://bit.ly/2RBwmGa)
            country (str): Country where Tweet/User is located (has to be in the correct format, see
                            https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2)
            excludeRT (bool): Exclude retweets from the payload? Default False
            results_per_call (int): How many results per request? Max is 500 for the academic API.
            otherTerms (list): List of other search terms, e.g. ['#COVID', 'is:reply']
        
        Notes:
            - tweets are fetched in reverse chronological order, i.e. starting at endDate 
            and continuing until a limit is reached.
            - endDate refers to previous day until 23:59
        '''
        
        # If excluding retweets, set rt to '-' 
        rt = '-is:retweet' if excludeRT == True else ''
        
        # Are the terms in quotes
        mainTerms = '"{}"'.format(mainTerms) if inQuotes == True else '{}'
        
        # Build query text
        queryText = '{} lang: {} place_country:{}'.format(mainTerms,
                                                         language,
                                                         country)
        
        # If there are other terms, include them in the queryText
        queryText = queryText.extend(other) if otherTerms != [] else queryText
        
        # Save these as will be used to determine limits
        self.results_per_call = results_per_call
        self.startDate = startDate
        self.endDate = endDate
            
        # Build query
        self.query = gen_request_parameters(queryText,
                                      start_time = self.startDate,
                                      end_time = self.endDate,
                                      tweet_fields = return_fields,
                                      results_per_call = self.results_per_call)
        
        return self.query
    
    def get_data(self, nTweets = 500):
        '''
        
        '''
        
        #
        self.rs = ResultStream(request_parameters = self.query,
                                  max_tweets = nTweets,
                                  output_format = "a",
                                  **self.credentials)
        
        self.result = list(self.rs.stream())
        
        # We can get the total requests made for a payload using:
        # twitterData_instance.rs.n_requests
        # twitterData_instance.rs.session_request_counter
        
        # This can be used to get the overall requests made
        self.all_requests += self.rs.session_request_counter       
        
        
    
    def wait_time():
        '''
        Figure out how long you have to wait, and after how many requests, to avoid 
        rate limit issues.
        '''
        pass
    
    def get_df(self):
        '''
        '''
        # Remove the entries (i.e. dictionaries) that contain
        # the key 'newest_id' from the payload, i.e. the result 
        # of our query (which is a list of dictionaries).        
        clean_json_list = [x for x in self.result if 'newest_id' not in x]        
        
        df = pd.json_normalize(clean_json_list)

        # Calculate the time covered in a payload.
        # Most recent date/time in the df in datetime format
        self.most_recent = twitterData.toDatetime(max(self.df['created_at']))
        self.oldest = twitterData.toDatetime(min(self.df['created_at']))
        
        self.timeCovered = (self.most_recent - self.oldest).seconds
        

        return df
    
        
    def oneDay(self,date):
        '''
        Get the df for every hour and combine into a single dataframe. 
        '''
        
        init_request_session = time.time()
        
        # Hours in the day
        t = ['{}:00'.format(x) for x in range(0,24)]
        
        # For every hour (of 24)
        for i in range(24):
            
            # Determine start and end time, e.g. '2020-10-23 00:00' abd '2020-10-23 01:00' 
            startTime = '{} {}'.format(date, t[i]) 
            
            if i==23:
                endDate = '{} {}'.format(date, '23:59') 
            else:
                endDate = '{} {}'.format(date, t[i+1]) 

            # Build the query
            self.build_query('jobs', startDate, endDate, results_per_call=500)
            
            
            # If the next request is the 300th (or multiple thereof)
            # and we are within the same 15 min window.
            # TODO: This is no good, as all_requests is not incremented by 1 (but by 3-30ish on every call)
            if (self.all_requests+1 % 300 != 0) and (time.time() - init_request_session < 900):
                
                # Get the data (up to 1000 results per hour)
                self.get_data(nTweets = 1000)
                
            else:
                # Sleep for 15 minutes minus however long we had in this session
                time.sleep(900 - (time.time() - init_request_session))
                
                # Then get the data
                self.get_data(nTweets = 1000)
                
                # And reset the session timer
                init_request_session = time.time()
                
            
            # Clean data
            current_df = self.get_df()
            
            # Add to dataframe containing the data for a single day.
            if i == 1:
                all_day_df = current_df
                
            else:
                all_day_df = pd.concat([all_day_df, current_df])

        
    
    def exportOneDay():
        '''
        
        '''
        pass
    
    def exportOneDay():
        pass
    
    @staticmethod
    def toDatetime(dateStr):
        '''
        Take a date in the ISO format that we get from twitter "%Y-%m-%dT%H:%M:%S.000Z"
        and transform to a datetime for calculations.

        Parameters:
            dateStr (str): A date string (ISO format)
        
        Returns:
            dateDT (datetime): A datetime object  
        '''
        
        dateDT = datetime.strptime(dateStr, "%Y-%m-%dT%H:%M:%S.000Z")
        
        return dateDT
    
    
    

In [130]:
t = ['{}:00'.format(x) for x in range(0,24)]

for i in range(24):
    
    if i == 23:
        print(t[i], '23:59')
    else:
        print(t[i], t[i+1])
    
    
    
    #print(t[i-1], t[i])

0:00 1:00
1:00 2:00
2:00 3:00
3:00 4:00
4:00 5:00
5:00 6:00
6:00 7:00
7:00 8:00
8:00 9:00
9:00 10:00
10:00 11:00
11:00 12:00
12:00 13:00
13:00 14:00
14:00 15:00
15:00 16:00
16:00 17:00
17:00 18:00
18:00 19:00
19:00 20:00
20:00 21:00
21:00 22:00
22:00 23:00
23:00 23:59


In [104]:
#00:00
#01:00
#02:00
#03:00
#...
#10:00

t = ['{}:00'.format(x) for x in range(0,24)]
    
    

In [58]:
search1 = twitterData('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/')
search1.validate_credentials()

'Credentials validated successfully'

In [59]:
search1.build_query('jobs','2020-10-23 00:00', '2020-10-23 01:00', results_per_call=500)
search1.get_data(nTweets = 1000)
df1 = search1.get_df()

In [106]:
search1.build_query('jobs','2020-10-23 01:00', '2020-10-23 2:00', results_per_call=500)
search1.get_data(nTweets = 1000)
df2 = search1.get_df()

In [110]:
search1.build_query('jobs','2020-10-23 23:00', '2020-10-23 23:59', results_per_call=500)
search1.get_data(nTweets = 1000)
df3 = search1.get_df()

In [111]:
df3

Unnamed: 0,text,id,created_at,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,Yes we are! Thankful and blessed for all our B...,1319789833443811328,2020-10-23T23:56:31.000Z,3,0,23,0
1,White people always saying that Mexicans are s...,1319789469516455938,2020-10-23T23:55:04.000Z,1,0,0,0
2,"Interested in a job in #Aurora, COLORADO? This...",1319788560044552195,2020-10-23T23:51:27.000Z,0,0,0,0
3,@DangerousDC40 @washingtonpost There will be o...,1319788306390016002,2020-10-23T23:50:27.000Z,0,0,0,0
4,"Y’all, we are hiring! Head to our website htt...",1319787829606686728,2020-10-23T23:48:33.000Z,1,0,2,0
5,@MSNBC He’s 100% correct. Transitioning out of...,1319787776125112325,2020-10-23T23:48:20.000Z,5,2,9,1
6,"@realDonaldTrump New energy saving jobs, bette...",1319787715152416768,2020-10-23T23:48:06.000Z,0,0,0,0
7,@GeorgeTakei Rural voters fear transitioning t...,1319787231989669888,2020-10-23T23:46:11.000Z,0,0,1,0
8,I just had a whole conversation with myself .\...,1319786495130038272,2020-10-23T23:43:15.000Z,0,0,1,0
9,"Want to work in #Orange, CA? View our latest o...",1319785407962255360,2020-10-23T23:38:56.000Z,0,0,0,0


In [64]:
df2

Unnamed: 0,id,created_at,text,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1319458440633339904,2020-10-23T01:59:41.000Z,Many of the jobs lost this year will never com...,0,1,2,0
1,1319458438943019009,2020-10-23T01:59:40.000Z,I have a 401K and investments. I like seeing t...,0,1,1,0
2,1319458252460085249,2020-10-23T01:58:56.000Z,You’re all sitting here while you’re still all...,0,0,0,0
3,1319458198932381697,2020-10-23T01:58:43.000Z,"In case you didn’t know this, the stock market...",0,0,1,0
4,1319458079759568896,2020-10-23T01:58:15.000Z,"@Jaycaleb8 Bro what are you saying, I live dow...",0,1,0,0
...,...,...,...,...,...,...,...
58,1319444682720661504,2020-10-23T01:05:01.000Z,Looking forward to hearing @JoeBiden talk abou...,0,0,2,0
59,1319444365014609920,2020-10-23T01:03:45.000Z,If Biden thinks Trump mishandled Covid 19 he n...,0,0,0,0
60,1319444125972914176,2020-10-23T01:02:48.000Z,@rrt003 @MSNBC @DailyNewsSA Well no. Also how ...,0,1,1,0
61,1319443992451248129,2020-10-23T01:02:16.000Z,Gays dont not liking Trump as a person. Ooohh ...,2,4,40,1


In [65]:
search1.rs.session_request_counter

30

In [66]:
search1.all_requests

59

In [67]:
search1.all_requests2

2

For vaccines, 1650 tweets appear to be getting all the results from a 24 hour window

In [3]:
search2 = twitterData('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/')
search2.validate_credentials()
search2.build_query('jobs','2021-2-16', '2021-2-17', results_per_call=500)

'{"query": "\\"jobs\\" lang: en place_country:US", "max_results": 500, "start_time": "2021-02-16T00:00:00Z", "end_time": "2021-02-17T00:00:00Z", "tweet.fields": "id,created_at,text,public_metrics"}'

In [4]:
search2.get_data(nTweets=1500)

In execute_request



In execute_request



In stream





In [247]:
search2.get_df()

In [248]:
len(search2.df)

661

In [244]:
search2.oldest

datetime.datetime(2021, 2, 15, 0, 1, 14)

In [238]:
search2.most_recent

datetime.datetime(2021, 2, 15, 23, 59, 44)

In [177]:
search2.df

Unnamed: 0,id,created_at,text,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1341896166171140103,2020-12-23T23:59:11.000Z,Every one of our district employees is an #ess...,0,0,0,0
1,1341895880853618688,2020-12-23T23:58:03.000Z,@DarraTheExplora Where would you inject the va...,0,0,1,0
2,1341895694085451776,2020-12-23T23:57:19.000Z,@thechrishan You might like this! \n\nAbout so...,0,1,1,0
3,1341895303297978373,2020-12-23T23:55:46.000Z,Post Vaccine Sex Fest new band name I call it ...,0,0,0,0
4,1341895270901149696,2020-12-23T23:55:38.000Z,What if instead of a vaccine we just were able...,0,1,10,0
...,...,...,...,...,...,...,...
1495,1341555060891930627,2020-12-23T01:23:46.000Z,"got the vaccine before the ps5, what a world h...",0,2,2,0
1496,1341554906579275780,2020-12-23T01:23:09.000Z,There prioritizing conducted felons with the v...,0,1,0,0
1497,1341554821858471942,2020-12-23T01:22:49.000Z,"@JaniceDean Amen, Janice. I’m 72 with pulmonar...",0,0,0,0
1498,1341554805051940867,2020-12-23T01:22:45.000Z,@Khamarupa @TheRickyDavila What’s being critic...,0,0,0,0


In [150]:
??searchtweets.ResultStream

Object `searchtweets.ResultStream` not found.


In [68]:
import time

s = time.time()

In [136]:
(time.time() - s) 

2823.917551755905

1623244061.923844

In [91]:
print(len(df1))
print(len(df2))

39
63


In [94]:
bla = pd.concat([df1, df2])

In [96]:
min(bla['created_at'])

'2020-10-23T00:02:40.000Z'