# Planning

### Things that need to be done    

- [x] Set up skeleton for a nice object oriented approach. 
- [x] Figure out best way to get tweets.
- [x] Data format
- [x] Build class and funcs
- [x] Figure out how to best count all requests in a session and make sure it's functional
    - This is already built-in to searchtweets to some extent, so will use that.
- [x] Make oneDay() fail safe (it's a bad idea to get the data and combine it within a single for, because if something goes wrong in an interation all the data from previous iterations will be lost if something goes wrong).
    - Getting a whole week instead
- [ ] Change oneDay() to oneWeek().    
- [ ] Save metadata for the payloads (i.e. self.most_recent, self.oldest, self.timeCovered, rs.total_results + other?)
    - Seach tweets already saves some sort of log, print this to file with the appropriate name.  

### Rate Limits

- 10,000,000 Tweets per month (resets on the 19th of each month). 
- 300 requests/15 minute window, with 500 Tweets/request:
    - 150,000 tweets/15min 
    - 600,000 tweets/hour

### How many tweets to get?
- Period covered is: Oct 23rd - July 30th (-ish)
    - ~ 280 days
    - ~ 6720 hours
    - If we get 1000 tweets per hour: $6,720,000 * 2$. 
    - That's very little in terms of space, but might take quite a while for it to go through sentiment analysis.
    - It would take ~22 hours to get the whole data (due to rate limits).
    - However, it's unlikely that our queries would return anywhere near 1,000 results/hour.

### Best way to get tweets
- Period covered is: Oct 23rd - July 30th (-ish)
    - $n_h$ per hour/day
    - $n_d$ per day (where $n_d$ would be ~ $n_h*24$)
    - $n_w$ per week (where $n_w$ would be ~ $n_h*24 *7$)  
    

- $n_w$ is probably the best options: 
    - can leverage functions built into *searchtweets* to avoid rate limit violations (e.g. exponential back-off).
    - it's easy to select tweets in any given day/hour from these data.

### Data format

- A single results call: **JSON to pd**.
    - This is relatively straightforward with one minor complication, i.e. entries such as this:
    <blockquote>{'newest_id': '1402310241992183808',
  'oldest_id': '1402310139630211083',
  'result_count': 100,
  'next_token': 'b26v89c19zqg8o3fpdg7rbcqdq8stpgmibslekg3kxail'}
    </blockquote>
    - This is used by the wrapper to get the next lot of tweets if max_tweets > results_per_call, but will also always be the last entry in a result.
    
    
- Multiple result calls: **pds in dict/dict-of-dict**. 
    - I am thinking the best way to store all the data would be a dict of dataframes, but will see how it works  
    
### Survey periods

| Period | A_I_start_date | A_I_end_date | A_I_week | HPS_start_date | HPS_end_date | HPS_Week | HPS Topic |
|--------|----------------|--------------|----------|----------------|--------------|----------|-----------|
| P1     | 23.10.2020     | 26.10.2020   | W29*     | 28.10.2020     | 09.11.2020   | W18      |     E    |
| P2     | 13.11.2020     | 16.11.2020   | W30      | 11.11.2020     | 23.11.2020   | W19      |     E    |
| P2     | 20.11.2020     | 23.11.2020   | W31      | 11.11.2020     | 23.11.2020   | W19      |     E    |
| P3     | 04.12.2020     | 07.12.2020   | W32      | 25.11.2020     | 07.12.2020   | W20      |     E    |
| P4     | 11.12.2020     | 14.12.2020   | W33      | 09.12.2020     | 21.12.2020   | W21      |     E    |
| P4     | 18.12.2020     | 21.12.2020   | W34      | 09.12.2020     | 21.12.2020   | W21      |     E    |
| P5     | 08.01.2021     | 11.01.2021   | W35      | 06.01.2021     | 18.01.2021   | W22      |    E,V   |
| P6     | 22.01.2021     | 25.01.2021   | W36      | 20.01.2021     | 01.02.2021   | W23      |    E,V   |
| P6     | 29.01.2021     | 01.02.2021   | W37      | 20.01.2021     | 01.02.2021   | W23      |    E,V   |
| P7     | 05.02.2021     | 08.02.2021   | W38      | 03.02.2021     | 15.02.2021   | W24      |    E,V   |
| P8     | 19.02.2021     | 22.02.2021   | W39      | 17.02.2021     | 01.03.2021   | W25      |    E,V   |
| P8     | 28.02.2021     | 01.03.2021   | W40      | 17.02.2021     | 01.03.2021   | W25      |    E,V   |
| P9     | 05.03.2021     | 08.03.2021   | W41      | 03.03.2021     | 15.03.2021   | W26      |    E,V   |
| P10    | 19.03.2021     | 22.03.2021   | W42      | 17.03.2021     | 29.03.2021   | W27      |    E,V   |
| P11    | 02.04.2021     | 05.04.2021   | W43      | 14.04.2021     | 26.04.2021   | W28      |    E,V   |
| P11    | 16.04.2021     | 19.04.2021   | W44      | 14.04.2021     | 26.04.2021   | W28      |    E,V   |
| P12    | 07.05.2021     | 10.05.2021   | W45      | 28.04.2021     | 10.05.2021   | W29      |    E,V   |
| P13    | 21.05.2021     | 24.05.2021   | W46      |                |              | W30      |    E,V   |


e.g. P1: 23.10.20 - 26.10.20 (Fri - Mon)
* Corresponding week 19.19.20 - 25.10.20 or 26.10.20 - 01.11.20?
* For now let's say the former. 

# Dev

In [1]:
from datetime import date, datetime, timedelta
import time
from os import path
from searchtweets import ResultStream, gen_request_parameters, load_credentials, collect_results, convert_utc_time
import pandas as pd
import numpy as np

In [2]:
def countTweets(startDate, endDate, tweets_per_hour):
    '''
    Specify dates in DD.MM.YYY format (no leading 0 for months or days)
    '''
    
    s_d, s_m, s_y = [ int(i) for i in startDate.split('.')]
    e_d, e_m, e_y = [ int(i) for i in endDate.split('.')]

    endDate = date(e_y, e_m, e_d)
    startDate = date(s_y, s_m, s_d)
    days = endDate-startDate
    print("From {} to {} we have {} days, {} hours, and {} tweets (with {} tweets per hour)".format(startDate, 
                                                                                                    endDate, 
                                                                                                    days.days, 
                                                                                                    days.days*24, 
                                                                                                   days.days*24*tweets_per_hour,
                                                                                                   tweets_per_hour))
    
    
countTweets('23.10.2020', '24.05.2021', 10)     

From 2020-10-23 to 2021-05-24 we have 213 days, 5112 hours, and 51120 tweets (with 10 tweets per hour)


In [138]:
countTweets('23.10.2020', '1.11.2020', 1000) 

From 2020-10-23 to 2020-11-01 we have 9 days, 216 hours, and 216000 tweets (with 1000 tweets per hour)


In [5]:
class twitterData():
    '''
    A class for holding all the Twitter search related elements, from validating credentials
    to cleaning the data.
    '''
        
    def __init__(self, main_path):
        '''
        
        '''
        self.main_path = main_path
    
    def validate_credentials(self):
        '''
        
        '''
        c_path = path.join(self.main_path, 'twitter_keys.yaml')
        self.credentials = load_credentials('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/twitter_keys.yaml', 
                                       env_overwrite=True);
        self.all_requests = 0;

        return "Credentials validated successfully"
    
    
    
    def build_query(self,
                    mainTerms, 
                    startDate,
                    endDate,
                    inQuotes = True, 
                    language = 'en', 
                    country = 'US',
                    excludeRT = False,
                    results_per_call = 500,
                    return_fields = 'id,created_at,text,public_metrics',
                    otherTerms = []):
        
        '''
        Builds the query that is used to make the requests and get payloads.
        
        Parameters:
            mainTerms (str): The search terms we want, e.g. 'jobs'
            startDate (str): The lower end of the period we are interested in YYY-MM-DD HH:MM format, 
                             e.g. '2020-10-23 13:00'
            endDate (str): The higher end of the period we are interested in in YYY-MM-DD HH:MM format, 
                             e.g. '2020-10-23 14:00'
            inQuotes (bool): Do we want an exact phrase match? If true the terms will be put in quotes
            language (str): Language used in the query (only languages supported by Twitter + 
                            has to be in the correct format, see https://bit.ly/2RBwmGa)
            country (str): Country where Tweet/User is located (has to be in the correct format, see
                            https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2)
            excludeRT (bool): Exclude retweets from the payload? Default False
            results_per_call (int): How many results per request? Max is 500 for the academic API.
            otherTerms (list): List of other search terms, e.g. ['#COVID', 'is:reply']
        
        Notes:
            - More notes on building queries here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query.
            - Tweets are fetched in reverse chronological order, i.e. starting at endDate 
            and continuing until a limit is reached.
            - endDate refers to previous day until 23:59
        '''
        
        # If excluding retweets, set rt to '-' 
        rt = '-is:retweet' if excludeRT == True else ''
        
        # Are the terms in quotes
        mainTerms = '"{}"'.format(mainTerms) if inQuotes == True else '{}'
        
        # Build query text
        queryText = '{} lang: {} place_country:{}'.format(mainTerms,
                                                         language,
                                                         country)
        
        # If there are other terms, include them in the queryText
        queryText = queryText.extend(other) if otherTerms != [] else queryText
        
        # Save these as will be used to determine limits
        self.results_per_call = results_per_call
        self.startDate = startDate
        self.endDate = endDate
            
        # Build query
        self.query = gen_request_parameters(queryText,
                                      start_time = self.startDate,
                                      end_time = self.endDate,
                                      tweet_fields = return_fields,
                                      results_per_call = self.results_per_call)
        
    
    def get_data(self, nTweets = 500):
        '''
        
        '''
        
        #
        self.rs = ResultStream(request_parameters = self.query,
                                  max_tweets = nTweets,
                                  output_format = "a",
                                  **self.credentials)
        
        self.result = list(self.rs.stream())
        
        # We can get the total requests made for a payload using:
        # twitterData_instance.rs.n_requests
        # twitterData_instance.rs.session_request_counter
        
        # This can be used to get the overall requests made
        self.all_requests += self.rs.session_request_counter       
        
        
    
    def get_df(self):
        '''
        '''
        # Remove the entries (i.e. dictionaries) that contain
        # the key 'newest_id' from the payload, i.e. the result 
        # of our query (which is a list of dictionaries).        
        clean_json_list = [x for x in self.result if 'newest_id' not in x]        
        
        df = pd.json_normalize(clean_json_list)

        # Calculate the time covered in a payload.
        # Most recent date/time in the df in datetime format
        self.most_recent = twitterData.toDatetime(max(df['created_at']))
        self.oldest = twitterData.toDatetime(min(df['created_at']))
        
        self.timeCovered = (self.most_recent - self.oldest).seconds
    
        return df
    
    def oneWeek(self, mainTerms, startDate, endDate):
        '''
        Convenience function for getting all the tweets from a specified period.
        The parameters are fed to **build_query()**, which has more parameters with the following default values:
                    inQuotes = False, 
                    language = 'en', 
                    country = 'US',
                    excludeRT = False,
                    results_per_call = 500,
                    return_fields = 'id,created_at,text,public_metrics',
                    otherTerms = []
        These should either be added to the build_query() call within the current function, or the defaults changed in build_query().
        Parameters:
            mainTerms (str): search 
            startDate (str): week starting (format: 'YYYY-MM-DD' w, e.g. '2020-10-23')
            endDate (str): week ending (~)
            
        Returns:
            week_df (pd.DataFrame): Payload returned by the query for the specified period in df format.   
        '''
        
        self.build_query(mainTerms, startDate, endDate, results_per_call=500)
        self.get_data(nTweets = 1000)
        
        df1 = self.get_df()
    
        
#     def oneDay(self,dateStart, dateEnd):
#         '''
#         Get the df for a single week (specified by dateStart and dateEnd).
#         '''
        
#         init_request_session = time.time()
        
#         # Hours in the day
#         t = ['{}:00'.format(x) for x in range(0,24)]
        
#         # For every hour (of 24)
#         for i in range(24):
            
#             # Determine start and end time, e.g. '2020-10-23 00:00' abd '2020-10-23 01:00' 
#             startTime = '{} {}'.format(date, t[i]) 
            
#             if i==23:
#                 endDate = '{} {}'.format(date, '23:59') 
#             else:
#                 endDate = '{} {}'.format(date, t[i+1]) 

#             # Build the query
#             self.build_query('jobs', startDate, endDate, results_per_call=500)
            
            
#             # If the next request is the 300th (or multiple thereof)
#             # and we are within the same 15 min window.
#             # TODO: This is no good, as all_requests is not incremented by 1 (but by 3-30ish on every call)
#             if (self.all_requests+1 % 300 != 0) and (time.time() - init_request_session < 900):
                
#                 # Get the data (up to 1000 results per hour)
#                 self.get_data(nTweets = 1000)
                
#             else:
#                 # Sleep for 15 minutes minus however long we had in this session
#                 time.sleep(900 - (time.time() - init_request_session))
                
#                 # Then get the data
#                 self.get_data(nTweets = 1000)
                
#                 # And reset the session timer
#                 init_request_session = time.time()
                
            
#             # Clean data
#             current_df = self.get_df()
            
#             # Add to dataframe containing the data for a single day.
#             if i == 1:
#                 all_day_df = current_df
                
#             else:
#                 all_day_df = pd.concat([all_day_df, current_df])

        
    
    def exportOneDay():
        '''
        
        '''
        pass
    
    def exportOneDay():
        pass
    
    @staticmethod
    def toDatetime(dateStr):
        '''
        Take a date in the ISO format that we get from twitter "%Y-%m-%dT%H:%M:%S.000Z"
        and transform to a datetime for calculations.

        Parameters:
            dateStr (str): A date string (ISO format)
        
        Returns:
            dateDT (datetime): A datetime object  
        '''
        
        dateDT = datetime.strptime(dateStr, "%Y-%m-%dT%H:%M:%S.000Z")
        
        return dateDT
    
    @staticmethod
    def weekFromDay(day):
        '''
        Work the week starting and ending dates given any date.
        Params:
            day (datetime): Can be a Timestamp (pandas/numpy object) or a datetime.datetime object.

        Returns: 
            weekStart (Timestamp): The date corresponding to the start (i.e. Monday) of the date specified by *day* param.
            weekEnd (Timestamp): The date corresponding to the end (i.e. Sunday) of the date specified by *day* param.
        '''

        weekStart = day - timedelta(days=day.weekday())
        weekEnd = weekStart + timedelta(days=6)

        return weekStart.strftime('%Y-%m-%d'), weekEnd.strftime('%Y-%m-%d')
    
    
    

# Testing

In [16]:
search1 = twitterData('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/')
search1.validate_credentials()

'Credentials validated successfully'

## Single query: getting data for 1 hour

In [17]:
# Build a query.
search1.build_query('jobs','2020-10-23 00:00', '2020-10-23 01:00', results_per_call=500)

# Getting payload. 
# This is saved in self.results.
search1.get_data(nTweets = 1000)

# Clean data and save in a pd.DataFrame
df1 = search1.get_df()
df1

Unnamed: 0,id,created_at,text,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1319442612487479301,2020-10-23T00:56:47.000Z,Exactly..they should loose their jobs.. https:...,0,0,0,0
1,1319442601444020225,2020-10-23T00:56:44.000Z,@MeidasTouch Is Trump going schizo on us that ...,1,1,3,0
2,1319442291749158915,2020-10-23T00:55:31.000Z,Thank you @connectmeetings for getting meeting...,1,0,3,2
3,1319442241052680193,2020-10-23T00:55:18.000Z,@jecoreyarthur Or another option for jobs,0,0,0,0
4,1319442109917728770,2020-10-23T00:54:47.000Z,“They took our jobs!!” Bro u didnt go to colle...,0,0,0,0
5,1319442103164936193,2020-10-23T00:54:46.000Z,I'm horrified by this. Any health professional...,0,0,1,0
6,1319441826332409856,2020-10-23T00:53:40.000Z,Part of me says “if only Hunter didn’t take th...,0,0,2,0
7,1319441463671967746,2020-10-23T00:52:13.000Z,@JohnDiesattheEn Modern debates are the NFL Bl...,0,0,0,0
8,1319441090911547392,2020-10-23T00:50:44.000Z,You have?\nThat’s all I want your amazing Earl...,0,0,0,0
9,1319440777085419521,2020-10-23T00:49:29.000Z,I’d say that this is a reason some jobs can’t ...,0,1,2,0


* The number of requests in a single query is saved in the instance attributed .rs.n_requests. 
* This is overwritten when a new request is made, but before that, this number (n_request) is added to the instance's .all_requests attribute. 
    * For example, below we can see that .n_requests = 1 after both the first and second payload (saved in df1), but .all_requests is 3. 
* The .all_requests attribute will be used for ensurign compliance with rate limits.
    * This could be done directly through *searchtweets*, which has built-in tools (e.g. exponential back-off), by making a single query for the whole period (~280 days).
    * However, since the period we are interested in covering here is quite big, this is probably not a good solutions (e.g. if something fails on request 5,000/7,000 all data is lost but all tweets already accessed will count towards the monthly rate limit)



In [18]:
print(search1.rs.n_requests)
print(search1.rs.session_request_counter)
print(search1.all_requests)

1
3
3


In [19]:
search1.build_query('jobs','2020-10-23 01:00', '2020-10-23 2:00', results_per_call=500)
search1.get_data(nTweets = 1000)
df2 = search1.get_df()
df2

Unnamed: 0,id,created_at,text,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1319458440633339904,2020-10-23T01:59:41.000Z,Many of the jobs lost this year will never com...,0,1,2,0
1,1319458438943019009,2020-10-23T01:59:40.000Z,I have a 401K and investments. I like seeing t...,0,1,1,0
2,1319458252460085249,2020-10-23T01:58:56.000Z,You’re all sitting here while you’re still all...,0,0,0,0
3,1319458198932381697,2020-10-23T01:58:43.000Z,"In case you didn’t know this, the stock market...",0,0,1,0
4,1319458079759568896,2020-10-23T01:58:15.000Z,"@Jaycaleb8 Bro what are you saying, I live dow...",0,1,0,0
...,...,...,...,...,...,...,...
59,1319444682720661504,2020-10-23T01:05:01.000Z,Looking forward to hearing @JoeBiden talk abou...,0,0,2,0
60,1319444365014609920,2020-10-23T01:03:45.000Z,If Biden thinks Trump mishandled Covid 19 he n...,0,0,0,0
61,1319444125972914176,2020-10-23T01:02:48.000Z,@rrt003 @MSNBC @DailyNewsSA Well no. Also how ...,0,1,1,0
62,1319443992451248129,2020-10-23T01:02:16.000Z,Gays dont not liking Trump as a person. Ooohh ...,2,4,40,1


In [21]:
print(search1.rs.n_requests)
print(search1.rs.session_request_counter)
print(search1.all_requests)

1
4
7


The above also gives an indication of how many tweets to expect for our most basic query on 'jobs'. 

## Single query: getting data for 1 week based on the data collection periods in the surveys.

In [22]:
# Load the survey periods
# These hold the dates on which data was collected for each survey in part
# Will use it to get twitter data from the same period.
surveyPeriods = pd.read_excel('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Surveys/table_details/surveyPeriods.xlsx', sheet_name='AI+HPS')

In [30]:
# Use:
p1_start, p2_start = twitterData.weekFromDay(surveyPeriods['A_I_start_date'][0])
print("Week start(Monday): {} \nWeek end (Sunday): {}\nVar type: {},{}".format(p1_start, p2_start, type(p1_start), type(p2_start)))

Week start(Monday): 2020-10-19 
Week end (Sunday): 2020-10-25
Var type: <class 'str'>,<class 'str'>


In [31]:
search2 = twitterData('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Twitter/')
search2.validate_credentials()
search2.build_query('jobs', p1_start, p2_start)

In [32]:
search2.get_data(nTweets=20000)

In [48]:
search2.get_df()

Unnamed: 0,id,text,created_at,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1353128768211013634,@VerySaucySalsa @conniechansf Every city that ...,2021-01-23T23:53:32.000Z,0,0,1,0
1,1353128719250989057,@GetUpESPN @BartScott57 @Realrclark25 Come on ...,2021-01-23T23:53:21.000Z,0,0,0,0
2,1353128517781905408,You’ll raise their pay and the minimum wage fo...,2021-01-23T23:52:33.000Z,0,1,1,0
3,1353128220569329665,@redsteeze @mt_lass Thanks New Mexico! We’re l...,2021-01-23T23:51:22.000Z,1,0,0,0
4,1353127984576794627,Biden has managed to rid this nation of 100's ...,2021-01-23T23:50:26.000Z,0,0,0,0
...,...,...,...,...,...,...,...
5466,1350958697183404043,"@jussbryant Exactly, so with that being said m...",2021-01-18T00:10:27.000Z,0,1,0,0
5467,1350957711375130635,@alavelle07 @laurenboebert So you like Sociali...,2021-01-18T00:06:32.000Z,0,0,0,0
5468,1350957355752681474,"Or is that what you think, my brotha? How many...",2021-01-18T00:05:07.000Z,0,0,0,0
5469,1350956550605725697,@ViolationsGreg @davecokin But the last two OC...,2021-01-18T00:01:55.000Z,0,1,0,0
