# Planning

### Things that need to be done    

- [x] Set up skeleton for a nice object oriented approach. 
- [x] Figure out best way to get tweets.
- [x] Data format
- [x] Build class and funcs
- [x] Figure out how to best count all requests in a session and make sure it's functional
    - This is already built-in to searchtweets to some extent, so will use that.
- [x] Make oneDay() fail safe (it's a bad idea to get the data and combine it within a single for, because if something goes wrong in an interation all the data from previous iterations will be lost if something goes wrong).
    - Getting a whole week instead
- [x] Change oneDay() to oneWeek().    
- [x] Save metadata for the payloads (i.e. self.most_recent, self.oldest, self.timeCovered, rs.total_results + other?)
    - Seach tweets already saves some sort of log, print this to file with the appropriate name.  
- [ ] More tests on oneWeek(): currently it sometimes returns an empty Stream sometimes, it might be due to bad handling of rate limits by searchtweets?
- [ ] Updated export and loading to be compliant with the new folder structure. 
- [ ] See why df3 different size than what we get with oneWeek(). Test in intervals > 15 min, my suspicion is that it's bad handling of rate limits by searchtweets.

### Rate Limits

- 10,000,000 Tweets per month (resets on the 19th of each month). 
- 300 requests/15 minute window, with 500 Tweets/request:
    - 150,000 tweets/15min 
    - 600,000 tweets/hour

### How many tweets to get?
- Period covered is: Oct 23rd - July 30th (-ish)
    - ~ 280 days
    - ~ 6720 hours
    - If we get 1000 tweets per hour: $6,720,000 * 2$. 
    - That's very little in terms of space, but might take quite a while for it to go through sentiment analysis.
    - It would take ~22 hours to get the whole data (due to rate limits).
    - However, it's unlikely that our queries would return anywhere near 1,000 results/hour.

### Best way to get tweets
- Period covered is: Oct 23rd - July 30th (-ish)
    - $n_h$ per hour/day
    - $n_d$ per day (where $n_d$ would be ~ $n_h*24$)
    - $n_w$ per week (where $n_w$ would be ~ $n_h*24 *7$)  
    

- $n_w$ is probably the best options: 
    - can leverage functions built into *searchtweets* to avoid rate limit violations (e.g. exponential back-off).
    - it's easy to select tweets in any given day/hour from these data.

### Data format

- A single results call: **JSON to pd**.
    - This is relatively straightforward with one minor complication, i.e. entries such as this:
    <blockquote>{'newest_id': '1402310241992183808',
  'oldest_id': '1402310139630211083',
  'result_count': 100,
  'next_token': 'b26v89c19zqg8o3fpdg7rbcqdq8stpgmibslekg3kxail'}
    </blockquote>
    - This is used by the wrapper to get the next lot of tweets if max_tweets > results_per_call, but will also always be the last entry in a result.
    
    
- Multiple result calls: **pds in dict/dict-of-dict**. 
    - I am thinking the best way to store all the data would be a dict of dataframes, but will see how it works  
    
### Survey periods

| Period | A_I_start_date | A_I_end_date | A_I_week | HPS_start_date | HPS_end_date | HPS_Week | HPS Topic |
|--------|----------------|--------------|----------|----------------|--------------|----------|-----------|
| P1     | 23.10.2020     | 26.10.2020   | W29*     | 28.10.2020     | 09.11.2020   | W18      |     E    |
| P2     | 13.11.2020     | 16.11.2020   | W30      | 11.11.2020     | 23.11.2020   | W19      |     E    |
| P2     | 20.11.2020     | 23.11.2020   | W31      | 11.11.2020     | 23.11.2020   | W19      |     E    |
| P3     | 04.12.2020     | 07.12.2020   | W32      | 25.11.2020     | 07.12.2020   | W20      |     E    |
| P4     | 11.12.2020     | 14.12.2020   | W33      | 09.12.2020     | 21.12.2020   | W21      |     E    |
| P4     | 18.12.2020     | 21.12.2020   | W34      | 09.12.2020     | 21.12.2020   | W21      |     E    |
| P5     | 08.01.2021     | 11.01.2021   | W35      | 06.01.2021     | 18.01.2021   | W22      |    E,V   |
| P6     | 22.01.2021     | 25.01.2021   | W36      | 20.01.2021     | 01.02.2021   | W23      |    E,V   |
| P6     | 29.01.2021     | 01.02.2021   | W37      | 20.01.2021     | 01.02.2021   | W23      |    E,V   |
| P7     | 05.02.2021     | 08.02.2021   | W38      | 03.02.2021     | 15.02.2021   | W24      |    E,V   |
| P8     | 19.02.2021     | 22.02.2021   | W39      | 17.02.2021     | 01.03.2021   | W25      |    E,V   |
| P8     | 28.02.2021     | 01.03.2021   | W40      | 17.02.2021     | 01.03.2021   | W25      |    E,V   |
| P9     | 05.03.2021     | 08.03.2021   | W41      | 03.03.2021     | 15.03.2021   | W26      |    E,V   |
| P10    | 19.03.2021     | 22.03.2021   | W42      | 17.03.2021     | 29.03.2021   | W27      |    E,V   |
| P11    | 02.04.2021     | 05.04.2021   | W43      | 14.04.2021     | 26.04.2021   | W28      |    E,V   |
| P11    | 16.04.2021     | 19.04.2021   | W44      | 14.04.2021     | 26.04.2021   | W28      |    E,V   |
| P12    | 07.05.2021     | 10.05.2021   | W45      | 28.04.2021     | 10.05.2021   | W29      |    E,V   |
| P13    | 21.05.2021     | 24.05.2021   | W46      |                |              | W30      |    E,V   |


e.g. P1: 23.10.20 - 26.10.20 (Fri - Mon)
* Corresponding week 19.19.20 - 25.10.20 or 26.10.20 - 01.11.20?
* For now let's say the former. 

# Dev

In [1]:
from datetime import date, datetime, timedelta
import time
from os import path
from searchtweets import ResultStream, gen_request_parameters, load_credentials, collect_results, convert_utc_time
import pandas as pd
import numpy as np
import json

In [2]:
def countTweets(startDate, endDate, tweets_per_hour):
    '''
    Specify dates in DD.MM.YYY format (no leading 0 for months or days)
    '''
    
    s_d, s_m, s_y = [ int(i) for i in startDate.split('.')]
    e_d, e_m, e_y = [ int(i) for i in endDate.split('.')]

    endDate = date(e_y, e_m, e_d)
    startDate = date(s_y, s_m, s_d)
    days = endDate-startDate
    print("From {} to {} we have {} days, {} hours, and {} tweets (with {} tweets per hour)".format(startDate, 
                                                                                                    endDate, 
                                                                                                    days.days, 
                                                                                                    days.days*24, 
                                                                                                   days.days*24*tweets_per_hour,
                                                                                                   tweets_per_hour))
    
    
countTweets('23.10.2020', '24.05.2021', 10)     

From 2020-10-23 to 2021-05-24 we have 213 days, 5112 hours, and 51120 tweets (with 10 tweets per hour)


In [138]:
countTweets('23.10.2020', '1.11.2020', 1000) 

From 2020-10-23 to 2020-11-01 we have 9 days, 216 hours, and 216000 tweets (with 1000 tweets per hour)


In [90]:
class twitterData():
    '''
    A class for holding all the Twitter search related elements, from validating credentials
    to cleaning the data.
    '''
        
    def __init__(self, main_path = '/Volumes/Survey_Social_Media_Compare/Methods/'):
        '''
        
        '''
        
        self.main_path = main_path
    
    def validate_credentials(self):
        '''
        
        '''
        c_path = path.join(self.main_path, 'Scripts/Twitter/twitter_keys.yaml')
        self.credentials = load_credentials(c_path, 
                                       env_overwrite=True);
        self.all_requests = 0;
        self.total_results_overall = 0;

        return "Credentials validated successfully"
    
    
    
    def build_query(self,
                    mainTerms, 
                    startDate,
                    endDate,
                    inQuotes = True, 
                    language = 'en', 
                    country = 'US',
                    excludeRT = False,
                    results_per_call = 500,
                    return_fields = 'id,created_at,text,public_metrics',
                    otherTerms = []):
        
        '''
        Builds the query that is used to make the requests and get payloads.
        
        Parameters:
            mainTerms (str): The search terms we want, e.g. 'jobs'
            startDate (str): The lower end of the period we are interested in YYY-MM-DD HH:MM format, 
                             e.g. '2020-10-23 13:00'
            endDate (str): The higher end of the period we are interested in in YYY-MM-DD HH:MM format, 
                             e.g. '2020-10-23 14:00'
            inQuotes (bool): Do we want an exact phrase match? If true the terms will be put in quotes
            language (str): Language used in the query (only languages supported by Twitter + 
                            has to be in the correct format, see https://bit.ly/2RBwmGa)
            country (str): Country where Tweet/User is located (has to be in the correct format, see
                            https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2)
            excludeRT (bool): Exclude retweets from the payload? Default False
            results_per_call (int): How many results per request? Max is 500 for the academic API.
            otherTerms (list): List of other search terms, e.g. ['#COVID', 'is:reply']
        
        Notes:
            - More notes on building queries here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query.
            - Tweets are fetched in reverse chronological order, i.e. starting at endDate 
            and continuing until a limit is reached.
            - endDate refers to previous day until 23:59
        '''
        
        # If excluding retweets, set rt to '-' 
        rt = '-is:retweet' if excludeRT == True else ''
        
        # Are the terms in quotes
        mainTerms = '"{}"'.format(mainTerms) if inQuotes == True else '{}'
        
        # Build query text
        queryText = '{} lang: {} place_country:{}'.format(mainTerms,
                                                         language,
                                                         country)
        
        # If there are other terms, include them in the queryText
        queryText = queryText.extend(other) if otherTerms != [] else queryText
        
        # Save these as will be used to determine limits
        self.results_per_call = results_per_call
        
        print(startDate)
        print(endDate)
            
        # Build query
        self.query = gen_request_parameters(queryText,
                                      start_time = startDate,
                                      end_time = endDate,
                                      tweet_fields = return_fields,
                                      results_per_call = self.results_per_call)
        
    
    def get_data(self, nTweets):
        '''
        
        '''
        
        self.rs = ResultStream(request_parameters = self.query,
                                  max_tweets = nTweets,
                                  output_format = "a",
                                  **self.credentials)
        
        self.result = list(self.rs.stream())
        
        # We can get the total requests made for a payload using:
        # twitterData_instance.rs.n_requests
        # twitterData_instance.rs.session_request_counter
        
        # This can be used to get the overall requests made and saving logs.
        self.all_requests += self.rs.session_request_counter       
        self.total_results_overall += self.rs.total_results
        
    
    def surveyDates(self):
        
        # Path to survey periods file
        s_path = path.join(self.main_path, 'Scripts/Surveys/table_details/surveyPeriods.xlsx')
        
        # Load survey periods
        self.surveyPeriods = pd.read_excel(s_path, sheet_name='AI+HPS')
        
        # Generate tuple of week start and end dates based on the collection dates in the Axios/Ipsos survey. 
        self.AI_weeks = [twitterData.weekFromDay(date) for date in self.surveyPeriods['A_I_start_date']]
        
#         # Get first monday and last sunday from the A/I data collection periods
#         firstDate,_ = twitterData.weekFromDay(surveyPeriods['A_I_start_date'][0])
#         _, lastDate = twitterData.weekFromDay(surveyPeriods['A_I_start_date'].iloc[-1])

#         # Create data ranges for all mondays/sundays starting with the first one covered in A/I.
#         self.mondays = pd.date_range(firstDate, lastDate, freq='W-MON')
#         self.sundays = pd.date_range(firstDate, lastDate, freq='W-SUN')
        
#         # Get strings
#         self.mondays_str = self.mondays.strftime('%Y-%m-%d')
#         self.sundays_str = self.sundays.strftime('%Y-%m-%d')

#         # Saving all the results in tuples
#         self.all_weeks = [(m, s) for m, s in zip(self.mondays, self.sundays)]
#         self.all_weeks_str = [(m, s) for m, s in zip(self.mondays_str, self.sundays_str)]

        # Get first monday and last sunday from the A/I data collection periods
        firstDate,_ = twitterData.weekFromDay(self.surveyPeriods['A_I_start_date'][0])
        _, lastDate = twitterData.weekFromDay(self.surveyPeriods['A_I_start_date'].iloc[-1])

        # Create data ranges for all mondays/sundays starting with the first one covered in A/I.
        self.mondays = pd.date_range(firstDate, lastDate, freq='W-MON')
#         self.sundays = pd.date_range(firstDate, lastDate, freq='W-SUN') # Legacy
        self.leading_mondays = pd.date_range(twitterData.nextMonday(firstDate), twitterData.nextMonday(lastDate), freq='W-MON')

        # Get strings
        self.mondays_str = self.mondays.strftime('%Y-%m-%d')
#         self.sundays_str = self.sundays.strftime('%Y-%m-%d') # Legacy
        self.leading_mondays_str = self.leading_mondays.strftime('%Y-%m-%d')

        # Saving all the results in a tuple
        self.all_weeks = [(m, s) for m, s in zip(self.mondays, self.leading_mondays)]
        self.all_weeks_str = [(m, s) for m, s in zip(self.mondays_str, self.leading_mondays_str)]
    
    
    def createDicts(self):
        
        self.allData = dict.fromkeys(self.mondays_str)
        self.logs = dict.fromkeys(self.mondays_str)
    
    def oneWeek(self, mainTerms, weekNum):
        '''
        TODO: re-write this docstring to reflect changes. 
        Convenience function for getting all the tweets from a specified period.
        The parameters are fed to **build_query()**, which has more parameters with the following default values:
                    inQuotes = False, 
                    language = 'en', 
                    country = 'US',
                    excludeRT = False,
                    results_per_call = 500,
                    return_fields = 'id,created_at,text,public_metrics',
                    otherTerms = []
        These should either be added to the build_query() call within the current function, or the defaults changed in build_query().
        Parameters:
            mainTerms (str): search 
            startDate (str): week starting (format: 'YYYY-MM-DD' w, e.g. '2020-10-23')
            endDate (str): week ending (~)
            
        Returns:
            week_df (pd.DataFrame): Payload returned by the query for the specified period in df format.   
        '''
        
        # Get start and end date from the week number
        startDate = self.all_weeks_str[weekNum - 1][0]
        endDate = self.all_weeks_str[weekNum - 1][1]

        
        # Could have also done
        # startDate = self.mondays_str[weekNum - 1]
        # endDate = self.leading_mondays[weekNum - 1]
        
        # Build the query with the specified terms
        self.build_query(mainTerms, startDate, endDate, results_per_call=500)
        
        # Get the data. 
        self.get_data(nTweets = 20000) # 20,000 is quite conservative, it's unlikely we would get more than ~10,000/week.
        
        
        # Clean data (-> pd.DataFrame) and save into dictionary, with the specified by the startDate (i.e. the date corresponding to Monday of any given week in the entire period covered)
        self.allData[startDate] = twitterData.get_df(self.result)
        
        # Calculate the time covered in a payload.
        # Most recent date/time in the df in datetime format
        self.most_recent = twitterData.toDatetime(max(self.allData[startDate]['created_at']))
        self.oldest = twitterData.toDatetime(min(self.allData[startDate]['created_at']))
        
        self.timeCovered = str(self.most_recent - self.oldest)
        
        self.logs[startDate] = {
            'mostRecent': self.most_recent,
            'oldest': self.oldest,
            'timeCovered': self.timeCovered,
            'sessionRequestCounter': self.rs.session_request_counter,
            'totalRequests': self.all_requests,
            'totalTweets': self.rs.total_results,
            'totalTweetsOverall': self.total_results_overall,
            'requestParams': self.rs.request_parameters
            }
            
        # Save current week's Monday date.
        # Will be used to name the files when saving.
        self.currentWeek = startDate
    
    def exportOneWeek(self, topic, saveDF = True, saveJSON = False):
        '''
        topic (str): "Employment" or "Vaccination"
        
        '''
        
        if saveJSON:
            json_path = path.join(self.main_path, 'Data/Twitter/{}/JSON/{}.json'.format(topic,self.currentWeek))
            
            with open(json_path, 'w') as fout:
                json.dump(self.result, fout)
            
        if saveDF:
            df_path = path.join(self.main_path, 'Data/Twitter/{}/CSV/{}.csv'.format(topic, self.currentWeek))
            self.allData[self.currentWeek].to_csv(df_path)
            
    @staticmethod
    def loadOneWeek(weekStart, topic, loadDF = True, loadJSON = False):
        '''
        weekStart (str): name of file to be loaded; same format as used for currentWeek, e.g. '2020-10-19'
        topic (str): "Employment" or "Vaccination"
        
        '''
        
        if(loadDF and not loadJSON):
            df = pd.read_csv('/Volumes/Survey_Social_Media_Compare/Methods/Data/Twitter/CSV/{}.csv'.format(topic, weekStart), index_col=0, dtype={'id': object})
            return df
        
        if(not loadDF and loadJSON):
            with open('/Volumes/Survey_Social_Media_Compare/Methods/Data/Twitter/JSON/{}.json'.format(topic, weekStart)) as f:
                result = json.load(f)
            
            return result
        
        if(loadDF and loadJSON):
            df = pd.read_csv('/Volumes/Survey_Social_Media_Compare/Methods/Data/Twitter/{}/CSV/{}.csv'.format(topic, weekStart), index_col=0, dtype={'id': object})
            
            with open('/Volumes/Survey_Social_Media_Compare/Methods/Data/Twitter/{}/JSON/{}.json'.format(topic, weekStart)) as f:
                result = json.load(f)
                    
            return df, result                                                       
    
    @staticmethod
    def toDatetime(dateStr):
        '''
        Take a date in the ISO format that we get from twitter "%Y-%m-%dT%H:%M:%S.000Z"
        and transform to a datetime for calculations.

        Parameters:
            dateStr (str): A date string (ISO format)
        
        Returns:
            dateDT (datetime): A datetime object  
        '''
        
        try:
            dateDT = datetime.strptime(dateStr, "%Y-%m-%d")
            
        except: 
            dateDT = datetime.strptime(dateStr, "%Y-%m-%dT%H:%M:%S.000Z")
            
        
        return dateDT
    
    @staticmethod
    def weekFromDay(day):
        '''
        Work the week starting and ending dates given any date.
        Params:
            day (datetime): Can be a Timestamp (pandas/numpy object) or a datetime.datetime object.

        Returns: 
            weekStart (Timestamp): The date corresponding to the start (i.e. Monday) of the date specified by *day* param.
            weekEnd (Timestamp): The date corresponding to the end (i.e. Sunday) of the date specified by *day* param.
        '''

        weekStart = day - timedelta(days=day.weekday())
        weekEnd = weekStart + timedelta(days=6)

        return weekStart.strftime('%Y-%m-%d'), weekEnd.strftime('%Y-%m-%d')
    
    @staticmethod
    def nextMonday(date):
        date = twitterData.toDatetime(date)

        nextM = date + timedelta(days=-date.weekday(), weeks=1)

        return nextM
        
    @staticmethod
    def get_df(dictLS):
        '''
        '''
        # Remove the entries (i.e. dictionaries) that contain
        # the key 'newest_id' from the payload, i.e. the result 
        # of our query (which is a list of dictionaries).        
        clean_json_list = [x for x in dictLS if 'newest_id' not in x]        
        
        df = pd.json_normalize(clean_json_list)
    
        return df
    
    @staticmethod
    def combineWeeks(dataDict):
        pass
    

# Testing

In [3]:
search1 = twitterData('/Volumes/Survey_Social_Media_Compare/Methods')
search1.validate_credentials()

'Credentials validated successfully'

## Single query: getting data for 1 hour

In [65]:
# Build a query.
search1.build_query('jobs','2020-10-23 00:00', '2020-10-23 01:00', results_per_call=500)

# Getting payload. 
# This is saved in self.results.
search1.get_data(nTweets = 1000)

# Clean data and save in a pd.DataFrame
df1 = twitterData.get_df(search1.result)
df1

Unnamed: 0,id,created_at,text,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1319442612487479301,2020-10-23T00:56:47.000Z,Exactly..they should loose their jobs.. https:...,0,0,0,0
1,1319442601444020225,2020-10-23T00:56:44.000Z,@MeidasTouch Is Trump going schizo on us that ...,1,1,3,0
2,1319442291749158915,2020-10-23T00:55:31.000Z,Thank you @connectmeetings for getting meeting...,1,0,3,2
3,1319442241052680193,2020-10-23T00:55:18.000Z,@jecoreyarthur Or another option for jobs,0,0,0,0
4,1319442109917728770,2020-10-23T00:54:47.000Z,“They took our jobs!!” Bro u didnt go to colle...,0,0,0,0
5,1319442103164936193,2020-10-23T00:54:46.000Z,I'm horrified by this. Any health professional...,0,0,1,0
6,1319441826332409856,2020-10-23T00:53:40.000Z,Part of me says “if only Hunter didn’t take th...,0,0,2,0
7,1319441463671967746,2020-10-23T00:52:13.000Z,@JohnDiesattheEn Modern debates are the NFL Bl...,0,0,0,0
8,1319441090911547392,2020-10-23T00:50:44.000Z,You have?\nThat’s all I want your amazing Earl...,0,0,0,0
9,1319440777085419521,2020-10-23T00:49:29.000Z,I’d say that this is a reason some jobs can’t ...,0,1,2,0


* The number of requests in a single query is saved in the instance attributed .rs.n_requests. 
* This is overwritten when a new request is made, but before that, this number (n_request) is added to the instance's .all_requests attribute. 
    * For example, below we can see that .n_requests = 1 after both the first and second payload (saved in df1), but .all_requests is 3. 
* The .all_requests attribute will be used for ensurign compliance with rate limits.
    * This could be done directly through *searchtweets*, which has built-in tools (e.g. exponential back-off), by making a single query for the whole period (~280 days).
    * However, since the period we are interested in covering here is quite big, this is probably not a good solutions (e.g. if something fails on request 5,000/7,000 all data is lost but all tweets already accessed will count towards the monthly rate limit)



In [213]:
print(search1.rs.n_requests)
print(search1.rs.session_request_counter)
print(search1.all_requests)

1
17
33


In [66]:
search1.build_query('jobs','2020-10-23 01:00', '2020-10-23 2:00', results_per_call=500)
search1.get_data(nTweets = 1000)
df2 = twitterData.get_df(search1.result)
df2

Unnamed: 0,created_at,text,id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,2020-10-23T01:59:41.000Z,Many of the jobs lost this year will never com...,1319458440633339904,0,1,2,0
1,2020-10-23T01:59:40.000Z,I have a 401K and investments. I like seeing t...,1319458438943019009,0,1,1,0
2,2020-10-23T01:58:56.000Z,You’re all sitting here while you’re still all...,1319458252460085249,0,0,0,0
3,2020-10-23T01:58:43.000Z,"In case you didn’t know this, the stock market...",1319458198932381697,0,0,1,0
4,2020-10-23T01:58:15.000Z,"@Jaycaleb8 Bro what are you saying, I live dow...",1319458079759568896,0,1,0,0
...,...,...,...,...,...,...,...
59,2020-10-23T01:05:01.000Z,Looking forward to hearing @JoeBiden talk abou...,1319444682720661504,0,0,2,0
60,2020-10-23T01:03:45.000Z,If Biden thinks Trump mishandled Covid 19 he n...,1319444365014609920,0,0,0,0
61,2020-10-23T01:02:48.000Z,@rrt003 @MSNBC @DailyNewsSA Well no. Also how ...,1319444125972914176,0,1,1,0
62,2020-10-23T01:02:16.000Z,Gays dont not liking Trump as a person. Ooohh ...,1319443992451248129,2,4,40,1


In [67]:
print(search1.rs.n_requests)
print(search1.rs.session_request_counter)
print(search1.all_requests)

1
55
110


The above also gives an indication of how many tweets to expect for our most basic query on 'jobs' (in one hour). 

## Single query: getting data for 1 week based on the data collection periods in the surveys.

In [7]:
# Load the survey periods
# These hold the dates on which data was collected for each survey in part
# Will use it to get twitter data from the same period.
surveyPeriods = pd.read_excel('/Volumes/Survey_Social_Media_Compare/Methods/Scripts/Surveys/table_details/surveyPeriods.xlsx', sheet_name='AI+HPS')

In [8]:
# Usage:
p1_start, p2_start = twitterData.weekFromDay(surveyPeriods['A_I_start_date'][0])
print("Week start(Monday): {} \nWeek end (Sunday): {}\nVar type: {},{}".format(p1_start, p2_start, type(p1_start), type(p2_start)))

Week start(Monday): 2020-10-19 
Week end (Sunday): 2020-10-25
Var type: <class 'str'>,<class 'str'>


In [59]:
search2 = twitterData('/Volumes/Survey_Social_Media_Compare/Methods')
search2.validate_credentials()
search2.build_query('jobs', p1_start, p2_start)
search2.get_data(nTweets=20000) # Covering a whole week now -> higher nTweets set as the limit.
df3 = twitterData.get_df(search2.result)
df3

Unnamed: 0,id,created_at,text,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1320151587403288579,2020-10-24T23:54:00.000Z,The need to pick up the pace on transitioning ...,8,0,34,0
1,1320151419857498112,2020-10-24T23:53:20.000Z,We're hiring! Click to apply: Restaurant Manag...,0,0,0,0
2,1320151277926555660,2020-10-24T23:52:46.000Z,‘Even on a rainy day — you can tell the Sunshi...,2,1,7,0
3,1320151038448390144,2020-10-24T23:51:49.000Z,Life is a struggle especially in between jobs ...,0,0,0,0
4,1320150725825941504,2020-10-24T23:50:34.000Z,💯 New York remains desirable for all the same ...,0,0,4,0
...,...,...,...,...,...,...,...
2438,1319303811517812737,2020-10-22T15:45:14.000Z,@MikeEmanuelFox @TeamTrump @CNN @MSNBC @ABC @C...,0,0,0,0
2439,1319303678533259269,2020-10-22T15:44:43.000Z,New claims below 800k for first time in foreve...,1,0,0,0
2440,1319303241998520322,2020-10-22T15:42:59.000Z,@TrumpWarRoom @CNN @MSNBC @ABC @CBS @NBCNews a...,0,0,0,0
2441,1319303155746787328,2020-10-22T15:42:38.000Z,@JoeBiden @CNN @MSNBC @ABC @CBS @NBCNews are y...,0,0,0,0


In [60]:
search2.rs.session_request_counter

48

## Building the weeks from survey data collection periods.

This will be moved to the surveyDates method of twitterData. 

In [61]:
# Get a single week start and end date as tuple. 
bla = twitterData.weekFromDay(surveyPeriods['A_I_start_date'][0])

# Get all the week's start and end date as a list of tuples.
bla2 = [twitterData.weekFromDay(date) for date in surveyPeriods['A_I_start_date']]

# Get the Monday dates.
bla3 = [bla2[i][0] for i in range(len(bla2))]

* This will be used later on for getting the overlap periods. 
* Howver, what we actuall want here is all the the week start and end dates from the first A/I collection period to the last (with no breaks).

In [110]:
# Get first monday and last sunday from the A/I data collection periods
firstDate,_ = twitterData.weekFromDay(surveyPeriods['A_I_start_date'][0])
_, lastDate = twitterData.weekFromDay(surveyPeriods['A_I_start_date'].iloc[-1])

# Create data ranges for all mondays/sundays starting with the first one covered in A/I.
mondays = pd.date_range(firstDate, lastDate, freq='W-MON')
sundays = pd.date_range(firstDate, lastDate, freq='W-SUN')

# Saving results in a tuple
bla4 = (mondays[0].strftime('%Y-%m-%d'), sundays[0].strftime('%Y-%m-%d'))

# Saving all the results in a tuple
bla5 = [(m.strftime('%Y-%m-%d'), s.strftime('%Y-%m-%d')) for m, s in zip(mondays, sundays)]

In [91]:
bla2

[('2020-10-19', '2020-10-25'),
 ('2020-11-09', '2020-11-15'),
 ('2020-11-16', '2020-11-22'),
 ('2020-11-30', '2020-12-06'),
 ('2020-12-07', '2020-12-13'),
 ('2020-12-14', '2020-12-20'),
 ('2021-01-04', '2021-01-10'),
 ('2021-01-18', '2021-01-24'),
 ('2021-01-25', '2021-01-31'),
 ('2021-02-01', '2021-02-07'),
 ('2021-02-15', '2021-02-21'),
 ('2021-02-22', '2021-02-28'),
 ('2021-03-01', '2021-03-07'),
 ('2021-03-15', '2021-03-21'),
 ('2021-03-29', '2021-04-04'),
 ('2021-04-12', '2021-04-18'),
 ('2021-05-03', '2021-05-09'),
 ('2021-05-17', '2021-05-23')]

In [111]:
bla5

[('2020-10-19', '2020-10-25'),
 ('2020-10-26', '2020-11-01'),
 ('2020-11-02', '2020-11-08'),
 ('2020-11-09', '2020-11-15'),
 ('2020-11-16', '2020-11-22'),
 ('2020-11-23', '2020-11-29'),
 ('2020-11-30', '2020-12-06'),
 ('2020-12-07', '2020-12-13'),
 ('2020-12-14', '2020-12-20'),
 ('2020-12-21', '2020-12-27'),
 ('2020-12-28', '2021-01-03'),
 ('2021-01-04', '2021-01-10'),
 ('2021-01-11', '2021-01-17'),
 ('2021-01-18', '2021-01-24'),
 ('2021-01-25', '2021-01-31'),
 ('2021-02-01', '2021-02-07'),
 ('2021-02-08', '2021-02-14'),
 ('2021-02-15', '2021-02-21'),
 ('2021-02-22', '2021-02-28'),
 ('2021-03-01', '2021-03-07'),
 ('2021-03-08', '2021-03-14'),
 ('2021-03-15', '2021-03-21'),
 ('2021-03-22', '2021-03-28'),
 ('2021-03-29', '2021-04-04'),
 ('2021-04-05', '2021-04-11'),
 ('2021-04-12', '2021-04-18'),
 ('2021-04-19', '2021-04-25'),
 ('2021-04-26', '2021-05-02'),
 ('2021-05-03', '2021-05-09'),
 ('2021-05-10', '2021-05-16'),
 ('2021-05-17', '2021-05-23')]

In [119]:
# Updated version of the above.

# Get first monday and last sunday from the A/I data collection periods
firstDate,_ = twitterData.weekFromDay(surveyPeriods['A_I_start_date'][0])
_, lastDate = twitterData.weekFromDay(surveyPeriods['A_I_start_date'].iloc[-1])

# Create data ranges for all mondays/sundays starting with the first one covered in A/I.
mondays = pd.date_range(firstDate, lastDate, freq='W-MON')
sundays = pd.date_range(firstDate, lastDate, freq='W-SUN')

# Get strings
mondays_str = mondays.strftime('%Y-%m-%d')
sundays_str = sundays.strftime('%Y-%m-%d')

# Saving all the results in a tuple
all_weeks = [(m, s) for m, s in zip(mondays, sundays)]
all_weeks_str = [(m, s) for m, s in zip(mondays_str, sundays_str)]

In [127]:
all_weeks[:5]

[(Timestamp('2020-10-19 00:00:00', freq='W-MON'),
  Timestamp('2020-10-25 00:00:00', freq='W-SUN')),
 (Timestamp('2020-10-26 00:00:00', freq='W-MON'),
  Timestamp('2020-11-01 00:00:00', freq='W-SUN')),
 (Timestamp('2020-11-02 00:00:00', freq='W-MON'),
  Timestamp('2020-11-08 00:00:00', freq='W-SUN')),
 (Timestamp('2020-11-09 00:00:00', freq='W-MON'),
  Timestamp('2020-11-15 00:00:00', freq='W-SUN')),
 (Timestamp('2020-11-16 00:00:00', freq='W-MON'),
  Timestamp('2020-11-22 00:00:00', freq='W-SUN'))]

In [128]:
all_weeks_str[:5]

[('2020-10-19', '2020-10-25'),
 ('2020-10-26', '2020-11-01'),
 ('2020-11-02', '2020-11-08'),
 ('2020-11-09', '2020-11-15'),
 ('2020-11-16', '2020-11-22')]

In [133]:
mondays_str

Index(['2020-10-19', '2020-10-26', '2020-11-02', '2020-11-09', '2020-11-16',
       '2020-11-23', '2020-11-30', '2020-12-07', '2020-12-14', '2020-12-21',
       '2020-12-28', '2021-01-04', '2021-01-11', '2021-01-18', '2021-01-25',
       '2021-02-01', '2021-02-08', '2021-02-15', '2021-02-22', '2021-03-01',
       '2021-03-08', '2021-03-15', '2021-03-22', '2021-03-29', '2021-04-05',
       '2021-04-12', '2021-04-19', '2021-04-26', '2021-05-03', '2021-05-10',
       '2021-05-17'],
      dtype='object')

## CORRECTION: Building the weeks from survey data collection periods.

The endate used by the API actually refers to the previous day until 23:59.
* e.g. '2020-10-26' will search everything up until '2020-10-25 23:59'
* So *sundays* are not actually needed, the endDate in each week's query will just be the following monday.

In [291]:
# Final version of the above.

# @staticmethod
def nextMonday(date):
    date = twitterData.toDatetime(date)
    
    nextM = date + timedelta(days=-date.weekday(), weeks=1)
    
    return nextM
    

# Get first monday and last sunday from the A/I data collection periods
firstDate,_ = twitterData.weekFromDay(surveyPeriods['A_I_start_date'][0])
_, lastDate = twitterData.weekFromDay(surveyPeriods['A_I_start_date'].iloc[-1])

# Create data ranges for all mondays/sundays starting with the first one covered in A/I.
mondays = pd.date_range(firstDate, lastDate, freq='W-MON')
leading_mondays = pd.date_range(nextMonday(firstDate), nextMonday(lastDate), freq='W-MON')

# Get strings
mondays_str = mondays.strftime('%Y-%m-%d')
leading_mondays = leading_mondays.strftime('%Y-%m-%d')

# Saving all the results in a tuple
all_weeks = [(m, s) for m, s in zip(mondays, leading_mondays)]
all_weeks_str = [(m, s) for m, s in zip(mondays_str, leading_mondays)]

In [292]:
all_weeks_str

[('2020-10-19', '2020-10-26'),
 ('2020-10-26', '2020-11-02'),
 ('2020-11-02', '2020-11-09'),
 ('2020-11-09', '2020-11-16'),
 ('2020-11-16', '2020-11-23'),
 ('2020-11-23', '2020-11-30'),
 ('2020-11-30', '2020-12-07'),
 ('2020-12-07', '2020-12-14'),
 ('2020-12-14', '2020-12-21'),
 ('2020-12-21', '2020-12-28'),
 ('2020-12-28', '2021-01-04'),
 ('2021-01-04', '2021-01-11'),
 ('2021-01-11', '2021-01-18'),
 ('2021-01-18', '2021-01-25'),
 ('2021-01-25', '2021-02-01'),
 ('2021-02-01', '2021-02-08'),
 ('2021-02-08', '2021-02-15'),
 ('2021-02-15', '2021-02-22'),
 ('2021-02-22', '2021-03-01'),
 ('2021-03-01', '2021-03-08'),
 ('2021-03-08', '2021-03-15'),
 ('2021-03-15', '2021-03-22'),
 ('2021-03-22', '2021-03-29'),
 ('2021-03-29', '2021-04-05'),
 ('2021-04-05', '2021-04-12'),
 ('2021-04-12', '2021-04-19'),
 ('2021-04-19', '2021-04-26'),
 ('2021-04-26', '2021-05-03'),
 ('2021-05-03', '2021-05-10'),
 ('2021-05-10', '2021-05-17'),
 ('2021-05-17', '2021-05-24')]

In [282]:
bla = twitterData.toDatetime(lastDate)

following_monday = bla + timedelta(days=-bla.weekday(), weeks=1)
print(following_monday)

2021-05-24 00:00:00


In [270]:
mondays

DatetimeIndex(['2020-10-19', '2020-10-26', '2020-11-02', '2020-11-09',
               '2020-11-16', '2020-11-23', '2020-11-30', '2020-12-07',
               '2020-12-14', '2020-12-21', '2020-12-28', '2021-01-04',
               '2021-01-11', '2021-01-18', '2021-01-25', '2021-02-01',
               '2021-02-08', '2021-02-15', '2021-02-22', '2021-03-01',
               '2021-03-08', '2021-03-15', '2021-03-22', '2021-03-29',
               '2021-04-05', '2021-04-12', '2021-04-19', '2021-04-26',
               '2021-05-03', '2021-05-10', '2021-05-17'],
              dtype='datetime64[ns]', freq='W-MON')

## Single query: using the oneWeek() method

In [81]:
search4 = twitterData('/Volumes/Survey_Social_Media_Compare/Methods')
search4.validate_credentials()

# Generate dates from survey (Axios-Ipsos) info
search4.surveyDates()

# Create dictionaries for saving the data
search4.createDicts()

# # Get data from week 1
search4.oneWeek('jobs', 1)

# # Show dataframe
search4.allData[search4.currentWeek]

2020-10-19
2020-10-26


Unnamed: 0,id,text,created_at,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count
0,1320515328955293703,@Gregt041 @SeanHouse90 @WPTV Let me get this s...,2020-10-25T23:59:23.000Z,0,2,0,0
1,1320515190572666882,@patrickbetdavid China/Jobs,2020-10-25T23:58:50.000Z,0,0,0,0
2,1320514772975132672,How I did my last fore jobs these 6 months wit...,2020-10-25T23:57:10.000Z,0,0,0,0
3,1320514318778052608,We’re looking for a talented Restaurant Manage...,2020-10-25T23:55:22.000Z,1,0,0,0
4,1320513842644979712,Was @GovWhitmer in on purchase of @HennigesCon...,2020-10-25T23:53:28.000Z,0,0,0,0
...,...,...,...,...,...,...,...
5522,1317981967376207873,@realDonaldTrump You are President we have los...,2020-10-19T00:12:42.000Z,0,0,0,0
5523,1317981499208142849,@AbjornSell @StacyLStiles @JoeBiden Obama gove...,2020-10-19T00:10:50.000Z,0,0,0,0
5524,1317980815842750466,"@KellyO How desperate are ""reporters"" to save ...",2020-10-19T00:08:08.000Z,8,11,43,1
5525,1317979592125276162,So many things wrong with this 4 min clip. \n\...,2020-10-19T00:03:16.000Z,0,0,0,0


In [72]:
# Check if dataframe is the same as what we got with the stacked individual methods.
# all(search4.allData[search4.currentWeek] == df3)

In [82]:
# Check logs
search4.logs[search4.currentWeek]

{'mostRecent': datetime.datetime(2020, 10, 25, 23, 59, 23),
 'oldest': datetime.datetime(2020, 10, 19, 0, 2, 6),
 'timeCovered': '6 days, 23:57:17',
 'sessionRequestCounter': 79,
 'totalRequests': 79,
 'totalTweets': 5527,
 'totalTweetsOverall': 5527,
 'requestParams': {'query': '"jobs" lang: en place_country:US',
  'max_results': 500,
  'start_time': '2020-10-19T00:00:00Z',
  'end_time': '2020-10-26T00:00:00Z',
  'tweet.fields': 'id,created_at,text,public_metrics',
  'next_token': 'b26v89c19zqg8o3fos8vq67afna0t42iw3icscqtbqs8t'}}

In [83]:
# Test saving
search4.exportOneWeek(topic = "Employment", saveDF=True, saveJSON=True)

In [92]:
# Test loading
testDF, testJSON = twitterData.loadOneWeek(weekStart = '2020-10-19', topic= "Employment", loadDF=True, loadJSON=True)

In [94]:
# Check if these are the same
print(all(testDF == search4.allData[search4.currentWeek]))
print(testJSON == search4.result)

True