### 10-30-2020


# TWEETINSIGHTS - PART 1 - Collecting tweets

### ALEX MAZZARELLA

### DATA SCIENCE full time course - BrainStation
### CAPSTONE PROJECT

# =============================================================

## Notebook introduction

The purpose of the code in this notebook, is to collect tweets in a csv file, through the Twitter API.

We will create 

* a function to collect tweets
* additional code that calls the collecting function repeated times.

We will use the **python-twitter API**.

A few points to keep in consideration:

* Twitter API has limits to both the number of requests sent (currently is 180 every 15 minutes) as well as to the total amount of tweets that we can download (currently is 500,000 every 30 days)

* With the request we can specify the search terms. The API will return a JSON with the full metadata per each tweet. It is up to us to then eventually keep only those we are interested in, for the purpose of our project.

To check the most updated API rules and options click [here](https://developer.twitter.com/en/products/twitter-api).

## Import statements

In [1]:
import pandas as pd
from datetime import datetime
import time

import twitter

import os
import csv
import json
import requests

## Defining function to collect tweets

In [2]:
def scrape_tweets(tw_api, search_term):
    '''
    Returns a pandas dataframe with tweet data obtained from request.
    
    Input:
    - twitter API instance
    - search term (string). Note, the current version of the function does not verify 
    that the search term passed is a string. Assertion will be implemented at next function update
    
    Output:
    - Up to 50 tweets' metadata information in a pandas dataframe format
    
    Function sends a request to Twitter API, with the specified search string, for the most recent
    50 tweets (in English language).
 
    '''
       
    # initialize DataFrame to store tweet metadata returned
    df = pd.DataFrame() 

    # setting search parameters for API request    
    search_res = tw_api.GetSearch(
                        #search term (passed as function parameter)
                        term = search_term,
                        # selecting language of tweets (en = english)
                        lang = 'en',
                        # most recent N tweets
                        result_type = 'recent',
                        # entities might include additional tweet info,
                        # like hashtags, urls, user mentions, symbols
                        include_entities = True,
                        # specifying format of data returned
                        return_json = True,
                        # max number of tweets to return
                        count = 50 
                    )

    
    
    # flagging retweets and getting their full original text
    for tw in search_res['statuses']:
        
        # for each tweet returned, check if it was a truncated retweet.
        # need to use a try/except to avoid the execution to stop.
        try:
            # storing in a local var the original full text content
            tw_txt = tw['retweeted_status']['full_text']
            retweet = 1
        except:
            # no retweet
            tw_txt = tw['full_text']
            retweet = 0

        # store selected tweet metadata content to df
        df = df.append(
            {
            'tweet_id': tw['id'],
            'user_location': tw['user']['location'],
            'created_at': tw['created_at'],
            'handle': tw['user']['screen_name'],
            'tweet': tw_txt,
            'retweet': retweet,
            'hashtags': tw['entities']['hashtags'],
            'symbols': tw['entities']['symbols'],
            'user_mentions': tw['entities']['user_mentions'],
            'followers_count': tw['user']['followers_count'],
            'friends_count': tw['user']['friends_count']
            
            },
            ignore_index=True)
        
    return df

After some manual tuning I found out that a good request rate to balance the number of unique and duplicated results, is 

* 50 tweets per request
* one request every 3 minutes
* using two search terms alternatively

Note, that is for two brands (Netflix, Twitter).

Since the number of tweets containing different search terms may vary, it is recommendable to run a couple tests to fine tune the parameters above.

## Running multiple requests cycles

Here we will set up the instructions for the code to:

* initialize an API connection
* define API search terms
* define requests frequency intervals

Note: the code below is optimized for running requests for two brands (in our case Netflix and Xbox), therefore returning one separate CSV for each brand, with tweets info.

In [3]:
########### Local variables declaration ##########


# total number of requests to send - must be int
num_requests = 240 
# the wait time in between requests (expressed in seconds, must be int)
interval = 180 


# search terms list Netflix - must be at least two!
src_terms_net = ['@netflix -from:netflix','#netflix -from:netflix'] 
# search terms counter - Netflix
src_term_cnt_net = 0
# variable used for storing csv file name (Netflix)
csv_net_name = '' 


# search terms list xbox - must be at least two!
src_terms_xbox= ['@xbox -from:xbox','#xbox -from:xbox'] 
# search terms counter - xbox
src_term_cnt_xbox= 0
# variable used for storing csv file name (xbox)
csv_xbox_name = '' 


# string used to print, for each request completed at which csv file the tweets have been apended 
tw_added_to_csv = '' 
# string used to print, for each request completed what was the search term used 
tw_src_term_used = '' 


##########  API authentication  ##########

# initializing parameters for authentication to Twitter API - insert your own credentials
tw_api = twitter.Api(
                        consumer_key = '',
                        consumer_secret = '',
                        access_token_key = '',
                        access_token_secret = '',
                        # returns full (non truncated) tweets
                        tweet_mode = 'extended'  
                        ) 

# check my API credentials
try:
    tw_api.VerifyCredentials()
except:
    print('API authentication error')


#################### Sending API requests ####################
    
for i in range(1, num_requests+1):
    
    ########## First cycle of loop ##########
    
    # send a request for each of the two brands and create respective CSVs
    
    if i==1:
        # sends request to API with Netflix search terms
        df_req_net = scrape_tweets(tw_api, src_terms_net[src_term_cnt_net])
        # create csv files for the Netflix tweets  and writes tweets returned by request
        csv_net_name += 'net_tweets_pulled/net_Tweets' + datetime.now().strftime('%Y%m%d-%H%M%S') +'.csv'
        df_req_net.to_csv(csv_net_name, index = False)
                
        # same for xbox:
        # sends request to API with Xbox search terms
        df_req_xbox= scrape_tweets(tw_api, src_terms_xbox[src_term_cnt_xbox])
        # create CSV for xbox tweets and writes tweets returned by request
        csv_xbox_name += 'xbox_tweets_pulled/xbox_Tweets' + datetime.now().strftime('%Y%m%d-%H%M%S') +'.csv'
        df_req_xbox.to_csv(csv_xbox_name, index = False)
        
        
        # sets names of CSVs updated in the string variable that will be used for print statement
        tw_added_to_csv = csv_net_name + ' and ' + csv_xbox_name
        # sets search terms used in the string variable that will be used for print statement
        tw_src_term_used = src_terms_net[src_term_cnt_net] + ' and ' + src_terms_xbox[src_term_cnt_xbox]
        
        # increases counter for Netflix search terms - list must be greater than one!
        src_term_cnt_net += 1
        # increases counter for xbox search terms - list must be greater than one!
        src_term_cnt_xbox+= 1
    
    
    
    ########## from the second cycle of the loop ##########
    
    # in the following block, we are
    # calling the scraping function, getting a pandas dataframe with tweets returned
    # it will alternate call functions for Netflix (odd loop cycles) and xbox (even loop cycles)
    
    # Odd loop cycles except the first one: calls scrape function for Netflix 
    if i%2 != 0 and i != 1:
        df_req_net = scrape_tweets(tw_api, src_terms_net[src_term_cnt_net])
        # appends df returned from scrape function to the csv created in the current loop cycle
        df_req_net.to_csv(csv_net_name, mode = 'a', header = False, index = False)
        
        # sets names of CSV updated in the string variable that will be used for print statement
        tw_added_to_csv = csv_net_name
        # sets search term used in the string variable that will be used for print statement
        tw_src_term_used = src_terms_net[src_term_cnt_net]
        
        # if the counter of search terms for Netflix points at the last term in the list: 
        # else increase by one
        if src_term_cnt_net == len(src_terms_net)-1:
            src_term_cnt_net = 0
        else: src_term_cnt_net += 1
    
    # even loop cycles (and excludes first cycle): calls scrape function for xbox
    elif i != 1:
        df_req_xbox= scrape_tweets(tw_api, src_terms_xbox[src_term_cnt_xbox])
        # appends df returned from scrape function to the csv created in the current loop cycle
        df_req_xbox.to_csv(csv_xbox_name, mode = 'a', header = False, index = False)
        
        # sets names of CSVs updated in the string variable that will be used for print statement
        tw_added_to_csv = csv_xbox_name
        # sets search term used in the string variable that will be used for print statement
        tw_src_term_used = src_terms_xbox[src_term_cnt_xbox]
        
        # if the counter of search terms for xbox points at the last term in the list: reset,
        # else increase by one
        if src_term_cnt_xbox== len(src_terms_xbox)-1:
            src_term_cnt_xbox = 0
        else: src_term_cnt_xbox+= 1
    
    
        
    # confirm request and csv create/append with print statement
    print(f'Request n {i} completed - tweets added to {tw_added_to_csv} -- Search term used: {tw_src_term_used}')
    
    # setting a delay timer to pause until the next request
    # and printing countdown 
    print('countdown until next request...')
    for t in reversed(range(1,interval+1)):
        print(f'{t}  \r', end="")
        time.sleep(1)
    print(' \r', end='')
  


Request n 1 completed - tweets added to net_tweets_pulled/net_Tweets20201029-162357.csv and xbox_tweets_pulled/xbox_Tweets20201029-162358.csv -- Search term used: @netflix -from:netflix and @xbox -from:xbox
countdown until next request...
Request n 2 completed - tweets added to xbox_tweets_pulled/xbox_Tweets20201029-162358.csv -- Search term used: #xbox -from:xbox
countdown until next request...
167  

KeyboardInterrupt: 

The cell above was run just to show an example of the print statements, which inform the user about the progress of data collection (search key used and csv files updated). 

**NOTE:** you will NOT find in any of the delivered folders the csv files listed in those print statements.
Since they have been created just before submission, I have removed them right after running the example, so that they would not interfere with the rest of the analysis performed.