### 10-30-2020


# TWEETINSIGHTS - PART 1bis

### ALEX MAZZARELLA

### DATA SCIENCE full time course - BrainStation
### CAPSTONE PROJECT

# =============================================================

Note for the reader: this is a version of the tweet collection code adapted to return all tweets in one csv only.

As initially I had planned to run the analysis for two brands, I had created the code that would return two separate CSVs for two separate brands.

But since in the end i run the sentiment and topic analysis for only one, I have added this current version of scraping, to be used if the user needs all the tweets returned in one single file.

(I have not used this notebook for data collection in the capstone project).

## Notebook introduction

The purpose of the code in this notebook, is to collect tweets in a csv file, through the Twitter API.

We will create 

* a function to collect tweets
* additional code that calls the collecting function repeated times.

We will use the **python-twitter API**.

A few points to keep in consideration:

* Twitter API has limits to both the number of requests sent (currently is 180 every 15 minutes) as well as to the total amount of tweets that we can download (currently is 500,000 every 30 days)

* With the request we can specify the search terms. The API will return a JSON with the full metadata per each tweet. It is up to us to then eventually keep only those we are interested in, for the purpose of our project.

To check the most updated API rules and options click [here](https://developer.twitter.com/en/products/twitter-api).

## Import statements

In [1]:
import pandas as pd
from datetime import datetime
import time

import twitter

import os
import csv
import json
import requests

## Defining function to collect tweets

In [2]:
def scrape_tweets(tw_api, search_term):
    '''
    Returns a pandas dataframe with tweet data obtained from request.
    
    Input:
    - twitter API instance
    - search term (string). Note, the current version of the function does not verify 
    that the search term passed is a string. Assertion will be implemented at next function update
    
    Output:
    - Up to 50 tweets' metadata information in a pandas dataframe format
    
    Function sends a request to Twitter API, with the specified search string, for the most recent
    50 tweets (in English language).
 
    '''
       
    # initialize DataFrame to store tweet metadata returned
    df = pd.DataFrame() 

    # setting search parameters for API request    
    search_res = tw_api.GetSearch(
                        #search term (passed as function parameter)
                        term = search_term,
                        # selecting language of tweets (en = english)
                        lang = 'en',
                        # most recent N tweets
                        result_type = 'recent',
                        # entities might include additional tweet info,
                        # like hashtags, urls, user mentions, symbols
                        include_entities = True,
                        # specifying format of data returned
                        return_json = True,
                        # max number of tweets to return
                        count = 50 
                    )

    
    
    # flagging retweets and getting their full original text
    for tw in search_res['statuses']:
        
        # for each tweet returned, check if it was a truncated retweet.
        # need to use a try/except to avoid the execution to stop.
        try:
            # storing in a local var the original full text content
            tw_txt = tw['retweeted_status']['full_text']
            retweet = 1
        except:
            # no retweet
            tw_txt = tw['full_text']
            retweet = 0

        # store selected tweet metadata content to df
        df = df.append(
            {
            'tweet_id': tw['id'],
            'user_location': tw['user']['location'],
            'created_at': tw['created_at'],
            'handle': tw['user']['screen_name'],
            'tweet': tw_txt,
            'retweet': retweet,
            'hashtags': tw['entities']['hashtags'],
            'symbols': tw['entities']['symbols'],
            'user_mentions': tw['entities']['user_mentions'],
            'followers_count': tw['user']['followers_count'],
            'friends_count': tw['user']['friends_count']
            
            },
            ignore_index=True)
        
    return df

After some manual tuning I found out that a good request rate to balance the number of unique and duplicated results, is 

* 50 tweets per request
* one request every 3 minutes
* using two search terms alternatively

Since the number of tweets containing different search terms may vary, it is recommendable to run a couple tests to fine tune the parameters above.

## Running multiple requests cycles

Here we will set up the instructions for the code to:

* initialize an API connection
* define API search terms
* define requests frequency intervals

In [3]:
########### Local variables declaration ##########


# total number of requests to send - must be int
num_requests = 240 
# the wait time in between requests (expressed in seconds, must be int)
interval = 10 


# search terms list - must be at least one
src_terms = ['my_first_search_term', 'my_second_search_term', 'another_search_term'] 
# search terms counter
src_term_cnt = 0
# variable used for storing csv file name
csv_name = '' 


# string used to print, for each request completed at which csv file the tweets have been apended 
tw_added_to_csv = '' 
# string used to print, for each request completed what was the search term used 
tw_src_term_used = '' 


##########  API authentication  ##########

# initializing parameters for authentication to Twitter API - insert your own credentials
tw_api = twitter.Api(
                        consumer_key = '',
                        consumer_secret = '',
                        access_token_key = '',
                        access_token_secret = '',
                        # returns full (non truncated) tweets
                        tweet_mode = 'extended'  
                        ) 

# check my API credentials
try:
    tw_api.VerifyCredentials()
except:
    print('API authentication error')


#################### Sending API requests ####################
    
for i in range(1, num_requests+1):
    
    ########## First cycle of loop ##########
    
    # send a first request and create CSV
    
    if i==1:
        # sends request to API with  search terms
        df_req = scrape_tweets(tw_api, src_terms[src_term_cnt])
        # create csv files for the pulled tweets tweets  and writes tweets returned by request
        csv_name += 'brand-name_tweets_pulled/brand-name_Tweets' + datetime.now().strftime('%Y%m%d-%H%M%S') +'.csv'
        df_req.to_csv(csv_name, index = False)
                
     
        
        # sets names of CSV updated in the string variable that will be used for print statement
        tw_added_to_csv = csv_name 
        # sets search terms used in the string variable that will be used for print statement
        tw_src_term_used = src_terms[src_term_cnt]
        
        # if the counter of search terms points at the last term in the list: 
        # else increase by one
        if src_term_cnt == len(src_terms)-1:
            src_term_cnt = 0
        else: src_term_cnt += 1
    
    
    
    ########## from the second cycle of the loop ##########

    # in the following block, we are
    # calling the scraping function, getting a pandas dataframe with tweets returned
    
    else:
        df_req = scrape_tweets(tw_api, src_terms[src_term_cnt])
        # appends df returned from scrape function to the csv created in the current loop cycle
        df_req.to_csv(csv_name, mode = 'a', header = False, index = False)
        
        # sets names of CSV updated in the string variable that will be used for print statement
        tw_added_to_csv = csv_name
        # sets search term used in the string variable that will be used for print statement
        tw_src_term_used = src_terms[src_term_cnt]
        
        # if the counter of search terms points at the last term in the list: 
        # else increase by one
        if src_term_cnt == len(src_terms)-1:
            src_term_cnt = 0
        else: src_term_cnt += 1
    
 
    
    
        
    # confirm request and csv create/append with print statement
    print(f'Request n {i} completed - tweets added to {tw_added_to_csv} -- Search term used: {tw_src_term_used}')
    
    # setting a delay timer to pause until the next request
    # and printing countdown 
    print('countdown until next request...')
    for t in reversed(range(1,interval+1)):
        print(f'{t}  \r', end="")
        time.sleep(1)
    print(' \r', end='')
  


Request n 1 completed - tweets added to brand-name_tweets_pulled/brand-name_Tweets20201106-151919.csv -- Search term used: my_first_search_term
countdown until next request...
Request n 2 completed - tweets added to brand-name_tweets_pulled/brand-name_Tweets20201106-151919.csv -- Search term used: my_second_search_term
countdown until next request...
8   

KeyboardInterrupt: 

The cell above was run just to show an example of the print statements, which inform the user about the progress of data collection (search key used and csv files updated). 