## Tweet Retrieval

In this notebook we have setup all of the necessary imports, keys and objects to gather our tweets. The Python package, Twython, was used to retrieve the tweets from Twitter. All of the data has already been gathered and saved as CSV files. 

***NOTE:*** *No cell blocks have been commented out. Running this notebook will trigger the tweet retrieval process, which takes 45 minutes to complete.* 

In [None]:
#imports
from twython import Twython
import time
import datetime
import pandas as pd
import csv
import re
import glob
import os

#autocomplete
%config IPCompleter.greedy=True

#create twython object
twitter = Twython()

#input key and secret code
APP_KEY = '8W2Xn6qLQyrj1MFoBHXtXpquL'
APP_SECRET = 'DYGu67FqOBepDoem92tKsoVrpCVI8yGQ0OjwRDqd23mbjMNcWL'
twitter = Twython(APP_KEY, APP_SECRET)

The process for gathering tweets required a crash course in Twitter's API. In order to search for and gather tweets based on a specific date and company name, we created a generator (Twython's cursor() method) to traverse and return specific results. After our first pull, we quickly realized the majority of tweets were retweets or company responses to individual tweets. We decided to omit both of these from our results. After reviewing the results from several pulls, we narrowed our search parameters to the following:
- Company name (Twitter handle)
- Date
- English tweets
- Original tweets (no retweets)
- Extended tweets (full text)

Since Twitter's maximum search request is 450 every 15 minutes, we decided to gather 1800 tweets per company per day, for a total search time of 3 hours each day. Unfortunately, Twython was very tempermental and most pulls resulted in connection errors, especially during peak business hours (9 AM - 5 PM). As such, we opted to manually run the code below for each pull rather than all at once for all 4 companies (see note in code).

Each Twitter object contains *numerous* attributes and sub-objects. In order to make our analysis more interesting, we decided to gather the following attributes for 3 sub-objects:

**Tweet**
- Tweet text
- Tweet ID
- Date
- Number of times retweeted
- Number of times favorited
- Popularity

**User**
- User ID
- Number of followers
- Number of friends
- Number of tweets generated

**Entity** (data in tweet)
- Number of links
- Number of hashtags 
- URL

The following code generates three datasets containing tweet, user and entity data for each company's tweets. 

In [None]:
## GRAB 1800 COMPANY TWEETS
# params to change in query: @company name, date
#company_tags = ['@drpepper', '@MonsterEnergy', '@CocaCola','@pepsi']

company_tag = '@drpepper'

#initialize lists to hold df data
userdf_lists = []
tweetdf_lists = []
entitydf_lists = []

#initialize variables for loop
num_tweets = 450 #corresponds to count parameter in .cursor() function; how many tweets we get at a time
counter = 0
stopper = num_tweets
last_id = 0
query = company_tag+' since:2018-04-16 -filter:nativeretweets' #all company tweets since date, exclude retweets

#get first tweet id for max_id
result = twitter.search(q=query, count=1, lang='en', tweet_mode="extended")
for tweet in result['statuses']:
    last_id = tweet['id']
    
# #manually assign last_id
# last_id = 983199594661150720
    

#NOTE: we could have looped through list of company handles to retrieve tweets for 
#all companies at once, but connection error were inevitable and would stop the entire process

#grab 450 tweets every 15 minutes until we've gotten 1800 (this will take 45 mins)
while counter < 1800: 
    results = twitter.cursor(twitter.search, q=query, lang='en', count=num_tweets, tweet_mode="extended", max_id=last_id)
    for tweet in results:
        #collect user data
        userid = tweet['user']['id']
        followers = tweet['user']['followers_count']
        friends = tweet['user']['friends_count']
        statuses = tweet['user']['statuses_count']
        if not tweet['user']['url']:
            url = 0
        else:
            url = 1
        #create list of user info to add to user df
        user_list = [userid, followers, friends, statuses, url, company_tag] 

        #collect tweet data
        last_id = tweet['id']
        text = tweet['full_text']
        retweets = tweet['retweet_count']
        favorited = tweet['favorite_count']
        date = tweet['created_at']
        popularity = tweet['metadata']['result_type']
        #create list of tweet info to add to tweet df
        tweet_list = [last_id, text, retweets, favorited, date, popularity, userid, company_tag]

        #collect entity data
        if not tweet['entities']['hashtags']:
            hashtags = 0
        else:
            hashtags = len(tweet['entities']['hashtags'])

        if not tweet['entities']['urls']:
            ent_url = 0
        else:
            ent_url = 1
        #create list of entity info to add to entity df
        entity_list = [hashtags, ent_url, last_id, company_tag]

        #add data to lists (these will go in csv file when we're done collecting)
        userdf_lists.append(user_list)
        tweetdf_lists.append(tweet_list)
        entitydf_lists.append(entity_list)

        counter += 1
        if counter >= stopper:
            break

    #update stopper
    stopper += num_tweets

    #print to indicate pause
    print(counter, " tweets", 'Timestamp: {:%Y-%m-%d %H:%M:%S}'.format(datetime.datetime.now()))

    #pause 15 mins before getting more tweets
    if counter != 1800:
        print("PAUSE")
        time.sleep(900)
    else:
        break

print("----DONE----")


The functions below generate three file names and converts a dataset into a CSV file. 

In [None]:
##FUNCTIONS TO GENERATE CSV

#function that generates three csv filenames for each dataset
#params: name of company(string), date (string, formated "month_day")
#return: list of 3 string filenames
def make_filenames(name, month_day):
    user_fname = month_day+"_user_data_"+name+".csv"
    tweet_fname = month_day+"_tweet_data_"+name+".csv"
    entity_fname = month_day+"_entity_data_"+name+".csv"
    return [user_fname, tweet_fname, entity_fname]

#function that generates three csv files for each filename
#params: list of 3 filenames, list of datasets (userdf_lists, tweetdf_lists, entitydf_lists)
def make_csv_files(filenames, datasets):
    for name, item in zip(filenames, datasets):
        with open(name, "w", encoding='utf-8-sig') as f:
            writer = csv.writer(f)
            writer.writerows(item)

Here we generate three CSV files from the results of one pull. 

In [None]:
## EXPORT TO CSV

#date for files
date_match = re.search(r"0[3-4]-[0-9]+", query)
#list of datasets
datasets = [userdf_lists, tweetdf_lists, entitydf_lists]

#generate csv files for company based on query
if (re.search(r"^@drpepper", query)):
    make_csv_files(make_filenames("drp", date_match.group()), datasets)
elif (re.search(r"^@MonsterEnergy", query)):
    make_csv_files(make_filenames("monst", date_match.group()), datasets)
elif (re.search(r"^@CocaCola", query)):
    make_csv_files(make_filenames("coke", date_match.group()), datasets)   
else:
    make_csv_files(make_filenames("pepsi", date_match.group()), datasets)
    