## Twitter Scraping For Twitch Streamer Recommender System

#### By: Aurelio Barrios

### What is Shown In This Notebook

This notebook is used alongside the `recommenderSystem.ipynb` file to build a recommender system from scratch. What you will find in this notebook is the pseudocode for how the data was scraped off of Twitter using the Tweepy Twitter API. Since the credentials necessary to run this API are personal they wont be included and therefore this notebook will not be displayed with cells that are ran. 

### Imports

First, necessary imports must be loaded in such as the `tweepy` or Twitter API package.

In [None]:
import os
import tweepy
import urllib
import pandas as pd
from PIL import Image

### Credentials

Insert credentials needed to load up the Twitter API. Left blank due to sensitivity.

In [None]:
#store tweepy api credentials
api_key = ''
api_secret = ''

access_token = ''

In [None]:
#log in to use twitter api
auth = tweepy.AppAuthHandler(api_key, api_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

## Twitter Scraping Using Twitter API 

We begin with loading a csv file `data/twitter_handles.csv` which contains the Twitch channel name and the respective Twitter account handle for some of the most prominent twitch streamers.

In [None]:
#gather twitch and twitter handles
twitter_df = pd.read_csv('data/twitter_handles.csv')
twitter_df.head()

### Twitter Scraping: Step One

In order to build the recommender system we must first establish where we will get our connections. In this twitter scraper script, the aim is to go through the twitter handles for each of the twitch streamers stored in the `twitter_df` dataframe. For every streamer we will scrape 200 of their followers. These followers will be the basis for establishing the connections needed to build a recommender system. 

In [None]:
#loop through each twitch streamer twitter handle we have
for streamer in twitter_df['twitter_handle']:
    try:
        #get 200 of their followers
        followers = tweepy.Cursor(api.get_followers, id=streamer, count=200).items(200)
        #store follower information in a list
        f_list = [[follower.id, follower.name, follower.screen_name] for follower in followers]
    except BaseException as e:
        print('Failed on_status:', e)
    #build list into dataframe
    curr_df = pd.DataFrame(f_list, columns=['id', 'name', 'screen_name'])
    #save the dataframe into file
    curr_df.to_csv('data/followers/' + streamer + '_200_followers.csv', index=False)

In [None]:
#used to get the limit status of our API
api.rate_limit_status()['resources']['followers']

### Twitter Scraping: Step Two

Now that we have a set of followers for each streamer we must establish connections between streamers using these followers. The script below loops through all the streamers and the handles of each of their 200 followers that were previously scraped. Due to the rate limits of the API we will only select 20 of the followers rather than using the full 200. For each of the followers we will collect which other streamers they follow in order to build the connections between each streamer. 

In [None]:
count, max_users = 0, 21
#loop through each of the streamers
for streamer in twitter_df['twitter_handle']:
    #read in the file holding the twitter handles of 200 of the streamers followers
    streamer_file = 'data/followers/' + streamer + '_200_followers.csv'
    curr_df = pd.read_csv(streamer_file).sample(n=200, random_state=0)
    #create directory to store scraped data for each streamer
    outdir = 'data/streamers/' + streamer    
    if not os.path.exists(outdir):
        os.mkdir(outdir)
        curr_index = 1
        #loop through 20 of the current streamers followers and scrape users data
        for _, row in curr_df.iterrows():
            #create file path to store data
            outfile = outdir + '/f' + str(curr_index) + '_' + row.screen_name + '.csv'
            try:
                #scrape a users data and gather 200 of the people that user is following
                following = tweepy.Cursor(api.get_friends, id=row.screen_name, count=200).items(200)
                #store scraped data into list
                follow_list = [[follow.id, follow.name, follow.screen_name]
                              for follow in following]
                #save current user data into file
                curr_follow_df = pd.DataFrame(follow_list, columns=['id', 'name', 'screen_name'])
                curr_follow_df.to_csv(outfile, index=False)

                curr_index += 1
                count += 1
            except BaseException as e:
                print('Failed on_status:', e)
            #only want a max of 20 followers for each streamer
            if curr_index == max_users:
                break

In [None]:
#used to get rate limit status of API
api.rate_limit_status()['resources']['friends']

### Twitter Scraping: Step Three

In this final step of twitter scraping we are going to be scraping the profile images of each of the streamers in our dataset. This part of the scraping is not part of the actual recommender system but part of the deployment of the system. This scraped data will be used in the website where the system will be deployed.

In [None]:
#build dataframe to store scraped data.
cols = ['twitter_handle', 'profile_img_normal']
img_data = pd.DataFrame(columns=cols)
#loop through all the streamers of interest
for user in list(twitter_df['twitter_handle']):
    try:
        #for each user we will get the users profile image url
        user_obj = api.get_user(screen_name=user)
        url = user_obj.profile_image_url
    except:
        print('Failed with user:', user)
        url = ''
    #add scraped data to our storage dataset
    curr_df = pd.DataFrame([[user, url]], columns=cols)
    img_data = img_data.append(curr_df)

In [None]:
#user to get rate limit status of API used
api.rate_limit_status()['resources']['users']['/users/:id']

In [None]:
#helper function to get all sizes of the twitter profile image
def image_builder(x, replace_with=''):
    if replace_with == '_original':
        return x.replace('_normal', '')
    return x.replace('_normal', replace_with)

#get all sizes of the twitter profile images
for size_tag in ['_bigger', '_mini', '_original']:
    img_data['profile_img' + size_tag] = img_data['profile_img_normal'].apply(lambda x: 
                                                    image_builder(x, replace_with=size_tag))

In [None]:
#helper function to return the width and length of each image saved
def get_img_dimensions(img_url):
    file = urllib.request.urlopen(img_url)
    im = Image.open(file)
    return im.size

img_data = img_data.reset_index(drop=True)
#get image dimensions
img_data['dimensions'] = img_data['profile_img_original'].apply(get_img_dimensions)
#build width and heigh columns from dimensions
img_data[['width', 'height']] = pd.DataFrame(img_data['dimensions'].tolist(), 
                                             index=img_data.index)

In [None]:
#save our image data into a csv file
img_data.to_csv('data/profile_img.csv', index=False)