<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Retrieving Data from Social Media

## Twitter API

The API allows to retrieve tweets and users data from Twitter in JSON format: each data point has all the features that are observable on the social network.

**NOTES**
* API have **limitations**: each endpoint can be queried for a limited number of data points in a fixed time window of 15 minutes.
* Python API Wrapper: _tweepy_ (https://tweepy.readthedocs.io/en/3.7.0/api.html)

## Instructions

1. Generate an API key from Twitter (https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html)

1. Interact with the API: `Retrieve account information, a sample of tweets and the list of followers of a given user.`


In [None]:
import time
import json
import tweepy
import pandas as pd

Function to handle correctly the possible errors while iterating cursors.

In [None]:
def limit_handled(cursor):
    while True:  
        try:
            yield cursor.next()    
        except tweepy.RateLimitError:
            print ('API Rate Limit exceeded. Waiting...')
            time.sleep(15 * 60)

## Twitter API Exercise

Twitter API credentials

In [None]:
cred = { "consumer_key" : "",
         "consumer_secret" : "",
         "access_token" : "",
         "access_token_secret" : ""
       }

In [None]:
consumer_key = cred['consumer_key']
consumer_secret = cred['consumer_secret']
access_token = cred['access_token']
access_token_secret = cred['access_token_secret']

In [None]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

Initialize Twitter object

In [None]:
twitter = tweepy.API(auth)

### a) Search for _football players_ users and save the result in tabular format

**Note**: set up a reasonable maximum number of users to search for, handling the API Rate Limit correctly

In [None]:
def save_user(u):
    return {'id_user': u.id,
                'username': u.screen_name, 
                'n_followers': u.followers_count, 
                'n_following': u.friends_count, 
                'lang': u.lang,
                'location': u.location, 
                'created_at': u.created_at,
                'profile_pic_url': u.profile_image_url, 
                'description': u.description,
                'protected': u.protected
                }

In [None]:
query = 'football players'
N_MAX = 50

header = ['id_user','username','n_followers','n_following', 'lang','location','created_at','profile_pic_url', 'description','protected']

In [None]:
users_df = pd.DataFrame(columns=header)
for u in limit_handled(tweepy.Cursor(twitter.search, q=query, tweet_mode='extended').items(N_MAX)):
    u_row = save_user(u)
    users_df = users_df.append(u_row, ignore_index=True)

In [None]:
users_df.head()

### b) For each user, if it has #followers < 1000, extract the followers and following list, storing both in a unique table.

In [None]:
users_filtered = users_df[users_df['n_followers'] < 1000]

In [None]:
print ('#Users with less than 1000 followers: {}'.format(users_filtered.shape[0]))

In [None]:
users_filtered.head()

In [None]:
follow = pd.DataFrame(columns=['id_following', 'id_followed'])

for index, u in users_filtered.iterrows():
    id_user = u['id_user']
    
    # get all followers
    for follower in limit_handled(tweepy.Cursor(twitter.followers_ids, user_id=id_user).items()):
        follow = follow.append({'id_following': follower, 'id_followed': id_user}, ignore_index=True)
    
    # get all following
    for following in limit_handled(tweepy.Cursor(twitter.friends_ids, user_id=id_user).items()):
        follow = follow.append({'id_following': id_user, 'id_followed': following}, ignore_index=True)


### c) For each _following user_ of searched users (a), extract and save all their information

Pay attention to **not wasting API calls**: if a following user is also a target user, the data is already present in the data and should not be asked to the API.

In [None]:
for index, u in users_filtered.iterrows():
    id_user = u['id_user']
    
    for index, following in follow[follow['id_following'] == id_user].iterrows():
        id_following = following['id_user']
        
        if id_following not in users_filtered['id_user']:
            following_data = twitter.get_user(user_id=id_following)
            
            # append to already defined table
            u_row = save_user(following_data)
            users_df = users_df.append(u_row)

In [None]:
# save user data
spark_df = spark.createDataFrame(users_df)
spark_df.write.mode("overwrite").saveAsTable("default.football_players")

# save following and followers
spark_df = spark.createDataFrame(follow)
spark_df.write.mode("overwrite").saveAsTable("default.football_players_social_network")

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.