## Problem statement
You have to write a python script which can fetch all the tweets(as many as allowed by Twitter
API) done by [midas@IIITD](https://twitter.com/midasIIITD) twitter handle and dump the responses into JSONlines file.
The other part of your script should be able to parse these JSONline files to display the
following for every tweet in a tabular format.
* The text of the tweet.
* Date and time of the tweet.
* The number of favorites/likes.
* The number of retweets.
* Number of Images present in Tweet. If no image returns None.

### Imports

First step is to import the required libraries for our task.
* [Tweepy](http://www.tweepy.org/) is the Python library that can be used to access Twitter API.
* Other libraries to be imported are:
    * [json](https://docs.python.org/3.7/library/json.html) - For reading/writing JSON Line files.
    * [os](https://docs.python.org/3.7/library/os.html) - For OS dependent functionalities such as manipulating path etc.
    * [pandas](https://pandas.pydata.org/) - A data analysis library. Here, in this notebook, Pandas is used to display information in tabular form.
 
    

In [1]:
import tweepy
from tweepy import OAuthHandler
import json
import os
import pandas as pd

In [2]:
#Path for storing/retrieving JSON file
FILE_PATH = os.path.join(os.getcwd(), "/Users/Akshay/Desktop/tweets.json")

### Authorization

For accessing data from Twitter API, we'd have to complete an authorization step. For this, it is required to have a [Twitter developer](https://developer.twitter.com/) account. After creating the account, we just have to [create a new app](https://developer.twitter.com/en/apps) and follow the guidlines for filling up some details. Once the app is created successfully, we'll be able to find a page where we can see/generate the consumer key, consumer secret key, access token and access token secret. It is recommended to not make these public, hence `#` is used in the following cell.

In [3]:
ACCESS_TOKEN = "#"
ACCESS_TOKEN_SECRET = "#"
CONSUMER_KEY = "#"
CONSUMER_SECRET_KEY = "#"

Tweepy supports accessing Twitter via Basic Authentication and the newer method, OAuth. Twitter has stopped accepting Basic Authentication so we'd have to use OAuth to use the Twitter API. Following cell shows how we can get access to the Twitter API using tweepy with OAuth:

In [4]:
auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET_KEY)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

At this point, Authorization step is completed and now we can head straight to the task. 

## First Part - Fetch Tweets and Dump them into JSON Line file

 In the following cell, the function, `fetch_and_dump_tweets(screen_name)` will fetch most recent tweets (only upto 3200) made by `screen_name` and dump them into the JSON file specified earlier by `FILE_PATH`. Specifically, [user_timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html) API method is used which will return a collection of most recent tweets up to a maximum of 200 per distinct request. Note that, `max_id`, an optional parameter has been added when invoking `user_timeline` after first time. It has been added to return the result with an ID less than (that is, older than) or equal to the specified ID.

In [5]:
def fetch_and_dump_tweets(screen_name):
    '''
    Parameters-
        screen_name: Twitter handle of the user
    Variables -
        tweets: List (of length upto 200) of tweets by screen_name
        oldest: Index of oldest fetched tweet
        all_tweets_json: List of _json data from all_tweets
    Returns -   
        all_tweets: List of all tweets by screen_name
    '''
    #all_tweets intialized as an empty list
    all_tweets = []
    
    #most recent tweets upto maximum of 200
    tweets = api.user_timeline(screen_name = screen_name,count=200)
    
    #add tweets to all_tweets
    all_tweets.extend(tweets)
    
    #ID of the most recent fetched tweet - 1
    oldest = all_tweets[-1].id - 1
    
    #fetch tweets that are older than most recent 200 tweets until no tweet is left or 3200 limit is reached
    while len(tweets) > 0:
        
        tweets = api.user_timeline(screen_name = screen_name, count=200, max_id=oldest)
        all_tweets.extend(tweets)
        oldest = all_tweets[-1].id - 1

    print(f"Total number of tweets from {screen_name} are {len(all_tweets)}")
    
    #all_tweets_json initialized as an empty list
    all_tweets_json = []
    
    #append _json corresponding to each of the tweet to the all_tweets_json
    for tweet in all_tweets:
        all_tweets_json.append(tweet._json)
        
    #dumping all_tweets_json to the file specified by FILE_PATH
    #sort_keys is set to True for sorting dictionaries by key
    #indent = 4, for pretty printing JSON array elements with indent level of 4
    with open(FILE_PATH, 'w', encoding='utf8') as f:
        json.dump(all_tweets_json, f, sort_keys = True,indent = 4)
    return all_tweets

Now we'll call this method to fetch tweets by [midas@IIITD](https://twitter.com/midasIIITD).

In [6]:
tweets_by_MIDAS = fetch_and_dump_tweets("midasIIITD")

Total number of tweets from midasIIITD are 296


Now, let's load the JSON file into `MIDAS_json`.

In [7]:
with open(FILE_PATH) as json_file:  
    MIDAS_json = json.load(json_file)

Let's have a look at how one of these tweet looks like in JSON format.

In [8]:
MIDAS_json[12]

{'contributors': None,
 'coordinates': None,
 'created_at': 'Tue Mar 12 14:37:55 +0000 2019',
 'entities': {'hashtags': [],
  'symbols': [],
  'urls': [{'display_url': 'twitter.com/i/web/status/1…',
    'expanded_url': 'https://twitter.com/i/web/status/1105478029147553792',
    'indices': [110, 133],
    'url': 'https://t.co/XEkcYO8KmW'}],
  'user_mentions': [{'id': 1021355762575073281,
    'id_str': '1021355762575073281',
    'indices': [23, 34],
    'name': 'MIDAS IIITD',
    'screen_name': 'midasIIITD'}]},
 'favorite_count': 16,
 'favorited': False,
 'geo': None,
 'id': 1105478029147553792,
 'id_str': '1105478029147553792',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'possibly_sensitive': False,
 'retweet_count': 4,
 'retweeted': False,
 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web C

It looks like `entities` is the one which provides metadata related to a particular tweet. Let's look at going through the `expanded_url` at how this tweet actually looks like on Twitter. ![Snapshot of tweet](https://i.imgur.com/IOoc3g1.png)<center><i>A snapshot of tweet made on March 12.</i></center><br>
As we can see, this tweet contains an image but there is no detail present about it in `entities`. Hence, we'd have to modify the function `fetch_and_dump_tweets` to get the complete information. After going through this [documentation](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object.html), it seems we'll have to include `extended_entities` corresponding to each tweet to the JSON file. This can be done by adding parameter `tweet_mode` to the `user_timeline` API method and then setting it to `extended`. Now we are good to go:

In [9]:
#Redefining the fetch_and_dump_tweets method.

def fetch_and_dump_tweets(screen_name):
    '''
    Parameters-
        screen_name: Twitter handle of the user
    Variables -
        tweets: List (of length upto 200) of tweets by screen_name
        oldest: Index of oldest fetched tweet
        all_tweets_json: List of _json data from all_tweets
    Returns -   
        all_tweets: List of all tweets by screen_name
    '''
    #all_tweets intialized as an empty list
    all_tweets = []
    
    #most recent tweets upto maximum of 200
    #Note that, this time an additional parameter, tweet_mode has been added
    tweets = api.user_timeline(screen_name = screen_name,count=200, tweet_mode='extended')
    
    #add tweets to all_tweets
    all_tweets.extend(tweets)
    
    #ID of the most recent fetched tweet - 1
    oldest = all_tweets[-1].id - 1
    
    #fetch tweets that are older than most recent 200 tweets until no tweet is left or 3200 limit is reached
    while len(tweets) > 0:
        
        #Note that, this time an additional parameter, tweet_mode has been added
        tweets = api.user_timeline(screen_name = screen_name, count=200, max_id=oldest, tweet_mode='extended')
        all_tweets.extend(tweets)
        oldest = all_tweets[-1].id - 1

    print(f"Total number of tweets from {screen_name} are {len(all_tweets)}")
    
    #all_tweets_json initialized as an empty list
    all_tweets_json = []
    
    #append _json corresponding to each of the tweet to the all_tweets_json
    for tweet in all_tweets:
        all_tweets_json.append(tweet._json)
        
    #dumping all_tweets_json to the file specified by FILE_PATH
    #sort_keys is set to True for sorting dictionaries by key
    #indent = 4, for pretty printing JSON array elements with indent level of 4
    with open(FILE_PATH, 'w', encoding='utf8') as f:
        json.dump(all_tweets_json, f, sort_keys = True,indent = 4)
    return all_tweets

In [10]:
tweets_by_MIDAS = fetch_and_dump_tweets("midasIIITD")

Total number of tweets from midasIIITD are 296


In [11]:
with open(FILE_PATH) as json_file:  
    MIDAS_json = json.load(json_file)

Now, let's look at the `entities` of the same tweet.

In [12]:
MIDAS_json[12]['entities']

{'hashtags': [{'indices': [111, 116], 'text': 'team'},
  {'indices': [117, 126], 'text': 'research'},
  {'indices': [127, 130], 'text': 'AI'},
  {'indices': [131, 134], 'text': 'ML'},
  {'indices': [135, 144], 'text': 'projects'}],
 'media': [{'display_url': 'pic.twitter.com/lN7hItwPO9',
   'expanded_url': 'https://twitter.com/midasIIITD/status/1105478029147553792/photo/1',
   'id': 1105477322264772610,
   'id_str': '1105477322264772610',
   'indices': [145, 168],
   'media_url': 'http://pbs.twimg.com/media/D1dxxHzXgAIeNSE.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/D1dxxHzXgAIeNSE.jpg',
   'sizes': {'large': {'h': 1338, 'resize': 'fit', 'w': 2048},
    'medium': {'h': 784, 'resize': 'fit', 'w': 1200},
    'small': {'h': 444, 'resize': 'fit', 'w': 680},
    'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
   'type': 'photo',
   'url': 'https://t.co/lN7hItwPO9'}],
 'symbols': [],
 'urls': [],
 'user_mentions': [{'id': 1021355762575073281,
   'id_str': '1021355762575073281'

In [13]:
MIDAS_json[12]['extended_entities']

{'media': [{'display_url': 'pic.twitter.com/lN7hItwPO9',
   'expanded_url': 'https://twitter.com/midasIIITD/status/1105478029147553792/photo/1',
   'id': 1105477322264772610,
   'id_str': '1105477322264772610',
   'indices': [145, 168],
   'media_url': 'http://pbs.twimg.com/media/D1dxxHzXgAIeNSE.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/D1dxxHzXgAIeNSE.jpg',
   'sizes': {'large': {'h': 1338, 'resize': 'fit', 'w': 2048},
    'medium': {'h': 784, 'resize': 'fit', 'w': 1200},
    'small': {'h': 444, 'resize': 'fit', 'w': 680},
    'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
   'type': 'photo',
   'url': 'https://t.co/lN7hItwPO9'}]}

Success! Now we have the media related information and this completes the first part of our task.

## Second Part - Parse the JSON Line file and get the required information in Tabular form

The second part of the task is to parse the JSON file we created in first part and get the required data in the tabular form. This part will contain a function: `parse_tweets(FILE_PATH)` to parse the JSONline file specified by `FILE_PATH` and return the information as a python list: `tweet_list`. `tweet_list` can then later be displayed in a tabular format as demonstrated in the subsequence cells.

In [37]:
def parse_tweets(FILE_PATH):
    '''
    Parameters-
        FILE_PATH: Path to the JSON file (tweets.json)
    Variables -
        tweets_by_MIDAS: List to load in the data from tweets.json
        tweet_info: Dictionary that will hold the required information for a tweet
        image_count: Counter for number of images in a tweet
        tweet_media: List that will hold all the media related information of a tweet
    Returns -   
        tweet_list: List of tweet_info
    '''
    #load tweets.json to tweets_by_MIDAS
    with open(FILE_PATH) as json_file:
        tweets_by_MIDAS = json.load(json_file)
        
        #tweet_list initialized as an empty list
        tweet_list=[]

        #Go through the json data of each tweet one by one
        for tweet in tweets_by_MIDAS:
            
            #an empty dictionary for a tweet
            tweet_info=dict()
            
            #get the text of tweet
            tweet_info['Text']=tweet['full_text']
            
            #get the date and time of the tweet
            tweet_info['Date and Time']= tweet['created_at']
            
            #get the number of likes and retweets for the tweet
            tweet_info['Number of Likes']=tweet['favorite_count']
            tweet_info['Number of Retweets']=tweet['retweet_count']

            #check if tweet contains any kind of media
            if 'media' in tweet['entities']:
                
                image_count = 0
                
                #get the media information about tweet
                tweet_media = tweet['extended_entities']['media']
                
                #go through all the media and check if it's photo/image
                #if yes, then increment the image_count
                for i in range(len(tweet_media)):
                    if(tweet_media[i]['type'] == 'photo'):
                        image_count += 1
                tweet_info['Number of Images'] = image_count
                
            #No media, hence set number of images to None    
            else:
                tweet_info['Number of Images'] = None
                
            #append the tweet_info to tweet_list    
            tweet_list.append(tweet_info)
    return tweet_list

In [38]:
MIDAS_info = parse_tweets(FILE_PATH)

In [39]:
MIDAS_info[1]

{'Date and Time': 'Wed Mar 20 02:40:07 +0000 2019',
 'Number of Images': None,
 'Number of Likes': 0,
 'Number of Retweets': 3,
 'Text': 'RT @IEEEBigMM19: BigMM 2019 : IEEE BigMM 2019 – Call for Workshop Proposals  \n\nhttps://t.co/I4vqf8FE6K …  \nWhen: Sep 11, 2019 - Sep 13, 201…'}

At this point, we are almost done and just have to transform the `midas_info` into a tabular form. It can be easily done by using [pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) as shown in the below cell.

In [52]:
pd.options.display.max_rows = 3200 #To show all the rows
MIDAS_df = pd.DataFrame(MIDAS_info)
MIDAS_df = MIDAS_df[['Text', 'Date and Time', 'Number of Likes', 'Number of Retweets', 'Number of Images']] #Reorder columns
MIDAS_df.where(MIDAS_df.notnull(), None) #Replace NaN with None

Unnamed: 0,Text,Date and Time,Number of Likes,Number of Retweets,Number of Images
0,@IEEEBigMM19 is also available on Facebook now...,Wed Mar 20 08:19:24 +0000 2019,1,1,
1,RT @IEEEBigMM19: BigMM 2019 : IEEE BigMM 2019 ...,Wed Mar 20 02:40:07 +0000 2019,0,3,
2,BigMM 2019 : IEEE BigMM 2019 – Call for Worksh...,Mon Mar 18 02:27:47 +0000 2019,6,3,
3,"Congratulations @midasIIITD team, Rohan, Prady...",Sun Mar 17 14:22:04 +0000 2019,14,4,
4,We have emailed the task details to all shortl...,Sat Mar 16 14:06:56 +0000 2019,6,0,
5,IEEE BigMM 2019 - Call for Workshop Proposals....,Sat Mar 16 09:20:29 +0000 2019,1,1,
6,"Congratulations! Arijit, Ramit, @debanjanbhucs...",Sat Mar 16 09:14:58 +0000 2019,7,2,
7,We will be releasing a very interesting task t...,Sat Mar 16 05:13:14 +0000 2019,7,2,
8,RT @hcdiiitd: Last day to register for #Portfo...,Wed Mar 13 17:09:44 +0000 2019,0,2,
9,@ACMMM19 @sigmm @TheOfficialACM @acmmmsys @ACM...,Wed Mar 13 04:11:24 +0000 2019,1,0,1.0
