# MIDAS IIIT-D Summer Internship Task 1

## Python Problem

### Problem Statement -

You have to write a python script which can fetch all the tweets(as many as allowed by Twitter API) done by 'midas@IIITD' twitter handle and dump the responses into JSONlines file.

The other part of your script should be able to parse these JSONline files to display the following for every tweet in a tabular format.
- The text of the tweet.
- Date and time of the tweet.
- The number of favorites/likes.
- The number of retweets.
- Number of Images present in Tweet. If no image returns None.


## Index

1. [Importing libraries](#importing_libraries)
2. [API Credentials](#api_credentials)
3. [Getting ready for scraping twitter](#getting_started)
4. [Scrape tweets](#scrape_tweets)
5. [Save JSONLines file](#save_jsonl)
6. [Parse JSONLines file](#parse_jsonl)
7. [Display Table](#display_table)

<a id='importing_libraries'></a>
<hr>

### Importing libraries
- pandas for tabular data formatting 
- tweepy for accessing Twitter API
- json for writing and parsing JSONL files

In [1]:
import os
import pandas as pd
import tweepy 
from tweepy import OAuthHandler
import json

<a id='api_credentials'></a>

### Getting Twitter API keys and credentials from environment variables

In [2]:
ACCESS_TOKEN = os.getenv('ACCESS_TOKEN')
ACCESS_TOKEN_SECRET = os.getenv('ACCESS_TOKEN_SECRET')
CONSUMER_KEY = os.getenv('CONSUMER_KEY')
CONSUMER_SECRET = os.getenv('CONSUMER_SECRET')

<a id='getting_started'></a>

### Getting started with scraping twitter data

- Define a class which initialises tweepy client object using api keys
- Member function get_tweets() takes a twitter username as input and scrapes all tweets of that user
- Tweets are scraped in batches of 100 so that the Twitter API is not abused

In [3]:
class fetchTweets():
    def __init__(self):
        try:
            auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
            auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
            
            self.api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
            
            '''
            initialising an api object from tweepy using our credentials
            '''

        except tweepy.TweepError as e:
            print(f'Error: Twitter Authentication Failed - {str(e)}')
            
    def get_tweets(self, screen_name):
        '''
        This function receives a twitter username as input and 
        scrapes all tweets for that specific user.
        all_tweets is a list containing all tweet objects
        We save the ID of last scraped tweet for every iteration 
        and use that as a reference to downlaod
        tweets which haven't yet been scraped.
        '''
        all_tweets = []
        
        new_tweets = self.api.user_timeline(screen_name=screen_name, count=100, tweet_mode='extended')
        
        all_tweets.extend(new_tweets)
        
        oldest = all_tweets[-1].id - 1
        
        while len(new_tweets) > 0:
            print(f"Getting tweets before {oldest}")
            
            new_tweets = self.api.user_timeline(screen_name=screen_name, count=100, max_id=oldest, tweet_mode='extended')
            
            all_tweets.extend(new_tweets)
            
            oldest = all_tweets[-1].id - 1
            
            print(f"{len(all_tweets)} tweets have been scraped")
        return all_tweets
    

### Initialising our fetchTweets class object

In [4]:
twitter = fetchTweets()

<a id='scrape_tweets'></a>

### Call member function on object to start scraping all tweets of @midasIIITD

In [5]:
tweets = twitter.get_tweets('midasIIITD')

Getting tweets before 1087712199836033023
200 tweets have been scraped
Getting tweets before 1037401364471508991
296 tweets have been scraped
Getting tweets before 1021377705084739583
296 tweets have been scraped


In [6]:
# sanity check - ensuring we're getting right data for a random tweet
tweets[8]._json['entities']

{'hashtags': [{'text': 'PortfolioCreationinDesign', 'indices': [39, 65]}],
 'symbols': [],
 'user_mentions': [{'screen_name': 'hcdiiitd',
   'name': 'Human-Centered Design (IIITDelhi)',
   'id': 1090887196754640896,
   'id_str': '1090887196754640896',
   'indices': [3, 12]}],
 'urls': []}

### Define a function to save tweets in a JSONLines file

- JSONLines also called newline-delimited JSON. JSON Lines is a convenient format for storing structured data that may be processed one record at a time.
- Each Line is a Valid JSON Value
- Line Separator is '\n'


In [7]:
def save_jsonl(tweets):
    '''
    This function takes as input a list of tweets.
    Tweepy represents these tweets as a list of Status objects.
    In every iteration we parse the status object and extract the json data.
    The json data is inserted in the file 'tweets.jsonl' with a '\n' separator 
    to make it JSONLines compatible.
    '''
    with open('tweets.jsonl', 'w') as f:
        for tweet in tweets:
            json.dump(tweet._json, f)
            f.write('\n')

In [8]:
save_jsonl(tweets)

The tweets have been saved in a file 'tweets.jsonl' inside the current directory.
The file can be parsed like so

In [9]:
# The data of 'tweets.jsonl' file can be parsed as below
# with open('tweets.jsonl') as f:
#     for line in f:
#         print(line)

os.listdir()

['.ipynb_checkpoints', 'tweets_scraper.ipynb', 'tweets.jsonl']

As we can see, a new file is created which contains all scraped tweets in jsonl format

### Define a function to parse jsonl file
- This function parses the jsonl file line by line and saves the required data in a dictionary

In [10]:
def parse_jsonl(filename):
    '''
    This function receives a jsonl file as input.
    It parses this jsonl file line by line and
    saves only the relevant details of each 
    tweet in a dictionary called tweets_dict.
    '''
    tweets_dict = {}
    
    with open(filename) as f:
        for line in f:
            tweet = json.loads(line)
            images = tweet['entities'].get('media', [])
            tweets_dict.setdefault('text', []).append(tweet['full_text'])
            tweets_dict.setdefault('datetime', []).append(tweet['created_at'])
            tweets_dict.setdefault('favorite_count', []).append(tweet['favorite_count'])
            tweets_dict.setdefault('retweet_count', []).append(tweet['retweet_count'])
            tweets_dict.setdefault('media', []).append(len(images))

    return tweets_dict

### Define a function to display table 
- This function takes a dictionary as input and creates a dataframe using it
- It then cleans the data and returns the dataframe in a presentable form

In [11]:
def display_table(tweets_dict):
    '''
    This function takes a dictionary as input.
    Data cleaning involves ordering the columns
    because dictionary is an unordered object in Python.
    0 Images are mapped to None and new line operators 
    are replaced by spaces.
    '''
    df = pd.DataFrame.from_dict(tweets_dict)
    ordered_columns = ['text', 'datetime', 'favorite_count', 'retweet_count', 'media']
    df = df.reindex(columns=ordered_columns)
    df = df.replace('\n', ' ', regex=True)
    df['media'] = df['media'].map({0: 'None'}).fillna(df['media'])
    df['datetime'] = pd.to_datetime(df['datetime'])
    
    return df

In [12]:
parsed_tweets = parse_jsonl('tweets.jsonl')

In [13]:
table = display_table(parsed_tweets)

In [14]:
table

Unnamed: 0,text,datetime,favorite_count,retweet_count,media
0,@IEEEBigMM19 is also available on Facebook now...,2019-03-20 08:19:24+00:00,1,1,
1,RT @IEEEBigMM19: BigMM 2019 : IEEE BigMM 2019 ...,2019-03-20 02:40:07+00:00,0,4,
2,BigMM 2019 : IEEE BigMM 2019 – Call for Worksh...,2019-03-18 02:27:47+00:00,6,3,
3,"Congratulations @midasIIITD team, Rohan, Prady...",2019-03-17 14:22:04+00:00,15,4,
4,We have emailed the task details to all shortl...,2019-03-16 14:06:56+00:00,6,0,
5,IEEE BigMM 2019 - Call for Workshop Proposals....,2019-03-16 09:20:29+00:00,1,1,
6,"Congratulations! Arijit, Ramit, @debanjanbhucs...",2019-03-16 09:14:58+00:00,7,2,
7,We will be releasing a very interesting task t...,2019-03-16 05:13:14+00:00,7,2,
8,RT @hcdiiitd: Last day to register for #Portfo...,2019-03-13 17:09:44+00:00,0,2,
9,@ACMMM19 @sigmm @TheOfficialACM @acmmmsys @ACM...,2019-03-13 04:11:24+00:00,1,0,1
