# MIDAS Summer Internship Task
### Problem 1: Python Problem
_For this problem, I will be dividing the notebook into two different segments. The first half of the notebook will fetch the tweets using Twitter API and save those tweets in a JSON file. The second half of this notebook is for parsing the previously saved file and displaying the results as mentioned in the problem statement._

### Fetching the Tweets
Here, I am using the requests library to authenticate the account on Twitter API and then using the same account, I will execute a search query. Once I have the search results, I will write them to a JSON file locally and then use it in the second half of this notebook.

In [1]:
import requests
import pandas as pd
import base64
import json
import csv
from datetime import datetime as dt

In [2]:
base_url = 'https://api.twitter.com/'
client_key = '[CLIENT KEY]'
client_secret = '[CLIENT SECRET KEY]'
file_name = 'data/midas_tweets.json'


def generate_key(client_key, client_secret):
    key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')
    b64_encoded_key = base64.b64encode(key_secret)
    b64_encoded_key = b64_encoded_key.decode('ascii')
    
    return b64_encoded_key


def authenticate(encoded_key):
    auth_url = '{}oauth2/token'.format(base_url)
    
    auth_headers = {
        'Authorization': 'Basic {}'.format(encoded_key),
        'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'
    }
    
    auth_data = {
        'grant_type': 'client_credentials'
    }
    
    auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)
    
    if auth_resp.status_code == 200:
        return auth_resp.json()['access_token']
    
    return None


def search(access_token):
    """This uses the Premium API, for Standard API,
    use 1.1/search/tweets.json as search_url and
    'q' as the parameter.
    """
    
    # Add the environment name
    search_url = '{}1.1/tweets/search/fullarchive/<ENV>.json'.format(base_url)
    
    search_headers = {
        'Authorization': 'Bearer {}'.format(access_token)
    }
    
    search_params = {
        'query': 'from:midasIIITD',
        'fromDate': '200603210000'
    }
    
    search_resp = requests.get(search_url, headers=search_headers, params=search_params)
    
    if search_resp.status_code == 200:
        return search_resp.json()
    
    return None


# Driver Code

b64_key = generate_key(client_key, client_secret)
access_token = authenticate(b64_key)

if access_token:
    tweet_data = search(access_token)
    if tweet_data:
        with open(file_name, 'w+') as json_file:
            for tweet in tweet_data['results']:
                json.dump(tweet, json_file)
                json_file.write('\n')

### Parsing the Tweets
Now as we have the fetched tweets saved into a JSON file, we can start parsing the results by reading the file and then saving the output in a csv format. I will use the pandas library to create a Data Frame and view the tweets in a tabular format.

In [3]:
def extract_text(tweet):
    """
    I have generalised this method to include the text
    even if the tweet is a quoted or a retweeted one.
    """
    if 'quoted_status' in tweet.keys():
        rt = 'RT @{} '.format(tweet['quoted_status']['user']['screen_name'])
        if tweet['truncated']:
            rt = rt + tweet['extended_tweet']['full_text']
        else:
            rt = rt + tweet['text']
        return rt + ' ' + extract_text(tweet['quoted_status'])
    elif 'retweeted_status' in tweet.keys():
        rt = 'RT @{} '.format(tweet['retweeted_status']['user']['screen_name'])
        return rt + extract_text(tweet['retweeted_status'])
    else:
        if tweet['truncated']:
            return tweet['extended_tweet']['full_text']
        else:
            return tweet['text']


def extract_values(tweet):
    values = {}
    values['Date/Time'] = str(dt.strptime(tweet['created_at'], date_time_format))
    values['Likes'] = tweet['favorite_count']
    values['Retweets'] = tweet['retweet_count']
    values['Text'] = extract_text(tweet)

    if tweet['truncated']:
        if 'media' in tweet['extended_tweet']['entities'].keys():
            entities = tweet['extended_tweet']['entities']
            images = len([x for x in entities['media'] if x['type'] == 'photo'])
            if images == 0:
                values['Images'] = None
            else:
                values['Images'] = images
    else:
        if 'media' in tweet['entities'].keys():
            entities = tweet['entities']
            images = len([x for x in entities['media'] if x['type'] == 'photo'])
            if images == 0:
                values['Images'] = None
            else:
                values['Images'] = images
    return values

# Driver Code

df = pd.DataFrame(columns=['Text', 'Date/Time', 'Likes', 'Retweets', 'Images'])
date_time_format = "%a %b %d %H:%M:%S +0000 %Y"
with open(file_name, 'r+') as json_file:
    for line in json_file:
        json_line = json.loads(line)
        df = df.append(extract_values(json_line), ignore_index=True)
        df = df.where(df.notnull(), None)

In [4]:
df

Unnamed: 0,Text,Date/Time,Likes,Retweets,Images
0,@IEEEBigMM19 is also available on Facebook now...,2019-03-20 08:19:24,1,1,
1,RT @IEEEBigMM19 BigMM 2019 : IEEE BigMM 2019 –...,2019-03-20 02:40:07,0,0,
2,BigMM 2019 : IEEE BigMM 2019 – Call for Worksh...,2019-03-18 02:27:47,6,3,
3,"Congratulations @midasIIITD team, Rohan, Prady...",2019-03-17 14:22:04,15,4,
4,We have emailed the task details to all shortl...,2019-03-16 14:06:56,6,0,
5,IEEE BigMM 2019 - Call for Workshop Proposals....,2019-03-16 09:20:29,1,1,
6,"Congratulations! Arijit, Ramit, @debanjanbhucs...",2019-03-16 09:14:58,7,2,
7,We will be releasing a very interesting task t...,2019-03-16 05:13:14,7,2,
8,RT @hcdiiitd Last day to register for #Portfol...,2019-03-13 17:09:44,0,0,
9,@ACMMM19 @sigmm @TheOfficialACM @acmmmsys @ACM...,2019-03-13 04:11:24,1,0,1


In [5]:
df.to_csv('data/tweets_formatted.csv')