# MIDASS@IIITD Summer Internship Task

## Python Problem

***Note***: Entire work has been done on [Google Colaboratory](https://colab.research.google.com/drive/19v50ZRM7wwj1ikRAWPnYJN4tXRXtGIvf).

***Part I : Python Script composed of fetching all the tweets by midas@IIITD twitter handle and dump responses into JSONlines file.***

***Introduction*** : The process of fetching tweets from Twitter consists of using an API to mine data through a social media handle. We will be using ***Tweepy*** because of it's ***ease to use***, ***simplicity***,  and ***availability***.

To do that we need ***consumer key***, ***consumer secret***, ***access key*** and ***access secret*** which can be obtained through [twitter developer section](https://developer.twitter.com/en/apps) by registering your app.

This will help us to access tweets from any twitter user

In [0]:
#importing tweepy library
import tweepy

# Fill the X's with the credentials obtained by above mentioned procedure. 
consumer_key = "XXXXXXXXXX" 
consumer_secret = "XXXXXXXXX"
access_key = "XXXXXXXXXXXXXXX"
access_secret = "XXXXXXXXXX"
#twitter handle
username = "midasIIITD"


# Authorization to consumer key and consumer secret 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 

# Access to user's access key and access secret 
auth.set_access_token(access_key, access_secret) 

# Calling api 
api = tweepy.API(auth) 

Twitter allows maximum of ***200*** tweets for extraction. Hence, we will be extracting tweets from midas@IIITD by using own ***timeline***

In [0]:
#maximum number of tweets to be extracted
number_of_tweets=200
#Using API interface to iterate through username by our own timeline
tweets = api.user_timeline(screen_name=username, count=number_of_tweets) 

To dump the responses into JSONlines file. We need to import ***json*** which is a built-in package for encoding and decoding JSON data. 

In [0]:
import json

The ***J***ava***S***cript ***O***bject ***N***otation was inspired by a subset of the JavaScript programming language dealing with object literal syntax. It's pretty much universal object notation at this point.

The JSON response from the Twitter API is available in the attribute _json, which is not the raw JSON string, but a dictionary.

In [0]:
#JSONlines of all our tweets from twitter handle 
midas_tweets = [tweet._json for tweet in tweets]





***The process of encoding JSON is usually called serialization and it refers to the transformation of data into a series of byte to be stored or transmitted across a network.***

Hence, we can create a file using python's context manager and open it in write mode. eg., data_file.json; to store all the data from twitter handle

This file will further used for extraction of information from tweets

In [0]:
#responses.json to dump all the responses into JSONlines file
with open("responses.json", "w") as write_file:
    json.dump(midas_tweets, write_file)

***Part II : parse these JSONlines files to display following information:***


*   The text of the tweet.
*   Date and time of the tweet.
*   The number of favorites.
*   The number of retweets.
*   Number of Images present in Tweet.




The parsing of JSONlines file can be done either through ***json.load()*** or using **pandas**. I have use pandas because of ***data analysis tools***. We can directly ***load json file*** in it and then, we can use it to*** express complex series*** both ***multidimensional*** and ***hetrogeneous***.

In [0]:
# loading python package
import pandas as pd
# open existing data file for manipulation in a dataframe
df = pd.read_json("responses.json")
# display first five rows to take a look at data
df.head()

Unnamed: 0,contributors,coordinates,created_at,entities,extended_entities,favorite_count,favorited,geo,id,id_str,...,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,truncated,user
0,,,2019-04-08 07:08:12,"{'hashtags': [], 'symbols': [], 'user_mentions...",,13,False,,1115149324533542912,1115149324533542912,...,,,,2,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","Many Congratulations to @midasIIITD student, S...",True,"{'id': 1021355762575073281, 'id_str': '1021355..."
1,,,2019-04-08 03:27:42,"{'hashtags': [], 'symbols': [], 'user_mentions...",,4,False,,1115093836341096449,1115093836341096448,...,,,,0,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",@midasIIITD thanks all students who have appea...,True,"{'id': 1021355762575073281, 'id_str': '1021355..."
2,,,2019-04-07 14:17:29,"{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,,1114894970886983680,1114894970886983680,...,,,,0,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...","@himanchalchandr Meanwhile, complete CV/NLP ta...",False,"{'id': 1021355762575073281, 'id_str': '1021355..."
3,,,2019-04-07 14:17:09,"{'hashtags': [], 'symbols': [], 'user_mentions...",,0,False,,1114894886292029440,1114894886292029440,...,,,,0,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",@sayangdipto123 Submit as per the guideline ag...,False,"{'id': 1021355762575073281, 'id_str': '1021355..."
4,,,2019-04-07 11:43:24,"{'hashtags': [], 'symbols': [], 'user_mentions...",,1,False,,1114856195335106560,1114856195335106560,...,,,,1,False,,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",We request all students whose interview are sc...,True,"{'id': 1021355762575073281, 'id_str': '1021355..."


Thereby, using pandas we can extract information we need. Let us take a list of columns to know what we have :

In [0]:
df.columns.values

array(['contributors', 'coordinates', 'created_at', 'entities',
       'extended_entities', 'favorite_count', 'favorited', 'geo', 'id',
       'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place',
       'possibly_sensitive', 'quoted_status', 'quoted_status_id',
       'quoted_status_id_str', 'retweet_count', 'retweeted',
       'retweeted_status', 'source', 'text', 'truncated', 'user'],
      dtype=object)

Since we have ***29 columns*** and we only need ***5 columns*** as mentioned.

1.   created_at : Date and time of tweet, already given by API.
2.   favorite_count: number of favorites, already given by API.
3.   retweet_count: number of retweet, already given by API.
4.   text: The text of tweet, already given by API.
5.   image_count: number of images present in tweet, no given by API; we have to determine it.

We need to remove other part of dataframe


In [0]:
# list of columns to remove
columns = ['contributors', 'coordinates', 'entities', 'extended_entities', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'place', 'possibly_sensitive', 'quoted_status', 'quoted_status_id', 'quoted_status_id_str', 'retweeted', 'retweeted_status', 'source', 'truncated', 'user']
# deleting columns we don't need
df.drop(columns, inplace=True, axis=1)

Now, let us take a look at data to see what we have obtain so far

In [0]:
df.head()

Unnamed: 0,created_at,favorite_count,retweet_count,text
0,2019-04-08 07:08:12,13,2,"Many Congratulations to @midasIIITD student, S..."
1,2019-04-08 03:27:42,4,0,@midasIIITD thanks all students who have appea...
2,2019-04-07 14:17:29,0,0,"@himanchalchandr Meanwhile, complete CV/NLP ta..."
3,2019-04-07 14:17:09,0,0,@sayangdipto123 Submit as per the guideline ag...
4,2019-04-07 11:43:24,1,1,We request all students whose interview are sc...


We already have all the ***four columns*** needed but to display ***Number of images*** present in Tweet. We need to write a python script consist of ***extracting tweets from API call***.

In [0]:
# number of images present
image_count = []

# now, scan that through every tweet present in the JSONlines.
for tweet in tweets:
  for media in tweet.entities.get("media",[{}]):
    #checks if there is any media-entity
    if media.get("type",None) == "photo":
      # if there is a image add 1
      image_count.append('1')
    else:
      # if there is not a image present then value is None
      image_count.append('None')

We have determine the Number of images present in tweet, now we add it to our dataframe to make it whole.

In [0]:
# load the package required: Numpy because 
import numpy as np
new_col = np.asarray(image_count)
df["image_count"] = new_col

In [0]:
df.head()

Unnamed: 0,created_at,favorite_count,retweet_count,text,image_count
0,2019-04-08 07:08:12,13,2,"Many Congratulations to @midasIIITD student, S...",
1,2019-04-08 03:27:42,4,0,@midasIIITD thanks all students who have appea...,
2,2019-04-07 14:17:29,0,0,"@himanchalchandr Meanwhile, complete CV/NLP ta...",
3,2019-04-07 14:17:09,0,0,@sayangdipto123 Submit as per the guideline ag...,
4,2019-04-07 11:43:24,1,1,We request all students whose interview are sc...,


***Summary*** : The entire solution of python problem is done on jupyter notebook through google colaboratory. Please go to [this link](https://colab.research.google.com/drive/19v50ZRM7wwj1ikRAWPnYJN4tXRXtGIvf) to take a look at it. 

P.S I have implemented all the things that I have learned from MOOCs available at [CS224n](http://web.stanford.edu/class/cs224n/) and [mlcourse.ai](https://mlcourse.ai/). 