<a href="https://colab.research.google.com/github/dax-1895/Twitter-data-mining/blob/master/project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Twitter Data Mining**
## Objective:
Write a python script which can fetch all the tweets(as many as allowed by Twitter
API) done by midas@IIITD twitter handle and dump the responses into JSONlines file.
The other part of your script should be able to parse these JSONline files to display the
following for every tweet in a tabular format.
*  The text of the tweet
*    Date and time of the tweet.
*   The number of favorites/likes.
*   The number of retweets.
*  Number of Images present in Tweet. If no image returns None.

## Importing all the packages required
*   **JSONlines**:
This data format is straight-forward: it is simply one valid JSON value per line, encoded using UTF-8. While code to consume and create such data is not that complex, it quickly becomes non-trivial enough to warrant a dedicated library when adding data validation, error handling, support for both binary and text streams, and so on. This small library implements all that (and more!) so that applications using this format do not have to reinvent the wheel. First step is to install the package and import it
*   **Tweepy**:
Tweepy is a Python library for accessing the Twitter API. It is great for simple automation and creating twitter bots.


*  **Numpy and pandas**: Standard Libraries used for data science


In [41]:
!pip install jsonlines
import tweepy
import numpy as np
import pandas as pd
import jsonlines



##**Authentication Keys**
Authentication keys have been taken from the Twitter Developers Page.It makes OAuth a simpler process.

In [0]:
consumer_key = "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" 
consumer_secret = "XXXXXXXXXXXXXXXXXXXXXXXXXX"
access_key = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

## Accessing all tweets
Using the tweepy API, we get access to all the tweets of @midasIIITD.

In [43]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
userID='@midasIIITD'
tweets = api.user_timeline(
                           screen_name=userID, 
                           # 500 is the maximum allowed count
                           count=500,
                           include_rts = False,
                           # Necessary to keep full_text 
                           # otherwise only the first 140 words are extracted
                           tweet_mode = 'extended'
                           )
print(tweets[100])


Status(_api=<tweepy.api.API object at 0x7f962d3a0048>, _json={'created_at': 'Tue Dec 18 14:42:27 +0000 2018', 'id': 1075038589505990656, 'id_str': '1075038589505990656', 'full_text': 'Feel free to contact us if you have any query on @ACMMM19. We look forward to your submissions to ACM Multimedia conference 2019. \n\nCheck more details on submission at https://t.co/xA2jk0KPPC\n\n#Multimedia #Research #ACMMM2019 #Nice', 'truncated': False, 'display_text_range': [0, 231], 'entities': {'hashtags': [{'text': 'Multimedia', 'indices': [193, 204]}, {'text': 'Research', 'indices': [205, 214]}, {'text': 'ACMMM2019', 'indices': [215, 225]}, {'text': 'Nice', 'indices': [226, 231]}], 'symbols': [], 'user_mentions': [{'screen_name': 'ACMMM19', 'name': 'ACMMM19', 'id': 961976587246817280, 'id_str': '961976587246817280', 'indices': [49, 57]}], 'urls': [{'url': 'https://t.co/xA2jk0KPPC', 'expanded_url': 'https://www.acmmm.org/2019/submission/', 'display_url': 'acmmm.org/2019/submissio…', 'indices': [16

##Getting All Previous Tweets

In [44]:
all_tweets = []
all_tweets.extend(tweets) # Each element of tweets gets appended to all tweets
oldest_id = tweets[-1].id # oldest id is that of the last element in all tweets
#try to check if there are tweets older than the givenoldest tweet
while True:
    tweets = api.user_timeline(screen_name=userID, 
                           # 500 is the maximum allowed count
                           count=500,
                           include_rts = False,
                           max_id = oldest_id - 1,
                           # Necessary to keep full_text 
                           # otherwise only the first 140 words are extracted
                           tweet_mode = 'extended'
                           )
    if len(tweets) == 0:
        break
    oldest_id = tweets[-1].id
    all_tweets.extend(tweets)
    print('N of tweets downloaded till now {}'.format(len(all_tweets)))


N of tweets downloaded till now 171


##Transform the tweepy tweets into a DataFrame that will populate the csv	

In [0]:
#transform the tweepy tweets into a DataFrame that will populate the csv	
from pandas import DataFrame
outtweets = [[#accessing the date and time of the tweet
              tweet.created_at,
              #accessing the favorite count
              tweet.favorite_count, 
              #accessing the retweet count
              tweet.retweet_count, 
              #accessing the decription of the tweet
              tweet.full_text.encode("utf-8").decode("utf-8")] 
             for idx,tweet in enumerate(all_tweets)]

In [0]:
df = DataFrame(outtweets,columns=["created_at","favorite_count","retweet_count", "text"])

##Counting the no of images
Images cannot be accessed directly. The information about images used is saved under entities under the keyword "media". But media files could be of various types. Therefore we need tocheck if the media is infact of an image

In [0]:
#create an array to store the info
images=[]

In [48]:
for tweet in all_tweets:
  i=0
  for media in tweet.entities.get("media",[{}]): #acessing the "media" element
    if media.get("type",None) == "photo": #checking if its type is that of a photo
      i=i+1#counting the number of images 
      images.append(i) #appending the value
    else:
      images.append(i)
  
images  

[0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0]

In [0]:
df["no_of_images"]=images #add the images to the dataframe
df["no_of_images"]=df["no_of_images"].replace(0,"None") #replacing all the zeros in the column to "None"

##Converting the dataframe into a csv file and a JSONlines file

In [0]:
df.to_csv('%s_tweets.csv' % userID,index=False) #converting to csv

In [0]:
df.to_json('%s_tweets.jsonl' % userID,orient='records',lines= True) #converting into JSON but we have selected lines="True" therefore converted into .jsonl


In [52]:
print(df.to_json(orient='records',lines= True)) #checking the output

{"created_at":1554886889000,"favorite_count":0,"retweet_count":0,"text":"Clarification: Our earlier post which indicates that the interview has been done is referring to the interviews of the candidates who have submitted applications and tasks in the first round (not through IIITD internship portal). The second round candidates yet to be interviewed.","no_of_images":"None"}
{"created_at":1554707292000,"favorite_count":18,"retweet_count":2,"text":"Many Congratulations to @midasIIITD student, Shagun Uppal @shagunuppls, on getting selected for the summer internship in the BRAIN lab at Singapore University of Technology &amp; Design @sutdsg, Singapore.\nWe wish her the best luck for the internship.\n\n#MIDAS #StudentAchievement https:\/\/t.co\/snX2GkzvQg","no_of_images":1}
{"created_at":1554694062000,"favorite_count":5,"retweet_count":0,"text":"@midasIIITD thanks all students who have appeared for the interview yesterday. We will announce the interview results for MIDAS internship latest 

In [53]:
jsonlines.open("@midasIIITD_tweets.jsonl") #opening our previously made file

<jsonlines.Reader at 0x7f962d932048 wrapping '@midasIIITD_tweets.jsonl'>

In [54]:
#reading all the lines in our file 
with jsonlines.open('@midasIIITD_tweets.jsonl') as reader:
    for obj in reader:
      print(obj)

{'created_at': 1554886889000, 'favorite_count': 0, 'retweet_count': 0, 'text': 'Clarification: Our earlier post which indicates that the interview has been done is referring to the interviews of the candidates who have submitted applications and tasks in the first round (not through IIITD internship portal). The second round candidates yet to be interviewed.', 'no_of_images': 'None'}
{'created_at': 1554707292000, 'favorite_count': 18, 'retweet_count': 2, 'text': 'Many Congratulations to @midasIIITD student, Shagun Uppal @shagunuppls, on getting selected for the summer internship in the BRAIN lab at Singapore University of Technology &amp; Design @sutdsg, Singapore.\nWe wish her the best luck for the internship.\n\n#MIDAS #StudentAchievement https://t.co/snX2GkzvQg', 'no_of_images': 1}
{'created_at': 1554694062000, 'favorite_count': 5, 'retweet_count': 0, 'text': '@midasIIITD thanks all students who have appeared for the interview yesterday. We will announce the interview results for MI

In [55]:
#converting it into a DataFrame to view in a tabular format
pd.read_json('@midasIIITD_tweets.jsonl',lines=True)

Unnamed: 0,created_at,favorite_count,no_of_images,retweet_count,text
0,2019-04-10 09:01:29,0,,0,Clarification: Our earlier post which indicate...
1,2019-04-08 07:08:12,18,1,2,"Many Congratulations to @midasIIITD student, S..."
2,2019-04-08 03:27:42,5,1,0,@midasIIITD thanks all students who have appea...
3,2019-04-07 14:17:29,0,,0,"@himanchalchandr Meanwhile, complete CV/NLP ta..."
4,2019-04-07 14:17:09,0,,0,@sayangdipto123 Submit as per the guideline ag...
5,2019-04-07 11:43:24,1,,1,We request all students whose interview are sc...
6,2019-04-07 06:55:19,5,,2,"Other queries: ""none of the Tweeter Apis give ..."
7,2019-04-07 06:53:38,4,,1,"Other queries: ""do we have to make two differe..."
8,2019-04-07 05:32:27,6,,1,"Other queries: ""If using Twitter api, it does ..."
9,2019-04-07 05:29:40,7,,1,Response to some queries asked by students on ...
