For importing dependencies. Tweepy is a python library which makes using Twitter API an easier task. jsonlines and json are the formats that we are going to work with.

In [64]:
import tweepy
from tweepy import OAuthHandler
import jsonlines
import json
import pandas as pd

Standard twitter api protocol. Please update the following cells with your own keys

In [13]:
consumer_key=""
consumer_secret=""
access_token=""
access_secret=""
username="midasIIITD"
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

In [6]:

def user_tweets(api,user):
    all_tweets = api.user_timeline(screen_name=user, count=1000, tweet_mode="extended")
    return all_tweets
tweets=user_tweets(api,username)


In [8]:
len(tweets)

200

Even though I passed 1000 as number of tweets parameter, it returns only 200 tweets. 

In [12]:
type((tweets[0]))

tweepy.models.Status

As this is of type Status, we will need to change it to a suitable format, for our requirement json or jsonlines
We have a _json field in this status file, which refers to the json response sent by twitter. We will extract and store that.
Hence we will get a list of dictionaries. That will be written to files of jsonl format.

In [25]:
tweet_jsons=[]
for tweet in tweets:
        tweet_jsons.append(tweet._json)

with jsonlines.open('tweets.jsonl', mode='w') as writer:
    writer.write(tweet_jsons)

In [16]:
type(tweets[0]._json)

dict

Now, the first talk of writing to json file is completed. We'll read from the json file now

In [28]:
tweet_jsons=[]
with jsonlines.open('tweets.jsonl') as reader:
    for main_list in reader:
        tweet_jsons = main_list

In [30]:
for key in tweet_jsons[0]:
    print(key)

created_at
id
id_str
full_text
truncated
display_text_range
entities
source
in_reply_to_status_id
in_reply_to_status_id_str
in_reply_to_user_id
in_reply_to_user_id_str
in_reply_to_screen_name
user
geo
coordinates
place
contributors
is_quote_status
retweet_count
favorite_count
favorited
retweeted
lang


We have multiple keys available in the json. However, we need to extract only a few out of them now.
Those include

full_text=The text of the tweet

created_at= Date and time of the tweet

favourite_count= The number of likes

retweet_count= The number of retweets

For images it is a bit more complicated than others. First, we need to go into 'entities'. Then, we need to check for 'media', which may or may not be present. In objects in media, if the type attribute is 'photo', then there is an image. In case of multiple images, multiple objects with type 'photo' will be present.

There are two cases:
1) The tweet is a tweet by the user
2) The tweet is a retweet
In case the tweet is a retweet, we need to get all the text,image, likes,retweet data from the original field, which can be done by utilising ['retweeted_status'] key. The text data might be truncated if we dont utilise retweeted_status attribute.

In [59]:
def extract_info(tweet):
    info_dict={} #this dictionary will be returned
    if 'retweeted_status' in tweet.keys():
        info_dict['text'] = tweet['retweeted_status']['full_text']
        info_dict['date-time']=tweet['created_at'] 
        info_dict['likes']=tweet['retweeted_status']['favorite_count']
        info_dict['retweets']=tweet['retweeted_status']['retweet_count']
        #now about the image task!
        if 'media' in tweet['retweeted_status']['entities'].keys():
            num = 0
            for media_objects in tweet['retweeted_status']['entities']['media']:
                if media_objects['type'] == 'photo':
                    num+=1
            if (num>0):
                info_dict['images']= str(num)
            else:
                info_dict['images']=  "None"
        else:
            info_dict['images']=  "None"
        return info_dict
    else:
        info_dict['text'] = tweet['full_text']
        info_dict['date-time']=tweet['created_at'] 
        info_dict['likes']=tweet['favorite_count']
        info_dict['retweets']=tweet['retweet_count']
        if 'media' in tweet['entities'].keys():
            num = 0
            for media_objects in tweet['entities']['media']:
                if media_objects['type'] == 'photo':
                    num+=1
            if (num>0):
                info_dict['images']= str(num)
            else:
                info_dict['images']=  "None"
        else:
            info_dict['images']=  "None"
        return info_dict
        
        

    

In [69]:
list_of_info=[]
for tweet in tweet_jsons:
    info_dict=extract_info(tweet)
    list_of_info.append([info_dict['text'],info_dict['date-time'],info_dict['likes'], info_dict['retweets'],info_dict['images']])

In [70]:
df = pd.DataFrame(list_of_info)
df.columns = ["Tweet's text", "Date-Time", "Likes", "Retweets","Images"]
pd.set_option('display.max_colwidth', -1)
df.head()



Unnamed: 0,Tweet's text,Date-Time,Likes,Retweets,Images
0,We have emailed the task details to all candidates who have applied to @midasIIITD internship through IIITD portal. Kindly check your spam folder if you have not received the email. We will evaluate all solutions received until April 10 midnight and announce results by April 14.,Fri Apr 05 16:08:37 +0000 2019,6,1,
1,Our NAACL paper on polarization in language on Twitter surrounding mass shootings is up on arXiv! https://t.co/g7wiegXxDg\nThis is the first lead-author paper from Dora Demszky; she put a huge amount of work into it and I think it turned out extremely well.,Fri Apr 05 04:05:11 +0000 2019,46,15,
2,Effective Transfer Learning For NLP https://t.co/Z1m0AzlfVv https://t.co/ccX4Uhxjn8,Fri Apr 05 04:04:43 +0000 2019,19,10,1.0
3,What’s new in @Stanford CS224N Natural Language Processing with Deep Learning for 2019? Question answering—1D CNNs—subword models—contextual word representations—transformers—generation—bias. YT playlist https://t.co/gFwwXJqYuQ – CS224N online hub https://t.co/HTnMzCAjS3 #NLProc https://t.co/rZKQvfUhiF,Wed Apr 03 18:31:53 +0000 2019,221,55,1.0
4,"Today we're releasing a large-scale extendable dataset of mathematical questions, for training (and evaluating the abilities of) neural models that can reason algebraically. \n\nPaper: https://t.co/D8g477gcQ4\nCode and data: https://t.co/QvR2WkK7j2 https://t.co/EWqNqaOUd5",Wed Apr 03 17:04:32 +0000 2019,2335,841,1.0
