In [57]:
import json
import pandas as pd

Loading and accessing tweets

In the video, we loaded a tweet we collected using tweepy into Python (to do!). Tweets arrive from the Streaming API in JSON format and need to be converted into a Python data structure.

In this exercise, we'll load a single tweet into Python and print out some fields.

In [41]:
def save_and_load_json(json_file, file_name, save=False, to_dict=False):
    
    if save:
        
        if to_dict:
            json_file = json.loads(json_file)

        with open(file_name, "w") as file:
            json.dump(json_file, file, indent=4)
    
    with open(file_name, "r") as file:
        return json.load(file)

In [42]:
tweet = save_and_load_json(tweet_data, 'tweet_json.json')
tweet

{'created_at': 'Thu Apr 19 14:25:04 +0000 2018',
 'id': 986973961295720449,
 'id_str': '986973961295720449',
 'text': "Writing out the script of my @DataCamp class and I can't help but mentally read it back to myself in @hugobowne's voice.",
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'DataCamp',
    'name': 'DataCamp',
    'id': 1568606814,
    'id_str': '1568606814',
    'indices': [29, 38]},
   {'screen_name': 'hugobowne',
    'name': 'Hugo Bowne-Anderson',
    'id': 1092509048,
    'id_str': '1092509048',
    'indices': [101, 111]}],
  'urls': []},
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 661613,
  'id_str': '661613',
  'name': 'Alex Hanna, Data Witch',
  'screen_name': 'alexhanna',
  'location': 'Toronto, ON',
  'desc

In [43]:
# Print tweet text
print(tweet['text'])

# Print tweet id
print(tweet['id'])

Writing out the script of my @DataCamp class and I can't help but mentally read it back to myself in @hugobowne's voice.
986973961295720449


In [44]:
# Print user handle
print(tweet['user']['screen_name'])

# Print user follower count
print(tweet['user']['followers_count'])

# Print user location
print(tweet['user']['location'])

# Print user description
print(tweet['user']['description'])

alexhanna
4267
Toronto, ON
Assistant professor @UofT. Protest, media, computation. Trans. Roller derby athlete @TOROLLERDERBY (Kate Silver #538). She/her.


Accessing retweet data

Now we're going to work with a tweet JSON that contains a retweet. A retweet has the same structure as a regular tweet, except that it has another tweet stored in retweeted_status.

The new tweet has been loaded as rt.

In [45]:
rt = save_and_load_json(tweet_data, 'rt.json')
rt

{'created_at': 'Thu Apr 19 12:45:59 +0000 2018',
 'id': 986949027123154944,
 'id_str': '986949027123154944',
 'text': "RT @hannawallach: ICYMI: NIPS/ICML/ICLR are looking for a full-time programmer to run the conferences' submission/review processes. More in…",
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'hannawallach',
    'name': 'Hanna Wallach',
    'id': 823957466,
    'id_str': '823957466',
    'indices': [3, 16]}],
  'urls': []},
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 661613,
  'id_str': '661613',
  'name': 'Alex Hanna, Data Witch',
  'screen_name': 'alexhanna',
  'location': 'Toronto, ON',
  'description': 'Assistant professor @UofT. Protest, media, computation. Trans. Roller derby athlete @TOROLLERDERBY (Kate Sil

In [46]:
# Print the text of the tweet
print(rt['text'])

# Print the text of tweet which has been retweeted
print(rt['retweeted_status']['text'])

# Print the user handle of the tweet
print(rt['user']['screen_name'])

# Print the user handle of the tweet which has been retweeted
print(rt['retweeted_status']['user']['screen_name'])

RT @hannawallach: ICYMI: NIPS/ICML/ICLR are looking for a full-time programmer to run the conferences' submission/review processes. More in…
ICYMI: NIPS/ICML/ICLR are looking for a full-time programmer to run the conferences' submission/review processes. M… https://t.co/aB9Y5tTyHT
alexhanna
hannawallach


Tweet Items and Tweet Flattening

There are multiple fields in the Twitter JSON which contains textual data. In a typical tweet, there's the tweet text, the user description, and the user location. In a tweet longer than 140 characters, there's the extended tweet child JSON. And in a quoted tweet, there's the original tweet text and the commentary with the quoted tweet.

For this exercise, you'll extract textual elements from a single quoted tweet in which the original tweet has more than 140 characters. Then, to analyze tweets at scale, we will want to flatten the tweet JSON into a single level. This will allow us to store the tweets in a DataFrame format.

In [51]:
quoted_tweet = save_and_load_json(quoted_tweet, 'quoted_tweet.json')
quoted_tweet

{'created_at': 'Wed Apr 25 17:20:04 +0000 2018',
 'id': 989192330832891904,
 'id_str': '989192330832891904',
 'text': 'maybe if I quote tweet this lil guy https://t.co/BzbLDz9j6g',
 'display_text_range': [0, 35],
 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 661613,
  'id_str': '661613',
  'name': 'Alex Hanna, Data Witch',
  'screen_name': 'alexhanna',
  'location': 'Toronto, ON',
  'url': 'http://alex-hanna.com',
  'description': 'Assistant professor @UofT. Protest, media, computation. Trans. Roller derby athlete @TOROLLERDERBY (Kate Silver #538). She/her.',
  'translator_type': 'regular',
  'protected': False,
  'verified': False,
  'followers_count': 4275,
  'friends_count': 2806,
  'listed_count': 246,
  'favourites_count': 23526,


In [52]:
# Print the tweet text
print(quoted_tweet['text'])

# Print the quoted tweet text
print(quoted_tweet['quoted_status']['text'])

# Print the quoted tweet's extended (140+) text
print(quoted_tweet['quoted_status']['extended_tweet']['full_text'])

# Print the quoted user location
print(quoted_tweet['quoted_status']['user']['location'])

maybe if I quote tweet this lil guy https://t.co/BzbLDz9j6g
O 280 characters, 280 characters! Wherefore art thou 280 characters?
Deny thy JSON and refuse thy key.
Or, if thou… https://t.co/MlFg4qFnEC
O 280 characters, 280 characters! Wherefore art thou 280 characters?
Deny thy JSON and refuse thy key.
Or, if thou wilt not, be but sworn my love,
And I’ll no longer be a 140 character tweet.
Toronto, ON


In [53]:
# Store the user screen_name in 'user-screen_name'
quoted_tweet['user-screen_name'] = quoted_tweet['quoted_status']['user']['screen_name']

# Store the quoted_status text in 'quoted_status-text'
quoted_tweet['quoted_status-text'] = quoted_tweet['quoted_status']['text']

# Store the quoted tweet's extended (140+) text in 
# 'quoted_status-extended_tweet-full_text'
quoted_tweet['quoted_status-extended_tweet-full_text'] = quoted_tweet['quoted_status']['extended_tweet']['full_text']

A tweet flattening function

We are typically interested in hundreds or thousands of tweets. For this purpose, it makes sense to define a function to flatten JSON file full of tweets. Let's call this function flatten_tweets(). We will use this function multiple times in this course and change it slightly as we deal with different types of data.

In [67]:
 def flatten_tweets(tweets_json):
    tweets = []
    
    # Iterate through each tweet
    for tweet in tweets_json:

        tweet_obj = json.loads(tweet)

        # Store the user screen name in 'user-screen_name'
        tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']
    
        # Check if this is a 140+ character tweet
        if 'extended_tweet' in tweet_obj:
            # Store the extended tweet text in 'extended_tweet-full_text'
            tweet_obj['extended_tweet-full_text'] = tweet_obj['extended_tweet']['full_text']
    
        if 'retweeted_status' in tweet_obj:
            # Store the retweet user screen name in 'retweeted_status-user-screen_name'
            tweet_obj['retweeted_status-user-screen_name'] = tweet_obj['retweeted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['retweeted_status-text'] = tweet_obj['retweeted_status']['text']
    
            if 'extended_tweet' in tweet_obj['retweeted_status']:
                # Store the extended retweet text in 'retweeted_status-extended_tweet-full_text'
                tweet_obj['retweeted_status-extended_tweet-full_text'] = tweet_obj['retweeted_status']['extended_tweet']['full_text']
            
        tweets.append(tweet_obj)
    return tweets

Loading tweets into a DataFrame

Now it's time to import data into a pandas DataFrame so we can analyze tweets at scale.

We will work with a dataset of tweets which contain the hashtag '#rstats' or '#python'. This dataset is stored as a list of tweet JSON objects in data_science_json.

In [68]:
data_science_json = save_and_load_json(data_science_json, 'data_science_json.json')

In [69]:
# Flatten the tweets and store in `tweets`
tweets = flatten_tweets(data_science_json)

# Create a DataFrame from `tweets`
ds_tweets = pd.DataFrame(tweets)

ds_tweets

Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,retweeted_status-text,retweeted_status-extended_tweet-full_text,possibly_sensitive,extended_entities,extended_tweet,extended_tweet-full_text,display_text_range,quoted_status_id,quoted_status_id_str,quoted_status
0,Fri Mar 30 13:04:22 +0000 2018,979705897457942528,979705897457942528,RT @Dennboss: Hahahah Efteling maakt Maxi-Cosi...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,Hahahah Efteling maakt Maxi-Cosi's voor in de ...,Hahahah Efteling maakt Maxi-Cosi's voor in de ...,,,,,,,,
1,Fri Mar 16 11:59:09 +0000 2018,974616055006941184,974616055006941184,RT @PythonWeekly: Python Weekly - Issue 338 ht...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,Python Weekly - Issue 338 https://t.co/7gJSoLJ...,Python Weekly - Issue 338 https://t.co/7gJSoLJ...,False,,,,,,,
2,Tue Mar 27 08:34:33 +0000 2018,978550832273805312,978550832273805312,"RT @dataandme: ICYMI, still 💜ing this: ""Where ...","<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,"ICYMI, still 💜ing this: ""Where do things live ...","ICYMI, still 💜ing this: ""Where do things live ...",False,,,,,,,
3,Fri Mar 16 21:26:58 +0000 2018,974758950737448960,974758950737448960,RT @dataandme: 🕴@jaredlander knows how to put ...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,"🕴@jaredlander knows how to put on a show…\n""Ne...","🕴@jaredlander knows how to put on a show…\n""Ne...",False,,,,,,,
4,Thu Mar 15 23:35:05 +0000 2018,974428804494995456,974428804494995456,RT @llanga: I heard it's Py Day today so I mad...,"<a href=""http://twitter.com/download/iphone"" r...",False,,,,,...,I heard it's Py Day today so I made a thing! M...,I heard it's Py Day today so I made a thing! M...,False,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370,Tue Mar 13 08:23:25 +0000 2018,973474600326696960,973474600326696960,RT @tkb: Want to tell stories about global dev...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,Want to tell stories about global development ...,Want to tell stories about global development ...,,,,,,,,
371,Sat Mar 03 17:42:59 +0000 2018,969991541233209344,969991541233209344,Самый дешёвый CDN из всех что я знаю https://t...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",False,,,,,...,,,False,,,,,,,
372,Fri Mar 16 07:46:03 +0000 2018,974552360302272513,974552360302272513,RT @json_stack: python json.loads Unterminated...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,python json.loads Unterminated string error [V...,,False,,,,,,,
373,Sun Mar 25 23:36:39 +0000 2018,978053077403095042,978053077403095042,RT @KKulma: My last blog post on hints how to ...,"<a href=""http://twitter.com/download/android"" ...",False,,,,,...,My last blog post on hints how to run a data p...,My last blog post on hints how to run a data p...,,,,,,,,
