<font size="+3">
    <b>Working with JSON Data</b>
</font>

<h1>Instructions</h1>
<br>
<font size="+1" style="color:dodgerblue">
    <ul>
        <li>There are questions in this assignment that are intentionally vague. You are to use your best judgment to decide the best way to answer the question.</li>
        <br>
        <li>Explain your answers as best as you can.</li>
        <br>
        <li>Note some of the keys that are asked about in this assignment, which were used in our in-class Twitter data, might not be in this version of Twitter API-generated dataset.</li>
        <br>
        <ul>
            <li>This is something that often happens in databases within companies.</li>
            <br>
            <li>Your job as a data analyst or data scientist, or someone managing a team of analysts, is to decide on a game plan for answering the questions with the given data that you do have.</li>
            <br>
            <li>It is possible that there isn't a good solution, and the solution is to talk to the data vendor to get different data capable of answering the question. </li>
            <br>
            <li>However, sometimes it is possible to get a sufficient answer if you think outside of the box.</li>
            <br>
            <li>The most important thing is you explain what you did, and how that helps solve the question, at least to a first approximation.</li>
            <br>
            <li>For example, if there isn't data telling you the Tweet has been truncated, how could you use the text to give an answer that is right some percentage of the time? </li>
            <br>
            <li>One might consider if ellipses are at the end of the Tweet, and/or if the length of the Tweet is at its maximal length, and/or if ellipses are at the beginning as well as the end of the Tweet, etc.</li>
            <br>
        </ul>
         <li style="color:red">Submit the <i>.ipynb</i> file on TurnItIn on Blackboard by the deadline. Be sure to refresh the page and double check your submission actually went through. <b>Note that you only need to submit your solutions, not all of the other recommended steps.</b> The recommended steps are meant to serve as a guide for your thinking process.</li>
            <br>
        </ol>
        <br>
        <hr style="border: 10px solid black">
        <br>
        <li style="color:black"><b>Grading</b></li>
        <br>
        <li style="color:black">There are four possible scores you can get from submitting this assignment on time (submitting a blank file or one without any apparent effort does not count). Note that the rubric is designed to incentivize you to go for $100%$ mastery of the material, as the little details matter in programming.</li>
        <br>
        <ul style="color:black">
            <li>Grade of $5$ out of $5$ - perfect submission with no significant errors</li>
            <br>
            <li>Grade of $4$ out of $5$ - near perfect submission with one or more significant errors</li>
            <br>
            <li>Grade of $2$ out of $5$ - apparent effort but far from perfect</li>
            <br>
            <li>Grade of $0$ out of $5$ - no submission or no apparent effort</li>
            <br>
        </ul>
    </ul>
</font>

$\square$

$\rule{800pt}{20pt}$

<h1>Before You Begin</h1>
<br>
<font size="+1">
    <ul>
        <li>Please read: <b>04_Python_and_Twitter_Data.ipynb</b></li>
        <br>
    </ul>
</font>

$\rule{800pt}{20pt}$

<h1>Imports</h1>

In [2]:
import os
import time

import json
import datetime

import pandas as pd

<h1>Twitter API v2</h1>

<a href="https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet">Twitter API v2 Data Dictionary</a>

In [3]:
from google.colab import drive
drive.mount('/content/drive')

# Set the data path
data_path = '/content/drive/My Drive/DSO559/'

Mounted at /content/drive


In [4]:
# Opening JSON file
file = open(os.path.join(data_path, 'Copy of twitter_data_v2_api.json'), 'r')

# Returns JSON object as a dictionary
data = json.load(file)

# Closing file
file.close()

<h2>What are the keys to this data dictionary?</h2>

In [5]:
data


{'data': [{'id': '1212092628029698048',
   'text': 'We believe the best future version of our API will come from building it with YOU. Here’s to another great year with everyone who builds on the Twitter platform. We can’t wait to continue working with you in the new year. https://t.co/yvxdK6aOo2',
   'possibly_sensitive': False,
   'referenced_tweets': [{'type': 'replied_to', 'id': '1212092627178287104'}],
   'entities': {'urls': [{'start': 222,
      'end': 245,
      'url': 'https://t.co/yvxdK6aOo2',
      'expanded_url': 'https://twitter.com/LovesNandos/status/1211797914437259264/photo/1',
      'display_url': 'pic.twitter.com/yvxdK6aOo2'}],
    'annotations': [{'start': 144,
      'end': 150,
      'probability': 0.626,
      'type': 'Product',
      'normalized_text': 'Twitter'}]},
   'author_id': '2244994945',
   'public_metrics': {'retweet_count': 8,
    'reply_count': 2,
    'like_count': 40,
    'quote_count': 1},
   'lang': 'en',
   'created_at': '2019-12-31T19:26:16.000Z',


In [6]:
data.keys()

dict_keys(['data', 'includes'])

In [17]:
#data["data"] is a list that includes a dictionary, the keys in data["data"] is:
for key in data["data"][0].keys():
    print("=======================")
    print(key)



id
text
possibly_sensitive
referenced_tweets
entities
author_id
public_metrics
lang
created_at
source
in_reply_to_user_id
attachments
context_annotations


In [54]:
#examine nested values:
data_keys=data["data"][0].keys()
keys_with_more_info = [key for key in data_keys if type(data["data"][0][key]) == dict]

keys_with_more_info

['entities', 'public_metrics', 'attachments']

In [19]:
#keys in data["includes']
for key in data["includes"].keys():
    print("=======================")
    print(key)


tweets


In [24]:
#Tweets is a nested dictionary, the keys in Tweets are:
data["includes"]["tweets"][0].keys()

dict_keys(['possibly_sensitive', 'referenced_tweets', 'text', 'entities', 'author_id', 'public_metrics', 'lang', 'created_at', 'source', 'in_reply_to_user_id', 'id'])

In [47]:
#examine nested values in data["includes"]["tweets"]
data_keys2=data["includes"]["tweets"][0].keys()
keys_with_more_info2 = [key for key in data_keys2 if type(data["includes"]["tweets"][0][key]) == dict]

keys_with_more_info2

['entities', 'public_metrics']

<h2>How many "phrases" are in the Tweet?</h2>
<font size="+1">
    <br>
    I'm thinking of a phrase as a generalized word. More specifically, something separated by a blank space.
</font>

In [36]:
#number of phrases include hyperlink
Tweet_text=data["data"][0]["text"]
Tweet_text
num_of_phrases=len(Tweet_text.split(" "))
num_of_phrases


42

In [40]:
#number of words exclude hyperlink
words = Tweet_text.split()
if words[-1].startswith('http://') or words[-1].startswith('https://'):
    words = words[:-1]
Tweet_text_without_link = ' '.join(words)
Tweet_text_without_link
num_of_phrases=len(Tweet_text_without_link.split(" "))
num_of_phrases

41

<h2>Is the Tweet truncated?</h2>
<br>
<font size="+1" style="color:dodgerblue">
    <ul>
        <li>This question requires some judgement.</li>
        <br>
        <li>In this new version of data, there isn't a keyword that directly tells if the tweet is truncated.</li>
        <br>
        <li>You must exercise judgment determining if a Tweet is truncated.</li>
        <br>
        <li><a href="https://arxiv.org/abs/2009.07661#:~:text=In%20November%202017%2C%20Twitter%20doubled,most%20influential%20social%20media%20platforms.">Tweet length doubled from 140 to 280.</a></li>
        <br>
        <li>It is likely we cannot have a definitive 'yes' or 'no' to the question, but rather a likelihood associated with the outcome.</li>
        <br>
    </ul>
</font>

In [95]:
def if_truncated(text):
  if len(text.split(" ")) > 280 or (text.split(" ")=="..."):
    print("This Tweet is very likely truncated")
  else:
    print("This Tweet is very likely not truncated")
  return None
if_truncated(Tweet_text)

This Tweet is very likely not truncated


<h2>How many Twitter accounts are tagged in the Tweet?</h2>

In [30]:
num_tagged=Tweet_text.count("@")
num_tagged

0

<h2>How many punctuation symbols are in the Tweet?</h2>

In [41]:
#punctuation synbol number excluding the hyperlink
punctuation_symbols = ["'", '"', '.', ',', ':', ';', '?', '!']
num_punctuation=sum([Tweet_text_without_link.count(punctuation) for punctuation in punctuation_symbols])
num_punctuation

3

<h2>Has this Tweet been favorited?</h2>

In [52]:
favorite_num=data["data"][0]["public_metrics"]["like_count"]
if favorite_num != 0:
  print(f"This Tweet has been favorited, the favorite count is {favorite_num}.")
else:
  print("This Tweet has not been favorited.")

This Tweet has been favorited, the favorite count is 40.


<h2>Is this Tweet in reply to another Tweet?</h2>

In [55]:
if data["data"][0]["in_reply_to_user_id"] is not None:
  print("The Tweet is replying to another Tweet.")
else:
  print("The Tweet is not replying to another Tweet.")

The Tweet is replying to another Tweet.


<h2>What was the source of this Tweet?</h2>
<font size="+1">
    <br>
    What was the app the user tweeted from: a web browser, a mobile device, or some other app?
</font>

In [56]:
data["data"][0]["source"]

'Twitter Web App'

<h2>Has this Tweet been retweeted?</h2>

In [58]:
retweet_num=data["data"][0]["public_metrics"]["retweet_count"]
if retweet_num != 0:
  print(f"This Tweet has been retweeted, the retweet number is {retweet_num}.")
else:
  print("This Tweet has not been retweeted.")

This Tweet has been retweeted, the retweet number is 8.


<h2>When was this Tweet created?</h2>


In [60]:
time_created=data["data"][0]["created_at"]
time_created

'2019-12-31T19:26:16.000Z'

<h2>What are the <i>entities</i> of this Tweet?</h2>
<font size="+1">
    <br>
    Entities are JSON objects that provide additional information about hashtags, urls, user mentions, and cashtags associated with the description.
    <br>
    <br>
    <ul>
        <li>How many users does the Tweet mention? What are their names?</li>
        <br>
        <li>How many URLs does the Tweet mentions? What are they?</li>
        <br>
        <li>How many hashtags does the Tweet have?</li>
        <br>
    </ul>
</font>

In [None]:
#######################################################################################
# This is very open ended, and subject to interpretation.
# In particular, you could use the 'context_annotations' key, or other keys in the data.
#######################################################################################

In [63]:
data["data"][0]["entities"]

{'urls': [{'start': 222,
   'end': 245,
   'url': 'https://t.co/yvxdK6aOo2',
   'expanded_url': 'https://twitter.com/LovesNandos/status/1211797914437259264/photo/1',
   'display_url': 'pic.twitter.com/yvxdK6aOo2'}],
 'annotations': [{'start': 144,
   'end': 150,
   'probability': 0.626,
   'type': 'Product',
   'normalized_text': 'Twitter'}]}

In [62]:
data["data"][0]["context_annotations"]

[{'domain': {'id': '119',
   'name': 'Holiday',
   'description': 'Holidays like Christmas or Halloween'},
  'entity': {'id': '1186637514896920576', 'name': ' New Years Eve'}},
 {'domain': {'id': '119',
   'name': 'Holiday',
   'description': 'Holidays like Christmas or Halloween'},
  'entity': {'id': '1206982436287963136',
   'name': 'Happy New Year: It’s finally 2020 everywhere!',
   'description': 'Catch fireworks and other celebrations as people across the globe enter the new year.\nPhoto via @GettyImages '}},
 {'domain': {'id': '46',
   'name': 'Brand Category',
   'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
  'entity': {'id': '781974596752842752', 'name': 'Services'}},
 {'domain': {'id': '47',
   'name': 'Brand',
   'description': 'Brands and Companies'},
  'entity': {'id': '10045225402', 'name': 'Twitter'}},
 {'domain': {'id': '119',
   'name': 'Holiday',
   'description': 'Holidays like Christmas or Halloween'},
  'entity': {'id': '

In [68]:
#How many users does the Tweet mentions? What are they?
User_mentioned="user_mentions" in data["data"][0]["entities"].keys()
if User_mentioned is False:
  print("No user has been mentioned in the Tweet.")
else:
  print('The Tweet mentions {} other user(s).'.format(len(data["data"][0]['user_mentions'])))


No user has been mentioned in the Tweet.


In [66]:
#How many URLs does the Tweet mentions? What are they?
if len(data["data"][0]['entities']['urls']) > 0:
        print('The Tweet mentions {} URL(s).'.format(len(data["data"][0]['entities']['urls'])))
        urls = data["data"][0]['entities']['urls'][0]["url"]
        print('The URL(s) is/are {}.'.format(urls))
else:
    print('The Tweet mentions {} URL(s).'.format(len(data["data"][0]['entities']['urls'])))

The Tweet mentions 1 URL(s).
The URL(s) is/are https://t.co/yvxdK6aOo2.


In [70]:
# [hashtag for hashtag in 'hello #as then #we_have something'.split(' ') if '#' in hashtag]
#How many hashtags does the Tweet mentions? What are they?
hashtag_list=[hashtag for hashtag in Tweet_text_without_link.split(' ') if '#' in hashtag]
if len(hashtag_list) ==0:
  print("There's no hashtag in the Tweet.")
else:
  print(f"There's {len(hashtag_list)} in the Tweet.")

There's no hashtag in the Tweet.


In [94]:
#How many domain does the Tweet contains? What are they?
domain_num=len(data["data"][0]["context_annotations"])
domain_num
domain_list=data["data"][0]["context_annotations"]
domain_counts = {}

for item in domain_list:
    domain_name = item['domain']['name']
    if domain_name in domain_counts:
        domain_counts[domain_name] += 1
    else:
        domain_counts[domain_name] = 1
print(f"There are {domain_num} different domains in total:")
for key in domain_counts.keys():
  print(f"{domain_counts[key]} {key} domain")

There are 5 different domains in total:
3 Holiday domain
1 Brand Category domain
1 Brand domain
