<font size="+3">
    <b>Working with JSON Data</b>
</font>

<h1>Instructions</h1>
<br>
<font size="+1" style="color:dodgerblue">
    <ul>
        <li>There are questions in this assignment that are intentionally vague. You are to use your best judgment to decide the best way to answer the question.</li>
        <br>
        <li>Explain your answers as best as you can.</li>
        <br>
        <li>Note some of the keys that are asked about in this assignment, which were used in our in-class Twitter data, might not be in this version of Twitter API-generated dataset.</li>
        <br>
        <ul>
            <li>This is something that often happens in databases within companies.</li>
            <br>
            <li>Your job as a data analyst or data scientist, or someone managing a team of analysts, is to decide on a game plan for answering the questions with the given data that you do have.</li>
            <br>
            <li>It is possible that there isn't a good solution, and the solution is to talk to the data vendor to get different data capable of answering the question. </li>
            <br>
            <li>However, sometimes it is possible to get a sufficient answer if you think outside of the box.</li>
            <br>
            <li>The most important thing is you explain what you did, and how that helps solve the question, at least to a first approximation.</li>
            <br>
            <li>For example, if there isn't data telling you the Tweet has been truncated, how could you use the text to give an answer that is right some percentage of the time? </li>
            <br>
            <li>One might consider if ellipses are at the end of the Tweet, and/or if the length of the Tweet is at its maximal length, and/or if ellipses are at the beginning as well as the end of the Tweet, etc.</li>
            <br>
        </ul>
         <li style="color:red">Submit the <i>.ipynb</i> file on TurnItIn on Blackboard by the deadline. Be sure to refresh the page and double check your submission actually went through. <b>Note that you only need to submit your solutions, not all of the other recommended steps.</b> The recommended steps are meant to serve as a guide for your thinking process.</li>
            <br>
        </ol>
        <br>
        <hr style="border: 10px solid black">
        <br>
        <li style="color:black"><b>Grading</b></li>
        <br>
        <li style="color:black">There are four possible scores you can get from submitting this assignment on time (submitting a blank file or one without any apparent effort does not count). Note that the rubric is designed to incentivize you to go for $100%$ mastery of the material, as the little details matter in programming.</li>
        <br>
        <ul style="color:black">
            <li>Grade of $5$ out of $5$ - perfect submission with no significant errors</li>
            <br>
            <li>Grade of $4$ out of $5$ - near perfect submission with one or more significant errors</li>
            <br>
            <li>Grade of $2$ out of $5$ - apparent effort but far from perfect</li>
            <br>
            <li>Grade of $0$ out of $5$ - no submission or no apparent effort</li>
            <br>
        </ul>
    </ul>
</font>

<hr style="border: 20px solid black">

<h1>Before You Begin</h1>
<br>
<font size="+1">
    <ul>
        <li>Please read: <b>04_Python_and_Twitter_Data.ipynb</b></li>
        <br>
    </ul>
</font>

<hr style="border: 20px solid black">

<h1>Imports</h1>

In [6]:
import os
import time

import json
import datetime

import pandas as pd

<h1>Twitter API v2</h1>

<a href="https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet">Twitter API v2 Data Dictionary</a>

In [7]:
file = open("twitter_data_v2_api.json","r")
data = json.load(file)
file.close()

In [8]:
data = data["data"][0]
data

{'id': '1212092628029698048',
 'text': 'We believe the best future version of our API will come from building it with YOU. Here’s to another great year with everyone who builds on the Twitter platform. We can’t wait to continue working with you in the new year. https://t.co/yvxdK6aOo2',
 'possibly_sensitive': False,
 'referenced_tweets': [{'type': 'replied_to', 'id': '1212092627178287104'}],
 'entities': {'urls': [{'start': 222,
    'end': 245,
    'url': 'https://t.co/yvxdK6aOo2',
    'expanded_url': 'https://twitter.com/LovesNandos/status/1211797914437259264/photo/1',
    'display_url': 'pic.twitter.com/yvxdK6aOo2'}],
  'annotations': [{'start': 144,
    'end': 150,
    'probability': 0.626,
    'type': 'Product',
    'normalized_text': 'Twitter'}]},
 'author_id': '2244994945',
 'public_metrics': {'retweet_count': 8,
  'reply_count': 2,
  'like_count': 40,
  'quote_count': 1},
 'lang': 'en',
 'created_at': '2019-12-31T19:26:16.000Z',
 'source': 'Twitter Web App',
 'in_reply_to_user_i

<h2>What are the keys to this data dictionary?</h2>

In [40]:
data.keys()

dict_keys(['id', 'text', 'possibly_sensitive', 'referenced_tweets', 'entities', 'author_id', 'public_metrics', 'lang', 'created_at', 'source', 'in_reply_to_user_id', 'attachments', 'context_annotations'])

In [41]:
for key in data.keys():
    print(key)
print(f"There are {len(data.keys())} keys.")

id
text
possibly_sensitive
referenced_tweets
entities
author_id
public_metrics
lang
created_at
source
in_reply_to_user_id
attachments
context_annotations
There are 13 keys.


<h2>How many "phrases" are in the Tweet?</h2>
<font size="+1">
    <br>
    I'm thinking of a phrase as a generalized word. More specifically, something separated by a blank space.
</font>

In [9]:
tweet_text = data["text"]
tweet_text
total_phrases = len(tweet_text.split(" "))
total_phrases

42

<h2>Is the Tweet truncated?</h2>
<br>
<font size="+1" style="color:dodgerblue">
    <ul>
        <li>This question requires some judgement.</li>
        <br>
        <li>In this new version of data, there isn't a keyword that directly tells if the tweet is truncated.</li>
        <br>
        <li>You must exercise judgment determining if a Tweet is truncated.</li>
        <br>
        <li><a href="https://arxiv.org/abs/2009.07661#:~:text=In%20November%202017%2C%20Twitter%20doubled,most%20influential%20social%20media%20platforms.">Tweet length doubled from 140 to 280.</a></li>
        <br>
        <li>It is likely we cannot have a definitive 'yes' or 'no' to the question, but rather a likelihood associated with the outcome.</li>
        <br>
    </ul>
</font>

In [10]:
if len(tweet_text) > 280 and tweet_text.split(" ")[-1] == "...":
    print("This Tweet is longer than 280 words limit and is therefore truncated.")
else:
    print("This Tweet is not truncated.")

This Tweet is not truncated.


<h2>How many Twitter accounts are tagged in the Tweet?</h2>

In [63]:
num_user_tags = tweet_text.count("@")
num_user_tags

0

<h2>How many punctuation symbols are in the Tweet?</h2>

In [64]:
punctuation_symbols = ["'", '"', '.', ',', ':', ';', '?', '!']
num_punctuation_symbols = \
sum([tweet_text.count(punctuation) for punctuation in punctuation_symbols]) - 3*tweet_text.count('...')

num_punctuation_symbols

5

<h2>Has this Tweet been favorited?</h2>

In [72]:
# if more than 40 users like the tweet, so this tweet has been  favorited. Otherwise, this tweet has not been favorited.
if (data["public_metrics"]["like_count"]) >= 40:
    print("This Tweet has been favorited.")
else:
    print("This Tweet has not been favorited.")

This Tweet has been favorited.


40

<h2>Is this Tweet in reply to another Tweet?</h2>

In [79]:
# reply count is larger than 1, so this tweet reply to another tweet.
if (data["public_metrics"]["reply_count"])>1:
    print("This Tweet replies to another Tweet.")
else:
    print("This Tweet does not reply to another Tweet.")

This Tweet replies to another Tweet.


<h2>What was the source of this Tweet?</h2>
<font size="+1">
    <br>
    What was the app the user tweeted from: a web browser, a mobile device, or some other app?
</font>

In [74]:
data["source"]

'Twitter Web App'

<h2>Has this Tweet been retweeted?</h2>

In [75]:
if (data["public_metrics"]["retweet_count"])>1:
    print("This Tweet has been reweeted.")
else:
    print("This Tweet has not been reweeted.")

This Tweet has been reweeted.


<h2>When was this Tweet created?</h2>


In [82]:
date_time_stamp = data['created_at']

date_time_stamp

'2019-12-31T19:26:16.000Z'

<h2>What are the <i>entities</i> of this Tweet?</h2>
<font size="+1">
    <br>
    Entities are JSON objects that provide additional information about hashtags, urls, user mentions, and cashtags associated with the description.
    <br>
    <br>
    <ul>
        <li>How many users does the Tweet mention? What are their names?</li>
        <br>
        <li>How many URLs does the Tweet mentions? What are they?</li>
        <br>
        <li>How many hashtags does the Tweet have?</li>
        <br>
    </ul>
</font>

In [None]:
#######################################################################################
# This is very open ended, and subject to interpretation.
# In particular, you could use the 'context_annotations' key, or other keys in the data.
#######################################################################################

In [169]:
# How many users does the Tweet mention? What are their names?
if len(data["context_annotations"]) > 0:
    print("The Tweet mentions {} other users.".format(len(data["context_annotations"])))
    user_name = [data["context_annotations"][user]["domain"]["name"] for user in range(len(data["context_annotations"]))]
    print("Their name are {}.".format(user_name))
else:
    print('The Tweet doesn\'t mention any other users.')

The Tweet mentions 5 other users.
Their name are ['Holiday', 'Holiday', 'Brand Category', 'Brand', 'Holiday'].


In [177]:
# How many URLs does the Tweet mentions? What are they?
if len(data['entities']['urls']) > 0:
        print('The Tweet mentions {} URL(s).'.format(len(data['entities']['urls'])))
        urls = [data['entities']['urls'][url]["url"] for url in range(len(data["entities"]["urls"]))]
        print('The URLs are {}.'.format(urls))
else:
    print('The Tweet mentions {} URL(s).'.format(len(data['entities']['urls'])))

The Tweet mentions 1 URL(s).
The URLs are ['https://t.co/yvxdK6aOo2'].


In [None]:
####################################################################################################################
# Example
# [hashtag for hashtag in 'hello #as then #we_have something'.split(' ') if '#' in hashtag]
####################################################################################################################

In [214]:
# How many hashtags does the Tweet have?
if tweet_text.count("#") > 0:
    print('The Tweet mentions {} hashtag(s).'.format(tweet_text.count("#")))
    hashtags = [hashtag for hashtag in tweet_text.split() if "#" in hashtag]
    print('The hashtags are {}.'.format(hashtags))
else:
    print('The Tweet mentions {} hashtag(s).'.format(tweet_text.count("#")))

The Tweet mentions 0 hashtag(s).
