## Assignment 2

By Charlie Mei (UNI: cm3947)

### The Task

Write a Python program that:
- Reads JSON objects of newsfeeds from the data file into a list or an array of Python dictionaries (or a Pandas dataframe)
- Prints the schema of the JSON object
- Prints the number of newsfeeds (JSON objects) in the collection
- Creates a set of unique newsfeeds by title and prints the new total collection size
- Prints the latest 100 article titles and urls

### My Response

In [11]:
import json
import pandas as pd
from genson import SchemaBuilder

#### Loading the data

In [2]:
# Read in JSON object as a list of Python dictionaries
with open('webhose_netflix.json') as f:
    data = f.readlines()

# Save feeds as a list of dictionaries
feeds = [json.loads(feed) for feed in data]

#### Acquiring the Schema of a feed

In [3]:
# Get the keys and values from a proxy feed
proxy = feeds[0]

# Generate a schema builder from genson
builder = SchemaBuilder()
builder.add_object(proxy)
builder.to_schema()

{'$schema': 'http://json-schema.org/schema#',
 'type': 'object',
 'properties': {'thread': {'type': 'object',
   'properties': {'uuid': {'type': 'string'},
    'url': {'type': 'string'},
    'site_full': {'type': 'string'},
    'site': {'type': 'string'},
    'site_section': {'type': 'string'},
    'site_categories': {'type': 'array', 'items': {'type': 'string'}},
    'section_title': {'type': 'string'},
    'title': {'type': 'string'},
    'title_full': {'type': 'string'},
    'published': {'type': 'string'},
    'replies_count': {'type': 'integer'},
    'participants_count': {'type': 'integer'},
    'site_type': {'type': 'string'},
    'country': {'type': 'string'},
    'spam_score': {'type': 'number'},
    'main_image': {'type': 'string'},
    'performance_score': {'type': 'integer'},
    'domain_rank': {'type': 'integer'},
    'social': {'type': 'object',
     'properties': {'facebook': {'type': 'object',
       'properties': {'likes': {'type': 'integer'},
        'comments': {'typ

#### What is the total number of feeds?

In [4]:
# The number of feeds in the collection
print("There is a total of " + str(len(feeds)) + " feeds.")

There is a total of 25288 feeds.


#### Creating a set of unique newsfeeds by title and prints the number of unique feeds

In [5]:
# Create a new dictionary with keys that are the titles
unique_feeds = list({feed['title']: feed for feed in feeds}.values())
# Duplicate titles will just be rewritten over for the same key
print("There are " + str(len(unique_feeds)) + " unique feeds.")

There are 19514 unique feeds.


#### The title and url of the latest 100 feeds

Latest feeds are defined as those with the latest *publish date*, so based on the ```published``` key field.

In [6]:
# Defining latest feeds as those with the latest PUBLISH DATE from the published key
latest_feeds = sorted(unique_feeds, key=lambda x: x['published'], reverse=True)
latest_100 = latest_feeds[:100]

In [8]:
for feed in latest_100:
    print("Feed from: " + feed['published'])
    print(feed['title'] + ": " + feed['url'] + "\n")

Feed from: 2020-06-03T22:49:00.000+03:00
13 Reasons Why: The popular Netflix show's creator teases chance of a hopeful ending: http://omgili.com/ri/.wHSUbtEfZSCvFgWhG.N__Y_kk6rEaYdjsrpI1bEeKmoCc0M_39dynSM56R6HIaYHQb2iaKLcfpCeAIHag3wvcOsuDtZvPdYxN1tBRxA4Hm8FSvJ6kVO10.ObZ3FpwSLKFW7k_06pIBSt.nChTTEKY1WdyGTuTmASa5H4UZHxGGOYFtRgQKk7rwvpg01M7EL

Feed from: 2020-06-03T07:33:00.000+03:00
A TV reboot of Bong Joon-ho's acclaimed film Snowpiercer has landed on Netflix — what's the deal?: http://omgili.com/ri/.wHSUbtEfZSvmJKugzHb_f4zAMZSmc0Tw9Lbg6dKYM_tvS.4yIw9SRIfO51Mo40h6m0sQe61rmlFUDUyIcFLalrWUdui2l2P3dAXZTJ8WgzqSPhKXeQQsD0AXKxkWg9T0GWSeWkd0reGCT8eynVmSw--

Feed from: 2020-06-03T07:00:00.000+03:00
2-Pack: Ideaworks Mosquito Killer Lamps (battery powered) 2 for $15: http://omgili.com/ri/.wHSUbtEfZTqX_2tXJqGoKlaYsSVurRxrDVEoi9Rqy_uXn2yaTd5VXr7oQuo19Jz12otJRKgTofI_TmTuBGdl2De3EZ2uAP8c8GF5JpdPwY-

Feed from: 2020-06-03T06:45:00.000+03:00
Already-Obese Average Americans Have Drunk & Eaten Their Way 