## Assignment 2

By Charlie Mei (UNI: cm3947)

### The Task

Write a Python program that:
- Reads JSON objects of newsfeeds from the data file into a list or an array of Python dictionaries (or a Pandas dataframe)
- Prints the schema of the JSON object
- Prints the number of newsfeeds (JSON objects) in the collection
- Creates a set of unique newsfeeds by title and prints the new total collection size
- Prints the latest 100 article titles and urls

### My Response

In [1]:
import json
import pandas as pd
from genson import SchemaBuilder
import webhoseio

#### Acquiring feeds for Amazon in the last 10 days and printing the number of newsfeeds

In [88]:
# Configuration for my webhoseio
webhoseio.config(token="ee3b1bdb-7d4b-4a0d-8465-59f05d96783f")
query_params = {
    "q": "organization:Amazon language:english",
    "ts": "1591230943381",
    "sort": "published"
}

In [89]:
# Get the first 100 feeds
output = webhoseio.query("filterWebContent", query_params)
feeds = [item for item in output['posts']]
len(feeds)

100

In [97]:
# How many more pages do we need?
(output['totalResults'] - 100) / 100

201.9

In [94]:
count = 202
while count>0:
    output = webhoseio.get_next()
    for item in output['posts']:
        feeds.append(item)
    count -=1

len(feeds)

20290

#### Acquiring the Schema of a feed

In [98]:
# Get the keys and values from a proxy feed
proxy = feeds[0]

# Generate a schema builder from genson
builder = SchemaBuilder()
builder.add_object(proxy)
builder.to_schema()

{'$schema': 'http://json-schema.org/schema#',
 'type': 'object',
 'properties': {'thread': {'type': 'object',
   'properties': {'uuid': {'type': 'string'},
    'url': {'type': 'string'},
    'site_full': {'type': 'string'},
    'site': {'type': 'string'},
    'site_section': {'type': 'string'},
    'site_categories': {'type': 'array'},
    'section_title': {'type': 'string'},
    'title': {'type': 'string'},
    'title_full': {'type': 'string'},
    'published': {'type': 'string'},
    'replies_count': {'type': 'integer'},
    'participants_count': {'type': 'integer'},
    'site_type': {'type': 'string'},
    'country': {'type': 'string'},
    'spam_score': {'type': 'number'},
    'main_image': {'type': 'null'},
    'performance_score': {'type': 'integer'},
    'domain_rank': {'type': 'null'},
    'social': {'type': 'object',
     'properties': {'facebook': {'type': 'object',
       'properties': {'likes': {'type': 'integer'},
        'comments': {'type': 'integer'},
        'shares': 

#### What is the total number of feeds?

In [99]:
# The number of feeds in the collection
print("There are a total of " + str(len(feeds)) + " feeds.")

There are a total of 20290 feeds.


#### Creating a set of unique newsfeeds by title and prints the number of unique feeds

In [115]:
# Create a new dictionary with keys that are the titles
unique_feeds = list({f['title']: f for f in feeds}.values())
# Duplicate titles will just be rewritten over for the same key
print("There are " + str(len(unique_feeds)) + " unique feeds.")

There are 16246 unique feeds.


#### The title and url of the latest 100 feeds

Latest feeds are defined as those with the latest *publish date*, so based on the ```published``` key field.

In [116]:
# Defining latest feeds as those with the latest PUBLISH DATE from the published key
latest_feeds = sorted(unique_feeds, key=lambda x: x['published'], reverse=True)
latest_100 = latest_feeds[:100]

In [117]:
for feed in latest_100:
    print("Feed from: " + feed['published'])
    print(feed['title'] + ": " + feed['url'] + "\n")

Feed from: 2020-06-14T03:30:00.000+03:00
Ninja Air Fryer that Cooks, Crisps and Dehydrates $99: https://freebies2deals.com/2020/06/ninja-air-fryer-that-cooks-crisps-and-dehydrates-99.html

Feed from: 2020-06-14T03:14:00.000+03:00
Anchorseal 2 $28.47 at Woodcraft: https://slickdeals.net/f/14125592-anchorseal-2-28-47-at-woodcraft?utm_source=rss&utm_content=9&utm_medium=RSS2

Feed from: 2020-06-14T03:11:00.000+03:00
INTRODUCCI‡∏£‚ÄúN ‡∏¢‡∏üQU‡∏£‚Ä∞ ES EL DROPSHIPPING? en autom‡∏£‡∏Åtico con Ebay Amazon y eBot.: https://latiendadelapiar.blogspot.com/2020/06/introduccion-que-es-el-dropshipping-en.html

Feed from: 2020-06-14T03:01:00.000+03:00
27 Products Under $20 That Mean Business: https://www.buzzfeed.com/maitlandquitmeyer/products-under-20-that-mean-business-june-2020

Feed from: 2020-06-14T03:00:00.000+03:00
Star Wars Boba Fett The Black Series Helmet + Free Shipping $99.99: https://slickdeals.net/f/14125070-star-wars-boba-fett-the-black-series-helmet-free-shipping-99-99?utm_source=rss