<h1>Hacker News Stories</h1>

<h4>Read and work with JSON, create and use lambda functions</h4>

Data from DataQuest, there they mentioned it was downloaded from the Hacker News API.
    Data contains data about stories from Hacker News in 2014.

In [3]:
# Reading the Hacker News JSON file
import json

file = open("hn_2014.json")
hn = json.load(file)

print("Amount of objects: ", len(hn))
print("Type of objects: ", type(hn))
print("Kyes of the first dict: ", hn[0].keys())

Amount of objects:  35806
Type of objects:  <class 'list'>
Kyes of the first dict:  dict_keys(['author', 'numComments', 'points', 'url', 'storyText', 'createdAt', 'tags', 'createdAtI', 'title', 'objectId'])


In [7]:
# Function that dumps the JSON object and prints it
def jprint(obj):
    # Create a formatted string of the JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

jprint(hn[0])

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "createdAtI": 1401350870,
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


In [10]:
# Delete createdAtI from dictionary
def del_key(dict_, key):
    # Create a copy so we don't modify the original dict
    modified_dict = dict_.copy()
    del modified_dict[key]
    return modified_dict


# For each story in hn delete createdAtI and store in hn_clean
hn_clean = [del_key(story, "createdAtI") for story in hn]

jprint(hn_clean[0])

{
    "author": "dragongraphics",
    "createdAt": "2014-05-29T08:07:50Z",
    "numComments": 0,
    "objectId": "7815238",
    "points": 2,
    "storyText": "",
    "tags": [
        "story",
        "author_dragongraphics",
        "story_7815238"
    ],
    "title": "Are we getting too Sassy? Weighing up micro-optimisation vs. maintainability",
    "url": "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy"
}


In [14]:
# Extract url value from each story in hn_clean
urls = [story["url"] for story in hn]

jprint(urls[0:5])

[
    "http://ashleynolan.co.uk/blog/are-we-getting-too-sassy",
    "http://spectrum.ieee.org/automaton/robotics/home-robots/telemba-telepresence-robot",
    "http://online.wsj.com/articles/apple-to-buy-beats-1401308971",
    "http://alexsblog.org/2014/05/29/dont-wait-for-inspiration/",
    "http://techcrunch.com/2014/05/28/hackerone-get-9m-in-series-a-funding-to-build-bug-tracking-bounty-programs/"
]


In [16]:
# Collect stories that have more than 1000 points
thousand_points = [d for d in hn_clean if d["points"] > 1000]

num_thousand_points = len(thousand_points)
print(num_thousand_points)

8


In [25]:
# Sort hn_clean by number of points from highest to lowest
hn_sorted_points = sorted(hn_clean, key=lambda json_dict: json_dict["points"], reverse=True)

# Collect a list of the 5 post titles assigned to the stories with the highest points
top_5_titles = [d["title"] for d in hn_sorted_points[:5]]
print(top_5_titles)

['2048', 'Today is The Day We Fight Back', 'Wozniak: “Actually, the movie was largely a lie about me”', 'Microsoft Open Sources C# Compiler', 'Elon Musk: To the People of New Jersey']


In [24]:
# Get the story with the most comments using a key lambda function
most_comments = max(hn_clean, key=lambda json_dict: json_dict["numComments"])

jprint(most_comments)

{
    "author": "platz",
    "createdAt": "2014-04-03T19:02:53Z",
    "numComments": 1208,
    "objectId": "7525198",
    "points": 889,
    "storyText": null,
    "tags": [
        "story",
        "author_platz",
        "story_7525198"
    ],
    "title": "Brendan Eich Steps Down as Mozilla CEO",
    "url": "https://blog.mozilla.org/blog/2014/04/03/brendan-eich-steps-down-as-mozilla-ceo/"
}


<h4>JSON with Pandas</h4>

In [28]:
import pandas as pd

hn_df = pd.DataFrame(hn_clean)
print(hn_df.head())

           author  numComments  points  \
0  dragongraphics            0       2   
1             jcr            0       1   
2        callum85            0       1   
3          d3v3r0            0       1   
4      timmipetit            0       1   

                                                 url storyText  \
0  http://ashleynolan.co.uk/blog/are-we-getting-t...             
1  http://spectrum.ieee.org/automaton/robotics/ho...             
2  http://online.wsj.com/articles/apple-to-buy-be...             
3  http://alexsblog.org/2014/05/29/dont-wait-for-...             
4  http://techcrunch.com/2014/05/28/hackerone-get...             

              createdAt                                           tags  \
0  2014-05-29T08:07:50Z  [story, author_dragongraphics, story_7815238]   
1  2014-05-29T08:05:58Z             [story, author_jcr, story_7815234]   
2  2014-05-29T08:05:06Z        [story, author_callum85, story_7815230]   
3  2014-05-29T08:00:08Z          [story, author_d3v3r0

In [31]:
# Check how many tags there are for each story
tags = hn_df["tags"]

tags_types = tags.apply(len)
type_counts = tags_types.value_counts(dropna=False)
print(type_counts)

3    33459
4     2347
Name: tags, dtype: int64


In [33]:
# Boolean mask to filter stories with 4 tags
has_four_tags = tags.apply(len) == 4

four_tags = tags[has_four_tags]
print(four_tags.head())

43     [story, author_alamgir_mand, story_7813869, sh...
86       [story, author_cweagans, story_7812404, ask_hn]
104    [story, author_nightstrike789, story_7812099, ...
107    [story, author_ISeemToBeAVerb, story_7812048, ...
109       [story, author_Swizec, story_7812018, show_hn]
Name: tags, dtype: object


In [40]:
# New column with show_hn or ask_hn (the optional 4th tag)
fourth_tags = tags.apply(lambda l: l[-1] if len(l) == 4 else None)
hn_df["show_or_ask_hn"] = fourth_tags

print(hn_df.head())

           author  numComments  points  \
0  dragongraphics            0       2   
1             jcr            0       1   
2        callum85            0       1   
3          d3v3r0            0       1   
4      timmipetit            0       1   

                                                 url storyText  \
0  http://ashleynolan.co.uk/blog/are-we-getting-t...             
1  http://spectrum.ieee.org/automaton/robotics/ho...             
2  http://online.wsj.com/articles/apple-to-buy-be...             
3  http://alexsblog.org/2014/05/29/dont-wait-for-...             
4  http://techcrunch.com/2014/05/28/hackerone-get...             

              createdAt                                           tags  \
0  2014-05-29T08:07:50Z  [story, author_dragongraphics, story_7815238]   
1  2014-05-29T08:05:58Z             [story, author_jcr, story_7815234]   
2  2014-05-29T08:05:06Z        [story, author_callum85, story_7815230]   
3  2014-05-29T08:00:08Z          [story, author_d3v3r0