# Guided Project: Hacker News Pipeline

In this course, we began with the concepts of functional programming, and then built our own data pipeline class in Python. We learned about advanced Python concepts such as the decorators, closures, and good API design. In the last mission, we also learned how to implement a directed acyclic graph (DAG) as the scheduler for our pipeline.

After completing all these missions, we have finally built a robust data pipeline that schedules our tasks in the correct order! In this guided project, we will use the pipeline we have been building, and apply it to a real world data pipeline project. From a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.

The data we will use comes from a Hacker News (HN) API that returns JSON data of the top stories in 2014. If you're unfamiliar with Hacker News, it's a link aggregator website that users vote up stories that are interesting to the community. It is similar to Reddit, but the community only revolves around on computer science and entrepreneurship posts.

To make things easier, we have already downloaded a list of JSON posts to a file called <code>hn_stories_2014.json</code>. The JSON file contains a single key stories, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

<code>created_at</code>: A timestamp of the story's creation time.<br>
<code>created_at_i</code>: A unix epoch timestamp.<br>
<code>url</code>: The URL of the story link.<br>
<code>objectID</code>: The ID of the story.<br>
<code>author</code>: The story's author (username on HN).<br>
<code>points</code>: The number of upvotes the story had.<br>
<code>title</code>: The headline of the post.<br>
<code>num_comments</code>: The number of a comments a post has.<br>

Using this dataset, we will run a sequence of basic natural language processing tasks using our <code>Pipeline</code> class. The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

We have provided a solution to the guided project for you. You can find it in this [link](https://github.com/dataquestio/solutions/blob/master/Mission267Solutions.ipynb).

In [1]:
import json
from pipeline import Pipeline
from pipeline import build_csv
import io
import csv
import string
from stop_words import stop_words
from collections import Counter

We'll start the project by loading the JSON file data into Python.

In [2]:
# Pipeline Class
pipeline = Pipeline()

@pipeline.task()
def file_to_json():
    """
    Input string or file object of a json file and returns list of "stories"
    """

    if isinstance(pipeline.f, str):
        pipeline.f = open(pipeline.f, mode='r')
    
    j = json.loads(pipeline.f.read())
    
    return j['stories']

@pipeline.task(depends_on=file_to_json)
def filter_stories(f):
    """
    Filter stories
    """
    for line in f:
        created_at = line['created_at']
        created_at_i = line['created_at_i']
        url = line['url']
        objectID = line['objectID']
        author = line['author']
        points = int(line['points'])
        title = line['title']
        num_comments = int(line['num_comments'])
        # Filter conditions
        if points > 50 and num_comments > 1 and "Ask HN" not in str(line['title']):
            yield(
                objectID, created_at, url, points, title
            )

@pipeline.task(depends_on=filter_stories)
def json_to_csv(f):
    """
    Creates CSV of filtered data
    """

    # Created file information
    header = ['objectID','created_at','url','points','title']
    io_file = io.StringIO()

    return build_csv(f,header=header,file=io_file)

@pipeline.task(depends_on=json_to_csv)
def extract_titles(io_file):
    """
    Returns a generator of every Hacker News Story title
    """

    reader = csv.reader(io_file)
    header = next(reader)
    idx = header.index('title')

    return(line[idx] for line in reader)

@pipeline.task(depends_on=extract_titles)
def clean_titles(f):
    """
    Returns generator of cleaned titles
    """

    def replace_char(t):
        """
        Removes punctuation and lowers case of string
        """
        for char in string.punctuation:
            if char in t:
                t = t.replace(char,"")
        return t.lower()

    return (replace_char(x) for x in f)

@pipeline.task(depends_on=clean_titles)
def build_keywords_dictonary(f):
    """
    Returns Counter object of words in cleaned titles
    """

    word_freq = Counter()

    for words in f:
        for word in words.split(" "):
            word_freq[word] += 1

    for word in stop_words:
        del word_freq[word]
    
    del word_freq['']

    return word_freq

@pipeline.task(depends_on=build_keywords_dictonary)
def top_100(f):
    """
    Returns Tuple of (Key,Count) of Top 100 Most Common Words
    """

    return tuple(f.most_common(100))

In [3]:
f = open('hn_stories_2014.json')
pipeline.f = f #set pipeline start file 
output = pipeline.run()
output[top_100]

(('new', 185),
 ('google', 167),
 ('bitcoin', 101),
 ('open', 92),
 ('programming', 90),
 ('web', 88),
 ('data', 85),
 ('video', 79),
 ('python', 75),
 ('code', 72),
 ('facebook', 71),
 ('released', 71),
 ('using', 70),
 ('2013', 65),
 ('javascript', 65),
 ('free', 64),
 ('source', 64),
 ('game', 63),
 ('internet', 62),
 ('microsoft', 59),
 ('c', 59),
 ('linux', 58),
 ('app', 57),
 ('pdf', 55),
 ('work', 54),
 ('language', 54),
 ('software', 52),
 ('2014', 52),
 ('startup', 51),
 ('apple', 50),
 ('use', 50),
 ('make', 50),
 ('time', 48),
 ('yc', 48),
 ('security', 48),
 ('nsa', 45),
 ('github', 45),
 ('windows', 44),
 ('world', 41),
 ('way', 41),
 ('like', 41),
 ('1', 40),
 ('project', 40),
 ('computer', 40),
 ('heartbleed', 40),
 ('git', 37),
 ('users', 37),
 ('dont', 37),
 ('design', 37),
 ('ios', 37),
 ('developer', 36),
 ('os', 36),
 ('twitter', 36),
 ('ceo', 36),
 ('vs', 36),
 ('life', 36),
 ('big', 35),
 ('day', 35),
 ('android', 34),
 ('online', 34),
 ('years', 33),
 ('simple', 