# Hacker News Pipeline

In this project, we will build a data pipeline from scratch and apply it to data from [Hacker News](https://news.ycombinator.com). The pipeline will filter, clean, aggregate, and summarize data from a JSON API in a sequence of functions that will apply the transformations for us.

Our goal will be to find the top 100 keywords of all Hacker News posts in 2014. This will give us a better understanding of the most popular tech topics in 2014.

## Introduction to the Data

The data has already been downloaded as a list of JSON posts that can be found in the `hn_stories_2014.json` file in this repository.

We'll be using the following keys in our data:

* `created_at`: Timestamp of the post's creation time.
* `created_at_i`: Unix epoch timestamp.
* `url`: URL of the post link.
* `objectID`: ID of the post.
* `author`: Post's author.
* `points`: Number of upvotes the post had.
* `title`: Headline of the post.
* `num_comments`: Number of comments on the post.

Let's start by instantiating our pipeline class and importing the libraries we'll be using.

In [1]:
import csv
import json
import io
import string
from datetime import datetime

from pipeline import Pipeline
pipeline = Pipeline()

## Loading the JSON Data

Next, we'll load the JSON data into Python. Since JSON files resemble a key-value dictionary, we'll parse the data into a Python dict object using the `json` module. 

In [2]:
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as file:
        data = json.load(file)
        stories = data['stories']
    return stories

## Filtering the Stories

We can start working on our data now that the stories have been loaded as a list of dict objects. We'll start by filtering the list of stories to extract the most popular stories in 2014.

We'll create a `pipeline.task()` function called `filter_stories()` that will be dependent on the `file_to_json()` function output, and it will return stories that have more than 50 points, at least 1 comment, and do not begin with "Ask HN".

In [3]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith('Ask HN')
    
    return (
        story for story in stories
        if is_popular(story)
    )

## Convert to CSV

In [None]:
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    pass

## Extract Title Column

## Clean the Titles

## Word Frequency Dictionary

## Sort the Top Words

## Conclusion and Next Steps