# Hacker News Pipeline

In this project, we will build a data pipeline from scratch and apply it to data from [Hacker News](https://news.ycombinator.com). The pipeline will filter, clean, aggregate, and summarize data from a JSON API in a sequence of functions that will apply the transformations for us.

Our goal will be to find the top 100 keywords of all Hacker News posts in 2014. This will give us a better understanding of the most popular tech topics in 2014.

## Introduction to the Data

The data has already been downloaded as a list of JSON posts that can be found in the `hn_stories_2014.json` file in this repository.

We'll be using the following keys in our data:

* `created_at`: Timestamp of the post's creation time.
* `created_at_i`: Unix epoch timestamp.
* `url`: URL of the post link.
* `objectID`: ID of the post.
* `author`: Post's author.
* `points`: Number of upvotes the post had.
* `title`: Headline of the post.
* `num_comments`: Number of comments on the post.

Let's start by instantiating our pipeline class and importing the libraries we'll be using.

In [1]:
import csv
import json
import io
import string
from datetime import datetime

from pipeline import Pipeline, build_csv
from stop_words import stop_words

pipeline = Pipeline()

## Loading the JSON Data

Next, we'll load the JSON data into Python. Since JSON files resemble a key-value dictionary, we'll parse the data into a Python dict object using the `json` module. 

In [2]:
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as file:
        data = json.load(file)
        stories = data['stories']
    return stories

## Filtering the Stories

We can start working on our data now that the stories have been loaded as a list of dict objects. We'll start by filtering the list of stories to extract the most popular stories in 2014.

We'll create a `pipeline.task()` function called `filter_stories()` that will be dependent on the `file_to_json()` function output, and it will return stories that have more than 50 points, at least 1 comment, and do not begin with "Ask HN".

In [3]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith('Ask HN')
    
    return (
        story for story in stories
        if is_popular(story)
    )

## Convert to CSV

Now that we've reduced our set of stories, we can write these dict objects to a CSV file so we can have a consistent data format.

We'll create a `pipeline.task()` function called `json_to_csv()` that will be depended on the `filter_stories()` output, and it will output the formated input to a CSV using the `IO` method.

In [4]:
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append(
            (story['objectID'], datetime.strptime(story['created_at'], '%Y-%m-%dT%H:%M:%SZ'), story['url'], story['points'], story['title'])
        )
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'point', 'title'], file=io.StringIO())

## Extract Title Column

Next, we'll create a `pipeline.task()` function called `extract_titles()` that will be dependent on the `json_to_csv()` output, and it will return a generator of every Hacker News story title. After we have all of the titles, we'll be able to count the word frequency which will help us find the most popular topics.

In [5]:
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    index_num = header.index('title')
    
    return (line[index_num] for line in reader)

## Clean the Titles

In order to have a consistent set of words to use for our word frequency model, we'll need to clean our list of titles and make sure they're all lowercase and without punctuation. The easiest way to remove punctuation from a string is to check each character and only keep the letters. For this, we can use `string.punctuation` to help us.

We'll create a `pipeline.task()` function called `clean_title()` that will be dependent on the `extract_titles()` output, and it will return the cleaned titles.

In [6]:
@pipeline.task(depends_on=extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(t for t in title if t not in string.punctuation)
        yield title

## Word Frequency Dictionary

Now that the data is clean, we can build a word frequency dictionary.

We'll want our frequency dictionary to show us only keywords and not all of the stop words that are frequently used and not useful to us for this analysis. To remove these words, we'll use a module in this repository called `stop_words.py` that contains a tuple of all the words we don't need.

In [7]:
@pipeline.task(depends_on=clean_title)
def build_keyword_dictionary(titles):
    freq_count = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words: # Removes the stop words located in our tuple
                if word not in freq_count:
                    freq_count[word] = 1
                freq_count[word] += 1
    return freq_count

## Sort the Top Words

Finally, we can now sort out the top words. We'll output a list of the top 100 tuples sorted from most used to least used.

In [8]:
@pipeline.task(depends_on=build_keyword_dictionary)
def top_keywords(word_freq):
    freq = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq[:100]

## Run the Pipeline

In [9]:
run_pipeline = pipeline.run()
print(run_pipeline[top_keywords])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3

## Conclusion and Next Steps

In this project, we built a data pipeline from scratch. We're able to see some interesting topics that were trending in 2014 like bitcoin. 

To continue our analysis, some next steps we could take to improve this pipeline are:

* Before filtering the data, first convert it to a CSV to keep the stories in a raw file.
* Acquire the data from Hacker News directly from a JSON API to process newer data.
* Use the nltk package for more advanced natural language processing.
* Rewrite the Pipeline class so that output is saved to a file after each task so that tasks don't need to be run more than once.

The idea for this project comes from the [DATAQUEST](https://app.dataquest.io/) **Building a Data Pipeline** course. 