# Hacker News Pipeline

## Introduction

In this project, we'll work with a sequence of basic natural language processing tasks using a created pipeline class. Our goal is to understand the tech topics in 2014 by finding the top 100 keywords of Hacker News posts in 2014. 

Our input data is from Hacker News (HN), a website about computer science and entrepreneurship posts that community vote stories,  API that returns JSON data of the top stories in 2014. The list of JSON posts is in 'hn_stories_2014.json'.

We will deal with the following keys of the posts:

* created_at: A timestamp of the story's creation time.
* created_at_i: A unix epoch timestamp.
* url: The URL of the story link.
* objectID: The ID of the story.
* author: The story's author (username on HN).
* points: The number of upvotes the story had.
* title: The headline of the post.
* num_comments: The number of a comments a post has.

## Pipeline Class

The base of this project is the class below created previously.

In [1]:
from collections import deque
# DAG class to deal with graph
class DAG():
    def __init__(self):
        self.graph = {}
        
    def in_degrees(self):
        in_degrees = {}
        for node in self.graph:
            if node not in in_degrees:
                in_degrees[node] = 0
            for pointed in self.graph[node]:
                if pointed not in in_degrees:
                    in_degrees[pointed] = 0
                in_degrees[pointed] += 1
        return in_degrees
    
    def sort(self):
        in_degrees = self.in_degrees()
        to_visit = deque()
        for node in self.graph:
            if in_degrees[node] == 0:
                to_visit.append(node)
                
        searched = []
        while to_visit:
            node = to_visit.popleft()
            for pointer in self.graph[node]:
                in_degrees[pointer] -= 1
                if in_degrees[pointer] == 0:
                    to_visit.append(pointer)
            searched.append(node)
        return searched
        
    def add(self, node, to=None):
        if node not in self.graph:
            self.graph[node] = []
        if to:
            if to not in self.graph:
                self.graph[to] = []
            self.graph[node].append(to)
            
        if len(self.sort()) != len(self.graph):
            raise Exception
            
class Pipeline:
    def __init__(self):
        self.tasks = DAG()
        
    def task(self, depends_on=None):
        def inner(function):
            self.tasks.add(function)
            if depends_on:
                self.tasks.add(depends_on, function)
            return function
        return inner
    
    def run(self):
        scheduled = self.tasks.sort()
        completed = {}
        for task in scheduled:
            for node, values in self.tasks.graph.items():
                if task in values:
                    completed[task] = task(completed[node])
            if task not in completed:
                completed[task] = task()
        return completed

In [2]:
pipeline = Pipeline()

## Loading the JSON Data

Now we'll load the JSON file into Python.

In [3]:
# Parse the JSON file into a dict object
import json
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json') as file:
        data_dict = json.load(file)
        return data_dict['stories']

## Filtering the Stories

We will start working with the data by filtering the list of stories to get the most popular stories of the year.

We can filter for popular stories by ensuring they are links (not other kinds of posts), have a good number of points, and have some comments.

In [4]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(data):
    for story in data:
        if story['points'] > 50 and story['num_comments'] > 1 and (not story['title'].startswith('Ask HN')):
            yield story  
    return

## Converting to CSV

By converting the dictionary to a CSV file, we have consistent data format for later summarizations.

In [5]:
import csv
import io
import itertools
import datetime as dt
def build_csv(lines, header=None, file=None):
    if header:
        lines = itertools.chain([header], lines)
    writer = csv.writer(file, delimiter = ',')
    writer.writerows(lines)
    file.seek(0)
    return file

@pipeline.task(depends_on=filter_stories)
def json_to_csv(dict_stories):
    lines = []
    for story in dict_stories:
        lines.append((story['objectID'], dt.datetime.strptime(story['created_at'], '%Y-%m-%dT%H:%M:%SZ'), story['url'], story['points'], story['title']))
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'], file=io.StringIO())

## Extracting Title Column

We can extract the titles of each popular post then we can run the next word frequency task.

In [6]:
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    read_file = list(csv.reader(csv_file))
    for row in read_file[1:]:
        yield row[-1]
    return

## Cleaning the Titles

To clean the titles, we need to lower case the titles, and to remove the punctuation.

In [7]:
import string

@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    for title in titles:
        clean_title = title
        for char in string.punctuation:
            clean_title = clean_title.replace('char', '')
        yield clean_title.lower()
    return

## Creating the Word Frequency Dictionary

Our word frequency dictionary will have a word as key and the number of times it appears in a text as value. 

Also, the dictionary won't include words that occur frequently in language, called stop words. For instance, 'the' and 'of' words. To solve this, we will import a module called stop_words that is a tuple of the most commonly used stop words in the English language.

In [8]:
from stop_words import stop_words
@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(titles):
    freq_words = {}
    for title in titles:
        for word in title.split():
            if word and word not in stop_words:
                if word not in freq_words:
                    freq_words[word] = 0
                else:
                    freq_words[word] += 1
    return freq_words

## Sorting the Top Words

In [9]:
@pipeline.task(depends_on=build_keyword_dictionary)
def order_words(dictionary):
    order_dictionary = sorted(dictionary.items(), reverse=True, key=lambda x: x[1])
    return order_dictionary[:100]

## Conclusion: Running the pipeline

Lastly, we can test our pipeline.

In [10]:
result = pipeline.run()
for word in result[order_words]:
    print(word)

('hn:', 192)
('new', 183)
('google', 150)
('bitcoin', 93)
('open', 90)
('programming', 87)
('web', 87)
('data', 81)
('python', 71)
('released', 68)
('using', 68)
('facebook', 64)
('code', 62)
('javascript', 61)
('game', 59)
('[video]', 59)
('source', 59)
('internet', 58)
('free', 56)
('app', 55)
('microsoft', 54)
('linux', 53)
('[pdf]', 52)
('language', 50)
('software', 50)
('use', 49)
('(2013)', 47)
('security', 47)
('apple', 46)
('time', 46)
('startup', 46)
('make', 46)
('2014', 44)
('work', 42)
('github', 41)
('computer', 39)
('heartbleed', 39)
('world', 37)
('windows', 37)
('nsa', 37)
('like', 37)
('way', 37)
('project', 36)
('ios', 36)
('u.s.', 34)
('developer', 33)
("don't", 33)
('online', 33)
('life', 33)
('git', 32)
('users', 32)
('os', 32)
('twitter', 32)
('big', 32)
('guide', 31)
('ceo', 31)
('mt.', 31)
('day', 31)
('android', 30)
('server', 30)
('learning', 30)
('design', 30)
('api', 30)
('says', 30)
('browser', 30)
('introducing', 29)
('firefox', 29)
('apps', 29)
('built', 