# Building a Hacker New Pipeline to Perform Basic NLP Tasks

In this project, we'll build and use a pipeline to filter, aggregate, and summarize data from [Hacker News](https://news.ycombinator.com/). Hacker News is a link aggregator website where users vote up stories that are interesting to the community. It is similar to [Reddit](https://www.reddit.com/), but the community only revolves around on computer science and entrepreneurship posts.

The [JSON file](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON) we'll use, `hn_stories_2014.json`, contains a list of top voted JSON posts from 2014. The JSON file contains a single key `stories`, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:

|Key|Description|
|---|-----------|
|created_at|A timestamp of the story's creation time.|
|created_at_i|A unix epoch timestamp.|
|url|The URL of the story link.|
|objectID|The ID of the story.|
|author|The story's author (username on HN).|
|points|The number of upvotes the story had.|
|title|The headline of the post.|
|num_comments|The number of a comments a post has.|

<br>

Using this dataset, **we will run a sequence of basic natural language processing (NLP) tasks using our `Pipeline` class**. Our `Pipeline` structure is based on a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) (`DAG`).

The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

## 1. Building the `Pipeline` class

In [1]:
# Import libraries we'll use
import json
from datetime import datetime
import io
import csv
import string
import csv
from collections import deque
import itertools

# Class for the directed acyclic graph (DAG) used in the Pipeline class
class DAG:
    """A directed acyclic graph (DAG) data structure class.
    
    Attributes:
    ----------
        graph (dictionary): Lists nodes (keys) and 
                            the list of node(s) they point to (values), if any.
                            
        degrees (dictionary): Lists nodes (keys) and 
                              the number of "edges" that point to it (values) if any.
                              
    Methods:
    -------
        in_degrees(): Builds/calculates the self.degree attribute.
        
        sort(): Returns list of topologically sorted nodes using Kahn's Algorithm.
                Essentially, it's a root node filter
                
        add(node, to=None): Adds a node and the node it points to (to), if any, to
                            the self.graph attribute.
        """
    def __init__(self):
        """Constructs graph attribute for the DAG object."""
        self.graph = {}

    def _in_degrees(self):
        """Constructs the degrees attribute for the DAG object."""
        self.degrees = {}
        for node in self.graph:
            if node not in self.degrees:
                self.degrees[node] = 0
            for pointed in self.graph[node]:
                if pointed not in self.degrees:
                    self.degrees[pointed] = 0
                self.degrees[pointed] += 1

    def sort(self):
        """Topologically sorts the degrees attribute and returns the result."""
        self._in_degrees()
        root_nodes = deque()
        for node in self.graph:
            if self.degrees[node] == 0:
                root_nodes.append(node)
        
        ordered_nodes = []
        while root_nodes:
            node = root_nodes.popleft()
            for pointer in self.graph[node]:
                self.degrees[pointer] -= 1
                if self.degrees[pointer] == 0:
                    root_nodes.append(pointer)
            ordered_nodes.append(node)
        return ordered_nodes

    def add(self, node, to=None):
        """Adds a node and the node it points to, if any, to the graph attribute.
        
        Args:
        ----
            node (immutable object): Any immutable object.
            
            to (immutable object): Any immutable object.
            
        Returns:
        -------
            list: A list of topologically sorted nodes.
        """
        if node not in self.graph:
            self.graph[node] = []
        if to:
            if to not in self.graph:
                self.graph[to] = []
            self.graph[node].append(to)
        if len(self.sort()) != len(self.graph): # Check for cycles
            raise Exception

# Class for our Pipeline, using the DAG structure upon initialization
class Pipeline:
    """A task scheduler.
    
    Attributes:
    ----------
        tasks (DAG object): A directed acyclic graph that maps task and their dependencies.
                            The DAG object itself does this in the form of a dictionary.
                            
    Methods:
    -------
        task(depends_on=None): A decorator that adds the decorated function and its dependencies
                               to the tasks attribute.
                               
        run(): Runs the topologically sorted list of functions in order of dependency in
               a nested fashion. Returns a dictionary of functions and thier outputs 
               at each step of the way.
    """
    def __init__(self):
        """Constucts the tasks attribute (a directed acyclic graph) for the Pipeline obejct."""
        self.tasks = DAG()

    def task(self, depends_on=None):
        """A decorator that adds the decorated task and its dependencies 
           to the tasks attribute.
           
        Args:
        ----
            depends_on (Immutable object): In most uses, it's the function which the
                                           function being decorated depends on for input.
        """
        def inner(f):
            """Adds the decorated function and it's dependencies to the task DAG."""
            self.tasks.add(f)
            if depends_on:
                self.tasks.add(depends_on, f)
            return f
        return inner

    def run(self):
        """Runs the topologically sorted list of functions in order of dependency in
           a nested fashion. Returns a dictionary of functions and thier outputs 
           at each step of the way. 
           
        Returns:
        -------
            dictionary:  A list of functions and thier outputs 
                         at each step of the way.    
        """
        scheduled = self.tasks.sort()
        completed = {}

        for task in scheduled:
            for node, values in self.tasks.graph.items():
                if task in values:
                    completed[task] = task(completed[node])
            if task not in completed:
                completed[task] = task()
        return completed

# Instantiate the pipeline class
pipeline = Pipeline()

## 2. Loading the data

In [2]:
# Load in json file data
def file_to_json():
    """Loads in JSON file as a list of dictionary objects."""
    with open('hn_stories_2014.json') as file:
        data = json.load(file)
        stories = data['stories']
    return stories

# Load JSON data as a list of dictionaries and display the first 3 entries
stories = file_to_json()
stories[:2]

[{'story_text': '',
  'created_at': '2014-05-29T08:25:40Z',
  'story_title': None,
  'story_id': None,
  'comment_text': None,
  'created_at_i': 1401351940,
  'url': 'https://duckduckgo.com/settings',
  'parent_id': None,
  'objectID': '7815290',
  'author': 'TuxLyn',
  'points': 1,
  'title': 'DuckDuckGo Settings',
  '_tags': ['story', 'author_TuxLyn', 'story_7815290'],
  'num_comments': 0,
  '_highlightResult': {'story_text': {'matchedWords': [],
    'value': '',
    'matchLevel': 'none'},
   'author': {'matchedWords': [], 'value': 'TuxLyn', 'matchLevel': 'none'},
   'url': {'matchedWords': [],
    'value': 'https://duckduckgo.com/settings',
    'matchLevel': 'none'},
   'title': {'matchedWords': [],
    'value': 'DuckDuckGo Settings',
    'matchLevel': 'none'}},
  'story_url': None},
 {'story_text': '',
  'created_at': '2014-05-29T08:23:46Z',
  'story_title': None,
  'story_id': None,
  'comment_text': None,
  'created_at_i': 1401351826,
  'url': 'http://bits.blogs.nytimes.com/2014/

## 3. Selecting the most popular stories

We'll  filter the list of stories to get the most popular stories of the year.

Like any social link aggregator site, individual users can post whatever content they want. The reason we want the most popular stories is to ensure that we select stories that were the most talked about during the year. We can filter for popular stories by ensuring they are links (not `Ask HN` posts), have a good number of points, and have some comments.

In [3]:
# The function gets the most popular stories
def filter_stories(stories):
    """Filters popular stories from a list of dictionaries.
    
    Args:
    ----
        stories (list): A list of dictionaries.
    Returns:
    -------
        generator: A generator of filtered stories
    """
    def is_popular(story):
        """A boolean filter that filters popular stories that have more than 50 points,
           more than 1 comment, and do not begin with Ask HN.
        """
        return story['num_comments'] > 1 and story['points'] > 50 and \
               not story['title'].startswith('Ask HN')
    
    return (story for story in stories if is_popular(story))

# Get the most popular stories
popular_stories = filter_stories(stories)

# Display the first three entries of popular_stories
counter = 0
for story in popular_stories:
    if counter < 3:
        print(story)
        counter += 1
    else:
        break
        
popular_stories = filter_stories(stories)

{'story_text': '', 'created_at': '2014-05-29T04:27:42Z', 'story_title': None, 'story_id': None, 'comment_text': None, 'created_at_i': 1401337662, 'url': 'http://krebsonsecurity.com/2014/05/true-goodbye-using-truecrypt-is-not-secure/', 'parent_id': None, 'objectID': '7814725', 'author': 'panarky', 'points': 60, 'title': 'True Goodbye: ‘Using TrueCrypt Is Not Secure’', '_tags': ['story', 'author_panarky', 'story_7814725'], 'num_comments': 23, '_highlightResult': {'story_text': {'matchedWords': [], 'value': '', 'matchLevel': 'none'}, 'author': {'matchedWords': [], 'value': 'panarky', 'matchLevel': 'none'}, 'url': {'matchedWords': [], 'value': 'http://krebsonsecurity.com/2014/05/true-goodbye-using-truecrypt-is-not-secure/', 'matchLevel': 'none'}, 'title': {'matchedWords': [], 'value': 'True Goodbye: ‘Using TrueCrypt Is Not Secure’', 'matchLevel': 'none'}}, 'story_url': None}
{'story_text': '', 'created_at': '2014-05-29T03:51:01Z', 'story_title': None, 'story_id': None, 'comment_text': None

## 4. Writing popular stories to a CSV file

With a reduced set of stories, it's time to write these `dict` objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of our pipeline tasks will be adaptable with future task requirements.

In [4]:
# Function for writing data into a csv file
def build_csv(lines, header=None, file=None):
    """Writes lines into a csv file.
    
    Args:
    ----
        lines (iterable): A generator or list or dictionary keys.
        
        header (list): A list of string names equal in length to lines
        
        file (string): A file name.
        
    Returns:
    -------
        string: The file name.
        
    """
    if header:
        lines = itertools.chain([header], lines)
    writer = csv.writer(file, delimiter=',')
    writer.writerows(lines)
    file.seek(0)
    return file

In [5]:
# The function saves the JSON data to a csv file
def json_to_csv(stories):
    """Saves JSON data to a csv file.
    
    Args:
    ----
        stories (generator): A list of dictionaries containing the most popular stories.
    
    Returns:
    -------
        file object: <The file object at x location>
    """
    lines = []
    for story in stories:
        lines.append((story['objectID'], 
                     datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), 
                     story['url'], 
                     story['points'], 
                     story['title']))
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'],
                     file=io.StringIO())

csv_file = json_to_csv(popular_stories)
print(csv_file)
csv_file.seek(0)

<_io.StringIO object at 0x7f128bc1fdc0>


0

## 5. Extracting story titles

Using the CSV file format we created in the previous task, we can now extract the title column. Once we have extracted the titles of each popular post, we can then run the next word frequency task.

We'll do the following steps to extract titles:
1. Create a `csv.reader()` object from the file object. 
2. Find the index of the `title` in the header. 
3. Iterate the through the reader, and return each item from the reader in the corresponding title index position.

In [6]:
# The function extracts the titles from the JSON data turned CSV file.
def extract_titles(csv_file):
    """Extracts titles from CSV file.
    
    Args:
    ----
        csv_file (string): The name of the CSV file we want to extract titles from.
        
    Returns:
    -------
        generator: A list of titles in string format.
    """
    if not isinstance(csv_file, str):
            csv_file.seek(0)
            reader = csv.reader(csv_file)
            header = next(reader)
            idx = header.index('title')
            return (line[idx] for line in reader)
    else:
        with open(csv_file) as file:
            reader = csv.reader(csv_file)
            header = next(reader)
            idx = header.index('title')
            return (line[idx] for line in reader)
        
# Extract titles from CSV file
titles = extract_titles(csv_file)

# Display the first three entries of titles
counter = 0
for title in titles:
    if counter < 3:
        print(title)
        counter += 1   
        
# Reset iteration on csv file       
csv_file.seek(0)

True Goodbye: ‘Using TrueCrypt Is Not Secure’
For Hire: Dedicated Young Man With Down Syndrome
Absolute Zero


0

## 5.1 Cleaning title

Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. For example, words like `Google`, `google`, `GooGle?`, and `google.`, all mean the same keyword: `google`. If we were to split the title into words, however, they would all be lumped into different categories.

To clean the titles, we should make sure to lower case the titles, and to remove the punctuation.

In [7]:
# The function converts all titles to lowercase and removes any punctuation.
def clean_titles(titles):
    """Cleans titles by converting each to lowercase and removing puncuation.
    
    Args:
    ----
        titles (generator): A list of titles.
    
    Yields:
    ------
        string: The title with lowercase letters and punctuation removed."""
    return (''.join(c for c in title.lower() if c not in string.punctuation) for title in titles)
        
# Clean titles
clean_titles = clean_titles(extract_titles(csv_file))

# Display the first three entries of clean_titles
counter = 0
for title in clean_titles:
    if counter < 3:
        print(title)
        counter += 1   
        
# Reset iteration on csv file       
csv_file.seek(0)

true goodbye ‘using truecrypt is not secure’
for hire dedicated young man with down syndrome
absolute zero


0

## 6. Generating a word frequency dictionary

With a cleaned title, we can now build the **word frequency** dictionary. A word frequency dictionary are key value pairs that connects a word to the number of times it is used in a text.

To find actual keywords, we should enforce the word frequency dictionary to not include **stop words**. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

## 6.1 Stop words list

A list of stop words was copied and pasted from [countwordsfree.com](https://countwordsfree.com/stopwords). A few additional stop words were added to the end for the purposes of our keyword search.

In [8]:
# Pasted stop words converted into a tuple
stop_words = tuple("""able
about
above
abroad
according
accordingly
across
actually
adj
after
afterwards
again
against
ago
ahead
ain't
all
allow
allows
almost
alone
along
alongside
already
also
although
always
am
amid
amidst
among
amongst
an
and
another
any
anybody
anyhow
anyone
anything
anyway
anyways
anywhere
apart
appear
appreciate
appropriate
are
aren't
around
as
a's
aside
ask
asking
associated
at
available
away
awfully
back
backward
backwards
be
became
because
become
becomes
becoming
been
before
beforehand
begin
behind
being
believe
below
beside
besides
best
better
between
beyond
both
brief
but
by
came
can
cannot
cant
can't
caption
cause
causes
certain
certainly
changes
clearly
c'mon
co
co.
com
come
comes
concerning
consequently
consider
considering
contain
containing
contains
corresponding
could
couldn't
course
c's
currently
dare
daren't
definitely
described
despite
did
didn't
different
directly
do
does
doesn't
doing
done
don't
down
downwards
during
each
edu
eg
eight
eighty
either
else
elsewhere
end
ending
enough
entirely
especially
et
etc
even
ever
evermore
every
everybody
everyone
everything
everywhere
ex
exactly
example
except
fairly
far
farther
few
fewer
fifth
first
five
followed
following
follows
for
forever
former
formerly
forth
forward
found
four
from
further
furthermore
get
gets
getting
given
gives
go
goes
going
gone
got
gotten
greetings
had
hadn't
half
happens
hardly
has
hasn't
have
haven't
having
he
he'd
he'll
hello
help
hence
her
here
hereafter
hereby
herein
here's
hereupon
hers
herself
he's
hi
him
himself
his
hither
hopefully
how
howbeit
however
hundred
i'd
ie
if
ignored
i'll
i'm
immediate
in
inasmuch
inc
inc.
indeed
indicate
indicated
indicates
inner
inside
insofar
instead
into
inward
is
isn't
it
it'd
it'll
its
it's
itself
i've
just
k
keep
keeps
kept
know
known
knows
last
lately
later
latter
latterly
least
less
lest
let
let's
like
liked
likely
likewise
little
look
looking
looks
low
lower
ltd
made
mainly
make
makes
many
may
maybe
mayn't
me
mean
meantime
meanwhile
merely
might
mightn't
mine
minus
miss
more
moreover
most
mostly
mr
mrs
much
must
mustn't
my
myself
name
namely
nd
near
nearly
necessary
need
needn't
needs
neither
never
neverf
neverless
nevertheless
new
next
nine
ninety
no
nobody
non
none
nonetheless
noone
no-one
nor
normally
not
nothing
notwithstanding
novel
now
nowhere
obviously
of
off
often
oh
ok
okay
old
on
once
one
ones
one's
only
onto
opposite
or
other
others
otherwise
ought
oughtn't
our
ours
ourselves
out
outside
over
overall
own
particular
particularly
past
per
perhaps
placed
please
plus
possible
presumably
probably
provided
provides
que
quite
qv
rather
rd
re
really
reasonably
recent
recently
regarding
regardless
regards
relatively
respectively
right
round
said
same
saw
say
saying
says
second
secondly
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sensible
sent
serious
seriously
seven
several
shall
shan't
she
she'd
she'll
she's
should
shouldn't
since
six
so
some
somebody
someday
somehow
someone
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specified
specify
specifying
still
sub
such
sup
sure
take
taken
taking
tell
tends
th
than
thank
thanks
thanx
that
that'll
thats
that's
that've
the
their
theirs
them
themselves
then
thence
there
thereafter
thereby
there'd
therefore
therein
there'll
there're
theres
there's
thereupon
there've
these
they
they'd
they'll
they're
they've
thing
things
think
third
thirty
this
thorough
thoroughly
those
though
three
through
throughout
thru
thus
till
to
together
too
took
toward
towards
tried
tries
truly
try
trying
t's
twice
two
un
under
underneath
undoing
unfortunately
unless
unlike
unlikely
until
unto
up
upon
upwards
us
use
used
useful
uses
using
usually
v
value
various
versus
very
via
viz
vs
want
wants
was
wasn't
way
we
we'd
welcome
well
we'll
went
were
we're
weren't
we've
what
whatever
what'll
what's
what've
when
whence
whenever
where
whereafter
whereas
whereby
wherein
where's
whereupon
wherever
whether
which
whichever
while
whilst
whither
who
who'd
whoever
whole
who'll
whom
whomever
who's
whose
why
will
willing
wish
with
within
without
wonder
won't
would
wouldn't
yes
yet
you
you'd
you'll
your
you're
yours
yourself
yourselves
you've
zero
a
how's
i
when's
why's
b
c
d
e
f
g
h
j
l
m
n
o
p
q
r
s
t
u
uucp
w
x
y
z
I
www
amount
bill
bottom
call
computer
con
couldnt
cry
de
describe
detail
due
eleven
empty
fifteen
fifty
fill
find
fire
forty
front
full
give
hasnt
herse
himse
interest
itse”
mill
move
myse”
part
put
show
side
sincere
sixty
system
ten
thick
thin
top
twelve
twenty
abst
accordance
act
added
adopted
affected
affecting
affects
ah
announce
anymore
apparently
approximately
aren
arent
arise
auth
beginning
beginnings
begins
biol
briefly
ca
date
ed
effect
et-al
ff
fix
gave
giving
heres
hes
hid
home
id
im
immediately
importance
important
index
information
invention
itd
keys
kg
km
largely
lets
line
'll
means
mg
million
ml
mug
na
nay
necessarily
nos
noted
obtain
obtained
omitted
ord
owing
page
pages
poorly
possibly
potentially
pp
predominantly
present
previously
primarily
promptly
proud
quickly
ran
readily
ref
refs
related
research
resulted
resulting
results
run
sec
section
shed
shes
showed
shown
showns
shows
significant
significantly
similar
similarly
slightly
somethan
specifically
state
states
stop
strongly
substantially
successfully
sufficiently
suggest
thered
thereof
therere
thereto
theyd
theyre
thou
thoughh
thousand
throug
til
tip
ts
ups
usefully
usefulness
've
vol
vols
wed
whats
wheres
whim
whod
whos
widely
words
world
youve
youd
youre
–
—
hn""".split('\n'))

## 6.2 Building a keyword dictionary

In [12]:
# The function builds a word frequency dictionary
def build_keyword_dictionary(titles):
    """Counts word frequency within titles that are not found in a list of stop words.
    
    Args:
    ----
        titles (generator): A List of titles in string format.
        
    Returns:
    -------
        Dictionary: A list of words (keys) and their frequencies (values).
    """
    keyword_count = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in keyword_count:
                    keyword_count[word] = 1
                keyword_count[word] += 1
    return keyword_count

# Build word frequency dictionary
word_frequency = build_keyword_dictionary(clean_titles(extract_titles(csv_file)))

# Display 20 entries in word_frequency
print(list(word_frequency.items())[:20])

# Reset iteration on csv file       
csv_file.seek(0)

[('true', 6), ('goodbye', 10), ('‘using', 2), ('truecrypt', 5), ('secure’', 2), ('hire', 13), ('dedicated', 7), ('young', 10), ('man', 25), ('syndrome', 4), ('absolute', 2), ('joshua', 2), ('norton', 2), ('emperor', 2), ('united', 7), ('soylent', 5), ('revolution', 6), ('pleasurable', 2), ('git', 38), ('20', 24)]


0

## 6.3 Sorting the word frequency dictionary

Finally, we're ready to sort the top words used in all the titles.

The goal is to output a list of tuples with (`word`, `frequency`) as the entries sorted from most used, to least most used.

In [13]:
# The function sorts and returns the top 100 most frequent words
def sort_keywords(keywords):
    """Generates a list of the top 100 most frequent words by sorting a 
       word frequency dictionary by value in decreasing order.
       Displays the top 100 tuples (word, frequency).
    
    Args:
    ----
        keywords (dictionary): A word frequency dictionary.
        
    Returns:
    -------
        list: A list of the top 100 tuples (word, frequency).
        """
    return sorted(keywords.items(), key=lambda x: x[1], reverse=True)[:100]

# Sort word frequency from highest to lowest
sorted_word_freq = sort_keywords(word_frequency)

# Print the top 10 most frequent words
print(sorted_word_freq[:10])

[('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72)]


## 7. Putting the pipeline together

In [14]:
# Initialize the pipeline object
pipeline = Pipeline()

# Save function to our pipeline to load in json file data
@pipeline.task()
def file_to_json():
    """Loads in JSON file as a list of dictionary objects."""
    with open('hn_stories_2014.json') as file:
        data = json.load(file)
        stories = data['stories']
    return stories

# Save function which depends on the output from file_to_json
# to our pipeline. The function gets the most popular stories
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    """Filters popular stories from a list of dictionaries.
    
    Args:
    ----
        stories (list): A list of dictionaries.
    Returns:
    -------
        generator: A generator of filtered stories
    """
    def is_popular(story):
        """A boolean filter that filters popular stories that have more than 50 points,
           more than 1 comment, and do not begin with Ask HN.
        """
        return story['num_comments'] > 1 and story['points'] > 50 and \
               not story['title'].startswith('Ask HN')
    
    return (story for story in stories if is_popular(story))

# Save function which depends on the output from filter_stories
# to our pipeline. The function saves the JSON data to a csv file
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    """Saves JSON data to a csv file.
    
    Args:
    ----
        stories (generator): A list of dictionaries containing the most popular stories.
    
    Returns:
    -------
        file object: <The file object at x location>
    """
    lines = []
    for story in stories:
        lines.append((story['objectID'], 
                     datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), 
                     story['url'], 
                     story['points'], 
                     story['title']))
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'],
                     file=io.StringIO())

# Save function which depends on the output from json_to_csv
# to our pipeline. The function extracts the titles from the JSON data turned CSV file.
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    """Extracts titles from CSV file.
    
    Args:
    ----
        csv_file (string): The name of the CSV file we want to extract titles from.
        
    Returns:
    -------
        generator: A list of titles in string format.
    """
    if not isinstance(csv_file, str):
            csv_file.seek(0)
            reader = csv.reader(csv_file)
            header = next(reader)
            idx = header.index('title')
            return (line[idx] for line in reader)
    else:
        with open(csv_file) as file:
            reader = csv.reader(csv_file)
            header = next(reader)
            idx = header.index('title')
            return (line[idx] for line in reader)
        
# Save function which depends on the output from extract_titles
# to our pipeline. The function converts all titles to lowercase and removes any punctuation.
@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    """Cleans titles by converting each to lowercase and removing puncuation.
    
    Args:
    ----
        titles (generator): A list of titles.
    
    Yields:
    ------
        string: The title with lowercase letters and punctuation removed."""
    return (''.join(c for c in title.lower() if c not in string.punctuation) for title in titles)

# Save function which depends on the output from clean_titles
# to our pipeline. The function builds a word frequency dictionary
@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(titles):
    """Counts word frequency within titles that are not found in a list of stop words.
    
    Args:
    ----
        titles (generator): A List of titles in string format.
        
    Returns:
    -------
        Dictionary: A list of words (keys) and their frequencies (values).
    """
    keyword_count = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in keyword_count:
                    keyword_count[word] = 1
                keyword_count[word] += 1
    return keyword_count

# Save function which depends on the output from build_keyword_dictionary
# to our pipeline. The function sorts and returns the topp 100 most frequent words
@pipeline.task(depends_on=build_keyword_dictionary)
def sort_keywords(keywords):
    """Generates a list of the top 100 most frequent words by sorting a 
       word frequency dictionary by value in decreasing order.
       Displays the top 100 tuples (word, frequency).
    
    Args:
    ----
        keywords (dictionary): A word frequency dictionary.
        
    Returns:
    -------
        list: A list of the top 100 tuples (word, frequency).
        """
    return sorted(keywords.items(), key=lambda x: x[1], reverse=True)[:100]

# Run the pipeline, saving outputs at every step in a dictionary
pipeline_outputs = pipeline.run()

# Top 100 most frequent words
top_100_words = pipeline_outputs[sort_keywords]
print(top_100_words)

[('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('1', 41), ('project', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 33), ('api', 33), ('apps', 33), ('browser', 33), ('server', 32), ('firefox', 32), ('fast', 32), ('gox', 32), ('problem', 32), ('mozilla', 32

## Conclusion

The final result yielded some interesting keywords. There were terms like `bitcoin` (the cryptocurrency), `heartbleed` (the 2014 hack), and many others. Even though this was a basic natural language processing task, it did provide some interesting insights into conversations from 2014.

### Next Steps

* Rewrite the `Pipeline` class' output to save a file of the output for each task. This will allow us to "checkpoint" tasks so they don't have to be run twice.
* Use the [`nltk` package](http://www.nltk.org/) for more advanced natural language processing tasks.
* Convert to a CSV before filtering, so you can keep all the stories from 2014 in a raw file.
* Fetch the data from Hacker News directly from [their JSON API](https://hn.algolia.com/api). Instead of reading from an older file, we can perform additional data processing using newer data.