In previous files, we began with the concepts of functional programming, and then built our own data pipeline class in Python. We learned about advanced Python concepts such as the `decorators`, `closures`, and good `API design`. In the last file, we also learned how to implement a **directed acyclic graph** as the scheduler for our pipeline.

After completing all these file, we have finally built a robust data pipeline that schedules our tasks in the correct order! In this project, we will use the pipeline we have been building, and apply it to a real world data pipeline project. From a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.

The data we will use comes from a [Hacker News](https://news.ycombinator.com/) (HN) API that returns [JSON data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON) of the top stories in 2014. If we're unfamiliar with Hacker News, it's a link aggregator website that users vote up stories that are interesting to the community. It is similar to [Reddit](https://www.reddit.com/), but the community only revolves around on computer science and entrepreneurship posts.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Using this dataset, we will run a sequence of basic natural language processing tasks using our `Pipeline` class. The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

[Solution](https://github.com/dataquestio/solutions/blob/master/Mission267Solutions.ipynb)

**Task**

* Import the `Pipeline` class from the `pipeline` module. We can import it like so: `from pipeline import Pipeline`.
* Instantiate an instance of the `Pipeline` class and assign it to the variable `pipeline`.

In [2]:
from datetime import datetime
import json
import io
import csv
import string

from pipeline import build_csv, Pipeline
from stop_words import stop_words

pipeline = Pipeline()

![image.png](attachment:image.png)

As a reminder, this is how we can parse `JSON` strings:

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [3]:
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)
        stories = data['stories']
    return stories

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [4]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith('Ask HN')
    
    return (
        story for story in stories
        if is_popular(story)
    )

With a reduced set of stories, it's time to write these `dict` objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of our pipeline tasks will be adaptable with future task requirements.

**Task**

![image.png](attachment:image.png)

**Answer**

In [5]:
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append(
            (story['objectID'], datetime.strptime(story['created_at'],
                                                  "%Y-%m-%dT%H:%M:%SZ"),
             story['url'], story['points'], story['title'])
        )
    return build_csv(lines, header=['objectID', 'created_at', 'url', 
                                    'points', 'title'], file=io.StringIO())

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [6]:
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    
    return (line[idx] for line in reader)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [7]:
@pipeline.task(depends_on=extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

![image.png](attachment:image.png)

As we can see, the title has been stripped of its punctuation and lower cased. Furthermore, to find actual keywords, we should enforce the word frequency dictionary to not include **stop words***. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

We have included a module called `stop_words` with a `tuple` of the most common used stop words in the English language. We can import in our notebook by using `from stop_words import stop_words`. Here's what the sample text would look like without the stop words:

`sample_text = "Wow, the NED Data Engineering track is the best track!"
print(word_freq_no_stop_words(sample_text))`

`{'wow': 1, 'NED': 1, 'data': 1, 'engineering': 1, 'track': 2, 'best': 1}`

**Task**

![image.png](attachment:image.png)

**Answer**

In [8]:
@pipeline.task(depends_on=clean_title)
def build_keyword_dictionary(titles):
    word_freq = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq

Finally, we're ready to sort the top words used in all the titles. In this final task, it's up to us to decide how we want to sort the top words. The goal is to output a list of tuples with (`word`, `frequency`) as the entries sorted from most used, to least most used.

**Task**

![image.png](attachment:image.png)

**Answer**

In [9]:
@pipeline.task(depends_on=build_keyword_dictionary)
def top_keywords(word_freq):
    freq_tuple = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq_tuple[:100]

ran = pipeline.run()
print(ran[top_keywords])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3

The final result yielded some interesting keywords. There were terms like `bitcoin` (the cryptocurrency), `heartbleed` (the 2014 hack), and many others. Even though this was a basic natural language processing task, it did provide some interesting insights into conversations from 2014. Nonetheless, now that we have created the pipeline, there are additional tasks we can perform with the data.

Here are just a few:

* Rewrite the `Pipeline` class' output to save a file of the output for each task. This will allow us to "checkpoint" tasks so they don't have to be run twice.
* Use the [`nltk` package](http://www.nltk.org/) for more advanced natural language processing tasks.
* Convert to a CSV before filtering, so we can keep all the stories from 2014 in a raw file.
* Fetch the data from Hacker News directly from a [JSON API](https://hn.algolia.com/api). Instead of reading from the file, we can perform additional data processing using newer data.