# Transforming Data
In this project, you'll be working with a dataset of submissions to [Hacker News](http://news.ycombinator.com/) from 2006 to 2015.

The dataset you'll be using was compiled by Arnaud Drizard using the Hacker News API, and can be found [here](https://github.com/arnauddri/hn). We've sampled 10000 rows from the data randomly, and removed all extraneous columns. Our dataset only has four columns:

- **submission_time** - when the story was submitted.
- **upvotes** - number of upvotes the submission got.
- **url** - the base domain of the submission.
- **headline** - the headline of the submission. Users can edit this, and it doesn't have to match the headline of the original article.

We'll be writing scripts to answer some questions:

- What words appear most often in the headlines?
- What domains were submitted most often to Hacker News?
- At what times are the most articles submitted?

## Reading The Data

In [3]:
# -*- coding: utf-8 -*-

'''
Code to load data for DataQuest Guided Project on Transforming Data
Reads in 'hn_stories.csv' file and adds four columns.
'''

import pandas as pd

def load_data():
    '''Reads data file in and adds four columns'''
    hn_stories = pd.read_csv('data/hn_stories.csv')
    hn_stories.columns = ['submission_time', 'upvotes', 'url', 'headline']
    return hn_stories

if __name__ == "__main__":
    STOR = load_data()
    stor1 = load_data() # pylint: disable=locally-disabled, invalid-name
    print STOR, stor1

In [4]:
from collections import Counter

df = load_data()
headlines = df.headline

headline_list = ""
for i in range(len(headlines)):
    headline_list += str(headlines[i]).strip("!@^&*():;<>,.?/[]{}+=|-_ ") + " "
headline_list = headline_list.lower()
headline_list = headline_list.split(" ")

headline_clean = []
for i in range(len(headline_list)):
    if headline_list[i] in "!@^&*():;<>,.?/[]{}+=|-_ ":
        pass
    else:
        headline_clean.append(headline_list[i])

headline_dict = Counter(headline_clean)
headline_count_list = sorted(headline_dict, key=headline_dict.get, reverse=True)
print(headline_count_list[:100])

['the', 'to', 'a', 'of', 'for', 'in', 'and', 'is', 'on', 'with', 'hn:', 'how', 'your', 'you', 'ask', 'from', 'google', 'new', 'why', 'what', 'an', 'are', 'by', 'at', 'show', 'it', 'web', 'do', 'app', '\xe2\x80\x93', 'i', 'not', 'that', 'as', 'data', 'be', 'startup', 'about', 'facebook', 'my', 'free', 'using', 'apple', 'online', 'get', 'can', 'open', 'android', 'this', 'will', 'out', 'now', 'we', 'its', 'up', 'code', 'best', 'video', 'one', 'have', 'or', 'software', 'twitter', 'more', 'first', 'iphone', 'all', 'make', 'should', 'internet', 'us', 'social', 'mobile', 'use', 'design', 'has', 'world', 'apps', 'business', 'just', '5', 'cloud', 'source', 'into', 'like', 'api', 'top', 'javascript', 'tech', 'programming', 'company', 'windows', 'project', 'when', 'time', 'future', 'game', 'ios', 'news', 'live']


In [None]:
from dateutil.parser import parse

def extract_hour(x):
    return parse(x).hour

def extract_day(x):
    return parse(x).day

df    = read.load_data()
times = df.submission_time
    
df['submission_hour'] = times.apply(lambda x: extract_hour(x))
sub_hour = df.submission_hour
sub_hour_counts = sub_hour.value_counts()
print(sub_hour_counts[:8])

df['submission_day'] = times.apply(lambda x: extract_day(x))
sub_day = df.submission_day
sub_day_counts = sub_day.value_counts()
print(sub_day_counts[:8])

In [None]:
df = load_data()
urls = df.url

urls_domain = []
urls_counts = urls.value_counts()
first100 = urls_counts[:10]
for name, row in first100.items():
    print("{0}: {1}".format(name, row))