# Guided Project: Transforming data with Python

### In this project, you'll be working with a dataset of submissions to [Hacker News](http://news.ycombinator.com/) from 2006 to 2015. 
Hacker News is a site where users can submit articles from across the internet (usually about technology and startups), and others can "upvote" the articles, signifying that they like them. The more upvotes a submission gets, the more popular it was in the community. Popular articles get to the "front page" of Hacker News, where they're more likely to be seen by others.<br>

The dataset you'll be using was compiled by Arnaud Drizard using the Hacker News API, and can be found [here](https://github.com/arnauddri/hn). We've sampled 10000 rows from the data randomly, and removed all extraneous columns. **Our dataset only has four columns**:

* submission_time -- when the story was submitted.
* upvotes -- number of upvotes the submission got.
* url -- the base domain of the submission.
* headline -- the headline of the submission. Users can edit this, and it doesn't have to match the headline of the original article.

### You'll be writing scripts to answer some main questions:

What words appear most often in the headlines?
What domains were submitted most often to Hacker News?
At what times are the most articles submitted?
You'll be answering these questions by writing command line scripts, instead of using IPython notebook. IPython notebooks are great for quick data visualization and exploration, but Python scripts are the way to put anything we learn into production. Let's say you want to make a website to help people write headlines that get as many upvotes as possible, and submit articles at the right time. To do this, you'll need scripts.

In the read.py file, read the hn_stories.csv file into a Pandas Dataframe.
* There is no header row in the data, so the columns don't have names. See this [stackoverflow thread for how to add column names](http://stackoverflow.com/questions/11346283/renaming-columns-in-pandas). Add the column names from the last screen (submission_time, upvotes, url, and headline) to the Dataframe.
* Create a function called load_data that takes no inputs, but contains the code to read in and process the dataset. load_data should return a Pandas Dataframe with the column names set correctly.

As you work on these steps, you should be running your script on the command line every so often and verifying that things are working. You can run read.py from the command line by calling python read.py. 
* The first verification is to make sure that you don't see any errors. 
* The second one is to call print at key points in your code, and make sure that the output looks like what you expect. 
* You might want to do this after each step above. This is a good general rule of thumb to follow when writing new code.

```bash

/home/dq/scripts$ ls -la
total 840                                   
drwxr-xr-x 2 dq dq   4096 Nov 13 07:36 .    
drwxr-xr-x 1 dq dq   4096 Nov 13 09:04 ..   
-rwxr-xr-x 1 dq dq 851754 Nov 13 07:36 hn_stories.csv
-rwxrwxrwx 1 dq dq      0 Nov 13 09:11 read.py
/home/dq/scripts$ nano read.py

---
import pandas as pd

def load_data():
    stories = pd.read_csv('hn_stories.csv')
    stories.columns = ['submission_time', 'upvotes', 'url', 'headline']
    return stories
---

/home/dq/scripts$ 
/home/dq/scripts$ 
/home/dq/scripts$ 

```

We now want to figure out which words appear most often in the headlines. We'll be developing another script, called `count.py` to accomplish this. We'll need to import our load_data function from `read.py` into `count.py` so we can use it.

You'll recall that if you have a folder with two files, `read.py` and `count.py`, you can use the function `load_data` in `read.py` from `count.py` by writing the following code in count.py:

```bash
import read
df = read.load_data()
```

```bash

---count.py

import read
from collections import Counter

data = read.load_data()
data_ = data[data['headline'].notnull()]

headlines = ' '.join(data_['headline'])
headlines = headlines.lower()\
.replace('(','')\
.replace(')','')\
.replace('?','')

head_split = headlines.split(' ')

c = Counter(head_split)
print(c.most_common(100))

---

[('the', 2051), ('to', 1643), ('a', 1279), ('of', 1174), ('for', 1143)
, ('in', 1042), ('and', 960), ('', 740), ('is', 621), ('on', 573), ('w
ith', 541), ('hn:', 537), ('how', 529), ('-', 487), ('your', 480), ('y
ou', 401), ('ask', 371), ('from', 314), ('google', 308), ('new', 305),
 ('why', 266), ('what', 262), ('an', 245), ('are', 223), ('by', 222), ...







```