![hacker_news](hacker_news_logo.png)

#  Exploring Hacker News Posts

## 1. Introduction

**Hacker News** (sometimes abbreviated as HN) is a social news website https://news.ycombinator.com/ focusing on computer science and entrepreneurship. 

It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity."

The word hacker in "Hacker News" is used in its original meaning and refers to the hacker culture which consists of people who enjoy tinkering with technology.

The intention was to recreate a community similar to the early days of Reddit. 

However, unlike Reddit where new users can immediately both upvote and downvote content, Hacker News does not allow users to downvote content until they have accumulated 501 "karma" points...

https://en.wikipedia.org/wiki/Hacker_News

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. 

Users submit `Ask HN` posts to ask the Hacker News community a specific question and users also submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting.

We'll compare these two types of posts to determine the following:

Do `Ask HN` or `Show HN` receive more comments on average?
Do posts created at a certain time receive more comments on average?


We will compare the most common types of posts (`Ask HN` or `Show HN`) in this site to know:


- What type of comments on average are the most abundant. 

- Analyze the relationship ( whether or not ) between the time 

- which posts are created and the number of comments they receive.

### Data dictionary:

- `id`: the unique identifier from Hacker News for the post

- `title`: the title of the post

- `url`: the URL that the posts links to, if the post has a URL

- `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

- `num_comments`: the number of comments on the post

- `author`: the username of the person who submitted the post

- `created_at`: the date and time of the post's submission

In [1]:
from csv import reader
pathfolder = '/home/ion/Formacion/Dataquest/Data Scientist in Python/Step-1/Python_for_Data_Science_Intermediate/13- Hacker_News'
hn = open(pathfolder + '/' + 'HN_posts_year_to_Sep_26_2016.csv')
hn = reader(hn)
hn = list(hn)
header = hn[0]

This is the header we are going to work with, which corresponds to row 0.

In [2]:
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

These are the first 5 rows of our dataset with which we are going to work, so we can get an idea of the data content and what it looks like.

In [3]:
for filas in hn[0:5]:
    print(filas)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


## 2. Removing Headers from a List of Lists

- 1.Extract the first row of data, and assign it to the variable headers.

In [4]:
headers = hn[0]

- 2.Removing the first row from hn.

In [5]:
del hn[0]

- 3.Display headers.

In [6]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


- 4.Display the first five rows of hn to verify that you removed the header row properly.

In [7]:
print(hn[0:4])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


## 3. Extracting Ask HN and Show HN Posts

Now that we have removed the `hn` headers, we are ready to filter our data.

Since we are only interested in the titles of entries that begin with:

- **`Ask HN`** or **`Show HN`** we will create new lists of lists containing only the data for those two titles.

To find posts starting with  **`Ask HN`** or **`Show HN`**, we will use the string method `startswith`. 

example:

Given an object of type string, say `string1`, we can check if it starts with ***'whatever_it_is'***, just by inspecting the output of the object as follows:

- `string1.startswith('dq')`. 

If `string1` starts with ***'data***, it will return True, otherwise it will return False.

        print('dataquest'.startswith('Data'))
        `False`

        print('dataquest'.startswith('data'))
        `True`
        
In the above example we get False because dataquest does not start with 'what_it_is', however the second printout prints True because dataquest **does** start with 'what_it_is'. 

Case is important, so if we want to control case, we can use the lower method which returns a lowercase version of the initial string. Here is an example:

`print('DataQuest'.lower())`

`dataquest`

### Filter strategy

The method `object.startswith('string')` we have introduced makes it easier for us to filter the content in our list.

The strategy we will follow next to filter the information we will create three empty lists called: **ask_posts**, **show_posts** and **other_posts**.

We will loop through `hn` and assign the title of each row to a variable called `title`. 

- If the lowercase version of the title starts with ask hn, add the row to ask_posts.
    - If the lowercase version of the title starts with show hn, add the row to show_posts.
    - Otherwise, add to other_post

- Check the number of posts in ask_posts, show_posts, and other_posts.

Empty lists called: 

- `ask_posts`
- ` ` 
- `other_posts`

In [8]:
ask_posts = []
show_posts = []
other_posts = []

- Loop through each row in `hn`.
- Assign the title of each row to a variable named `title`.
- The column is the second column, you will need to get the index element 1 of each row.

In [9]:
for fila in hn:
    title = fila[1]
    #print(title)

If the lowercase version of title starts with `ask hn`, append the row to `ask_posts`.

Else if the lowercase version of title starts with `show hn`, append the row to `show_posts`.

Else append to `other_posts`.

In [10]:
for fila in hn:
    title = fila[1]
    title = title.lower()   # all in lower_case
    
    if title.startswith('ask hn'):
        ask_posts.append(fila)
    elif title.startswith('show hn'):
        show_posts.append(fila)
    else:
        other_posts.append(fila)

- 4.Check the number of posts in `ask_posts`, `show_posts` and `other_posts`

In [11]:
len(ask_posts)

9139

In [12]:
len(show_posts)

10158

In [13]:
len(other_posts)

273822

## 4. Calculating the Average Number of Comments for Ask HN and Show HN Posts

Samples of listing contents `ask_posts` and `show_posts`

In [14]:
print(ask_posts[0:3])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']]


In [15]:
print(show_posts[0:3])

[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']]


The total number of comments on ask entries must be assigned to total_ask_comments.
Remenber Initialize `total_ask_comments` = 0.

In [16]:
total_ask_comments = 0

- Using a for loop to iterate over the `ask_posts` entries.

- The `num_comments` column is the fifth column of `ask_posts`, you will need to get the index element 4 in each row.

- Also need to convert the value to an integer in order to calculate the sum of all comments.
    - Add this value to `total_ask_comments`.
    - Calculate the average number of comments on ask entries and assign it to `avg_ask_comments`.
    
**Header looks like:**

    ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

### Computing:

- #### avg number on Ask post
- ####  max of num_comments on Ask post.

In [17]:
maxi = 0
for comment in ask_posts:
    total_ask_comments += int(comment[4])  # convert string to int
    if maxi <= int(comment[4]):
        maxi = int(comment[4])
      
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print('avg number of "Ask_posts" comments:', round(avg_ask_comments,2))
print('max number of "Ask_posts" comment:', maxi)

avg number of "Ask_posts" comments: 10.39
max number of "Ask_posts" comment: 1007


- #### title of the max of num_comments on Ask post.

In [18]:
lista = []
most_important_questions = {}
maxi = 0
for comment in ask_posts:
     if maxi < int(comment[4]):
            maxi = int(comment[4])
            most_important_questions[int(comment[4])] = comment[1]
            
most_voted = list(reversed(sorted(most_important_questions.keys())))

print("-- Score --    -- Tittle -- ")

for score in most_voted:
    texto = "   {points}           {relevance}".format(points = score,relevance = most_important_questions[score])
    print(texto)

-- Score --    -- Tittle -- 
   1007           Ask HN: Who is hiring? (June 2016)
   947           Ask HN: Who is hiring? (August 2016)
   910           Ask HN: Who is hiring? (September 2016)
   660           Ask HN: Is web programming a series of hacks on hacks?
   477           Ask HN: What do you wish someone would build?
   97           Ask HN: What are the best practises for using SSH keys?
   22           Ask HN: What is that one deciding factor that makes a website successful?
   7           Ask HN: What TLD do you use for local development?


- #### avg number on Show post
- #### max of num_comments on Show post.

In [19]:
max_show = 0
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    if max_show <= int(row[4]):
        max_show = int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print('avg number of comments: "Show posts" ', round(avg_show_comments,2))

print('max number of "Show posts" comments:', max_show)

avg number of comments: "Show posts"  4.89
max number of "Show posts" comments: 306


- #### title of the max of num_comments on Show post.

In [20]:
most_important_show = {}
max_show = 0
for comment in show_posts:
     if max_show < int(comment[4]):
            max_show = int(comment[4])
            most_important_show[int(comment[4])] = comment[1]
            
most_voted_show = list(reversed(sorted(most_important_show.keys())))

print("-- Score --    -- Tittle -- ")

for score in most_voted_show:
    texto = "   {points}           {relevance}".format(points = score,relevance = most_important_show[score])
    print(texto)

-- Score --    -- Tittle -- 
   306           Show HN: BitKeeper  Enterprise-ready version control, now open-source
   280           Show HN: I invented a caffeinated toothpaste
   169           Show HN: Primitive Pictures
   167           Show HN: Lemonade  the world's first P2P insurance company
   102           Show HN: InstaPart  Build circuit boards faster with instant parts
   26           Show HN: G9.js  Automatically Interactive Differentiable Graphics
   3           Show HN: Cursor that Screenshot
   1           Show HN: Jumble  Essays on the go #PaulInYourPocket


In [21]:
%%html
<style>
table {float:left}
</style>

### Observations


- #### avg number of "Ask_posts" comments: 10.39
- #### max number of "Ask_posts" comment: 1007


|-- Score --|    -- Tittle --| 
|---|---| 
|1007|           Ask HN: Who is hiring? (June 2016)|
|947 |          Ask HN: Who is hiring? (August 2016)|
|910 |          Ask HN: Who is hiring? (September 2016)|
|660 |          Ask HN: Is web programming a series of hacks on hacks?|
|477 |          Ask HN: What do you wish someone would build?|
|97  |         Ask HN: What are the best practises for using SSH keys?|
|22  |         Ask HN: What is that one deciding factor that makes a website successful?|
|7   |        Ask HN: What TLD do you use for local development?|

- #### avg number of comments: "Show posts"  4.89
- #### max number of "Show posts" comments: 306


|-- Score --|    -- Tittle --| 
|---|---|
|306 |          Show HN: BitKeeper  Enterprise-ready version control, now open-source|
|280 |          Show HN: I invented a caffeinated toothpaste|
|169 |          Show HN: Primitive Pictures|
|167 |          Show HN: Lemonade  the world's first P2P insurance company|
|102 |          Show HN: InstaPart  Build circuit boards faster with instant parts|
|26  |         Show HN: G9.js  Automatically Interactive Differentiable Graphics|
|3   |        Show HN: Cursor that Screenshot|
|1   |        Show HN: Jumble  Essays on the go #PaulInYourPocke|


Clearly there is a higher average value in the creation of questions we also see that the scores that have received the questions are much higher than the posts with a show content.

The highest score in show post is almost close to the fifth position of the posts in which questions are asked.

We can observe is that what matters most to the users of this forum (always taking into account the value of the scores) and that it is in the first three positions is to know who or what company is hiring, so this is a place to know about this. 

Another thing that we can deduce is that in June 2016 the importance of that question occupies the first position while as time passes in August and September the importance (according to the score) is reduced.


In another hand the relation to the content of the show post the most voted is related to know that BitKeeper is a fast, enterprise-ready, available as Open Source under the Apache 2.0 License distributed SCM that scales up to very large projects and down to tiny ones.

## 5. Finding the Number of `Ask posted` and `Comments posted` by Hour Created


On the previous screen, we determined that, on average, ask posts receive more comments than show posts. 

Since ask posts are more likely to receive comments, **we'll focus our remaining analysis just on these posts.**

Next, **we'll determine if ask posts created at a certain time are more likely to attract comments**. We'll use the following steps to perform this analysis:

- Calculate the number of ask posts created in each hour of the day, along with the number of comments received.

- Calculate the average number of comments ask posts receive by hour created.

NOTE: we can use the `datetime.strptime()` constructor to parse dates stored as strings and return datetime objects, example:

   - `date_1_str = "December 24, 1984"`
   - `date_1_dt = dt.datetime.strptime(date_1_str, "%B %d, %Y")`

Let's use this technique to calculate the number of ask posts created per hour, along with the total number of comments.

In [22]:
import datetime as dt

Create an empty list, and assign it to `result_list`. This will be a list of lists.

In [23]:
result_list = []

This will be a list of lists.
Iterate over `ask_posts`, and append to `result_list` a list with two elements:


   - The first element should be the column `created_at`. 
    Because the `created_at` column is the seventh column in `ask_posts`, you'll need to get the element at index 6 in each row.
    
   - The second element should be the number of comments of the post. You'll also need to convert the value to an integer.

In [24]:
tupla_list = []

for row in ask_posts:
    time_stamp = row[6]
    num_comment = row[4]
    tupla = (time_stamp, num_comment)
    result_list.append(tupla)

result_list

[('9/26/2016 2:53', '7'),
 ('9/26/2016 1:17', '3'),
 ('9/25/2016 22:57', '0'),
 ('9/25/2016 22:48', '3'),
 ('9/25/2016 21:50', '2'),
 ('9/25/2016 19:30', '1'),
 ('9/25/2016 19:22', '22'),
 ('9/25/2016 17:55', '3'),
 ('9/25/2016 15:48', '0'),
 ('9/25/2016 15:35', '13'),
 ('9/25/2016 15:28', '0'),
 ('9/25/2016 14:43', '0'),
 ('9/25/2016 14:17', '3'),
 ('9/25/2016 13:08', '2'),
 ('9/25/2016 11:27', '2'),
 ('9/25/2016 10:51', '0'),
 ('9/25/2016 10:47', '6'),
 ('9/25/2016 9:04', '97'),
 ('9/25/2016 7:09', '4'),
 ('9/25/2016 3:00', '1'),
 ('9/24/2016 23:04', '0'),
 ('9/24/2016 22:02', '7'),
 ('9/24/2016 21:18', '2'),
 ('9/24/2016 20:58', '0'),
 ('9/24/2016 19:57', '1'),
 ('9/24/2016 19:02', '0'),
 ('9/24/2016 17:55', '0'),
 ('9/24/2016 17:27', '1'),
 ('9/24/2016 16:50', '0'),
 ('9/24/2016 16:03', '5'),
 ('9/24/2016 15:29', '66'),
 ('9/24/2016 14:03', '1'),
 ('9/24/2016 10:10', '11'),
 ('9/24/2016 8:46', '7'),
 ('9/24/2016 8:39', '1'),
 ('9/24/2016 8:38', '1'),
 ('9/24/2016 8:28', '1'),
 ('9/

Create two empty dictionaries called `counts_by_hour` and `comments_by_hour`.
Loop through each row of `result_list`.

- Extract the hour from the date, which is the first element of the row.

- Use the datetime.strptime() method to parse the date and create a datetime object.

- Use the string we want to parse as the first argument and a string that specifies the format as the second argument.

- Use the datetime.strftime() method to select just the hour from the datetime object.

If the hour isn't a key in `counts_by_hour`:

- Create the key in `counts_by_hour`, and set it equal to 1.
- Create the key in `comments_by_hour`, and set it equal to the `comment number`.

If the hour is already a key in `counts_by_hour`:
- Increment the value in `counts_by_hour` by 1.
- Increment the value in `comments_by_hour` by the `comment number`.

In [25]:
counts_by_hour   = {}
comments_by_hour = {}

for row in result_list:
    time_stamp = row[0]
    time_obj = dt.datetime.strptime(time_stamp, "%m/%d/%Y %H:%M")
    num_comments = int(row[1])
    
    hour = time_obj.hour
        
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        
print(counts_by_hour)
print("---")
print(comments_by_hour)

{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}
---
{2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}



`counts_by_hour`: contains the number of `ask posts` created during each hour of the day.

`comments_by_hour`: contains the corresponding number of `comments ask posts` created at each hour received.

## 6. Calculating the Average Number of Comments for Ask HN Posts by Hour

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

example:

*To illustrate the technique, let's work with the following dictionary:

      sample_dict = {
                   'apple': 2, 
                  'banana': 4, 
                  'orange': 6
                   }
               
*Suppose we wanted to multiply each of the values by ten and return the results as a list of lists. We can use the following code:

     fruits = []

     for fruit in sample_dict:
     fruits.append([fruit, 10*sample_dict[fruit]])
    
Below are the results:

    [['apple', 20], ['banana', 40], ['orange', 60]]

Calculate the average number of comments per post, for posts created during each hour of the day.

In [26]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[[2, 11.137546468401487],
 [1, 7.407801418439717],
 [22, 8.804177545691905],
 [21, 8.687258687258687],
 [19, 7.163043478260869],
 [17, 9.449744463373083],
 [15, 28.676470588235293],
 [14, 9.692007797270955],
 [13, 16.31756756756757],
 [11, 8.96474358974359],
 [10, 10.684397163120567],
 [9, 6.653153153153153],
 [7, 7.013274336283186],
 [3, 7.948339483394834],
 [23, 6.696793002915452],
 [20, 8.749019607843136],
 [16, 7.713298791018998],
 [8, 9.190661478599221],
 [0, 7.5647840531561465],
 [18, 7.94299674267101],
 [12, 12.380116959064328],
 [4, 9.7119341563786],
 [6, 6.782051282051282],
 [5, 8.794258373205741]]

## 7. Sorting and Printing Values from a List of Lists

To finish, we'll sort the list of lists and print the five highest values in a format that's easier to read.

In [27]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[11.137546468401487, 2], [7.407801418439717, 1], [8.804177545691905, 22], [8.687258687258687, 21], [7.163043478260869, 19], [9.449744463373083, 17], [28.676470588235293, 15], [9.692007797270955, 14], [16.31756756756757, 13], [8.96474358974359, 11], [10.684397163120567, 10], [6.653153153153153, 9], [7.013274336283186, 7], [7.948339483394834, 3], [6.696793002915452, 23], [8.749019607843136, 20], [7.713298791018998, 16], [9.190661478599221, 8], [7.5647840531561465, 0], [7.94299674267101, 18], [12.380116959064328, 12], [9.7119341563786, 4], [6.782051282051282, 6], [8.794258373205741, 5]]


In [28]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[28.676470588235293, 15],
 [16.31756756756757, 13],
 [12.380116959064328, 12],
 [11.137546468401487, 2],
 [10.684397163120567, 10],
 [9.7119341563786, 4],
 [9.692007797270955, 14],
 [9.449744463373083, 17],
 [9.190661478599221, 8],
 [8.96474358974359, 11],
 [8.804177545691905, 22],
 [8.794258373205741, 5],
 [8.749019607843136, 20],
 [8.687258687258687, 21],
 [7.948339483394834, 3],
 [7.94299674267101, 18],
 [7.713298791018998, 16],
 [7.5647840531561465, 0],
 [7.407801418439717, 1],
 [7.163043478260869, 19],
 [7.013274336283186, 7],
 [6.782051282051282, 6],
 [6.696793002915452, 23],
 [6.653153153153153, 9]]

Print the string "Top 5 Hours for Ask Posts Comments".
Loop through each average and hour in the first five lists of sorted_swap.
Use the `datetime.strptime()` constructor to parse the hour and create a datetime object.
Use the `datetime.strftime()` method to format the datetime object into a string that looks like HH:00.
Print the hour and the average number of comments, making sure to format your output.

In [29]:
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(str(hr), "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
