#  Exploring Hacker News Posts

In this project, we'll work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Exploring this dataset, we are specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News community a specific question

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting

`Our goal` is to compare these two types of posts to determine the following:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

## Table of Contents

> #### 1. Data Sources
> #### 2. Open the datasets
> #### 3. Removing Headers from a List of Lists
> #### 4. Extracting Ask HN and Show HN Posts
> #### 5. Calculating the Average Number of Comments for Ask HN and Show HN Posts
> #### 6. Finding the Number of Ask Posts and Comments by Hour Created
> #### 7. Calculating the Average Number of Comments for Ask HN Posts by Hour
> #### 8. Sorting and Printing Values from a List of Lists
> #### 9. Conclusion

---------------------

### Data Source

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

----------------

### Open and read dataset

In [14]:
from csv import reader
opened_file = open("datasets/hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

In [16]:
# display first 5 rows of dataset
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


### Removing Headers from a List of Lists

In [18]:
# extract header of the dataset
headers = hn[0]

In [19]:
# display the header of the dataset
print(headers)
print("\nNumber of columns: ",len(headers))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Number of columns:  7


From the header, we can see that our dataset is made up of 7 columns which are;

|Column name|Description|
|:---|:---|
|id|The unique identifier from Hacker News for the post|
|title|The title of the post|
|url|The URL the posts links to, if the post has a URL|
|num_points|Number of points the post acquired (total upvotes minus total downvotes)|
|num_comments|The number of comments on the post|
|author|The username of the person who submitted the posts|
|created_at|The date and time of the post's submission|

-------------


In [20]:
# remove header from our dataset
hn = hn[1:]

In [22]:
# display first 5 rows of dataset
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN Posts

Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

In [46]:
# create lists containing posts with the different title categories

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    # convert title to all lower case before checking startswiths
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [47]:
# display first 3 rows of each of the lists
print("Ask hn posts:\n", ask_posts[:3])
print("\nShow hn posts:\n", show_posts[:3])
print("\nOther posts:\n", other_posts[:3])

Ask hn posts:
 [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]

Show hn posts:
 [['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]

Other posts:
 [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source an

In [48]:
# display the number of post in each of the lists

print("Number of posts in ask_posts:", len(ask_posts))
print("\nNumber of posts in show_posts:", len(show_posts))
print("\nNumber of posts in other_posts:", len(other_posts))

Number of posts in ask_posts: 1744

Number of posts in show_posts: 1162

Number of posts in other_posts: 17194


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now we are going to determine if `ask posts` or `show posts` receive `more comments on average`.

In [57]:
# we find the total number of comments in ask posts list

total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

# compute average number of comments on ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Average number of comments on ask posts: {:.4f}".format(avg_ask_comments))

Average number of comments on ask posts: 14.0384


In [58]:
# we find the total number of comments in show posts list

total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

# compute average number of comments on ask posts
avg_show_comments = total_show_comments / len(show_posts)

print("Average number of comments on show posts: {:.4f}".format(avg_show_comments))

Average number of comments on show posts: 10.3167


Looking at the averages we calculated above, we see that `ask posts` receives more comments (an average of about `14 comments per post`) than the `show posts` (with an average of `10 comments per post`).

This also means users tend to interract more on the ask posts than any posts.

-----------

### Finding the Number of Ask Posts and Comments by Hour Created

Since `ask posts are more` likely to receive comments than `show posts`, we'll focus our remaining analysis just on these category of posts.

In this section, we will determine if ask posts created at a certain time are more likely to attract comments.
We will use the following steps to perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [97]:
import datetime as dt

# create list of list with date a post was created and number of comments
result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    dates_and_num_comments = [created_at, num_comments]
    result_list.append(dates_and_num_comments)
    
print(result_list[:10])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17], ['9/26/2015 23:23', 1], ['4/22/2016 12:24', 4], ['11/16/2015 9:22', 1], ['2/24/2016 17:57', 1], ['6/4/2016 17:17', 2]]


In [129]:
# extract the comments by hours and the comment count

counts_by_hour = {}
comments_by_hour = {}

for item in result_list:
    date = item[0]
    # parse date and time from string
    date_time = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    # extract time and then hours
    time_hrs_min = date_time.time()
    hour = time_hrs_min.hour
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = item[1]
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += item[1]


In [130]:
print(counts_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


In [131]:
print(comments_by_hour)

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


From the analysis above, we observe that `3pm` is the most popular time that comments are made. This is further proven true as the number of comments made during this peroid of the day is higher`(4477 comments)` than during any other period of the day. This is closely follwed by the hour of `4pm`. Though 4pm is not the second most popular time comments were made, however there were many comments `(1814 comments)` made during this time more than other times popular than it (like 6pm, 7pm and 9pm.
This means most of the comments were made in the evening. However, to get a better understanding of this, we will compute the average number of comments per posts for posts created during each hour of the day.

---------------

###  Calculating the Average Number of Comments for Ask HN Posts by Hour

Now we will use the two dictionaries we created above (`counts_by_hour` and `comments_by_hour`) to calculate the average number of comments per post for posts created during each hour of the day.

In [142]:
avg_by_hour = []

for hour in counts_by_hour:
    num_hour = counts_by_hour[hour]
    for comment_hour in comments_by_hour:
        if comment_hour == hour:
            num_comment = comments_by_hour[comment_hour]
     
    # create list of hour and average 
    avg_comment_hour = [hour, num_comment / num_hour]
    avg_by_hour.append(avg_comment_hour)   

In [143]:
print(avg_by_hour)

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


### Sorting and Printing Values from a List of Lists

The averages we created above will give a better inside of our analysis. However, this format makes it difficult to identify the hours with the highest values.

We will thus proceed to sort the list of lists and printing the five highest values in a format that's easier to read

In [144]:
# Create a list that equals avg_by_hour with swapped columns.

swap_avg_by_hour = []
for item in avg_by_hour:
    # swap the values
    swap_comment_hour = [item[1], item[0]]
    swap_avg_by_hour.append(swap_comment_hour)
    
print(swap_avg_by_hour)

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]


In [149]:
# sort the swapped averages in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# display top 5 hours for ASk Posts Comments
print(sorted_swap[:6])

[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21], [14.741176470588234, 13]]


In [171]:
for item in sorted_swap[:6]:
    avg_comment = item[0]
    hour = dt.datetime.strptime(str(item[1]), "%H")
    hour_format = hour.strftime("%H:%M")
    template = "{}: {:.2f} average comments per post"
    print(template.format(hour_format, avg_comment))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post


### Conclusion
This final analysis has confirmed the assertion we made above about `3pm being the most` user engaging period with the highest average number of comments `(close to 39 comments)` per post.

This therefore implies that if you desire to get high comments, you should consider creating your post at 3pm.

Though 2am is on the top 5 most interractive hours, we realize that the `evening and night` period are very engaging and there is a high possibilty if you post during these hours, you will get many users engaged on your post.