# Exploring Hacker News Posts

Hacker News is a site where user-submitted "posts" are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

<br>

We'll open and read the data set below to see what information it contains:

In [1]:
from csv import reader
opened = open("hacker_news.csv") 
read_file = reader(opened)
hn = list(read_file)

for row in hn[0:5]:
    print(row)
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




Below are descriptions of the columns:

* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if it the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

In [2]:
#In order to analyze our data, we need to first remove the row containing the column headers

header = hn[0]
data = hn[1:]


We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in data:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(title)
    if title.lower().startswith('show hn'):
        show_posts.append(title)
    else:
        other_posts.append(title)
        
        
print("Number of ask posts:", len(ask_posts))
print("Number of show posts:", len(show_posts))
print("Number of other posts:", len(other_posts))
    


Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 18938


Let's determine if ask posts or show posts receive more comments on average.

In [4]:
#ASK POSTS
ask_comments = 0

for row in data:
    if row[1] in ask_posts:
        num_comments = int(row[4])
        ask_comments += num_comments
    
avg_ask = ask_comments / len(ask_posts)

print("Average number of comments on 'Ask posts':", avg_ask)


#SHOW POSTS
show_comments = 0

for row in data:
    if row[1] in show_posts:
        num_comments = int(row[4])
        show_comments += num_comments
    
avg_show = show_comments / len(show_posts)

print("Average number of comments on 'Show posts':", avg_show)
    
    

Average number of comments on 'Ask posts': 14.038417431192661
Average number of comments on 'Show posts': 10.31669535283993


Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [5]:
import datetime as dt

#time format in created_at column: 8/4/2016 11:52

result_list = []

for row in data:
    if row[1] in ask_posts:
        time = row[6]
        num_comments = int(row[4])
        time_comments = []
        
        
        time_dt = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
        hour = time_dt.hour
        
        time_comments.append(hour)
        time_comments.append(num_comments)
        
        result_list.append(time_comments)
        
        
hour_freq = {}   #hour, freq
hour_comment = {}  #hour, total comments

for result in result_list:
    hour = result[0]
    comments = result[1]
    if hour in hour_freq:
        hour_freq[hour] +=1
        hour_comment[hour] += comments
    else:
        hour_freq[hour] = 1
        hour_comment[hour] = comments
        
print("Hour frequency table:")
for key, value in hour_freq.items():
    print(key, ":", value)

print("\n")

print("Comment frequency by hour table:")
for key, value in hour_comment.items():
    print(key, ":", value)
    

Hour frequency table:
0 : 55
1 : 60
2 : 58
3 : 54
4 : 47
5 : 46
6 : 44
7 : 34
8 : 48
9 : 45
10 : 59
11 : 58
12 : 73
13 : 85
14 : 107
15 : 116
16 : 108
17 : 100
18 : 109
19 : 110
20 : 80
21 : 109
22 : 71
23 : 68


Comment frequency by hour table:
0 : 447
1 : 683
2 : 1381
3 : 421
4 : 337
5 : 464
6 : 397
7 : 267
8 : 492
9 : 251
10 : 793
11 : 641
12 : 687
13 : 1253
14 : 1416
15 : 4477
16 : 1814
17 : 1146
18 : 1439
19 : 1188
20 : 1722
21 : 1745
22 : 479
23 : 543


Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [16]:
avg_comment_hour = []

for hour in hour_freq:
    tot_comment = hour_comment[hour]
    avg = tot_comment / hour_freq[hour]
    
    temp = []
    temp.append(hour)
    temp.append(avg)
    
    avg_comment_hour.append(temp)
    
print("Average comments by hour:")    
for item in avg_comment_hour:
    print(item)
    
   

Average comments by hour:
[0, 8.127272727272727]
[1, 11.383333333333333]
[2, 23.810344827586206]
[3, 7.796296296296297]
[4, 7.170212765957447]
[5, 10.08695652173913]
[6, 9.022727272727273]
[7, 7.852941176470588]
[8, 10.25]
[9, 5.5777777777777775]
[10, 13.440677966101696]
[11, 11.051724137931034]
[12, 9.41095890410959]
[13, 14.741176470588234]
[14, 13.233644859813085]
[15, 38.5948275862069]
[16, 16.796296296296298]
[17, 11.46]
[18, 13.20183486238532]
[19, 10.8]
[20, 21.525]
[21, 16.009174311926607]
[22, 6.746478873239437]
[23, 7.985294117647059]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [33]:
swap_avg_by_hour = []
for item in avg_comment_hour:
    temp = []
    temp.append(item[1])
    temp.append(item[0])
    swap_avg_by_hour.append(temp)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments:")

top_5 = sorted_swap[0:5]

    
for item in top_5:
    template = "{hour}:00: {num:.2f} average comments per post."
    output = template.format(hour = item[1], num = item[0])
    print(output)

Top 5 Hours for Ask Posts Comments:
15:00: 38.59 average comments per post.
2:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


### Conclusion

In order to have a higher chance of receiving comments, you should create an "Ask Post" during the hours listed above. This may change depending on the time zone you live in.
