# Exploring Hacker News Posts
In this project, we will be exploring the Ask posts (questions) and Show posts (when users submit a project, product or something interesting) on Hacker News.

We will compare the two to determine if:
1. `Ask HN` posts receive more comments than `Show HN` posts
2. Do posts created at a certain time receive more comments on average?

We will be using a concise version of this [data set](https://www.kaggle.com/hacker-news/hacker-news-posts). Here are some of the descriptions of the columns:

`id`: the unique identifier from Hacker News for the post

`title`: the title of the post

`url`: the URL that the posts links to, if the post has a URL

`num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

`num_comments`: the number of comments on the post

`author`: the username of the person who submitted the post

`created_at`: the date and time of the post's submission

# Part 1: Analyzing if Ask posts receive more comments than Show posts

In [12]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:4]) #print first 5 rows of data set

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


We will remove the headers to analyze the remaining data.

In [13]:
headers = hn[0]
hn = hn[1:]

print(headers) # Display header
print("\n")
print(hn[:4]) # Display remaining rows

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Next, we will filter the types of posts for our analysis: `ask_posts`,`show_posts` and `other_posts`. However, as the posts may have different cases, we will control for that by changing all the strings to lowercase.

## Count for number of posts and comments

In [14]:
ask_posts = []
show_posts = []    
other_posts = []

for row in hn:
    title = row[1]
    title_lower = title.lower() #Change to lower case
    if title_lower.startswith('ask hn'):
        ask_posts.append(row)
    elif title_lower.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Ask posts:',len(ask_posts))
print('\n')
print('Show Posts:',len(show_posts))
print('\n')
print('Other Posts:',len(other_posts))
print('\n')
print('Total:',len(hn)) #To verify total number

Ask posts: 1744


Show Posts: 1162


Other Posts: 17194


Total: 20100


We can see that there are more `Ask HN` posts than `Show HN` posts.

Next, we want to look at the comments for the posts on Hacker News.

In [15]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)

14.038417431192661


In [16]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

10.31669535283993


We can conclude from the above that ask posts receive more comments on average (around 4).

# Part 2: Do posts created at a certain time receive more comments on average?
Next, we will analyze if posts created at a certain hour are more likely to attract comments. We will do the following:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [17]:
import datetime as dt
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    hour = time = dt.datetime.strptime(date, date_format).strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment

print(counts_by_hour)
print("\n")
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Since we have the comments and posts, we will be calculating the average comments per post by hour.

In [18]:
avg_by_hour = []

for key in comments_by_hour:
    avg_by_hour.append([ key, (comments_by_hour[key]/counts_by_hour[key])])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


To make it easier for us to identify the highest number of comments per post, we will swap the order of the columns as well as sort in descending order.

In [19]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour,reverse =True)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [20]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From this, we can see that 3PM is the best time for Ask posts to gather the most number of comments.

# Conclusion
We compared the number of comments between Ask and Show posts on Hacker News in Part 1 and identified the best hour to post an Ask post to gather the most comments.

From the above analysis, we can see that on Hacker News, there are generally more comments on Ask posts compared to Show posts. In addition, for Ask posts, the best posting time for them to be answered or to generate any comments is 15:00/3PM.

Moving forward, we can conduct the same analysis for the Other posts in Part 1, to determine if there are any other subtype of posts not covered in this project that also generate more points. We can also do the same analysis in Part 2 but for days or months.