# Looking for specific posts inside "Hacker News" dataset

We are going to search for posts whos titles begin with "Ask HN" or "Show HH"


- "Ask HN" posts are those where users are asking the HN community specific questions
- "Show HN" posts show a project or a product or generally something interesting

We will specifically try to do two tasks:
1. Which type of post receives more comments on average
2. Does post creation time have any outcome in regards to the comments amount on average.




In [1]:
from csv import reader
data_file = open("hacker_news.csv")
data_reader = reader(data_file)
hn = list(data_reader)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
for i in range(5):
    print(hn[i])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


The data will be filtered in three lists of lists :
1. ```ask_posts```
2. ```show_posts```
3. ```other_posts```


In [3]:
ask_posts = []
show_posts = []
other_posts = []

for entry in hn:
    title = entry[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(entry)
    elif title.startswith('show hn'):
        show_posts.append(entry)
    else:
        other_posts.append(entry)

print('Total entries {}'.format(len(hn)))
print('Ask post entries {}'.format(len(ask_posts)))
print('Show post entries {}'.format(len(show_posts)))
print('Other post entries {}'.format(len(other_posts)))
print('The sum of three entries {}'.format(len(other_posts) + len(show_posts) + len(ask_posts)))


Total entries 20100
Ask post entries 1744
Show post entries 1162
Other post entries 17194
The sum of three entries 20100


In [4]:
print(headers)
for row in ask_posts[:5] + show_posts[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a progra

The function below will return the average number of comments for a list of posts.

The column index for the number of comments is 4.

In [6]:
def get_avg_comment(posts):
    # the column index for the num comments is 4
    total_comments = 0
    for row in posts:
        num_comments = int(row[4])
        total_comments += num_comments
    avg_comments = total_comments / len(posts)
    return avg_comments

In [8]:
print('Ask HN posts have an average of {} comments'.format(get_avg_comment(ask_posts)))
print('Show HN posts have an average of {} comments'.format(get_avg_comment(show_posts)))

Ask HN posts have an average of 14.038417431192661 comments
Show HN posts have an average of 10.31669535283993 comments


On average, ask posts receive more comments, though not by a lot.

Next step will be to check if the post creation time has any influence on the number of comments received.

In [14]:
import datetime as dt

def post_per_hour(posts_list):
    posts_comments_per_hour = {}
    date_format = "%m/%d/%Y %H:%M"
    for row in posts_list:
        hour = dt.datetime.strptime(row[6], date_format).strftime("%H")
        num_comments = int(row[4])
        if hour in posts_comments_per_hour:
            posts_comments_per_hour[hour][0] += 1
            posts_comments_per_hour[hour][1] += num_comments
        else:
            posts_comments_per_hour[hour] = [1, num_comments]
            
    return posts_comments_per_hour    

In [16]:
ask_posts_comments_per_hour = post_per_hour(ask_posts)
print(ask_posts_comments_per_hour)

{'05': [46, 464], '00': [55, 447], '23': [68, 543], '04': [47, 337], '11': [58, 641], '01': [60, 683], '20': [80, 1722], '13': [85, 1253], '12': [73, 687], '03': [54, 421], '07': [34, 267], '06': [44, 397], '21': [109, 1745], '15': [116, 4477], '18': [109, 1439], '22': [71, 479], '17': [100, 1146], '09': [45, 251], '08': [48, 492], '14': [107, 1416], '19': [110, 1188], '02': [58, 1381], '10': [59, 793], '16': [108, 1814]}


In [41]:
avg_per_hour = [[hour, pc[1]/pc[0]] for hour, pc in ask_posts_comments_per_hour.items()]
avg_per_hour_sorted = sorted(avg_per_hour, key=lambda kv: kv[1], reverse=True)
for item in avg_per_hour_sorted[:5]:
    print('At {}:00 hours there were an average of {:.2f} comments / post'.format(item[0], item[1]))

At 15:00 hours there were an average of 38.59 comments / post
At 02:00 hours there were an average of 23.81 comments / post
At 20:00 hours there were an average of 21.52 comments / post
At 16:00 hours there were an average of 16.80 comments / post
At 21:00 hours there were an average of 16.01 comments / post


On average, ask posts posted between 15:00 and 15:59 have most comments.

Lets check for show hn posts too, see if we have the same pattern.

In [44]:
show_posts_comments_per_hour = post_per_hour(show_posts)
avg_per_hour = [[hour, pc[1]/pc[0]] for hour, pc in show_posts_comments_per_hour.items()]
avg_per_hour_sorted = sorted(avg_per_hour, key=lambda kv: kv[1], reverse=True)
for item in avg_per_hour_sorted[:5]:
    print('At {}:00 hours there were an average of {:.2f} comments / post'.format(item[0], item[1]))

At 18:00 hours there were an average of 15.77 comments / post
At 00:00 hours there were an average of 15.71 comments / post
At 14:00 hours there were an average of 13.44 comments / post
At 23:00 hours there were an average of 12.42 comments / post
At 22:00 hours there were an average of 12.39 comments / post


Looks like there is no similar pattern.

Let's check the number of points on average.

In [34]:
def get_avg_points(posts):
    # the column index for the num comments is 3
    total_points = 0
    for row in posts:
        num_points = int(row[3])
        total_points += num_points
    return total_points / len(posts)


In [35]:
ask_avg_points =get_avg_points(ask_posts)
show_avg_points =get_avg_points(show_posts)
print('Ask HN posts have {:.2f} points on average'.format(ask_avg_points))
print('Show HN posts have {:.2f} points on average'.format(show_avg_points))

Ask HN posts have 15.06 points on average
Show HN posts have 27.56 points on average


As for the comments, we will check the hour at which a post is likely to get more points

In [37]:
def points_per_hour(posts_list):
    posts_points_per_hour = {}
    date_format = "%m/%d/%Y %H:%M"
    for row in posts_list:
        hour = dt.datetime.strptime(row[6], date_format).strftime("%H")
        num_points = int(row[3])
        if hour in posts_points_per_hour:
            posts_points_per_hour[hour][0] += 1
            posts_points_per_hour[hour][1] += num_points
        else:
            posts_points_per_hour[hour] = [1, num_points]
            
    return posts_points_per_hour 

In [45]:
ask_points_per_hour = points_per_hour(ask_posts)
show_points_per_hour = points_per_hour(show_posts)

avg_per_hour = [[hour, pc[1]/pc[0]] for hour, pc in ask_points_per_hour.items()]
avg_per_hour_sorted = sorted(avg_per_hour, key=lambda kv: kv[1], reverse=True)
for item in avg_per_hour_sorted[:5]:
    print('At {}:00 hours there were an average of {:.2f} points / ask post'.format(item[0], item[1]))
print('----------------')
avg_per_hour = [[hour, pc[1]/pc[0]] for hour, pc in show_points_per_hour.items()]
avg_per_hour_sorted = sorted(avg_per_hour, key=lambda kv: kv[1], reverse=True)
for item in avg_per_hour_sorted[:5]:
    print('At {}:00 hours there were an average of {:.2f} points / show post'.format(item[0], item[1]))

At 15:00 hours there were an average of 29.99 points / ask post
At 13:00 hours there were an average of 24.26 points / ask post
At 16:00 hours there were an average of 23.35 points / ask post
At 17:00 hours there were an average of 19.41 points / ask post
At 10:00 hours there were an average of 18.68 points / ask post
----------------
At 23:00 hours there were an average of 42.39 points / show post
At 12:00 hours there were an average of 41.69 points / show post
At 22:00 hours there were an average of 40.35 points / show post
At 00:00 hours there were an average of 37.84 points / show post
At 18:00 hours there were an average of 36.31 points / show post


# Conclusion

In order to get the maximum response, it seems that between the "Show HN" and "Ask HN" posts, it should be an "Ask HN" post and that it should be posted between 15:00 and 15:59 in order to get the most comments on average.

On average, Show Hn posts got more points, almost double than Ask HN, with the most on average being given for posts from 23:00 to 23:59.

Looks like there is no corelation for Show HN posts between the number of comments on average and the number of points, but for the Ask Hn there is.