# Hacker News Site - When and What to post to get the maximum comments

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

**Introduction**
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

*Below are descriptions of the columns:*
* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

## Import and prepare data

In [6]:
from csv import reader
file = open("hacker_news.csv")
read = reader(file)
hn = list(read)

In [7]:
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [8]:
header = hn[0]
print(header)
print('\n') # adds a new (empty) line after each row
hn = hn[1:]
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Create lists of filtered data

In [9]:
# create a new list wiht titles of 'Ask HN' or 'Show HN'
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = str(title)
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(hn[0:]))
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

20100
1744
1162
17194


In [10]:
# total number of comments: ask posts
total_ask_comments = 0

for e in ask_posts:
    noc = int(e[4])
    total_ask_comments += noc

print(total_ask_comments)

avg_ask_comments = total_ask_comments / len(ask_posts)
print('\n')
print(avg_ask_comments)

24483


14.038417431192661


In [11]:
# total number of comments: show posts
total_show_comments = 0

for e in show_posts:
    noc = int(e[4])
    total_show_comments += noc

print(total_show_comments)

avg_show_comments = total_show_comments / len(show_posts)
print('\n')
print(avg_show_comments)

11988


10.31669535283993


Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Find the best time to 'Ask HN'

In [16]:
import datetime as dt

# create a list with time and no. of comments
result_list = []

for e in ask_posts:
    col = e[6]
    noc = int(e[4])
    result_list.append([col, noc])

print(result_list[0:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


In [26]:
# create dictionaries of a) no. of posts/hour b) avg no. of comments/hour
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    time = date.hour
    comment = int(row[1])
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


In [29]:
# create a list of avg no. of comments/hour
avg_by_hour = []

for e in counts_by_hour:
    post = counts_by_hour[e]
    avg_by_hour.append([e,comments_by_hour[e]/post])

print(avg_by_hour)

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


In [45]:
# we want to see which hour has the highest avg no of comments

# step1: swap elements in avg_by_hour
swap_avg_by_hour = []

for e in avg_by_hour:
    swap_avg_by_hour.append([e[1],e[0]])
    
print(swap_avg_by_hour)

# step2: sorting the result
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# top5
print('\n')
print("Top 5 Hours for Ask Posts Comments")
for e in swap_avg_by_hour[0:5]:
    hour = dt.datetime.strptime(str(e[1]), "%H")
    time = hour.strftime("%H:%M")
    no = e[0]
    print("{hour}: {no:.2f} average comments per post".format(hour=time, no=no))

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]


Top 5 Hours for Ask Posts Comments
09:00: 5.58 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
16:00: 16.80 average comments per post
