# Average Comments for Posts on Hacker News

This notebook looks at the number of comments for posts on Hacker News, calculates the averages, and separates them by the time the posts were posted.

In [1]:
import csv
import datetime as dt

In [2]:
opened_file = open("hacker_news.csv")
tbl_file = list(csv.reader(opened_file))

In [3]:
def explore_data(dataset, start = 1, end = 11):
    i = 0
    for s in dataset[0]:
        print('Header', i, '-', s)
        i += 1
    print('\nDataset contains ', len(dataset)-1, ' rows and ',
        len(dataset[0]), ' columns.\n')
    for row in dataset[start:end]:
        print(row)
        print('\n') # adds a new (empty) line after each row

In [4]:
def print_rows(rows_list, start = 0, end = -1):
    for row in rows_list[start:end]:
        print(row, '\n')

In [5]:
def sort_data(dataset, sort_field, reverse = False):
    reordered_list = []
    sorted_list = []
    for row in dataset:
        reordered_row = []
        reordered_row.append(row[sort_field])
        for x in row[:sort_field]:
            reordered_row.append(x)
        for x in row[sort_field+1:]:
            reordered_row.append(x)
        reordered_list.append(reordered_row)
    
    sorted_list = sorted(reordered_list, reverse = reverse)
    
    return sorted_list

In [6]:
explore_data(tbl_file, 1, 6)

Header 0 - id
Header 1 - title
Header 2 - url
Header 3 - num_points
Header 4 - num_comments
Header 5 - author
Header 6 - created_at

Dataset contains  20100  rows and  7  columns.

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.ny

The following creates a dictionary containing the names of the headers as keys and the numbers as values. This enables me to access a field by the name of the column, instead of having to look up the index number of the element.

In [14]:
headers = tbl_file[0]
hn = tbl_file[1:]

hd = {}
i = 0
for s in headers:
    hd[s] = i
    i += 1

In [15]:
print(headers, '\n\n', hd, '\n\n')
print_rows(hn, 0, 5)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

 {'title': 1, 'url': 2, 'author': 5, 'id': 0, 'num_comments': 4, 'created_at': 6, 'num_points': 3} 


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.ny

The following creates separate lists for the Ask HN posts, the Show HN posts, and the other posts.

In [16]:
ask_posts = []
show_posts = []
other_posts = []

i = 0
for row in hn:
    i += 1
    title = row[hd['title']]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
    
print('Number of Ask HN posts:', len(ask_posts))
print('Number of Show HN posts:', len(show_posts))
print('Number of other posts:', len(other_posts))

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194


The following calculates the total and average comments for each of the previous lists. As you can see, both the total and average for the Ask HN list are more than for the Show HN list.

In [17]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[hd['num_comments']])

print('Total Ask HN comments:', total_ask_comments)

avg_ask_comments = total_ask_comments / len(ask_posts)

print('Average Ask HN comments:', avg_ask_comments)

Total Ask HN comments: 24483
Average Ask HN comments: 14.038417431192661


In [18]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[hd['num_comments']])

print('Total Show HN comments:', total_show_comments)

avg_show_comments = total_show_comments / len(show_posts)

print('Average Show HN comments:', avg_show_comments)

Total Show HN comments: 11988
Average Show HN comments: 10.31669535283993


The following creates a list containing when each post was created and the number of comments for that post. Then it uses my 'print_rows' function to print the first 10 rows with line breaks for easy reading.

In [22]:
result_list = []
for row in ask_posts:
    result_list.append([row[hd['created_at']], row[hd['num_comments']]])

print_rows(result_list, 0, 10)

['8/16/2016 9:55', '6'] 

['11/22/2015 13:43', '29'] 

['5/2/2016 10:14', '1'] 

['8/2/2016 14:20', '3'] 

['10/15/2015 16:38', '17'] 

['9/26/2015 23:23', '1'] 

['4/22/2016 12:24', '4'] 

['11/16/2015 9:22', '1'] 

['2/24/2016 17:57', '1'] 

['6/4/2016 17:17', '2'] 



The following creates two dictionaries, one to store the number of posts per hour and the other the number of comments per hour.

In [23]:
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    dt_object = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = int(dt_object.strftime('%H'))
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])

The following uses the posts per hour and comments per hour to calculate the average number of comments per post for each hour. Then it prints the total and average number of comments for each hour.

In [24]:
avg_by_hour = []
for hour in counts_by_hour:
    print('There were', counts_by_hour[hour], 'posts posted during hour',
          hour, '\nTotal comments:', comments_by_hour[hour])
    
    avg_comments = int(comments_by_hour[hour]) / int(counts_by_hour[hour])
    avg_by_hour.append([hour, avg_comments])
    print('Average comments:', avg_comments)
    print('\n')

There were 55 posts posted during hour 0 
Total comments: 447
Average comments: 8.127272727272727


There were 60 posts posted during hour 1 
Total comments: 683
Average comments: 11.383333333333333


There were 58 posts posted during hour 2 
Total comments: 1381
Average comments: 23.810344827586206


There were 54 posts posted during hour 3 
Total comments: 421
Average comments: 7.796296296296297


There were 47 posts posted during hour 4 
Total comments: 337
Average comments: 7.170212765957447


There were 46 posts posted during hour 5 
Total comments: 464
Average comments: 10.08695652173913


There were 44 posts posted during hour 6 
Total comments: 397
Average comments: 9.022727272727273


There were 34 posts posted during hour 7 
Total comments: 267
Average comments: 7.852941176470588


There were 48 posts posted during hour 8 
Total comments: 492
Average comments: 10.25


There were 45 posts posted during hour 9 
Total comments: 251
Average comments: 5.5777777777777775


There we

The following uses the 'sort_data' function I created, which sorts a list by any field. In this case, it sorts the list of average comments for each hour by the average number of comments, in reverse order. Then it formats and prints the hour and the average comments for the top 5 rows.

In [25]:
sorted_swap = sort_data(avg_by_hour, 1, True)

print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[0:5]:
    f_hour = dt.datetime.strptime(str(row[1]), '%H').strftime('%H:00:')
    print(f_hour, '{avg:.2f} average comments per post'.format(avg = row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The time zone for this data is Eastern Time, so I need to convert it to my time zone, which is Central Standard Time. According to this data, the best time to create a post to ensure the highest chance of receiving comments is between 2:00 and 3:00 PM Central Standard Time.