# Analyzing Submissions to Hacker News


For this project, we will use Jupyter Notebook to analyze a dataset of posts submitted to [Hacker News](https://news.ycombinator.com/). Hacker News is a news feed that is populated with posts submitted by users. The posts can receive votes and comments, which impact their rank within the news feed.

The dataset for this analysis was obtained from [kaggle](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).


In [112]:
# Import the reader function from the CSV module

from csv import reader

In [113]:
# Read the csv file as a list of lists

open_file = open(r"C:\Users\awaul\OneDrive\Documents\Data\Hacker_News_Data\hacker_news.csv")
reader_file = reader(open_file)
hn_data = list(reader_file)

In [114]:
# Print the first 5 rows of data

print(hn_data[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['10176923', "Why we aren't tempted to use ACLs on our Unix machines", 'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation', '34', '23', 'mjn', '9/6/2015 6:03'], ['10177011', 'Video Poker Hackers Cleared of Federal Charges', 'http://www.wired.com/2013/11/video--poker-case/', '23', '3', 'trengrj', '9/6/2015 7:25'], ['10177048', 'The Microservices Way  Weekly Microserivces Newsletter', 'https://www.getrevue.co/profile/microservices', '1', '1', 'britman', '9/6/2015 7:50'], ['10177077', 'The Hitler at Home stories of the pre-WWII American press', 'http://www.atlasobscura.com/articles/the-american-medias-awkward-fawning-over-hitlers-taste-in-home-decor', '75', '75', 'aaronbrethorst', '9/6/2015 8:05']]


In [115]:
# Extract the header row from data

headers = hn_data[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [116]:
# Remove the header row from hn_data

hn_data = hn_data[1:]
print(hn_data[:5])

[['10176923', "Why we aren't tempted to use ACLs on our Unix machines", 'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation', '34', '23', 'mjn', '9/6/2015 6:03'], ['10177011', 'Video Poker Hackers Cleared of Federal Charges', 'http://www.wired.com/2013/11/video--poker-case/', '23', '3', 'trengrj', '9/6/2015 7:25'], ['10177048', 'The Microservices Way  Weekly Microserivces Newsletter', 'https://www.getrevue.co/profile/microservices', '1', '1', 'britman', '9/6/2015 7:50'], ['10177077', 'The Hitler at Home stories of the pre-WWII American press', 'http://www.atlasobscura.com/articles/the-american-medias-awkward-fawning-over-hitlers-taste-in-home-decor', '75', '75', 'aaronbrethorst', '9/6/2015 8:05'], ['10177103', 'GM crops created superweed, say scientists (2005)', 'http://www.theguardian.com/science/2005/jul/25/gm.food', '58', '27', 'x5n1', '9/6/2015 8:24']]


In [117]:
# Assign posts to a list from the corresponding categories (ask_posts, show_posts or other_posts)

ask_posts = []
show_posts = []
other_posts = []

for post in hn_data:
    title = post[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(post)
        
    elif title.startswith('show hn'):
        show_posts.append(post)
        
    else:
        other_posts.append(post)

In [118]:
# View the first 3 posts for each category

print(ask_posts[:3])
print('\n')
print(show_posts[:3])
print('\n')
print(other_posts[:3])

[['10177801', 'Ask HN: How to keep young developers', '', '3', '6', 'orangeplus', '9/6/2015 14:53'], ['10182770', 'Ask HN: If you are learning Chinese', '', '1', '2', 'goodcharacters', '9/7/2015 19:54'], ['10182780', 'Ask HN: Are freemium microservices a thing?', '', '1', '2', 'hyperpallium', '9/7/2015 19:56']]


[['10177459', 'Show HN: AppyPaper  Gift wrap with app icons printed on it', 'http://www.appypaper.com/', '6', '4', 'submitstartup', '9/6/2015 12:38'], ['10179920', 'Show HN: Easiest way to build html tables in React', 'https://github.com/legitcode/table', '3', '2', 'zackify', '9/7/2015 3:20'], ['10180369', 'Show HN: Chemozart  molecule editor and visualizer with mechanics calculators', 'https://github.com/mohebifar/chemozart', '34', '17', 'mohebifar', '9/7/2015 6:50']]


[['10176923', "Why we aren't tempted to use ACLs on our Unix machines", 'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation', '34', '23', 'mjn', '9/6/2015 6:03'], ['10177011', 'Video Poker Hacke

In [119]:
# Examine the number of posts for each category 

print("Number of Ask HN posts:", f'{len(ask_posts):,}')
print("Number of Show HN posts:", f'{len(show_posts):,}')
print("Number of General News posts:", f'{len(other_posts):,}')

Number of Ask HN posts: 1,744
Number of Show HN posts: 1,162
Number of General News posts: 17,193


In [120]:
# Write a function to calculate the average number of comments per category

def avg_comments(list, index=4):
    total_comments = 0
    for row in list:
        num_comments = int(row[index])
        total_comments += num_comments
    
    avg_comments = total_comments / len(list)
    return avg_comments

In [121]:
# Which category of posts receives the most comments on average?

print("Average comments for an Ask HN post:",f'{avg_comments(ask_posts):.1f}')
print('\n')
print("Average comments for an Show HN post:",f'{avg_comments(show_posts):.1f}')
print('\n')
print("Average comments for a General News post:",f'{avg_comments(other_posts):.1f}')
print('\n')

Average comments for an Ask HN post: 14.0


Average comments for an Show HN post: 10.3


Average comments for a General News post: 26.9




In [126]:
# When is the best time to submit posts to attract comments?

import datetime as dt
date_format = "%m/%d/%Y %H:%M"

In [127]:
# Write a function to analyze the frequency of posts by hour

def posts_by_hour(category_list):
    
    counts_by_hour = {}
    
    for post in category_list:
        date = post[-1]
        time = dt.datetime.strptime(date, date_format).strftime('%H')
        
        if time in counts_by_hour:
            counts_by_hour[time] += 1
        else:
            counts_by_hour[time] = 1
            
    return counts_by_hour

a = posts_by_hour(ask_posts)
print(a)
     

   


{'14': 107, '19': 110, '15': 116, '20': 80, '00': 55, '01': 60, '03': 54, '07': 34, '16': 108, '22': 71, '05': 46, '13': 85, '10': 59, '11': 58, '17': 100, '23': 68, '12': 73, '06': 44, '18': 109, '09': 45, '04': 47, '21': 109, '02': 58, '08': 48}


In [128]:
# Write a function to analyze the number of comments posted 

def comments_by_hour(category_list):
    com_by_hour = {}
    for post in category_list:
        date = post[-1]
        time = dt.datetime.strptime(date, date_format).strftime('%H')
        num_comments = int(post[4])
        
        if time in com_by_hour:
            com_by_hour[time] += num_comments
        else:
            com_by_hour[time] = num_comments
            
    return com_by_hour

b = comments_by_hour(ask_posts)
print(b)

{'14': 1416, '19': 1188, '15': 4477, '20': 1722, '00': 447, '01': 683, '03': 421, '07': 267, '16': 1814, '22': 479, '05': 464, '13': 1253, '10': 793, '11': 641, '17': 1146, '23': 543, '12': 687, '06': 397, '18': 1439, '09': 251, '04': 337, '21': 1745, '02': 1381, '08': 492}


In [140]:
# Examine the average number of Comments for Ask HN Posts by Hour

ask_hn_com_by_hour = comments_by_hour(ask_posts)
ask_hn_posts_by_hour = posts_by_hour(ask_posts)



def avg_by_hour(category_comments_by_hr, category_posts_by_hour):
    avg_by_hour = []

    for hr in category_comments_by_hr:
        avg_by_hour.append([hr, category_comments_by_hr[hr] / category_posts_by_hour[hr]])
    
    return avg_by_hour

ask_hn_avg_by_hr = avg_by_hour(ask_hn_com_by_hour, ask_hn_posts_by_hour)

print(ask_hn_avg_by_hr)




[['14', 13.233644859813085], ['19', 10.8], ['15', 38.5948275862069], ['20', 21.525], ['00', 8.127272727272727], ['01', 11.383333333333333], ['03', 7.796296296296297], ['07', 7.852941176470588], ['16', 16.796296296296298], ['22', 6.746478873239437], ['05', 10.08695652173913], ['13', 14.741176470588234], ['10', 13.440677966101696], ['11', 11.051724137931034], ['17', 11.46], ['23', 7.985294117647059], ['12', 9.41095890410959], ['06', 9.022727272727273], ['18', 13.20183486238532], ['09', 5.5777777777777775], ['04', 7.170212765957447], ['21', 16.009174311926607], ['02', 23.810344827586206], ['08', 10.25]]


In [163]:
def sort_by_avg_comments(category_avg_by_hr):
    swap_columns = []
    for row in category_avg_by_hr:
        swap_columns.append([row[1],row[0]])
    sorted_swap = sorted(swap_columns, reverse=True)
    return sorted_swap
        
best_times_to_post_ask_hn = sort_by_avg_comments(ask_hn_avg_by_hr)

print(best_times_to_post_ask_hn)


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [169]:
print("Best Times to Post for 'Ask HN' Comments")
for avg, hr in best_times_to_post_ask_hn[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))


Best Times to Post for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


In [142]:
show_hn_com_by_hr = comments_by_hour(show_posts)
show_hn_posts_by_hr = posts_by_hour(show_posts)

show_hn_avg_by_hr = avg_by_hour(show_hn_com_by_hr, show_hn_posts_by_hr)

print(show_hn_avg_by_hr)

[['12', 11.80327868852459], ['03', 10.62962962962963], ['06', 8.875], ['21', 5.787234042553192], ['22', 12.391304347826088], ['14', 13.44186046511628], ['16', 11.655913978494624], ['17', 9.795698924731182], ['00', 15.709677419354838], ['18', 15.770491803278688], ['19', 9.8], ['08', 4.852941176470588], ['13', 9.555555555555555], ['15', 8.102564102564102], ['20', 10.2], ['01', 8.785714285714286], ['09', 9.7], ['23', 12.416666666666666], ['05', 3.0526315789473686], ['02', 4.233333333333333], ['11', 11.159090909090908], ['04', 9.5], ['07', 11.5], ['10', 8.25]]
