# Exploring Hacker News posts
In this project, I explore posts on [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts I explore begin with either _Ask HN_ or _Show HN_. Specifically for this project, I want to answer the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

# Reading in
I'll start by reading in the file and removing the header row of the dataset

In [1]:
#Read in the file and display the first few lines. The dataset is a list of lists
from csv import reader
opened_file=open("hacker_news.csv")
reader_file= reader(opened_file)
hn=list(reader_file)
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
#Remove the header 
headers=hn[0]
hn=hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Filter Data
Since I am interested in two different types of posts, I will sort each post as an _Ask_, _Show_, or _Other_

In [13]:
ask_posts=[]
show_posts=[]
other_posts=[]

#check the title and add to the appropriate list 
for posts in hn:
    title=posts[1]
    if(title.lower().startswith('ask hn')):
        ask_posts.append(posts)
    elif(title.lower().startswith('show hn')):
        show_posts.append(posts)
    else:
        other_posts.append(posts)
    
print("Total number of 'ask' posts:" ,len(ask_posts))
print("Total number of 'show'posts:" ,len(show_posts))
print("Total number of other posts:" ,len(other_posts))

Total number of 'ask' posts: 1744
Total number of 'show'posts: 1162
Total number of other posts: 17194


# Calculating the Average Number of Comments for Ask HN and Show HN Posts
In the previous section, I counted the number of each type of posts. There is about 1744 asks posts and 1162 show posts. Now I will find the average number of comments for each posts.

In [52]:
total_ask_comments=0
#print(ask_posts[0])
for p in ask_posts:
    num_comments=int(p[4])
    total_ask_comments+=num_comments
avg_ask_comments=total_ask_comments/len(ask_posts)


total_show_comments=0
for s in show_posts:
    num_scomments=int(s[4])
    total_show_comments+=num_scomments
avg_show_comments=total_show_comments/len(show_posts)

print('Avg ask comments: ',avg_ask_comments)
print('Avg show comments: ',avg_show_comments)






Avg ask comments:  14.038417431192661
Avg show comments:  10.31669535283993
<class 'float'>


On average, _ask_ posts get 14 comments while _show_ posts only get 10 comments.

# Calculating asks posts by hour created and average number of comments by hour
In the previous section, I saw that _ask_ posts are more likely to receive more comments. I'll now calculate the amount of ask posts created in each hour of the day and the average number of posts by hour created.

In [53]:
import datetime as dt
result_list=[]

#Make a new list of asks posts with the time and number of comments
for p in ask_posts:
    new_list=[]
    new_list.append(p[6])
    new_list.append(p[4])
    result_list.append(new_list)
    
counts_by_hour={}
comments_by_hour={}
date_format = "%m/%d/%Y %H:%M"

#iterate over each hour in the new list and count how many times that hour apears,
#also count the number of comments for that hour
for r in result_list:
    date = r[0]
    comment = int(r[1])
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += int(comment)
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = int(comment)
        counts_by_hour[time] = 1
#print("Number of comments by hour:",comments_by_hour)
comments_by_hour
#print("Number of posts by hour:",counts_by_hour)

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

I'll now use the two dictionaries I created to calculate the average comments within each hour

In [54]:
avg_by_hour=[]
for each in comments_by_hour:
    avg_by_hour.append([each,comments_by_hour[each]/counts_by_hour[each]])
    #print(counts_by_hour[each])
print("Average comments within each hour:",avg_by_hour)

 

Average comments within each hour: [['23', 7.985294117647059], ['15', 38.5948275862069], ['11', 11.051724137931034], ['07', 7.852941176470588], ['22', 6.746478873239437], ['14', 13.233644859813085], ['20', 21.525], ['19', 10.8], ['01', 11.383333333333333], ['03', 7.796296296296297], ['05', 10.08695652173913], ['06', 9.022727272727273], ['12', 9.41095890410959], ['13', 14.741176470588234], ['21', 16.009174311926607], ['17', 11.46], ['09', 5.5777777777777775], ['00', 8.127272727272727], ['16', 16.796296296296298], ['02', 23.810344827586206], ['18', 13.20183486238532], ['04', 7.170212765957447], ['10', 13.440677966101696], ['08', 10.25]]


# Top hours for comments
I now have a list of average comments within each hour, but it is difficult to decipher what the top hours for comments are. I'll now swap the hour and comments positions in the list in order to sort by the highest value. Finally, I'll display the top five hours.

In [55]:
swap_avg_by_hour=[]
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
#print(swap_avg_by_hour)

sorted_swap=sorted(swap_avg_by_hour,reverse=True)
#print(sorted_swap)
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(str(hr), "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion
Looking at the averages, 3 pm EST is the time for the highest chance of receiving comments. Strangely, 2 am is the second highest. Since most users are usually asleep as this time, the comments could be coming from users based on the West Coast or another part of the world.