# Exploring Haker News Posts

Hacker News website where user-submitted stories are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles. 

The goal of this project is to explore posts submitted to Hacker News to see which features make the article more popular. 
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

# Open and Explore Data

[This data set](https://www.kaggle.com/hacker-news/hacker-news-posts) contains approximately 20,000 data points of stories submitted to Hacker News. There are 7 columns including num_points - the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes. 

* Ask HN posts ask the Hacker News community a specific question. 

* Show HN posts show the Hacker News community a project, product, or something interesting. 


We want to see the community reaction to these type of posts and if they engage more with the comments. 

We will also analyze the time the user posts to see if a certain time receives more comments on average. 

In [1]:
from csv import reader

In [2]:
opened= open('hacker_news.csv')
read= reader(opened)
hn= list(read)
header= hn[0]
hn= hn[1:]

In [3]:
print(header)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Create separate lists for Ask HN posts and Show HN posts

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title= row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of Ask HN posts:", len(ask_posts))
print("Number of Show HN posts:",len(show_posts))
print("Number of other posts:", len(other_posts))

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194


Explore the engagement on Ask Posts versus Show Posts by calculating the average number of comments 

In [5]:
total_ask_comments = 0 

for row in ask_posts:
    num_comments= int(row[4])
    total_ask_comments += num_comments 

print("Total number of comments in Ask HN:", total_ask_comments)

Total number of comments in Ask HN: 24483


In [6]:
avg_ask_comments= total_ask_comments / len(ask_posts)
print("Average comments per post in Ask HN:", avg_ask_comments)

Average comments per post in Ask HN: 14.038417431192661


In [7]:
total_show_comments = 0 

for row in show_posts: 
    num_comments = int(row[4])
    total_show_comments += num_comments 
    
print("Total number of comments in Show HN:", total_show_comments)

Total number of comments in Show HN: 11988


In [8]:
avg_show_comments= total_show_comments / len(show_posts)
print("Average comments per post in Show HN:", avg_show_comments)

Average comments per post in Show HN: 10.31669535283993


On average, Ask HN posts receive more comments than Show HN posts. 
Ask HN has 1744 posts with 14.0 comments on average for each post. 
Show HN has 1162 posts with 10.3 comments on average for each post. 

Ask HN are more popular so we will focus our analysis on just these posts. We will determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis: 

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
* Calculate the average number of comments ask posts receive by hour created 

In [9]:
import datetime as dt

In [10]:
result_list = []

for row in ask_posts:
    created_at= row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])
    
print(result_list[:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


In [11]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at= row[0]
    created_at= dt.datetime.strptime(created_at,'%m/%d/%Y %H:%M' )
    created_at= created_at.strftime('%H')
    if created_at not in counts_by_hour:
        counts_by_hour[created_at]= 1
        comments_by_hour[created_at]= row[1]
    if created_at in counts_by_hour:
        counts_by_hour[created_at] += 1
        comments_by_hour[created_at] += row[1]    

In [12]:
print(counts_by_hour)
print("\n")
print(comments_by_hour)

{'09': 46, '13': 86, '10': 60, '14': 108, '16': 109, '23': 69, '12': 74, '17': 101, '15': 117, '21': 110, '20': 81, '02': 59, '18': 110, '03': 55, '05': 47, '19': 111, '01': 61, '22': 72, '08': 49, '04': 48, '00': 56, '06': 45, '07': 35, '11': 59}


{'09': 257, '13': 1282, '10': 794, '14': 1419, '16': 1831, '23': 544, '12': 691, '17': 1147, '15': 4478, '21': 1749, '20': 1724, '02': 1384, '18': 1441, '03': 422, '05': 493, '19': 1191, '01': 716, '22': 481, '08': 497, '04': 340, '00': 457, '06': 398, '07': 269, '11': 643}


In [13]:
avg_by_hour = []

for hour in comments_by_hour:
    hour_avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, hour_avg])

Let's format the list avg_by_hour in an easier to read format and more conducive to our analysis. 
* Swap the averages and hour data 
* Sort the averages in decending order
* Print the top 5 hours for ask post comments 

In [14]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [15]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [16]:
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    hour= dt.datetime.strptime(row[1], '%H')
    hour= hour.strftime('%H:00') 
    string= '{h} : {avg:.2f} average comments per post'.format(h= hour, avg= row[0])
    print(string)

Top 5 Hours for Ask Posts Comments
15:00 : 38.27 average comments per post
02:00 : 23.46 average comments per post
20:00 : 21.28 average comments per post
16:00 : 16.80 average comments per post
21:00 : 15.90 average comments per post


The time zone for these results are Eastern Standard Time. 
I will convert the times to my currect time zone in LA of Pacific Standard Time. 

In [17]:
for row in sorted_swap[:5]:
    hour= dt.datetime.strptime(row[1], '%H')
    LA_time= hour + dt.timedelta(hours= -3)
    LA_time= LA_time.strftime('%H:00')
    string= '{h} : {avg:.2f} average comments per post'.format(h = LA_time, avg= row[0])
    print(string)

12:00 : 38.27 average comments per post
23:00 : 23.46 average comments per post
17:00 : 21.28 average comments per post
13:00 : 16.80 average comments per post
18:00 : 15.90 average comments per post


# Conclusion 

In conclusion, the best time to post an Ask HN article is 12:00 PST to receive the most comments and engagement. 

Posts made at 23:00 PST and 17:00 PST are the next best hours to get comments. 