# Exploring Hacker News Posts
***
In this project, we're going through [this database](https://www.kaggle.com/hacker-news/hacker-news-posts) and seeing which types of posts are more popular on the website [Hacker News](https://news.ycombinator.com/), in terms of how many comments they receive: posts that ask questions (`Ask HN`), or posts that show users something interesting (`Show HN`).

First, we'll start by **importing and reading the data.**

In [8]:
import csv

f = open('hacker_news.csv')
hn = list(csv.reader(f))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Next, we'll be **removing the headers.**

In [9]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Extracting Ask HN and Show HN Posts
Now that we have the data, we can start sorting it out.  We'll do this by separating the posts into 3 different categories, and from there, we'll move on with only the comments we want to look at.

In [25]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(title)
print("Ask HN:", len(ask_posts))
print("Show HN:", len(show_posts))
print("Other Posts:", len(other_posts))    

Ask HN: 1744
Show HN: 1162
Other Posts: 17194


Now that the ask and show posts are in their own lists, we can find which types get more comments on average.  We'll start with the `Ask posts`.

In [28]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average Number of Ask Comments:", avg_ask_comments)

Average Number of Ask Comments: 14.038417431192661


Next is the `Show posts`.

In [29]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average Number of Show Comments:", avg_show_comments)

Average Number of Show Comments: 10.31669535283993


With the 2 previous code cells, we have discovered that `Ask` posts get approximately 4 more comments than `Show` posts on average.  Since we're looking for what gets the most comments, we'll focus on these posts from here on out.

# Finding the Number of Ask Posts and Comments by Hour Created
Now that we know what kind of posts gets more comments, we need to see when the most comments on these posts come in.  We'll do this by making a list that contains all the hours of the day (in 24-hour format), and the number of comments.  Next, we'll arrange the contents of the list into a dictionary that puts the number of comments next to the hour it was posted at.

In [40]:
import datetime as dt
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

# Calculating the Average Number of Comments for Ask HN Posts by Hour
Once that's done, we will find the average number of comments `Ask` posts get every hour.

In [41]:
avg_by_hour = []
for row in comments_by_hour:
    avg_by_hour.append([row, comments_by_hour[row] / counts_by_hour[row]])
avg_by_hour    

[['14', 13.233644859813085],
 ['03', 7.796296296296297],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['04', 7.170212765957447],
 ['21', 16.009174311926607],
 ['13', 14.741176470588234],
 ['18', 13.20183486238532],
 ['15', 38.5948275862069],
 ['10', 13.440677966101696],
 ['06', 9.022727272727273],
 ['20', 21.525],
 ['17', 11.46],
 ['02', 23.810344827586206],
 ['11', 11.051724137931034],
 ['00', 8.127272727272727],
 ['07', 7.852941176470588],
 ['08', 10.25],
 ['16', 16.796296296296298],
 ['09', 5.5777777777777775],
 ['22', 6.746478873239437],
 ['19', 10.8],
 ['05', 10.08695652173913],
 ['01', 11.383333333333333]]

# Sorting and Printing Values from a List of Lists
Finally, we'll rearrange this list by which time a posts gets the most comments, then find the top 5 times when the most comments are posted on `Ask` posts.

In [53]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour, '\n') 

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr,"%H").strftime("%H:%M"), avg
        )
    )

[[13.233644859813085, '14'], [7.796296296296297, '03'], [7.985294117647059, '23'], [9.41095890410959, '12'], [7.170212765957447, '04'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.20183486238532, '18'], [38.5948275862069, '15'], [13.440677966101696, '10'], [9.022727272727273, '06'], [21.525, '20'], [11.46, '17'], [23.810344827586206, '02'], [11.051724137931034, '11'], [8.127272727272727, '00'], [7.852941176470588, '07'], [10.25, '08'], [16.796296296296298, '16'], [5.5777777777777775, '09'], [6.746478873239437, '22'], [10.8, '19'], [10.08695652173913, '05'], [11.383333333333333, '01']] 

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From this, we can find that the best time to post a question if you want to receive the most amount of comments is at 1500 hours (or 3 PM).  For reference, the timezone used is Eastern Time in the US, according to the [data set documentation](https://www.kaggle.com/hacker-news/hacker-news-posts). So, we could also write 15:00 as 3:00 PM EST.

# Conclusion
In this project, we analyzed 2 types of user posts on a website to determine which type of post received the most comments on average and when. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as `Ask post` and created between 15:00 and 16:00 (3:00 PM EST - 4:00 PM EST).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, `Ask posts` received more comments on average and ask posts created between 15:00 and 16:00 (3:00 PM EST - 4:00 PM EST) received the most comments on average.