# Exploring Hacker News Posts
We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Dataset can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts "Hacker News Posts | Kaggle")

(Full disclosure this is a guided project.)

## Load Data

In [1]:
import datetime as dt
from csv import reader

In [2]:
# Read in the data and store as a list of lists
with open("HN_posts_year_to_Sep_26_2016.csv", encoding="utf-8") as f:
    hn = list(reader(f))

# Remove the header from the data
hn_header, hn_data = hn[0], hn[1:]

In [3]:
# Establishing a data summary function as this could be useful later
def explore_data(dataset, start, end, rows_and_columns=True):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))

In [4]:
# Print a preview of the data along with the header
print(hn_header)
explore_data(hn_data, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']
Number of rows: 293119

## Question 1: Do Ask HN or Show HN receive more comments on average?

In [5]:
# Separate posts beginning with "Ask HN" and "Show HN"
ask_posts, show_posts, other_posts = [], [], []
for post in hn_data:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

print("  Ask Posts: ", len(ask_posts))
print(" Show Posts: ", len(show_posts))
print("Other Posts: ", len(other_posts))

  Ask Posts:  9139
 Show Posts:  10158
Other Posts:  273822


In [6]:
# Find the average number of comments for ask and show post respectively
total_ask_comments, total_show_comments = 0, 0

for post in ask_posts:
    total_ask_comments += int(post[4])

for post in show_posts:
    total_show_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_posts = total_show_comments / len(show_posts)

print("Average Comments on Ask Posts:", avg_ask_comments)
print("Average Comments on Show Posts:", avg_show_posts)

Average Comments on Ask Posts: 10.393478498741656
Average Comments on Show Posts: 4.886099625910612


On average, "Ask" posts receive more than twice as many comments as "Show" posts.
With this in mind, "Ask" posts will be our focus for the rest of this analysis.

Next, we'll determine if "Ask" posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
1. Calculate the average number of comments ask posts receive by hour created.

## Question 2: Do posts created at a certain time receive more comments on average?

In [7]:
# Create a new list with a column including the time a post was created and a column
# including the number of comments that post recieved
result_list = [[post[6], int(post[4])] for post in ask_posts]

explore_data(result_list, 0, 5, True)

['9/26/2016 2:53', 7]
['9/26/2016 1:17', 3]
['9/25/2016 22:57', 0]
['9/25/2016 22:48', 3]
['9/25/2016 21:50', 2]
Number of rows: 9139
Number of columns: 2


In [8]:
# Calculate the number of ask posts created per hour and the total number of comments
counts_by_hour, comments_by_hour = {}, {}
for result in result_list:
    date = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = result[1]

print(counts_by_hour)
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [9]:
# Calculate the average number of comments on posts created during each hour of the day
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

print(*sorted(avg_by_hour), sep = "\n")

['00', 7.5647840531561465]
['01', 7.407801418439717]
['02', 11.137546468401487]
['03', 7.948339483394834]
['04', 9.7119341563786]
['05', 8.794258373205741]
['06', 6.782051282051282]
['07', 7.013274336283186]
['08', 9.190661478599221]
['09', 6.653153153153153]
['10', 10.684397163120567]
['11', 8.96474358974359]
['12', 12.380116959064328]
['13', 16.31756756756757]
['14', 9.692007797270955]
['15', 28.676470588235293]
['16', 7.713298791018998]
['17', 9.449744463373083]
['18', 7.94299674267101]
['19', 7.163043478260869]
['20', 8.749019607843136]
['21', 8.687258687258687]
['22', 8.804177545691905]
['23', 6.696793002915452]


In [10]:
# Sort the list by average in descending order
avg_by_hour_sorted = sorted(avg_by_hour, key=lambda x:x[1], reverse=True)

print(*avg_by_hour_sorted, sep = "\n")

['15', 28.676470588235293]
['13', 16.31756756756757]
['12', 12.380116959064328]
['02', 11.137546468401487]
['10', 10.684397163120567]
['04', 9.7119341563786]
['14', 9.692007797270955]
['17', 9.449744463373083]
['08', 9.190661478599221]
['11', 8.96474358974359]
['22', 8.804177545691905]
['05', 8.794258373205741]
['20', 8.749019607843136]
['21', 8.687258687258687]
['03', 7.948339483394834]
['18', 7.94299674267101]
['16', 7.713298791018998]
['00', 7.5647840531561465]
['01', 7.407801418439717]
['19', 7.163043478260869]
['07', 7.013274336283186]
['06', 6.782051282051282]
['23', 6.696793002915452]
['09', 6.653153153153153]


In [11]:
# Print our results
print("Top 5 Hours for Ask Posts Comments\n")
for hour in avg_by_hour_sorted[:5]:
    print("{}:00 Eastern Time {:.2f} average comments per post".format(hour[0], hour[1]))

Top 5 Hours for Ask Posts Comments

15:00 Eastern Time 28.68 average comments per post
13:00 Eastern Time 16.32 average comments per post
12:00 Eastern Time 12.38 average comments per post
02:00 Eastern Time 11.14 average comments per post
10:00 Eastern Time 10.68 average comments per post


## Conclusion
Thus far this analysis suggests:
- "Ask" posts recieve more comments than "Show" posts on average
- "Ask" posts created from 1500 to 1600 Eastern Time recieve more comments on average than "Ask" posts created during the other hours of the day

These insights taken together suggest an "Ask" post created from 1500 to 1600 Eastern Time are more likely to elicit engagement then "Ask" and "Show" posts created at other hours of the day.
For a brand looking to maximize engagement this is an actionable insight.

### Potential Next Steps
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.