In this project, we'll work with a data set of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator [**_Y Combinator_**](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [**_here_**](https://www.kaggle.com/hacker-news/hacker-news-posts)

We're specifically interested in posts whose titles begin with either __Ask HN__ or __Show HN__. Users submit __Ask HN__ posts to ask the Hacker News community a specific question.

Likewise, users submit __Show HN__ posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?  
- Do posts created at a certain time receive more comments on average?

In [1]:
from csv import reader
import datetime as dt

opened_file = open("C:/Users/ifediorah.kenechukwu/Documents/PythonDA/Datasets/hacker_news.csv", encoding = "utf-8")
read_file = reader(opened_file)
hn = list(read_file)
opened_file.close()
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [2]:
print(len(hn))

293119


Recall, we are only interested in posts that begins with `Ask HN` and `Show HN`

We will group our data into separate lists containing these category of posts

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


Since we are only concerned with `Ask Hn` and `Show HN` posts, we can see that our data has been reduced to about 20,000 entries

In [4]:
print(ask_posts[:3])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']]


In [5]:
print(show_posts[:3])

[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']]


Now let's determine which set of posts receive more comments on an average

In [6]:
# Calculating average comments for "Ask HN" posts
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

10.393478498741656


In [7]:
# Calculating average comments for "Show HN" posts
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

4.886099625910612


#### `Ask HN` posts have more comments on an average than `Show HN` posts

This is expected as humans tend to respond more to questions than to statements

Since `Ask HN` posts have more comments on average, we'll focus our remaining analysis on these posts

In [8]:
print(ask_posts[:2])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']]


Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.
To do this, we will use the datetime module to parse the time accuratelys and also create a freqquency table of posts created in each hour of the day

##### Let's write a function to convert the time to a time format

In [33]:
def get_hour(string):
    formatted_time = dt.datetime.strptime(string, "%m/%d/%Y %H:%M")
    converted_time = formatted_time.time()
    return converted_time.strftime("%H")

print(get_hour(ask_posts[1][6]))

01


Now, let's create a frequency table for amount of posts created in each hour of the day

In [34]:
result_list = []
for item in ask_posts:
    time_created = item[6]
    comments = item[4]
    result_list.append([time_created, comments])

In [35]:
print(result_list[:4])

[['9/26/2016 2:53', '7'], ['9/26/2016 1:17', '3'], ['9/25/2016 22:57', '0'], ['9/25/2016 22:48', '3']]


In [36]:
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    hour = get_hour(row[0])
    comments = int(row[1])
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

In [37]:
print(counts_by_hour)
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


Now, we will use this two frequency tables to calculate the average number of comments ask posts created per hour of the day received

In [46]:
avg_by_hour = []
for row in counts_by_hour:
    avg_by_hour.append([row, comments_by_hour[row]/counts_by_hour[row]])
    

In [47]:
for row in sorted(avg_by_hour):
    print(row)

['00', 7.5647840531561465]
['01', 7.407801418439717]
['02', 11.137546468401487]
['03', 7.948339483394834]
['04', 9.7119341563786]
['05', 8.794258373205741]
['06', 6.782051282051282]
['07', 7.013274336283186]
['08', 9.190661478599221]
['09', 6.653153153153153]
['10', 10.684397163120567]
['11', 8.96474358974359]
['12', 12.380116959064328]
['13', 16.31756756756757]
['14', 9.692007797270955]
['15', 28.676470588235293]
['16', 7.713298791018998]
['17', 9.449744463373083]
['18', 7.94299674267101]
['19', 7.163043478260869]
['20', 8.749019607843136]
['21', 8.687258687258687]
['22', 8.804177545691905]
['23', 6.696793002915452]


Although we have what we are looking for, it is difficult to easily identify what posts have highest average comments by hour. Let's sort this list to look more presentable

In [48]:
# create a list with the values of avg_by_hour swapped

swap_avg_by_hour = []
for item in avg_by_hour:
    swap_avg_by_hour.append([item[1],item[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [49]:
print("Top 5 Hours for Ask Posts Comments")

for item in sorted_swap[:5]:
    print("{}:00: {:.2f} average comments per post".format(item[1],item[0]))

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


### Conclusion

From these, we can deduce that `Ask HN` posts created at about `3pm`,`1pm`,`12noon`,`2am`, and `10am` are more likely to get more engagements.

So if you want to get engagements, post during these hours