# Looking At Hacker News Posts 

For this project, we are taking a look at a dataset from postings from Hacker News, a post site that is all about technology(which you can check it [here](https://www.kaggle.com/hacker-news/hacker-news-posts)). What we are interested in lare two specfic kinds of postings, Show HN and Ask HN. Just a brief explanation what they are, Ask HN is posting specfic questions (i.e: How to start creating a website) and Show HN is where people can show off projects that they have done, some products, etc. The questions we want to ask are if either Ask HN or Show HN receive more comments on average and when do people get more comments from posts at a specific date. 

In [1]:
#Opening up and looking into the dataset (first five rows)
from csv import reader 

open_file = open('C:/Users/Adity/Onedrive/Documents/Python_data_science/Datasets/HN_posts.csv', 
                 encoding='utf8')
read_file = reader(open_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:] 

def explore_data(dataset, start, end):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
print(hn_header)
print('\n')
explore_data(hn, 0, 5)



['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




Here we can see what the data looks like. Each datapoint has its own id, title, url page, number of points(likes) on the posts, number of comments, the person who created the post, and when the post was being made. 

## Filtering Out the Ask HN and Show HN Posts

Once we have a general understanding of how the data looks, let's start off by filtering out posts, Ask HN posts, Show HN posts, and other posts. 

In [2]:
ask_hn = []
show_hn = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_hn.append(row)
    elif title.startswith('show hn'):
        show_hn.append(row)
    else:
        other_posts.append(row)
        

print("The number of Ask HN posts: " + str(len(ask_hn)))
print("The number of Show HN posts: " + str(len(show_hn)))
print("The number of Other posts: " + str(len(other_posts)))

The number of Ask HN posts: 9139
The number of Show HN posts: 10158
The number of Other posts: 273822


After filtering out the title into each category, we now know that there are 9,139 Ask HN posts and 10,158 Show HN posts. So we can just say that Show HN posts occur more often than Ask HN, right? Not just yet. Let's look a bit further into data. 

In [3]:
#Looking at the first five rows of the Ask HN and Show HN datasets
print("Ask HN:")
print("\n")
explore_data(ask_hn,0,5)
print("\n")
print("Show HN:")
print("\n")
explore_data(show_hn,0,5)

Ask HN:


['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']




Show HN:


['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']


['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html',

## Calculating The Average Number Of Each Posts 

To actually find out out what category posts more on average, let us do some calculations for each category

In [4]:
#Looking at average number of ask comments 

total_ask_comments = 0 

for row in ask_hn:
    ask_comments = int(row[4])
    total_ask_comments += ask_comments

avg_ask_comments = total_ask_comments/len(ask_hn)

print("The average amount of ask comments: " + str(avg_ask_comments))

The average amount of ask comments: 10.393478498741656


In [5]:
#Looking at the average number os show comments 

total_show_comments = 0 

for row in ask_hn:
    show_comments = int(row[4])
    total_show_comments += show_comments 
    
avg_show_comments = total_show_comments/len(show_hn)

print("The average amount of show comments: " + str(avg_show_comments))

The average amount of show comments: 9.350856467808624


Looking at the averages for both categories, we can now say that Ask HN (10 comments per post) as more comments than Show comments (9 comments per post). 

## Looking at Posts and Comments Created By Hour

Let's dig deeper into the Ask categories by looking when these kinds posts and comments are made on an hourly basis

In [16]:
#Importing the datetime module
import datetime as dt 

ask_result_list = []

ask_counts_by_hour = {}
ask_comments_by_hour = {}


for row in ask_hn:
    item = list([row[6],row[4]])
    ask_result_list.append(item)   
    
for row in ask_result_list:
    date = row[0]
    comments = int(row[1])
    date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date_dt.hour
    if hour not in ask_counts_by_hour:
        ask_counts_by_hour[hour] = 1
        ask_comments_by_hour[hour] = comments
    else:
        ask_counts_by_hour[hour] += 1
        ask_comments_by_hour[hour] += comments
        
    
print(ask_comments_by_hour)
print(ask_counts_by_hour)  

{2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}
{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}


Let's now figure the average number of comments per post for each hour.

In [19]:
avg_by_hour = []

for hour in ask_counts_by_hour:
    average = list([hour, round(int(ask_comments_by_hour[hour])/int(ask_counts_by_hour[hour]),2)])
    avg_by_hour.append(average)

print(avg_by_hour)

[[2, 11.14], [1, 7.41], [22, 8.8], [21, 8.69], [19, 7.16], [17, 9.45], [15, 28.68], [14, 9.69], [13, 16.32], [11, 8.96], [10, 10.68], [9, 6.65], [7, 7.01], [3, 7.95], [23, 6.7], [20, 8.75], [16, 7.71], [8, 9.19], [0, 7.56], [18, 7.94], [12, 12.38], [4, 9.71], [6, 6.78], [5, 8.79]]


In [25]:
swapped_avg_by_hour = []

for row in avg_by_hour:
    swapped_avg_by_hour.append(row)

sorted_swap=sorted(swapped_avg_by_hour,reverse=True)
print("Top 5 Hours for Ask Comments")

for row in sorted_swap[:5]:
    datetime_obj = dt.datetime.strptime(str(row[0]), "%H")
    time = datetime_obj.strftime("%H:%M")
    print("{}: {} average comments per hour".format(time, row[1]))
    

Top 5 Hours for Ask Comments
23:00: 6.7 average comments per hour
22:00: 8.8 average comments per hour
21:00: 8.69 average comments per hour
20:00: 8.75 average comments per hour
19:00: 7.16 average comments per hour
