# Hacker News

We will be using a dataset[dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) from Y-Combinator regarding posted articles. The columns of the dataset are:

```id```: Identifier defined by HN  
`title`: Article title  
`url`: Where the article may be found  
`num_points`: Points defined by upvotes subtract downvotes  
`num_comments`: Number comments on the post  
`created at`: Data and time at which the post was created


In [3]:
from csv import reader

opened_file = open("HN_posts_year_to_Sep_26_2016.csv", encoding = "UTF-8")
read_file = reader(opened_file)
HN_list = list(read_file)
headers = HN_list[0]
hn = HN_list[1:]

print(headers)

print('The number of entries is', len(hn))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
The number of entries is  293119


---
Separate the different types of Hacker New posts

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    lower_case_title = title.lower()
    
    if lower_case_title.startswith('ask hn'):
        ask_posts.append(row)
    elif lower_case_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('The number of Ask HN posts is', len(ask_posts))
print('The number of Show HN posts is', len(show_posts))
print('The number of other posts is', len(other_posts))

if (len(ask_posts) + len(show_posts) + len(other_posts)) == len(hn):
    print('\nThe post types have been tabulated correctly')
else:
    print('\nThe posts types have not been tabulated correctly')


The number of Ask HN posts is 9139
The number of Show HN posts is 10158
The number of other posts is 273822

The post types have been tabulated correctly
['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


---
Determine the number of comments on ask posts

In [14]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    comment_number = int(row[4])
    
    total_ask_comments += comment_number
    
for row in show_posts:
    comment_number = int(row[4])
    
    total_show_comments += comment_number
    
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)

print('The total number of comments on Ask posts:', total_ask_comments)
print('The average number of comments on Ask posts:', avg_ask_comments)

print('The total number of comments on Show posts:', total_show_comments)
print('The average number of comments on Show posts:', avg_show_comments)

    
    
    

The total number of comments on Ask posts: 94986
The average number of comments on Ask posts: 10.393478498741656
The total number of comments on Show posts: 49633
The average number of comments on Show posts: 4.886099625910612


Ask Posts get more than twice the number of comments as Show Posts on average. It appears that when people ask a simple, straight-forward question, people leap to share their opinions. Conversely, with Show Posts, users must go to the Git repo and do "actual work" to see the project that might not even interest them.

---
Now, we will see if there is a correlation between the time a post is made and the attention it draws.

In [66]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_time = row[6]
    comment_number = int(row[4])
    
    new_row = [created_time, comment_number]
    
    result_list.append(new_row)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    
    date_time = row[0]
    comment = row[1]
    
    date_time_dt = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M") 
    
    time_hour = date_time_dt.hour
    
    if time_hour not in counts_by_hour:
        counts_by_hour[time_hour] = 1
        comments_by_hour[time_hour] = comment
    else:
        counts_by_hour[time_hour] += 1
        comments_by_hour[time_hour] += comment
        
avg_by_hour = []

for hours in counts_by_hour:
    
    avg_comment = comments_by_hour[hours] / counts_by_hour[hours]
    
    avg_hour = [hours, avg_comment]
    
    avg_by_hour.append(avg_hour)
    
avg_by_hour_swapped = []

for row in avg_by_hour:
    hour = row[0]
    avg_comment = row[1]
    
    new_row = [avg_comment, hour]
    
    avg_by_hour_swapped.append(new_row)
    
sorted_avg = sorted(avg_by_hour_swapped, reverse=True)

index = 0

print('Top 5 Hours for Ask Posts Comments:')

while index < 5:
    
    row = sorted_avg[index]
    avg_comment = row[0]
    hour = str(row[1])
       
    print("{}:00: {:,.2f}".format(hour,avg_comment))
    
    index += 1

Top 5 Hours for Ask Posts Comments:
15:00: 28.68
13:00: 16.32
12:00: 12.38
2:00: 11.14
10:00: 10.68


The most comments around lunch in the Eastern time zone. This means for me, it would be best to post around 18:00 because I am in Vienna.