In [1]:
#Reading files from python
import os
from csv import reader

## I have stored path name in absolute path, needs to change in other machines
path = 'C:\\Users\\btjos\\Documents\\Git_Data\\'

![Hacker_News](path/hacker_news.jpg)

# Hacker News 

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here[https://www.kaggle.com/hacker-news/hacker-news-posts], but the dataset has  all submissions that did not receive any comments removed, and then randomly sampled the remaining submissions. 

Columns are:
- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

Couple examples posts:
- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?
- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

## Analyse these two types of posts to determine:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

# Leranings/code snippet repository

1. datetime utilities

In [3]:
# read from file 
opened_file = open(path + 'hacker_news.csv', encoding='utf8')
hn_file = reader(opened_file)
hn = list(hn_file)
hn[0:3]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30']]

In [4]:
# analyzing text fields, hence headers removed from the data
headers = hn[0]
print(headers)
del hn[0]
print(hn[0])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


In [22]:
# filter the posts
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of posts', len(hn))
print('Number of ask_hn posts', len(ask_posts))
print('Number of show_hn posts', len(show_posts))
print('Number of other posts', len(other_posts))

Number of posts 20100
Number of ask_hn posts 1744
Number of show_hn posts 1162
Number of other posts 17194


### Exploring average comments

pandas should provide shorter code and better vectorized way for finding averages

total_ask_comments = 0
for row in ask_posts:
    total_ask_comments = total_ask_comments + float(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments), 2)

total_show_comments = 0
for row in show_posts:
    total_show_comments = total_show_comments + float(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments),2)

ask posts receive more comments. further diging into if there is better time to post these ask posts...

In [40]:
import datetime as dt
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

result_list[0:3]

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

In [53]:
## note the date_format: %Y as against %y
counts_by_hour = {}
comments_by_hour = {}

for row in result_list: 
    date_format = "%m/%d/%Y %H:%M"
    time = dt.datetime.strptime(row[0], date_format)
    hour = time.hour
    
    if hour in counts_by_hour:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] = comments_by_hour[hour] + row[1]
    else:
        counts_by_hour[hour] =1
        comments_by_hour[hour] = row[1]

In [56]:
avg_by_hour = {}
for hour in counts_by_hour:
    avg_comments = round(comments_by_hour[hour]/counts_by_hour[hour], 2)
    avg_by_hour[hour] = avg_comments
print(avg_by_hour)

{9: 5.58, 13: 14.74, 10: 13.44, 14: 13.23, 16: 16.8, 23: 7.99, 12: 9.41, 17: 11.46, 15: 38.59, 21: 16.01, 20: 21.52, 2: 23.81, 18: 13.2, 3: 7.8, 5: 10.09, 19: 10.8, 1: 11.38, 22: 6.75, 8: 10.25, 4: 7.17, 0: 8.13, 6: 9.02, 7: 7.85, 11: 11.05}


Dictionaties can't be sorted, hence creating a swapped and ordered dictionary

In [58]:
swap_avg_by_hour = {}
for hour in avg_by_hour:
    swap_avg_by_hour[avg_by_hour[hour]] = hour
print(swap_avg_by_hour)

{5.58: 9, 14.74: 13, 13.44: 10, 13.23: 14, 16.8: 16, 7.99: 23, 9.41: 12, 11.46: 17, 38.59: 15, 16.01: 21, 21.52: 20, 23.81: 2, 13.2: 18, 7.8: 3, 10.09: 5, 10.8: 19, 11.38: 1, 6.75: 22, 10.25: 8, 7.17: 4, 8.13: 0, 9.02: 6, 7.85: 7, 11.05: 11}


In [59]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [61]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[0:6]:
    print('At ', swap_avg_by_hour[row], " ", row, 'avg comments were posted')

Top 5 Hours for Ask Posts Comments
At  15   38.59 avg comments were posted
At  2   23.81 avg comments were posted
At  20   21.52 avg comments were posted
At  16   16.8 avg comments were posted
At  21   16.01 avg comments were posted
At  13   14.74 avg comments were posted


In [None]:
### Complete using pandas methods