# __Hacker News Analysis__

In this project, I'll be evaluating a dataset from the website Hacker News.  The site functions somewhat like Reddit, in that individuals may post questions, projects, products, or other interesting information.  Other users may then comment on the content, and up/down vote eachothers' comments.

I'm interested in the following questions:
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

In [1]:
#Import needed libraries
from csv import reader

#Import the csv file containing our data
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

#Print first five lines of hn
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
#Remove the first row (column headers) from the list and save it in a new list
headers = hn[:1]
hn = hn[1:]

#Verify that I've split the headers from the dataset
print(headers, ',\n')
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']] ,

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [3]:
#Separate out posts beginning with 'Ask HN' and 'Show HN' 9with case variations)
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = str(row[1])
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("There are {} ask-posts".format(len(ask_posts)))
print("There are {} show-posts".format(len(show_posts)))
print("There are {} other-posts".format(len(other_posts)))

There are 1744 ask-posts
There are 1162 show-posts
There are 17194 other-posts


In [4]:
#Determine if ask posts or show posts receive more comments on average
#Determine average number of comments per ask post
total_ask_comments = 0
for post in ask_posts:
    num_ask_comments = post[4]
    num_ask_comments = int(num_ask_comments)
    total_ask_comments += num_ask_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments per ask post: ", avg_ask_comments)

#Determine average number of comments per show post
total_show_comments = 0
for post in show_posts:
    num_show_comments = int(post[4])
    total_show_comments += num_show_comments
    
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments per show post: ", avg_show_comments)

Average number of comments per ask post:  14.038417431192661
Average number of comments per show post:  10.31669535283993


As we can see in the previous cell, on average, ask posts receive roughly 4 more comments than show posts!

Because ask posts are more likely to receive comments, we'll focus the rest of our analysis on ask posts only.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [5]:
import datetime as dt
result_list = []

for post in ask_posts:
    created_at = post[6]
    num_ask_comments = int(post[4])
    result_list.append([created_at, num_ask_comments])
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row[0]
    comment = row[1]
    time = dt.datetime.strptime(hour, '%m/%d/%Y %H:%M').strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

comments_by_hour, counts_by_hour

({'00': 447,
  '01': 683,
  '02': 1381,
  '03': 421,
  '04': 337,
  '05': 464,
  '06': 397,
  '07': 267,
  '08': 492,
  '09': 251,
  '10': 793,
  '11': 641,
  '12': 687,
  '13': 1253,
  '14': 1416,
  '15': 4477,
  '16': 1814,
  '17': 1146,
  '18': 1439,
  '19': 1188,
  '20': 1722,
  '21': 1745,
  '22': 479,
  '23': 543},
 {'00': 55,
  '01': 60,
  '02': 58,
  '03': 54,
  '04': 47,
  '05': 46,
  '06': 44,
  '07': 34,
  '08': 48,
  '09': 45,
  '10': 59,
  '11': 58,
  '12': 73,
  '13': 85,
  '14': 107,
  '15': 116,
  '16': 108,
  '17': 100,
  '18': 109,
  '19': 110,
  '20': 80,
  '21': 109,
  '22': 71,
  '23': 68})

I'll now use the dictionaries created in the previous cell to calculate the average number of comments for posts created during each hour of the day.

In [6]:
#Initialize a new list for the results of the calculation
avg_by_hour = []

#Calculate
for key in comments_by_hour:
    avg_by_hour.append([key, comments_by_hour[key]/counts_by_hour[key]]) 

avg_by_hour

[['10', 13.440677966101696],
 ['19', 10.8],
 ['14', 13.233644859813085],
 ['13', 14.741176470588234],
 ['17', 11.46],
 ['04', 7.170212765957447],
 ['03', 7.796296296296297],
 ['15', 38.5948275862069],
 ['00', 8.127272727272727],
 ['20', 21.525],
 ['11', 11.051724137931034],
 ['18', 13.20183486238532],
 ['02', 23.810344827586206],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['21', 16.009174311926607],
 ['09', 5.5777777777777775],
 ['22', 6.746478873239437],
 ['12', 9.41095890410959],
 ['16', 16.796296296296298],
 ['01', 11.383333333333333],
 ['23', 7.985294117647059],
 ['05', 10.08695652173913],
 ['08', 10.25]]

In [15]:
#Swap the two rows in the avg_by_hour list and assign it to the new list (swap_avg_by_hour)
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
# print(swap_avg_by_hour)

#Sort the new list (swap_avg_by_hour) in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
# sorted_swap
print('\n', "Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    average = float(row[0])
    hour = row[1]
    print('\n' "{} : {:.2f} average comments per post".format(hour, average))


 Top 5 Hours for Ask Posts Comments

15 : 38.59 average comments per post

02 : 23.81 average comments per post

20 : 21.52 average comments per post

16 : 16.80 average comments per post

21 : 16.01 average comments per post


# __Results__
As we can see in the previous cell, 3pm is clearly the time in which ask posts receive the most comments.

# __Further Study__

* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare your results to the average number of comments and points other posts receive.
* Use Dataquest's data science project style guide to format your project.
