# Hacker News Project

- The dataset being used here is a range of posts from the website 'HackerNews'
- Below are the descriptions of the columns in the dataset
    - **id**: the unique identifier from Hacker News for the post
    - **title**: the title of the post
    - **url**: the URL that the posts links to, if the post has a URL
    - **num_points**: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    - **num_comments**: the number of comments on the post
    - **author**: the username of the person who submitted the post
    - **created_at**: the date and time of the post's submission
- We're specifically interested in posts with titles that begin with either _Ask HN_ or _Show HN_
- We'll compare the two to determine
    - Do _Ask HN_ or _Show HN_ receive more comments on avg
    - Do posts created at a certain time receive more comments on avg

In [17]:
from csv import reader
import datetime as dt

In [20]:
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0] # this is the header row
hn = hn[1:] #first row is a header
print(hn[1])
print(headers)

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [21]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


- Now we determine if ask posts or show posts receive more comments on avg

In [22]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments    

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_ask_comments

14.038417431192661

In [23]:
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
avg_show_comments

10.31669535283993

- As we can see above the average comments for an ask post is 14 and for a show is 10 <br/> both rounded to the nearest whole number

- Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

    - Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
    - Calculate the average number of comments ask posts receive by hour created.

In [45]:
result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {} #number of posts at each time of day
comments_by_hour = {} #number of comments onm posts at each time of day
date_format = '%m/%d/%Y %H:%M'
for row in result_list:
    date = row[0]
    comment = row[1]
    hour = dt.datetime.strptime(date, date_format).strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

- Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [47]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [49]:
avg_by_hour = []

for hr in comments_by_hour:
    num_posts = counts_by_hour[hr]
    num_comments = comments_by_hour[hr]
    avg = num_comments / num_posts
    avg_by_hour.append([hr, avg])

- Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [54]:
#first swap order so that avg comments is in first column
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [55]:
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

- In the above we see that the most comments per hour is at 3pm with 38 comments followed by 2am with 23 comments

1. Determine if show or ask posts receive more points on average.
2. Determine if posts created at a certain time are more likely to receive more points.
3. Compare your results to the average number of comments and points other posts receive.
4. Use Dataquest's data science project style guide to format your project.