# Exploring Hacker News

Let's pretend we build free apps that we use to generate ad revenue. We want to know what types of apps attract the most users. We will analyze data from the Apple Store to see what type of apps attract the most users

Our data sources is https://www.kaggle.com/hacker-news/hacker-news-posts

Note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

## Question 1: Is Asking or Showing Generating More Interest?

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [4]:
# open our data file
opened_file = open('data/HN_posts_year_to_Sep_26_2016.csv')
from csv import reader
read_file = reader(opened_file)

# hn = our data
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


In [5]:
# extract our data's column headers
headers = hn[:1]
# resave data without headers
hn = hn[1:]

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

We'll do this by searching the titles from a common tag used with these posts in the title: 'ask hn' and 'show hn'.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Now we have all the ask and show posts from Hackernews in their lists (*ask_posts* and *show_posts* respectively), with everything else in *other_posts* for now.

### Next, let's determine if ask posts or show posts receive more comments on average.

In [7]:
#find total number of comments in ask_posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4]) # num_comments is the 5th column
    total_ask_comments += num_comments

# use total to find average number of comments
avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)

10.393478498741656


In [8]:
# find total number of comments in show_posts
# uses same logic as above block
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4]) 
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

4.886099625910612


It looks like Ask posts recieve 10.39 comments on average, while Show posts on HackerNews only recieve 4.88. 

### Conclusion : *Ask* posts get twice as many comments on average as *Show* posts. 

## Question 2: Do posts during certain times generate more comments?

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [9]:
print(ask_posts[:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


In [10]:
# import module to read dates & times
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

print(result_list)

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2], ['9/25/2016 19:30', 1], ['9/25/2016 19:22', 22], ['9/25/2016 17:55', 3], ['9/25/2016 15:48', 0], ['9/25/2016 15:35', 13], ['9/25/2016 15:28', 0], ['9/25/2016 14:43', 0], ['9/25/2016 14:17', 3], ['9/25/2016 13:08', 2], ['9/25/2016 11:27', 2], ['9/25/2016 10:51', 0], ['9/25/2016 10:47', 6], ['9/25/2016 9:04', 97], ['9/25/2016 7:09', 4], ['9/25/2016 3:00', 1], ['9/24/2016 23:04', 0], ['9/24/2016 22:02', 7], ['9/24/2016 21:18', 2], ['9/24/2016 20:58', 0], ['9/24/2016 19:57', 1], ['9/24/2016 19:02', 0], ['9/24/2016 17:55', 0], ['9/24/2016 17:27', 1], ['9/24/2016 16:50', 0], ['9/24/2016 16:03', 5], ['9/24/2016 15:29', 66], ['9/24/2016 14:03', 1], ['9/24/2016 10:10', 11], ['9/24/2016 8:46', 7], ['9/24/2016 8:39', 1], ['9/24/2016 8:38', 1], ['9/24/2016 8:28', 1], ['9/24/2016 3:36', 3], ['9/24/2016 0:21', 2], ['9/23/2016 23:38', 6], ['9/23/2016 23:35', 6], ['9/23/2016 22:13', 4

In [11]:
for result in result_list:
    datetime = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(datetime, "%H") 
    comments = result[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    elif hour in counts_by_hour:
        counts_by_hour[hour] = counts_by_hour[hour] + 1
        comments_by_hour[hour] = comments_by_hour[hour] + comments
    
print(counts_by_hour)
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


We've completed the first step and now have the values stored in 2 lists: *counts_by_hour* and *comments_by_hour*.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [12]:
avg_by_hour = []

for hour in comments_by_hour:
    average = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, average])
    
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [13]:
swap_avg_by_hour = []

for avg in avg_by_hour:
    val1 = avg[0]
    val2 = avg[1]
    swap_avg_by_hour.append([val2, val1])
    
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [15]:
swap_avg_by_hour = sorted(swap_avg_by_hour, reverse=True)
print(swap_avg_by_hour)

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10'], [9.7119341563786, '04'], [9.692007797270955, '14'], [9.449744463373083, '17'], [9.190661478599221, '08'], [8.96474358974359, '11'], [8.804177545691905, '22'], [8.794258373205741, '05'], [8.749019607843136, '20'], [8.687258687258687, '21'], [7.948339483394834, '03'], [7.94299674267101, '18'], [7.713298791018998, '16'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [7.163043478260869, '19'], [7.013274336283186, '07'], [6.782051282051282, '06'], [6.696793002915452, '23'], [6.653153153153153, '09']]


In [18]:
print('Top 5 Best Hours to Post')
template = "{}:00 : {:.2f} average comments per post."
for row in swap_avg_by_hour[:5]:
    result = template.format(row[1], row[0])
    print(result)
    

Top 5 Best Hours to Post
15:00 : 28.68 average comments per post.
13:00 : 16.32 average comments per post.
12:00 : 12.38 average comments per post.
02:00 : 11.14 average comments per post.
10:00 : 10.68 average comments per post.


Which hours should you create a post during to have a higher chance of receiving comments? **Refer back to the documentation for the data set to convert the times to the time zone you live in.** Write a markdown cell explaining your findings.

Here are some next steps for you to consider:
- Determine if show or ask posts receive more points on average.
- Determine if posts created at a certain time are more likely to receive more points.
- Compare your results to the average number of comments and points other posts receive.
- Use Dataquest's data science project style guide to format your project.