# Exploring Hacker News Posts 

In this project, we'll work with a dataset of submissions to popular technology site [Hack News](https://news.ycombinator.com/news?p=2).

The dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing posts that did not receive comments and randomly sampling from the remaining submissions.

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. `Ask HN` are posts that user use to ask the Hacker News community a specific question. `Show HN` are posts that show the Hacker News community projects, research articles, or something interesting.

We'll compare these two types of posts to dertermine the following:
* Do `Ask HN` or `Show HN` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

First, let's read hacker_news.csv file into a list of lists and view the first frive rows.

In [1]:
import csv

f = open('hacker_news.csv')
hn = list(csv.reader(f))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

We notice the header, they are described as following:
* `id`: the unique indentifier from Hacker News for the post
* `title`: the title of the post
* `url`: the URL that the post links to, if the post has a URL
* `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comment`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of the post's submission (the time zone is the US Eastern Time)


### Removing Headers from a List of Lists

Let's go ahead and remove the header so we can work with data effectively.

In [2]:
header = hn[0]

hn = hn[1:]
print(header)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]


### Extracting Ask HN and Show HN Posts

Since we're only interested in rows of which the title begins with either `Ask HN` or `Show HN`, we will need to create a list to store these particular data. We can extract the certain rows by using the `startswith` method that allows us to return `True` if the string start with certain object, otherwise, it returns `False`. Then we are going to add them into separate lists for either "ask posts", "show posts", or "other posts".


In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    is_ask_hn = (title.lower()).startswith("ask hn")
    is_show_hn = (title.lower()).startswith("show hn")
    if is_ask_hn:
        ask_posts.append(row)
    elif is_show_hn:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts), len(show_posts), len(other_posts))

1744 1162 17194


### Calculating the Average Number of Comments for Ask HN and SHow HN Posts

Let's determine if ask posts or show posts receive more comments on average.

In [4]:
# Average comment on ask posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
# Average Comment on show posts
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


As the numbers show, on average, Ask posts from Hacker News community receive 14 comments relative to Show posts of 10 comments. This is possibly tied to the activeness of Hacker News community on helping others - members are more inclined to participate in question posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments:
1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [8]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    created_at = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = created_at.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 0
        comments_by_hour[hour] = 0
    counts_by_hour[hour] += 1
    comments_by_hour[hour] += row[1]
    
comments_by_hour
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}


### Calculating the Average Number of Comments for Ask HN Posts by Hour


In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    for key in counts_by_hour:
        if hour == key:
            avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[key])])
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]


### Sorting and Printing Values from a List of Lists

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [18]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to the analysis, we can notice the `15:00` is the time that posts, on average, receive 38.59 comments per post, 60% higher in terms of number of comments received in second place. Since, the time zone was originally recorded in the US Eastern Time [(documentation for the data set)](https://www.kaggle.com/hacker-news/hacker-news-posts). The Central Time zone where I reside would be `14:00`.


## Conclusion

In this project, we have done analysis on Hacker News Posts, specifically ask posts and show posts. Through this, we found that ask posts, on average, are likely to receive more comments; and of thoses ask posts, we found that the time range between `15:00` and `16:00` (Eastern Time zone) is the one when ask posts receive most comments, on average.