
# Exploring Hacker News Posts

Our aim in this project is to find whether "Ask HN" or "Show HN" posts receive more comments on average and also to determine do posts created at a certain time receive more comment on average or not. 

## Opening and Exploring the Data

This data set is Hacker News posts from the last 12 months (up to September 26 2016)


- [A data set](https://www.kaggle.com/hacker-news/hacker-news-posts) containing data Hacker News Posts upto September 2016. You can download the data set directly from [this link](https://www.kaggle.com/hacker-news/hacker-news-posts).

Let's start by opening the  dataset and then continue with exploring the data.


In [4]:
from csv import reader

# The Hacker News Posts Dataset
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

In [5]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [6]:
# Function to easily explore the data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [7]:
print(hn_header)
explore_data(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of ro

I see that there are 20100 rows and 7 Columns for the Hacker News Posts Dataset. At a quick glance, the columns that might be useful for the purpose of our analysis are `'title'`, `'num_points'`, and `'created_at'`.

## Extracting "Ask HN" and "Show HN" posts

In [21]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Count of Ask HN posts: ", len(ask_posts))
print("Count of Show HN posts: ", len(show_posts))
print("Count of Other posts: ", len(other_posts))

Count of Ask HN posts:  1744
Count of Show HN posts:  1162
Count of Other posts:  17194


There are 1744 Ask HN posts, 1162 Show HN posts and 17194 posts of other type of posts.

## Calculating the average number of comments on 'Ask HN' posts

In [9]:
# Average number of comments on Ask_HN posts
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

# Average number of comments on Show_HN posts
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


I see that Ask HN posts receive more comments average than Show HN posts. It can also be noticed that the count of Ask HN posts is 60% more than Show HN posts.

Since Ask HN posts are more likely to receive comments, I'll focus on these posts.

Now, let's determine whether posts created at a certain time of the day are more likely to receive comments.

## Calculate the amount of posts created in each hour of the day along with number of comments received

In [24]:
import datetime as dt
result_list = []

print(result_list)
for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date = row[0]
    format = "%m/%d/%Y %H:%M"
    date_to_hour = dt.datetime.strptime(date, format).strftime("%H")
    count = row[1]
    
    if date_to_hour not in counts_by_hour:
        counts_by_hour[date_to_hour] = 1
        comments_by_hour[date_to_hour] = count
    else:
        counts_by_hour[date_to_hour] += 1
        comments_by_hour[date_to_hour] += count
    
counts_by_hour

[]


{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

The counts_by_hour list contains the Hour in string format and the count.

In [26]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

The comments_by_hour list contains the Hour in string format and the count.

## Calculate the average number of comments for Ask HN posts per hour

In [27]:
avg_by_hour = []
for hour in counts_by_hour:
    avg = int(comments_by_hour[hour]) / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])

In [28]:
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Sorting values in a list

In [29]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


## Swaping the average number comments with hour

In [31]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [34]:
print("Top 5 for Ask Posts Comments")

for average, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"),average
        )
    )

Top 5 for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour 15:00 receives the most comments by average, the average is 38 comments per post. There's also a 7 hour difference between first highest and second highest average comments per post

# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on my analysis, to maximize the amount of comments a post receives, I'd recommend the post be categorized as ask post and created between 15:00 and 16:00.

However, it should be noted that the data set I analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 received the most comments on average.

# To be continued inue

That's it for the guided steps! Here's a quick summary of what we accomplished in this guided project:

    We set a goal for the project.
    We collected and sorted the data.
    We reformatted and cleaned the data to prepare it for analysis.
    We analyzed the data.

Guided projects can be used to build a portfolio to showcase to potential employers, so we encourage you to keep working on this. Here are some next steps for you to consider:

    Determine if show or ask posts receive more points on average.
    Determine if posts created at a certain time are more likely to receive more points.
    Compare your results to the average number of comments and points other posts receive.
    Use Dataquest's data science project style guide to format your project.
