# Hacker News Post Analysis

Hello, guys! This is my second data analysis project. Here we will analyze the *Ask HN* and *Show HN* posts, discover which type recieve more comments on average, and if posts created at a certain time recieve more comments on average.

The data set we will be using can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), and has approximately 20,000 rows. 

The columns in the data set are as follows: 

- `id`: the unique identifier from the Hacker News for the post
- `title`: the title of the post
- `url`: the URL that the post links to, if the post has a URL
- `num-points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes. 
- `author`: the username of the person who submitted the post
- `created_at`: the date and time at which the post was submitted

Let's take a look at the first few rows of the data set:

In [1]:
open_file = open('/Catharine/DataSets/hacker_news_reduced.csv', encoding='utf8')
from csv import reader
read_file = reader(open_file)
hn = list(read_file)

for row in hn[:4]:
    print(row)
    print('\n')
    
print(len(hn))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


20101


As you can see, the data set contains the header on the first row. Let's separate the header and the actual data so we can further analyze it. 

In [2]:
header_hn = hn[:1]
hn = hn[1:]

for row in hn[:5]:
    print(row)
    print('\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




### Cleaning the data

The next step is to filter the data set for only the information we're looking for. Since we're interested in the `Ask HN` and `Show HN` posts, we will filter those out.

The `ask_posts` data set will contain only the posts that begin with `Ask HN`, the `show_posts` data set will contain only the posts that begin with `Show_HN`, and finally the `other_posts` will contain the rest of the original data set.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else: 
        other_posts.append(row)

### Analyzing the data

Now that we have the data separated by type, we will analyze which type recieves the most comments on average.

In [4]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average Comments on Ask Posts', avg_ask_comments)


total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = round(total_show_comments / len(show_posts),2)
print('Average Comments on Show Posts',avg_show_comments)

Average Comments on Ask Posts 14.038417431192661
Average Comments on Show Posts 10.32


From the code above, we can see that Ask Posts recieve on average 14.03 comments per post while Show Posts recieve only 10.32.

That being said, we will continue our analysis regarding only the Ask Posts. The next question we want to answer is: if posts created at a certain time recieve more comments on average. 

In [5]:
import datetime as dt
result_list = [] ## list of lists

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {} ## number of ask posts created during each hour of the day.
comments_by_hour = {} ## corresponding number of comments ask posts created at each hour received.
date_format= "%m/%d/%Y %H:%M"

#running through result list to count posts by hour, and comments by hour
for row in result_list:
    date_time = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date_time, date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

# creating a list with the average number of comments per hour
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    

# ordering the averages from greatest to least
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Getting top Five Results
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    avg = row[0] 
    time = dt.datetime.strptime(row[1], "%H")
    time = time.strftime("%H:%M")
    print("{t}: {a:.2f} average comments per post".format(t=time, a=avg))
    

    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to the information above, posting an *Ask HN* post at 15:00, 02:00, 20:00, 16:00, or 21:00 gives you a higher chance of getting comments. One thing to consider is the time zone. The [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts) tells us the time zone is Eastern Time in the US. If I were to consider my time zone (BRT), I'd have to consider the times below: 

In [6]:
for row in sorted_swap[:5]:
    avg = row[0] 
    time = dt.datetime.strptime(row[1], "%H")
    time_plus_3h = time + dt.timedelta(hours = 3)
    time_plus_3h = time_plus_3h.strftime("%H:%M")
    
    print("{t}: {a:.2f} average comments per post".format(t=time_plus_3h, a=avg))

18:00: 38.59 average comments per post
05:00: 23.81 average comments per post
23:00: 21.52 average comments per post
19:00: 16.80 average comments per post
00:00: 16.01 average comments per post


### Conclusion

When we compare *Ask HN* and *Show HN* posts, we saw that *Ask HN* posts get more comments on average. Among the *Ask HN* posts, the time frames to create a post to get more comments would be at 15:00, 02:00, 20:00, 16:00, or 21:00, considering EST. That concludes this project! 