# Exploring Hacker News (HN) Posts

### In which I ponder whether to ask or to show, and to stay beautiful or to be less youthful.

---

So, I actually have a question that lingers in my mind, and [Hacker News (HN)](https://news.ycombinator.com/) seems like an appropriate place to ask. 

Luckily, this guided project from Dataquest will help me explore the HN posts by analyzing these questions:
1. **Do 'Ask HN' or 'Show HN' receive more comments on average?** 'Ask HN' is a post with a specific type of question, for example, ["Side projects that are making money, but you'd not talk about them?"](https://news.ycombinator.com/item?id=23438930), I bet you'd want to take a peek on this post haha.. and 'Show HN' is about interesting contents, could be a project or product, such as ["Anytime, a Simple Time Converter"](https://news.ycombinator.com/item?id=23437682). None of these were created by me, I just happened to stumble upon the posts and I think they're interesting!
2. **Do posts created at a certain time receive more comments on average?** I'll refer to that neat 'Anytime' time converter stated above for the timezone name.

Of course, there are other more important factors, such as the relevancy of the posts and whether the topic is general and/or popular or very niche. 

In my case, my question is quite niche and unpopular (trust me, I've searched the entire HN and found little to non-existent posts), so I don't really expect to get answers if I create an 'Ask HN' post. I even consider maybe I need to gauge the users' interest on that particular topic first by creating a 'Show HN' post and, if the points are high, then I can proceed with 'Ask HN'. Anyway! That sounds like me getting ahead of myself. Let's just keep things simple first and weigh the time factor and work from there.

The dataset for this project can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), and it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing no-comments submissions, and then randomly sampled from the remaining submissions.

And now, since I'm blabbering too much already, let's begin!

*...oh, btw, my question is: "What are the best Massive Open Online Courses (MOOCs) related to clinical/medical/healthcare informatics?" Please contact me if you happened to know the answer, thanks :)* 

---
## First, let's see the file...

In [1]:
import csv

opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Before I dive in further, let's remove that distracting header from the first post.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


---

## ...and now: to ask or to show?

Since I'm interested in analyzing posts that contain 'Ask HN' and 'Show HN', let's create lists to separate these posts and see the distribution of the posts and their comments.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("There are " + str((len(ask_posts))) + " 'Ask HN' posts, "
      + str((len(show_posts))) + " 'Show HN' posts, and "
      + str((len(other_posts))) + " other HN posts.")

There are 1744 'Ask HN' posts, 1162 'Show HN' posts, and 17194 other HN posts.


In [4]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)

print("Average 'Ask HN' comments is {:.2f}".format(avg_ask_comments)
     + ", while the average 'Show HN' comments is {:.2f}.".format(avg_show_comments) 
)

Average 'Ask HN' comments is 14.04, while the average 'Show HN' comments is 10.32.


It seems like 'Ask HN' posts received more love than 'Show HN' (14 and 10, respectively). Well, it makes total sense, isn't it? People ask questions, people give answers. Ok then, this indicates that I should just create the 'Ask HN' post rather than create the 'Show HN' first.

---

## Next: to stay awake, or to rise early?

Let's see if 'Ask HN' posts created at a certain time are more likely to attract comments by performing the following analyses:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    created = row[6]
    comments = int(row[4])
    result_list.append([created, comments])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row [1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [6]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['02', 23.810344827586206],
 ['23', 7.985294117647059],
 ['20', 21.525],
 ['18', 13.20183486238532],
 ['07', 7.852941176470588],
 ['10', 13.440677966101696],
 ['12', 9.41095890410959],
 ['06', 9.022727272727273],
 ['13', 14.741176470588234],
 ['00', 8.127272727272727],
 ['14', 13.233644859813085],
 ['05', 10.08695652173913],
 ['09', 5.5777777777777775],
 ['01', 11.383333333333333],
 ['03', 7.796296296296297],
 ['11', 11.051724137931034],
 ['22', 6.746478873239437],
 ['21', 16.009174311926607],
 ['15', 38.5948275862069],
 ['19', 10.8],
 ['17', 11.46],
 ['04', 7.170212765957447],
 ['08', 10.25],
 ['16', 16.796296296296298]]

Um, this is a little confusing. Let's just sort it by the average value and swap the hour and the average.

In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

Now, let's see the top 5 hours for 'Ask HN' posts comments.

In [8]:
print("Top 5 Hours for 'Ask HN' Posts Comments")
for each_average, each_hour in sorted_swap[:5]:
    top_5 = "{}: {:.2f} average comments per post"
    print(top_5.format(
            dt.datetime.strptime(each_hour, "%H").strftime("%H:%M"),each_average
        )
    )

Top 5 Hours for 'Ask HN' Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Since the [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts) pointed out that the timezone used is EST, and I live on the other side of the world (GMT +7), this will determine whether I can get my beauty sleep or not. Hopefully so. 

Now, let's create a function to convert the timezone (I got these awesome codes from Stack Overflow).

In [9]:
import datetime
import pytz

def convert_datetime_timezone(dt, tz1, tz2):
    tz1 = pytz.timezone(tz1)
    tz2 = pytz.timezone(tz2)

    dt = datetime.datetime.strptime(dt,"%H")
    dt = tz1.localize(dt)
    dt = dt.astimezone(tz2)
    dt = dt.strftime("%H")
    
    return dt

for row in sorted_swap[:5]:
    convert_datetime_timezone(row[1], "US/Eastern", "Asia/Jakarta")

Perfect. I've spent hours on understanding and applying this chunk of code into my for loop. Phew.

Now, let's see the top 5 hours in my local timezone for 'Ask HN' posts comments.

In [10]:
print("Top 5 Hours for 'Ask HN' Posts Comments (GMT +7)")
for each_average, each_hour in sorted_swap[:5]:
    top_5 = "{}: {:.2f} average comments per post"
    newhr = convert_datetime_timezone(each_hour, "US/Eastern", "Asia/Jakarta")
    print(top_5.format(
            dt.datetime.strptime(newhr, "%H").strftime("%H:%M"),each_average
        )
    )

Top 5 Hours for 'Ask HN' Posts Comments (GMT +7)
03:00: 38.59 average comments per post
14:00: 23.81 average comments per post
08:00: 21.52 average comments per post
04:00: 16.80 average comments per post
09:00: 16.01 average comments per post


Uh, oh. 3 AM?! Bye, beauty sleep :(

---

## Go on, feed your curiosity on 'points'

The project is actually done on that 'Top 5 Hours'. Yay!

But... yeah, I can't just ignore that 'points' data now that I have some thoughts on it.

As I've mentioned before, 'Ask HN' may receive more comments because people obviously would give answers. On a closer look, the dataset provides us with another interesting data point called 'points'.

Maybe, 'points' to 'Show HN' posts is like 'comments' to 'Ask HN' posts. It's possible that the more interesting the content shared in the 'Show HN' posts, more points may be received. And while we're at it, let's weigh the time factor as well. 

Let's check it out!

In [11]:
# Average point on 'Ask HN' posts
total_ask_points = 0

for row in ask_posts:
    total_ask_points += int(row[3])

avg_ask_points = total_ask_points / len(ask_posts)

# Average point on 'Show HN' posts
total_show_points = 0

for row in show_posts:
    total_show_points += int(row[3])

avg_show_points = total_show_points / len(show_posts)

print("Average 'Ask HN' points is {:.2f}".format(avg_ask_points)
     + ", while the average 'Show HN' points is {:.2f}.".format(avg_show_points))

Average 'Ask HN' points is 15.06, while the average 'Show HN' points is 27.56.


Whoa, my guess is correct! 'Show HN' posts have better average points than 'Ask HN' posts. 

So, let's focus only on 'Show HN' posts here, and check at what times (in my local time) do these 'Show HN' posts receive more points, and their average amount of points received at each hour of the day.

In [12]:
# Calculate the amount of 'Show HN' posts created during each hour of day 
# and the number of points received.

import datetime as dt

result_list = []

for row in show_posts:
    created = row[6]
    points = int(row[3])
    result_list.append([created, points])
    
counts_by_hour = {}
points_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    point = row [1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        points_by_hour[time] = point
    else:
        counts_by_hour[time] += 1
        points_by_hour[time] += point

# Calculate the average amount of points `Show HN` posts created 
# at each hour of the day receive, sorted and swapped.

avg_by_hour = []

for hour in points_by_hour:
    avg_by_hour.append([hour, points_by_hour[hour] / counts_by_hour[hour]])
    
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Now print the the 5 hours with the highest average points in local time

print("Top 5 Hours for 'Show HN' Posts Points (GMT +7)")
for each_average, each_hour in sorted_swap[:5]:
    top_5 = "{}: {:.2f} average comments per post"
    newhr = convert_datetime_timezone(each_hour, "US/Eastern", "Asia/Jakarta")
    print(top_5.format(
            dt.datetime.strptime(newhr, "%H").strftime("%H:%M"),each_average
        )
    )

Top 5 Hours for 'Show HN' Posts Points (GMT +7)
11:00: 42.39 average comments per post
00:00: 41.69 average comments per post
10:00: 40.35 average comments per post
12:00: 37.84 average comments per post
06:00: 36.31 average comments per post


This is relieving. Look at those lovely afternoon times for me to post 'Show HN' to get more points. 

---

## What about the 'other HN' posts?

I suspect that 'other HN' posts will score higher in both average comments and points. Let's see.

In [13]:
# Average number of comments `other HN` posts receive.
total_other_comments = 0

for row in other_posts:
    total_other_comments += int(row[4])
    
avg_other_comments = total_other_comments / len(other_posts)

# Average point on 'other HN' posts
total_other_points = 0

for row in other_posts:
    total_other_points += int(row[3])

avg_other_points = total_other_points / len(other_posts)

print("Average 'Ask HN' comments is {:.2f}.".format(avg_ask_comments))
print("Average 'Show HN' comments is {:.2f}.".format(avg_show_comments))
print("Average 'other HN' comments is {:.2f}.".format(avg_other_comments))
print("\n")
print("Average 'Ask HN' points is {:.2f}.".format(avg_ask_points))
print("Average 'Show HN' points is {:.2f}.".format(avg_show_points))
print("Average 'other HN' points is {:.2f}.".format(avg_other_points))

Average 'Ask HN' comments is 14.04.
Average 'Show HN' comments is 10.32.
Average 'other HN' comments is 26.87.


Average 'Ask HN' points is 15.06.
Average 'Show HN' points is 27.56.
Average 'other HN' points is 55.41.


Yup, for 'other HN' posts, both of the average comments and points are higher than 'Ask HN' and 'Show HN' posts. 

Since average point for 'other HN' posts is higher than the comments, let's focus on the top 5 hours to get more points for 'other HN' posts.

In [14]:
# Calculate the amount of 'other HN' posts created during each hour of day 
# and the number of points received.

import datetime as dt

result_list = []

for row in other_posts:
    created = row[6]
    points = int(row[3])
    result_list.append([created, points])
    
counts_by_hour = {}
points_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    point = row [1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        points_by_hour[time] = point
    else:
        counts_by_hour[time] += 1
        points_by_hour[time] += point

# Calculate the average amount of points `other HN` posts created 
# at each hour of the day receive, sorted and swapped.

avg_by_hour = []

for hour in points_by_hour:
    avg_by_hour.append([hour, points_by_hour[hour] / counts_by_hour[hour]])
    
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Now print the the 5 hours with the highest average points in local time

print("Top 5 Hours for 'other HN' Posts Points (GMT +7)")
for each_average, each_hour in sorted_swap[:5]:
    top_5 = "{}: {:.2f} average comments per post"
    newhr = convert_datetime_timezone(each_hour, "US/Eastern", "Asia/Jakarta")
    print(top_5.format(
            dt.datetime.strptime(newhr, "%H").strftime("%H:%M"),each_average
        )
    )

Top 5 Hours for 'other HN' Posts Points (GMT +7)
01:00: 62.53 average comments per post
02:00: 61.79 average comments per post
03:00: 60.54 average comments per post
22:00: 60.48 average comments per post
07:00: 60.01 average comments per post


Ooookay, the top 5 hours are again the enemy of my beauty sleep LOL

---

## Conclusion
### ...less youthful yet stay beautiful.

Well, I need to do the following if I want to get more comments and more points on HN posts:
1. Post my question by using the 'Ask HN' to maximize comments or create an 'other HN' post to maximize points at 3 AM. No beauty sleep.
2. Create a 'Show HN' post related to my question and expect more points in this post compared to the 'Ask HN' post (in which I expect more comments) in the afternoon. Yes beauty sleep.

---

## Extras
### Homework

I need to learn on how to incorporate Daylight Saving Time (DST) in the code. Perhaps by factoring in dates and maybe those fascinating equinoxes and solstices.

### So I really did test this on HN!
...but unfortunately, [my 'Ask HN' post](https://news.ycombinator.com/item?id=23441496) received no reply :( This is okay since I'm just considering one factor (i.e. time) and the topic itself is not a popular/general HN topic.

Anyway, this was a fun project and well-worth a lack of beauty sleep :p