# Exploring Hacker News Posts

In this project I will explore data from the [Hacker News](https://news.ycombinator.com/) websiste, a forum for publishing news and stories about the most recent software and computer technologies and the most relevant topics in IT. <br>
In this project I'll make use only of Python libraries (no pandas, numpy, etc).

The typical post on HN is called a story and it is composed of the following fields: <br>
<ul>
<li>id: unique identifier for the story</li>
<li>title: title of the post</li>
<li>url: many post have url to link to</li>
<li>num_points: upvotes of the post</li>
<li>num_comments: the number of comments to the post</li>
<li>author: the author of the post</li>
<li>created_at: date and time the post was created</li>

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

In [2]:
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [3]:
headers = hn[0]
hn = hn[1:]

In [4]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [5]:
hn[0:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [6]:
print(len(hn))

20100


There are a little more than 20000 posts in this dataset.

### Extracting Ask HN and Show HN Posts

Posts on HN can be categorized as "Ask HN" and "Show HN", I'm goint to separate these two categories from other kind of posts.

In [7]:
ask_posts = []
show_posts = []
other_posts = []

In [8]:
for row in hn:
    title = row[1]
    title = title.lower()
    if(title.startswith("ask hn")):
        ask_posts.append(row)
    elif(title.startswith("show hn")):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [9]:
ask_posts[0:3]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14']]

In [10]:
show_posts[0:3]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05']]

In [11]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


### Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [12]:
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

In [13]:
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [14]:
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

In [15]:
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


On average every ask posts receives about 14 comments, while a show post receives about 10 comments.

### Finding the Amount of Ask Posts and Comments by Hour Created

In [16]:
import datetime as dt

In [17]:
result_list = []

In [18]:
for row in ask_posts:
    time = row[6]
    comments = int(row[4])
    result_list.append((time, comments))

In [19]:
result_list[0]

('8/16/2016 9:55', 6)

In [20]:
counts_by_hour = {}
comments_by_hour = {}

In [21]:
count = 0
for item in result_list:
    time = dt.datetime.strptime(item[0], "%m/%d/%Y %H:%M")
    hour = time.hour
    comments = int(item[1])
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] = counts_by_hour[hour] + 1
        comments_by_hour[hour] = comments_by_hour[hour] +comments

In [22]:
counts_by_hour

{9: 45,
 13: 85,
 10: 59,
 14: 107,
 16: 108,
 23: 68,
 12: 73,
 17: 100,
 15: 116,
 21: 109,
 20: 80,
 2: 58,
 18: 109,
 3: 54,
 5: 46,
 19: 110,
 1: 60,
 22: 71,
 8: 48,
 4: 47,
 0: 55,
 6: 44,
 7: 34,
 11: 58}

In [23]:
comments_by_hour

{9: 251,
 13: 1253,
 10: 793,
 14: 1416,
 16: 1814,
 23: 543,
 12: 687,
 17: 1146,
 15: 4477,
 21: 1745,
 20: 1722,
 2: 1381,
 18: 1439,
 3: 421,
 5: 464,
 19: 1188,
 1: 683,
 22: 479,
 8: 492,
 4: 337,
 0: 447,
 6: 397,
 7: 267,
 11: 641}

In [24]:
avg_by_hour = []
for i in range(0, 24):
    posts = counts_by_hour[i]
    comments = comments_by_hour[i]
    avg_comm = comments/posts
    avg_by_hour.append([i, avg_comm])

In [25]:
print(avg_by_hour)

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]


### Sorting and Printing values from a List of Lists

In [26]:
swap_avg_by_hour = []
for element in avg_by_hour:
    first = element[0]
    second = element[1]
    swap_avg_by_hour.append([second, first])

In [27]:
swap_avg_by_hour

[[8.127272727272727, 0],
 [11.383333333333333, 1],
 [23.810344827586206, 2],
 [7.796296296296297, 3],
 [7.170212765957447, 4],
 [10.08695652173913, 5],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [10.25, 8],
 [5.5777777777777775, 9],
 [13.440677966101696, 10],
 [11.051724137931034, 11],
 [9.41095890410959, 12],
 [14.741176470588234, 13],
 [13.233644859813085, 14],
 [38.5948275862069, 15],
 [16.796296296296298, 16],
 [11.46, 17],
 [13.20183486238532, 18],
 [10.8, 19],
 [21.525, 20],
 [16.009174311926607, 21],
 [6.746478873239437, 22],
 [7.985294117647059, 23]]

In [28]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [29]:
print("Top Commented Hours for Ask HN Posts")
print("------------------------------------")
for i in range(5):
    print("{}:00: {:.2f} average comments per post".format(sorted_swap[i][1], sorted_swap[i][0]))

Top Commented Hours for Ask HN Posts
------------------------------------
15:00: 38.59 average comments per post
2:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


15:00 seems the most convenient our for receiving comments, with an average for about 38 comments. 