# Exploring Hacker News Posts 

Hacker News is a website where users submit posts that other users can comment on.  There are two main types of posts, "Ask HN" and "Show HN".  "Ask HN" are specific questions users submit while "Show HN" are specific projects or pieces of work users submit.

With the data we are given, we are trying to find out which of the two main types of posts receive more comments on average.  We will also determine if posting at certain times of the day is more advantageous when trying to receive more comments on average.

### Importing and Reading data

In [1]:
# Here we are importing the data set 'hacker_news.csv', reading it as a
# list of lists, and assigning to the variable 'hn'.
from csv import reader

opened_file = open('hacker_news.csv', encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)

# We have created two lists, one with header information and
# one without header information.
headers = hn[0]
hn = hn[1:]

In [2]:
# Displaying the first 5 rows of the list.
print(headers)
print('\n')
hn[0:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']




[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

### Splitting 'Ask HN' + 'Show HN" + other posts

It is time to make three lists and populate one with titles that start with **'Ask HN'**, another list with **'Show HN'**, and the last list do not have either (**'Other'**).

In [3]:
# Initializing three lists.
ask_posts = []
show_posts = []
other_posts = []

# Appending the post titles based on their titles, titles are
# in the second column of the dataset.
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [4]:
# Counting the number of posts in our three lists.
print('Total number of posts in Ask HN:', len(ask_posts))
print('Total number of posts in Show HN', len(show_posts))
print('Total number of other posts:', len(other_posts))

Total number of posts in Ask HN: 9139
Total number of posts in Show HN 10158
Total number of other posts: 273822


In [5]:
# For every post in ask_posts, count the number of comments in each post then add to total_ask_comments.
# Now find the avg. number of 'Ask HN' comments: Total number of 'Ask HN' comments / Total number of 'Ask HN' comments. 
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / (len(ask_posts))
avg_ask_comments

10.393478498741656

In [6]:
# For every post in show_posts, count the number of comments in each post then add to total_show_comments.
# Now find the avg. number of 'Show HN' comments: Total number of 'Show HN' comments / Total number of 'Show HN' comments.
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
avg_show_comments

4.886099625910612

In [7]:
print('Average rounded number of comments on Ask HN posts = ', round(avg_ask_comments))
print('Average rounded number of comments on Show HN posts = ', round(avg_show_comments))

Average rounded number of comments on Ask HN posts =  10
Average rounded number of comments on Show HN posts =  5


Ask HN posts receive about double the comments that Show HN posts receive. 

## Finding the number of Ask posts and comments by hour created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1.  Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2.  Calculate the average number of comments ask posts receive by hour created.


In [8]:
# Importing the datetime module calling it dt
import datetime as dt

In [18]:
# Creating an empty list then iterating over ask_posts to append the
# created_at column and number of comments from the dataset.

result_list = []

for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])
result_list[:10]

[['9/26/2016 2:53', 7],
 ['9/26/2016 1:17', 3],
 ['9/25/2016 22:57', 0],
 ['9/25/2016 22:48', 3],
 ['9/25/2016 21:50', 2],
 ['9/25/2016 19:30', 1],
 ['9/25/2016 19:22', 22],
 ['9/25/2016 17:55', 3],
 ['9/25/2016 15:48', 0],
 ['9/25/2016 15:35', 13]]

In [10]:
# Here we are parsing the date info from column 0 of result_list and getting the hour info from each as well.
# Then we are counting how many ask posts are created in each our and how many comments there were.

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    post_created = row[0]
    comments_count = row[1]
    date_created = dt.datetime.strptime(post_created, "%m/%d/%Y %H:%M")
    hour_created = date_created.hour
    if hour_created in counts_by_hour:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments_count
    else:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments_count

print("Number of posts created during each hour of the day:", counts_by_hour)
print("Number of comments ask posts created at each of received:", comments_by_hour)

Number of posts created during each hour of the day: {2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}
Number of comments ask posts created at each of received: {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


In [19]:
# Calculating the average number of comments "Ask HN" posts receive by hour created.

avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
avg_by_hour[:10]

[[2, 11.137546468401487],
 [1, 7.407801418439717],
 [22, 8.804177545691905],
 [21, 8.687258687258687],
 [19, 7.163043478260869],
 [17, 9.449744463373083],
 [15, 28.676470588235293],
 [14, 9.692007797270955],
 [13, 16.31756756756757],
 [11, 8.96474358974359]]

In [None]:
# We want to swap the elements in avg_by_hour so the average comments are the first column
# and the hour created is the second column.

swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    comments = row[1]
    swap_avg_by_hour.append([row[1], row[0]])

# Here we are sorting the swapped list so list shows the average number of comments starting
# from the most and descending to the least, and then printing the top 5.

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for 'Ask HN' Posts Comments:")
sorted_swap[:5]

In [None]:
# Here we are printing the hour and average in a particular format to make it
# easier to read.

string = "Top 5 Hours for Ask Posts Comments"
print(string)
print("-----------------------------------")
time_format = "%H"

for row in sorted_swap[:5]:
    date = dt.datetime.strptime(str(row[1]),'%H')
    time = date.strftime('%H:%M')
    template = "{}: {:.2f} average comments per post"
    print(template.format(time,row[0]))

There are times to post during the day that will give you a higher chance of receiving comments.  15:00 (3 pm) is by far the best time to post something in the hopes of getting more comments.  3 pm can give you more than double the comments compared to most other times in the top 5 hours of the day.  