# Hacker News Post - Post information comparison

The goal of this project is to perform an analysis of certain types of post on the Hacker News website. We will focus specifically on the 'ASK HN' and the 'SHOW HN' posts.

    1) Users submit Ask HN posts to ask the Hacker News 
    community a specific question
    2) Users submit Show HN posts to show the Hacker News community
    a project, product, or just generally something interesting.

We will compare these types of posts to see which type gets the most comments and whether the time variable plays a part in the amount of comments.

In [2]:
# importing the library and opening the csv file. Assign it to variable hn
from csv import reader

opened_file = open('hacker_news.csv')
reader = reader(opened_file)
hn = list(reader)

In [3]:
# checking if the file was opened and read correctly
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [4]:
# assigning the first row to header variable to seperate it from future analysis
hn_header = hn[0]
hn = hn[1:]

In [5]:
hn_header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [6]:
# We now look to identify the ASK HN and SHOW HN posts
# We'll use the str.startswith() function
ask_posts = []
show_posts = []
other_posts = []

# loop through all posts in the hn dataset
for i in hn:
    post_title = i[1].lower() # we add the lower() function to account for different cases
    if post_title.startswith('ask hn') == True:
        ask_posts.append(i)
    elif post_title.startswith('show hn') == True:
        show_posts.append(i)
    else:
        other_posts.append(i)
        
print('Number of ask posts: ', len(ask_posts))
print('Number of show posts: ', len(show_posts))
print('Number of other posts: ',len(other_posts))

Number of ask posts:  1744
Number of show posts:  1162
Number of other posts:  17194


In [7]:
ask_posts[0:3] # just to check whether the code works

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14']]

In [8]:
print(type(ask_posts[0][4]))

<class 'str'>


In [9]:
#Counting the amount of comments on the 'ask hn' posts
total_ask_comments = 0

for i in ask_posts:
    comments = int(i[4]) #The number of comments is in string format so convert to int
    total_ask_comments += comments

avg_comments = total_ask_comments / len(ask_posts)   
print('The total number of ask comments is: ', total_ask_comments)
print('''The average number of comments per 'ask hn' post : ''',avg_comments) 
    

The total number of ask comments is:  24483
The average number of comments per 'ask hn' post :  14.038417431192661


In [10]:
#Counting the amount of comments on the 'show hn' posts
total_show_comments = 0

for i in show_posts:
    comments = int(i[4]) #The number of comments is in string format so convert to int
    total_show_comments += comments

avg_comments_show = total_show_comments / len(ask_posts)   
print('The total number of show comments is: ', total_show_comments)
print('''The average number of comments per 'show hn' post : ''',avg_comments_show) 

The total number of show comments is:  11988
The average number of comments per 'show hn' post :  6.873853211009174


In [11]:
print('total posts compared: ',1744 / 1162 )
print('total comments compared: ', 24483 / 11988)

total posts compared:  1.5008605851979346
total comments compared:  2.0422922922922924


As made visible above, the the 'ask hn' posts outnumber the the 'show hn' posts by 1.5 to 1. The amount of comments, however, are twice as many for the 'ask hn' posts.
The average amount of 'ask hn' comments per posts is more than twice the amount than those for the 'show hn' posts. This means that the 'ask hn' posts get twice as much attention. 

### 'Ask hn' post analysis

We identified the 'ask hn' posts to be the most popular so we will focus on these first to analyse the if there is a time where more comments are being posted

In [12]:
from datetime import datetime as dt

result_list = [] # create a list of lists to gather the post time and amount of comments
for i in ask_posts:
    list_a = i[-1]
    list_b = int(i[4])
    list_grand = list_a, list_b
    result_list.append(list_grand)
    
print(result_list[0:3])
    

[('8/16/2016 9:55', 6), ('11/22/2015 13:43', 29), ('5/2/2016 10:14', 1)]


In [44]:
counts_by_hour = {}
comments_by_hour = {}

for i in result_list:
    date = i[0]
    comment = int(i[1])
    hour = dt.strptime(date, '%m/%d/%Y %H:%M').strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
        
print('Counts by the hour: ', counts_by_hour)
print('Comments by the hour: ', comments_by_hour)
   

Counts by the hour:  {'19': 110, '09': 45, '14': 107, '08': 48, '13': 85, '02': 58, '06': 44, '22': 71, '23': 68, '21': 109, '04': 47, '03': 54, '10': 59, '12': 73, '18': 109, '11': 58, '15': 116, '16': 108, '00': 55, '20': 80, '17': 100, '05': 46, '07': 34, '01': 60}
Comments by the hour:  {'19': 1188, '09': 251, '14': 1416, '08': 492, '13': 1253, '02': 1381, '06': 397, '22': 479, '23': 543, '21': 1745, '04': 337, '03': 421, '10': 793, '12': 687, '18': 1439, '11': 641, '15': 4477, '16': 1814, '00': 447, '20': 1722, '17': 1146, '05': 464, '07': 267, '01': 683}


In [20]:
# Using the max() function, extract the hour where we have the most comments and 'ask hn' posts
print('max count by hour:', max(counts_by_hour, key=counts_by_hour.get))
print('Max comments by the hour: ', max(comments_by_hour, key=comments_by_hour.get))

max count by hour: 15
Max comments by the hour:  15


Above code shows us that the best time to post an 'ask hn' post is at 15h:00 because thats when most people appear to be online. Ofcourse, we should also take into account that if lots of people are posting, there is a chance our post will be lost in the masses.

In [36]:
# lets look at the avg values
avg_count = []

for i in counts_by_hour:
    avg_count.append([i, counts_by_hour[i]])
    
avg_by_hour = []

for i in comments_by_hour:
    avg_by_hour.append([i, comments_by_hour[i]])
    
for i in avg_by_hour:
    for j in avg_count:
        if j[0] == i[0]:
            avg_com = i[1] / j[1]
            i[1] = avg_com
print('\n')
print('The average amount of comments per post, per hour:' ,avg_by_hour)
    



The average amount of comments per post, per hour: [['19', 10.8], ['09', 5.5777777777777775], ['14', 13.233644859813085], ['08', 10.25], ['13', 14.741176470588234], ['02', 23.810344827586206], ['06', 9.022727272727273], ['22', 6.746478873239437], ['23', 7.985294117647059], ['21', 16.009174311926607], ['04', 7.170212765957447], ['03', 7.796296296296297], ['10', 13.440677966101696], ['12', 9.41095890410959], ['18', 13.20183486238532], ['11', 11.051724137931034], ['15', 38.5948275862069], ['16', 16.796296296296298], ['00', 8.127272727272727], ['20', 21.525], ['17', 11.46], ['05', 10.08695652173913], ['07', 7.852941176470588], ['01', 11.383333333333333]]


In [43]:
# the code above gives us the result but not in a clear fashion, so we will now make it more readable by sorting
swap_avg_by_hour = []
for i in avg_by_hour:
    swap_avg_by_hour.append([i[0],i[1]])
    
print('The first five swapped elements : ',swap_avg_by_hour[0:5] )
print('\n')
sorted_swap = sorted(swap_avg_by_hour,reverse = True)
print('Top 5 Hours for Ask Posts Comments: ', sorted_swap[0:5])


The first five swapped elements :  [[10.8, '19'], [5.5777777777777775, '09'], [13.233644859813085, '14'], [10.25, '08'], [14.741176470588234, '13']]


Top 5 Hours for Ask Posts Comments:  [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


We now have the top 5 hours to post on, based upon the average amount of comments a post receives during that hour. 
Let us try to clean it up a bit so that we can present it clearly.

In [57]:


for i in sorted_swap[0:5]:
    str = '{}: {:.2f} average comments per post'
    print(str.format(i[1],i[0]))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We have a finished list of best hours to post on. Some next questions we can look to answer: 
    1. Determine if show or ask posts receive more points on average
    2. Determine if posts created at a certain time are more likely to receive more points.
    3. Compare your results to the average number of comments and points other posts receive.
    (4. format the project to make it cleaner)
    
