# Hacker News Posting Trends

This is a project with the popular technology site 'Hacker News'. https://news.ycombinator.com/

We will be analysing the data to look at when people tend to post and what types of posts are more popular in terms of comments.

First lets load the dataset and print the first 5 lines

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Column definitions
- id: The unique post identification number
- title: A short description of the post
- url: The URL that a post links to (if applicable)
- num_points: The number of points a post has (measured by approval - disapproval votes)
- num_comments: The number of comments made by users on the post
- author: The username of the post creator
- created_at: The date and time post was created in MM/DD/YYYY hh/mm format.

The data is in a list of list format which is fine, but the columns are the the first list. Lets remove this from the dataset.

In [2]:
headers = hn[0]
hn = hn[1:]
print(hn[:2])


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


### Isolating Ask And Show Type Posts

There are different types of posts available in the datasets. The posts we are interested in is the ask_posts and show_posts which are community lead rather than news driven. The rest of the posts we will put into other_posts.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower() #Avoids case sensitivity issues
    if title.startswith('ask hn'):
        ask_posts.append(row)
    if title.startswith('show hn'):
        show_posts.append(row)    
    else:
        other_posts.append(row)
#C        
print('Number of Ask posts:',len(ask_posts))
print('Number of Show posts:',len(show_posts))
print('Number of Other posts:',len(other_posts))

Number of Ask posts: 1744
Number of Show posts: 1162
Number of Other posts: 18938


Other posts are the majority in our dataset. For now we are only concerned with Ask and Show posts.

Now that we have the number of posts lets work out the number of comments for these.

In [6]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / (len(ask_posts))
print('Average number of comments in Ask posts:',avg_ask_comments)

Average number of comments in Ask posts: 14.038417431192661


In [7]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / (len(show_posts))
print('Number of comments in Show posts:',avg_show_comments)

Number of comments in Show posts: 10.31669535283993


Ask posts show higher number of comments that show posts. Perhaps because feedback is actively requested?

### Analysing Time Of Posting

We will now focus on the Ask posts as these give the greatest comment response.
The question to ask is: "Is there a particular time that ask posts are more likely to attract comments?"

To do this we will use a module called datetime to parse the 'created_at' column of data

In [9]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result = ([created_at, num_comments])
    result_list.append(result)
print(result_list[:4])
    

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3]]


Heres a quick overview of what the datetime module can do

In [18]:
time = "11/22/2015 13:43"
#Create the datetime object from a string#
time = dt.datetime.strptime(time, "%m/%d/%Y %H:%M") 
#create an hour string from a datetime object#
time = dt.datetime.strftime(time, "%H") 
print(time)

13


Now that we have established the datetime function, lets use it to create a frequency dictionary of posts in each hour

In [12]:
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    created = row[0]
    comments = row[1]
    time = dt.datetime.strptime(created, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time, "%H")
    #If its not in the dictionary create the hour and set frequency to 1#
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    #If it is in dictionary then increase the hours comment frequency by 1#
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments   

In [27]:
print('Hour : Counts')
sorted(counts_by_hour.items(), key=lambda x: x[1] ,reverse= True)

Hour : Counts


[('15', 116),
 ('19', 110),
 ('21', 109),
 ('18', 109),
 ('16', 108),
 ('14', 107),
 ('17', 100),
 ('13', 85),
 ('20', 80),
 ('12', 73),
 ('22', 71),
 ('23', 68),
 ('01', 60),
 ('10', 59),
 ('02', 58),
 ('11', 58),
 ('00', 55),
 ('03', 54),
 ('08', 48),
 ('04', 47),
 ('05', 46),
 ('09', 45),
 ('06', 44),
 ('07', 34)]

### Working Out The Number Of Comments Per Hour

Now we should look at how many comments are made in total in each hour

In [28]:
print('Hour : Comments')
sorted(comments_by_hour.items(), key=lambda x: x[1] ,reverse= True)

Hour : Comments


[('15', 4477),
 ('16', 1814),
 ('21', 1745),
 ('20', 1722),
 ('18', 1439),
 ('14', 1416),
 ('02', 1381),
 ('13', 1253),
 ('19', 1188),
 ('17', 1146),
 ('10', 793),
 ('12', 687),
 ('01', 683),
 ('11', 641),
 ('23', 543),
 ('08', 492),
 ('22', 479),
 ('05', 464),
 ('00', 447),
 ('03', 421),
 ('06', 397),
 ('04', 337),
 ('07', 267),
 ('09', 251)]

### Calculating Average Number of Comments Per Post In Each Hour

We have the number of posts per each hour and the number of comments per each hour, so now we can calculate the average number of comments per post in each hour.

In [62]:
comments = []
average_by_hour = []
for hour in comments_by_hour:
    average_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour],2)])
print('Hour : Average Comments Per Hour', '\n')
print(average_by_hour)

Hour : Average Comments Per Hour 

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


Now lets print the most popular times to post and their average comments per post.


In [67]:
swap_avg_by_hour = []
for row in average_by_hour:
    swap_avg_by_hour.append ((row[1], row[0]))
print(swap_avg_by_hour[:])

[(5.58, '09'), (14.74, '13'), (13.44, '10'), (13.23, '14'), (16.8, '16'), (7.99, '23'), (9.41, '12'), (11.46, '17'), (38.59, '15'), (16.01, '21'), (21.52, '20'), (23.81, '02'), (13.2, '18'), (7.8, '03'), (10.09, '05'), (10.8, '19'), (11.38, '01'), (6.75, '22'), (10.25, '08'), (7.17, '04'), (8.13, '00'), (9.02, '06'), (7.85, '07'), (11.05, '11')]


Next we will sort the list in order to produce the top 5 posting times in terms of Central Standard Time

In [74]:
swap_avg_by_hour = sorted(swap_avg_by_hour, reverse = True)
print('Here is the list sorted by average comments:',swap_avg_by_hour, '\n\n')
print("These are the top 5 times to post at:")
for row in swap_avg_by_hour[:5]:
    hour = row[1]
    commenttotal = row[0]
#     commenttotal = str(commenttotal)
#     commenttotal = int(commenttotal)
    # The original time zone was Eastern Standard.
    # Here I am converting to Central Standard by subtracting an hour.
    convert_to_cst = dt.datetime.strptime(hour, '%H')
    cst = convert_to_cst - dt.timedelta(hours=1)
    cst = cst.strftime("%H:%M")     
   
    avg_comments = round(commenttotal,2)
    print("At {0} Central Standard Time there are on average {1} comments per post".format(cst, avg_comments))

Here is the list sorted by average comments: [(38.59, '15'), (23.81, '02'), (21.52, '20'), (16.8, '16'), (16.01, '21'), (14.74, '13'), (13.44, '10'), (13.23, '14'), (13.2, '18'), (11.46, '17'), (11.38, '01'), (11.05, '11'), (10.8, '19'), (10.25, '08'), (10.09, '05'), (9.41, '12'), (9.02, '06'), (8.13, '00'), (7.99, '23'), (7.85, '07'), (7.8, '03'), (7.17, '04'), (6.75, '22'), (5.58, '09')] 


These are the top 5 times to post at:
At 14:00 Central Standard Time there are on average 38.59 comments per post
At 01:00 Central Standard Time there are on average 23.81 comments per post
At 19:00 Central Standard Time there are on average 21.52 comments per post
At 15:00 Central Standard Time there are on average 16.8 comments per post
At 20:00 Central Standard Time there are on average 16.01 comments per post


### Conclusion

The results show some disparity in the top 5 posting times to get a community response. These are 1AM, 2PM, 3PM, 7PM and 8PM Central Standard Time.
The majority of posters are from the USA and so there will be some variation in the local times of the posters.

The worst time is at 8am, presumably as most people in the USA will be commuting to work or at work.