## About this project

### Goals
The goal of this project is to understand 1. the popularity of __Ask HN__ and __Show HN__ session comparing with other hacker news sections, and 2. the pattern of people commenting on hacker news, whether there is a certain time that's more popular than other.


### Questions

    *Do Ask HN or Show HN receive more comments on average?
    *Do posts created at a certain time receive more comments on average?


### methodology
We'll approach the problem by analyzing the number of comments.

### data

You can find the source data in this [Kaggle project](https://www.kaggle.com/hacker-news/hacker-news-posts).
The original data set contains all hacker news post in 12 months, ending on September 26 2015. For our project, the data has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

Below are descriptions of the columns:

-`id`: The unique identifier from Hacker News for the post

-`title`: The title of the post

-`url`: The URL that the posts links to, if it the post has a URL

-`num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

-`num_comments`: The number of comments that were made on the post

-`author`: The username of the person who submitted the post

-`created_at`: The date and time at which the post was submitted


In [1]:
# initialization

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)

hn = list(read_file)
hn_header = hn[:1]
hn_data = hn[1:]


## Step 1: Data clean-up

In [2]:
# a first look at the data
print(type(hn))
print('\n','Length of the data set:',len(hn))
print('\n','Data Header', '\n', hn_header)

for row in hn_data[:5]: 
    print(row)
    

<class 'list'>

 Length of the data set: 20101

 Data Header 
 [['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',

### clean-up todo list: 

1. uniform caps - return all lowercases
2. dates and time

In [3]:

upperCaseNum = 0
lowerCaseNum = 0
for row in hn_data:
    
    titleName = row[1]
    if "Ask HN" in titleName: 
        upperCaseNum += 1
    elif "ask HN" in titleName:
        lowerCaseNum += 1
        
print('number of uppercase:',upperCaseNum, 'lowercase:',lowerCaseNum)


# convert all to lowerCase: 
for row in hn_data:
    titleName = row[1]
    row[1] = titleName.lower()
# print(hn_data[1][1])
# print(hn_data[300][1])

        

number of uppercase: 1742 lowercase: 1


## Step 2: Analyze `Ask HN` and `Show HN` comments
`title` : row[1]

`num_comments` : row[4]

In [4]:
# ask_posts, show_posts, and other_posts.
ask_posts = []
show_posts = []
other_posts = []

for row in hn_data:
    titleName = row[1]
    if "ask hn" in titleName: 
        ask_posts.append(row)
    elif "show hn" in titleName:
        show_posts.append(row)
    else: 
        other_posts.append(row)
print('number of ask_hn:', len(ask_posts))
print('number of show_hn:',len(show_posts))
print('number of other_hn:',len(other_posts))

# print(ask_posts[1][4])
# print(show_posts[5][4])

print('ask posts percentage:',len(ask_posts)/(len(ask_posts)+len(show_posts)+len(other_posts))*100,'%')
print('show posts percentage:',len(show_posts)/(len(ask_posts)+len(show_posts)+len(other_posts))*100,'%')


number of ask_hn: 1745
number of show_hn: 1165
number of other_hn: 17190
ask posts percentage: 8.681592039800995 %
show posts percentage: 5.796019900497512 %


In [5]:
# analyze the average number of comments for each section

def avgComments(postList):
    listName = postList
    totalRows = 0
    totalComments = 0
    for row in postList:
        numComments = float(row[4])
        totalRows += 1
        totalComments += numComments
    return totalComments/totalRows

print('ASK hn average comments per post:',avgComments(ask_posts))
print('SHOW hn average comments per post:',avgComments(show_posts))
print('OTHER posts average comments per post:',avgComments(other_posts))

ASK hn average comments per post: 14.031518624641834
SHOW hn average comments per post: 10.302145922746782
OTHER posts average comments per post: 26.878359511343806


####  Summary:
`ask_hn` and `show_hn` consists of around 14% of the total hacker news posts; comparatively, the average comments for those two sessions are around half of the average comments per other post. 

## Step 3: Analyze the most popular time of comments
`num_comments`: row[4]

`created_at`: row[-1]

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

    Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
    Calculate the average number of comments ask posts receive by hour created.


In [6]:
# calculate the ask posts created in each hour of the day
print('sample date format:',hn_data[2][-1])

import datetime as dt
# for row in hn_data:
#     timeCreated = row[-1]
#     timeCreatedParsed = dt.datetime.strptime(timeCreated, "%m/%d/%Y %H:%M")

# created a new list with two elements: created_at & number of comments
result_list = []
for row in hn_data:
    timeCreated = row[-1]
    numComments = row[4]
    result_list.append([timeCreated,numComments])
print(result_list[:10])



sample date format: 6/23/2016 22:20
[['8/4/2016 11:52', '52'], ['1/26/2016 19:30', '10'], ['6/23/2016 22:20', '1'], ['6/17/2016 0:01', '1'], ['9/30/2015 4:12', '2'], ['10/31/2015 9:48', '22'], ['11/13/2015 0:45', '4'], ['8/16/2016 9:55', '6'], ['3/22/2016 16:18', '7'], ['10/13/2015 9:30', '10']]


In [7]:
## total amount of ask posts per hour - `counts_by_hour`
# total amount of comments from ask posts per hour - `comments_by_hour`

counts_per_hour = {}
comments_per_hour = {}
for row in result_list:
    createdTime = row[0]
    numComments = float(row[1])
    dtObject = dt.datetime.strptime(createdTime, "%m/%d/%Y %H:%M")
    createdHour = dtObject.hour
    if createdHour not in counts_per_hour:
        counts_per_hour[createdHour] = 1
        comments_per_hour[createdHour] = numComments
    else: 
        counts_per_hour[createdHour] += 1
        comments_per_hour[createdHour] += numComments

print(counts_per_hour)
print('\n')
print(comments_per_hour)   


{11: 762, 19: 1145, 22: 875, 0: 697, 4: 527, 9: 609, 16: 1302, 18: 1254, 14: 1151, 10: 686, 12: 923, 13: 1102, 20: 1051, 3: 488, 17: 1362, 1: 588, 23: 778, 8: 578, 2: 529, 21: 1030, 15: 1234, 6: 468, 7: 508, 5: 453}


{11: 20664.0, 19: 27894.0, 22: 18684.0, 0: 17478.0, 4: 11537.0, 9: 15274.0, 16: 30857.0, 18: 31587.0, 14: 33545.0, 10: 16818.0, 12: 25351.0, 13: 30562.0, 20: 23414.0, 3: 11626.0, 17: 34784.0, 1: 12465.0, 23: 17582.0, 8: 14062.0, 2: 13762.0, 21: 22652.0, 15: 35809.0, 6: 9253.0, 7: 12576.0, 5: 10290.0}


#### Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. 

In [8]:
for hourKey in counts_per_hour:
    avgComments = comments_per_hour[hourKey]/counts_per_hour[hourKey]
    
    print(hourKey, 'o\'clock:', avgComments)
    print('\n')

11 o'clock: 27.118110236220474


19 o'clock: 24.361572052401748


22 o'clock: 21.353142857142856


0 o'clock: 25.076040172166426


4 o'clock: 21.891840607210625


9 o'clock: 25.080459770114942


16 o'clock: 23.69969278033794


18 o'clock: 25.188995215311003


14 o'clock: 29.14422241529105


10 o'clock: 24.516034985422742


12 o'clock: 27.465872156013003


13 o'clock: 27.733212341197824


20 o'clock: 22.27783063748811


3 o'clock: 23.82377049180328


17 o'clock: 25.53891336270191


1 o'clock: 21.198979591836736


23 o'clock: 22.59897172236504


8 o'clock: 24.32871972318339


2 o'clock: 26.015122873345934


21 o'clock: 21.992233009708738


15 o'clock: 29.01863857374392


6 o'clock: 19.771367521367523


7 o'clock: 24.755905511811022


5 o'clock: 22.71523178807947




In [9]:
# create a list with hour and comments number so we can sort later.
avg_by_hour = []

for hour in counts_per_hour:
    avg_by_hour.append([hour,comments_per_hour[hour]/counts_per_hour[hour]])
print(avg_by_hour)

[[11, 27.118110236220474], [19, 24.361572052401748], [22, 21.353142857142856], [0, 25.076040172166426], [4, 21.891840607210625], [9, 25.080459770114942], [16, 23.69969278033794], [18, 25.188995215311003], [14, 29.14422241529105], [10, 24.516034985422742], [12, 27.465872156013003], [13, 27.733212341197824], [20, 22.27783063748811], [3, 23.82377049180328], [17, 25.53891336270191], [1, 21.198979591836736], [23, 22.59897172236504], [8, 24.32871972318339], [2, 26.015122873345934], [21, 21.992233009708738], [15, 29.01863857374392], [6, 19.771367521367523], [7, 24.755905511811022], [5, 22.71523178807947]]


In [10]:
# formatting

#swap the two elements in the list so we can sort by first element (avg comments)
swap_avg_by_hour = []
for row in avg_by_hour:
    firstElement = row[0]
    secondElement = row[1]
    swap_avg_by_hour.append([secondElement, firstElement])
# print(swap_avg_by_hour)

sortedByComments = sorted(swap_avg_by_hour,reverse = True)
print('Sorted table for average comments per hour')
print(sortedByComments)

print('\n')
print('Top five hacker news commenting hour')
for data in sortedByComments[:5]:
    formatTime = dt.datetime.strptime(str(data[1]),"%H").strftime("%H:%M")
    template = "{time}: {number:.2f} average comments per post".format(time= formatTime, number = data[0] )
    print(template)

Sorted table for average comments per hour
[[29.14422241529105, 14], [29.01863857374392, 15], [27.733212341197824, 13], [27.465872156013003, 12], [27.118110236220474, 11], [26.015122873345934, 2], [25.53891336270191, 17], [25.188995215311003, 18], [25.080459770114942, 9], [25.076040172166426, 0], [24.755905511811022, 7], [24.516034985422742, 10], [24.361572052401748, 19], [24.32871972318339, 8], [23.82377049180328, 3], [23.69969278033794, 16], [22.71523178807947, 5], [22.59897172236504, 23], [22.27783063748811, 20], [21.992233009708738, 21], [21.891840607210625, 4], [21.353142857142856, 22], [21.198979591836736, 1], [19.771367521367523, 6]]


Top five hacker news commenting hour
14:00: 29.14 average comments per post
15:00: 29.02 average comments per post
13:00: 27.73 average comments per post
12:00: 27.47 average comments per post
11:00: 27.12 average comments per post


### Summary: 
The average number of comments are fairly evenly distributed throughout the day(by hour). From the analysis, the most popular commenting hour is 2pm and 3pm, with 6am being the least popular commenting hour.