# Hacker News Post

***

## Import and open the file

**Store the header and the data in separate variables**

In [8]:
from csv import reader

with open('hacker_news.csv') as file:
    read = reader(file)
    hn = list(read)

# separate the header and the data
header = hn[0]
hn = hn[1:]


# print data to see if the reading went well
print(header) 
print("\n")
for item in hn[:5]:
    print(item)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


***

## Meaning of each column

1. **id**: the unique identifier from Hacker News for the post
2. **title**: the title of the post
3. **url**: the URL that the posts links to, if the post has a URL
4. **num_points**: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
5. **num_comments**: the number of comments on the post
6. **author**: the username of the person who submitted the post
7. **created_at**: the date and time of the post's submission

***

## Extract the 'Ask HN' and 'Show HN' posts

In [26]:
# loop over the hn and test if the item at the index 2 starts with 'Ask HN' or 'Show HN'
# store the data in show_posts, ask_posts and other_posts

show_posts, ask_posts, other_posts = [], [], []

for post in hn:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

# show_posts (1162); ask_posts (1744); other_posts(17194)
print(len(show_posts), len(ask_posts), len(other_posts))   



1162 1744 17194


___

## Calculate the average number of comments for 'Ask' and 'Show'

In [25]:
number_ask = 0
number_show = 0

# ask average comments
for post in ask_posts:
    number_ask += int(post[4])
average_comm_ask = round(number_ask / len(ask_posts))
    
# show average comments
for post in show_posts:
    number_show += int(post[4])
average_comm_show = round(number_show / len(show_posts))

# print the average comments per ask post and per show post
print(f"{average_comm_ask} average comments per ask post.")
print(f"{average_comm_show} average comments per show post.")

14 average comments per ask post.
10 average comments per show post.


> **The ask posts on average receive more (14 comments) than the show posts (10 comments). Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.**

___

## Comments-per-post ratio based on the hours of posting 'Ask' posts

In [56]:
# In this code block we will loop over the ask_posts list and...
# ...distribute the comments based on time periods in which they were posted.
# We will consider each hour of the day. 


import datetime as dt

hour_comm_dict = {}
hour_post_dict = {}

for post in ask_posts:
    time_key = dt.datetime.strptime(post[6], "%m/%d/%Y %H:%M").hour

    if time_key not in hour_comm_dict:
        hour_comm_dict[time_key] = int(post[4])
    else:
        hour_comm_dict[time_key] += int(post[4])

    if time_key not in hour_post_dict:
        hour_post_dict[time_key] = 1
    else:
        hour_post_dict[time_key] += 1



# Convert the dict in list of tuples and sort in ascending order
hour_comm_list = []
for key, value in hour_comm_dict.items():
    hour_comm_list.append((key, value))
hour_comm_list = sorted(hour_comm_list)




# Also lets create a comm_hour_list that will be sorted in the descending order based on the commentary number

comm_hour_list = []
for key, value in hour_comm_dict.items():
    comm_hour_list.append((value, key))
comm_hour_list = sorted(comm_hour_list, reverse = True)




# Print both the comm_hour_list and the hour_post_dict:
## CONCLUSION comm_hour_list could be used for understanding at what hours people are the most active in terms of comments
print("The values below show the number of commentaries left per hour in descending order")
for item in comm_hour_list:
    print(item)
print("\n")    


print("The values below show the number of posts per hour, not sorted")
for hour, post_count in hour_post_dict.items():
    print(hour, post_count)
print("\n")     


# What we are actually looking is the hour that shows the greates comments per post_count    
hours_comm_per_pc = []

for (comm_count, hour) in comm_hour_list:
    hours_comm_per_pc += [(hour, round(comm_count / hour_post_dict[hour],1))]

hours_comm_per_pc = sorted( hours_comm_per_pc, key = lambda x: x[1], reverse = True)

print("The data from below shows the hours with the greatest ratio of comments per posts.")
for item in hours_comm_per_pc:
    print(item)

    

The values below show the number of commentaries left per hour in descending order
(4477, 15)
(1814, 16)
(1745, 21)
(1722, 20)
(1439, 18)
(1416, 14)
(1381, 2)
(1253, 13)
(1188, 19)
(1146, 17)
(793, 10)
(687, 12)
(683, 1)
(641, 11)
(543, 23)
(492, 8)
(479, 22)
(464, 5)
(447, 0)
(421, 3)
(397, 6)
(337, 4)
(267, 7)
(251, 9)


The values below show the number of posts per hour, not sorted
9 45
13 85
10 59
14 107
16 108
23 68
12 73
17 100
15 116
21 109
20 80
2 58
18 109
3 54
5 46
19 110
1 60
22 71
8 48
4 47
0 55
6 44
7 34
11 58


The data from below shows the hours with the greatest ratio of comments per posts.
(15, 38.6)
(2, 23.8)
(20, 21.5)
(16, 16.8)
(21, 16.0)
(13, 14.7)
(10, 13.4)
(18, 13.2)
(14, 13.2)
(17, 11.5)
(1, 11.4)
(11, 11.1)
(19, 10.8)
(8, 10.2)
(5, 10.1)
(12, 9.4)
(6, 9.0)
(0, 8.1)
(23, 8.0)
(7, 7.9)
(3, 7.8)
(4, 7.2)
(22, 6.7)
(9, 5.6)


> **The first block of data shows that people are the most active at 2, 14, 15, 16, 18, 20, 21. The greatest number of commentaries per post are during the 15, 2, 20, 16, 21 hours.**

___

## Printing the comments-to-posts ratio distribution in a readable way

In [58]:
for (hour, ratio) in hours_comm_per_pc:
    print(f" The ratio of comments-per-post is {ratio} during the {hour}:00s hours")

 The ratio of comments-per-post is 38.6 during the 15:00s hours
 The ratio of comments-per-post is 23.8 during the 2:00s hours
 The ratio of comments-per-post is 21.5 during the 20:00s hours
 The ratio of comments-per-post is 16.8 during the 16:00s hours
 The ratio of comments-per-post is 16.0 during the 21:00s hours
 The ratio of comments-per-post is 14.7 during the 13:00s hours
 The ratio of comments-per-post is 13.4 during the 10:00s hours
 The ratio of comments-per-post is 13.2 during the 18:00s hours
 The ratio of comments-per-post is 13.2 during the 14:00s hours
 The ratio of comments-per-post is 11.5 during the 17:00s hours
 The ratio of comments-per-post is 11.4 during the 1:00s hours
 The ratio of comments-per-post is 11.1 during the 11:00s hours
 The ratio of comments-per-post is 10.8 during the 19:00s hours
 The ratio of comments-per-post is 10.2 during the 8:00s hours
 The ratio of comments-per-post is 10.1 during the 5:00s hours
 The ratio of comments-per-post is 9.4 durin

___