# Most popular post

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.

You can find the data set <a href="https://www.kaggle.com/hacker-news/hacker-news-posts">here</a>, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

#### Conclutions
We'll discover On average, Ask HN posts receive more comments. the the posts that are made in the afternoon during the 15 o'clock receive more comments.

## Exploring Hacker News dataset
We are going to use the [open](https://docs.python.org/3/library/functions.html#open) built-in function to open the `hacker_news.csv` file, together with the [with](https://docs.python.org/3/reference/compound_stmts.html#with) compound statemet, which ensures that once we finish using the file it will be close.

Finally assig to `hn` variable a list of list that represent a dataset.

In [1]:
from csv import reader
with open('hacker_news.csv') as file:
    file_reader = reader(file)
    hn = list(file_reader)

We display the first five rows fo `hm`

In [2]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The colmuns on the `hn` dataset are:

- `id`: The unique identifier from Hacker News for the post
- `title`:  The title of the post
- `url`: The URL that the posts links to, if the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

We are going to extract the header row form `hm` dataset, the header row is the first row (index 0) and assign to `headers` variable.

In [3]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Next we are going to remove the header row from `hn`

In [4]:
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Finding Ask HN and Show HN



We are going to define a [IntEnum](https://docs.python.org/3/library/enum.html?highlight=intenum#enum.IntEnum) subclass call `Columns`, help us to map columns wen we working with `hn` dataset and make readable our code.

In [5]:
from enum import IntEnum

class Columns(IntEnum):
    ID = 0
    TITLE =1
    URL = 2
    NUM_POINTS = 3
    NUM_COMMENTS = 4
    AUTHOR = 5
    CREATED_AT = 6

To find the posts that begin with either Ask HN or Show HN, we'll use the string method [startswith](https://docs.python.org/3/library/stdtypes.html#str.startswith). Given a string object, returns `True` if string starts with the prefix, otherwise return `False`.

For case variations we'll use string method [lower](https://docs.python.org/3/library/stdtypes.html?highlight=lower#str.lower), that return a copy of the string with all the cased characters converted to lowercase.

We'll start creating three arrays: 

- `ask_posts` for `Ask HN` posts
- `show_posts` for `Show HN` posts
- `other_posts` for no match posts

In [6]:
ask_posts = []
show_posts = []
other_posts = []

We'll loop through each row in `hn`. For each iteration we'll check the title of post:
- if starts with `ask hn`, then append to `ask_posts`
- else if starts with `ask hn`, then append to `show_posts`
- else apppend to `other_post`

In [7]:
for row in hn:
    title = row[Columns.TITLE]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

We'll check the number of posts in ask_posts, show_posts, and other_posts.


In [8]:
print('Ask HN: ', len(ask_posts))
print('Show HN: ', len(show_posts))
print('Other posts: ', len(other_posts))

Ask HN:  1744
Show HN:  1162
Other posts:  17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

We'll determine if ask posts or show posts receive more comments on average.

To calculate comments avarage we need to: 

- Iterate a list of list of post
- We'll define `total_comments` with value `0`, we'll use this variable to accumulate the `num_comments`
- We'll need to extract the `num_comments`, becase `num_comments` value is a `string` we use [int](https://docs.python.org/3/library/functions.html?highlight=int#int) built-in function that returns an integer object from string number.
- Finally we'll compute the comments avarage divide  `total_comments` by length of posts, for that we use [len](https://docs.python.org/3/library/functions.html?highlight=len#len) built-in funciton.

Because this process need to be executed twice, one to calculate the average of comments for Ask HN and Show HN Posts, we'll define `calculate_avg_comments`  function, that encapsulate the process , the function revice a param a list or list of posts.

In [9]:
def calculate_avg_comments(posts):
    total_comments = 0
    
    for post in posts:
        num_comments = post[Columns.NUM_COMMENTS]
        num_comments = int(num_comments)
        total_comments += num_comments
    
    return total_comments / len(posts)
        

We'll use `calculate_avg_comments` to compute the averege of comments of `ask_posts` and `show_posts`.

In [10]:
avg_ask_comments = calculate_avg_comments(ask_posts)
avg_show_comments = calculate_avg_comments(show_posts)

print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


The `Ask HN` post on average receive more comments.

## Finding the Amount of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

We'll discover if certain time are more likely to attract comments, to do that we'll calculate the amount of ask posts and comments by hour created. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

For the first step we'll  iterate over `ask_post` to extract the post `created at` date and respective `number comments`.

- We'll define `result_list` as list of list, to store `create_at` and `num_comments` values.
- We'll define `created_at_format` that represent the [strftime code](https://www.programiz.com/python-programming/datetime/strftime) for `created_at` column values, an example of `created_at` value: `'6/17/2016 17:10'`.
- We'll iterate over  `ask_posts` and append to `result_list` a list with two elements.
- The first element shall be `created_at` value, to store as [datetime](https://docs.python.org/3/library/datetime.html), we'll use [strptime](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior), that creates a datetime object from a string with specific format.
- The second value shall be `num_comments`.

In [11]:
import datetime as dt

result_list = []
# strftime code for created_at
created_at_format = '%m/%d/%Y %H:%M'
for post in ask_posts:
    created_at = post[Columns.CREATED_AT]
    created_at = dt.datetime.strptime(created_at, created_at_format)
    
    num_comments = post[Columns.NUM_COMMENTS]
    num_comments = int(num_comments)
    result_list.append([created_at, num_comments])

Then, we'll create the frequency tables for counts post by hour and counts comments by hour.

- We'll define empty directories, `counts_by_hour` for number of ask posts created during each hour of the day, and `comments_by_hour` for number of comments ask posts created at each hour received..
- We'll iterate over `result_list`.
- We'll extract the hour from the first elemnt of row, we'll use the [datetime.strftime()](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior), that creates a string representation of time by specific format. In this case we'll use `%H` strftime code to extract only the hour.
refers to the number of times an event or a value occurs. 
- We'll compute  the frequency tables using hour as a key.
- For `counts_by_hour` we'll increment by `1` the number of post are created at speficif hour.
- For `comments_by_hour` we'll increment by `num_comments` (the second elemet for row) the number of comments are created at speficif hour.

In [12]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at_hour = row[0].strftime('%H')
    counts_post = counts_by_hour.get(created_at_hour, 0)
    counts_comments = comments_by_hour.get(created_at_hour, 0)
    
    counts_by_hour[created_at_hour] = counts_post + 1
    comments_by_hour[created_at_hour] = counts_comments + row[1]

In [13]:
print(counts_by_hour, comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


##  Calculating the Average Number of Comments for Ask HN Posts by Hour

We'll create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

- We'll define `avg_by_hour` where we'll to store hour of the day and average number of comments per post for hour. 
- We'll iterate over `comments_by_hour`
- For every hour, we'll extract number of post by hour from  `counts_by_hour` dict.
- And number of comments ask posts created at each hour received from `comments_by_hour`dict.
- With these values, we'll calculate average number of comments per post for hour.

In [15]:
avg_by_hour = []
for hour in comments_by_hour:
    num_post = counts_by_hour.get(hour, 0)
    num_comments = comments_by_hour.get(hour, 0)
    avg = num_comments/num_post
    avg_by_hour.append([avg, hour])
    

In [16]:
print(avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


We'll formated `avg_by_hour` dict, because is hard to identify the hours with the highest values.

- We'll use the [sorted](https://docs.python.org/3/library/functions.html#sorted) built-in function that sorted list from the items in iterable, we'll set the `reverse` argument to `True`, so the values are return  highest to lowest value.
- We'll use [str.formart()](https://docs.python.org/3/library/stdtypes.html#str.format) to print make the next format: `15:00: 38.59 average comments per post`
- To format hour we'll use `datetime.strptime()` constructor to return a datetime objec and `strftime()` method to specify the format of the time.
- To format avarage value we'll use `{:.2f}` for indicate that  just two decimal places should be used.

In [22]:
sorted_avg_by_hour = sorted(avg_by_hour, reverse=True)
print('*** Hours for Ask Posts Comments ***')
for row in sorted_avg_by_hour:
    time = dt.datetime.strptime(row[1], '%H')
    time = time.strftime('%H:%M')
    result = str.format('{}: {:.2f} average comments per post',time, row[0])
    print(result)

*** Hours for Ask Posts Comments ***
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


## Conclusions
We'll discover the posts that are made in the afternoon during the 15 o'clock receive more comments.
The posts that are made in the morning between 7 and 9 o'clock or in the evening after 10 o'clock receive fewer interactions