## Exploring Hacker News Posts

In this project, we will be comparing different posts in Hacker News. We are particularly interested in looking at posts that begin with either **Ask HN** or **Show HN**. 

Users submit **Ask HN** posts to ask the Hacker News community a specific question, such as *"Ask HN: How to improve my personal website?"*. Likewise, users submit **Show HN** posts to show the Hacker News community a project, product, or just something interesting, such as, *"Show HN: Shanhu.io, a programming playground powered by e8vm"*. 

We are interested to compare these two types of posts to determine the following:
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

It should be note that the [dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) that we are using has been reduced to 20,000 rows from almost 300,000 rows, by removing all the submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

### Introduction 

We will first start off by importing our CSV file using the below codechunk:
* importing reader from csv
* open the 'hacker_news.csv' file using `open()` function
* use the `reader()` function that we have imported earlier to read the`opened_file`
* use `list()` function to list `read_file` and assigned it to the variable `hn` as list-of-list.
* Extract the first row of data and assigning it to `headers` as the header of the csv file.
* Extract the rest of the dataset and assign it as `hn`.

In [2]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(hn[:5])
print('\n')
print(headers)

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


As shown above, the header row is successfully removed. 

Now, our data set displays each of these items separately:
* `ID:`ID of post
* `title:` Title of the post
* `url:`The URL of the post
* `num_points:`The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments:`Number of comments of each posts
* `author:`the username of the person who submitted the post
* `created_at:` The date of the post that was created.

### Extracting 'Ask HN' and 'Show HN' posts.

**Methodology**

* **Identify** posts that begin with either `Ask HN` or `Show HN`
* **Filter** the data for those two types of posts into separate distinct lists.

  
We will assign the **Ask HN** posts to a list `ask_posts`, the **Show HN** posts to `show_posts` and the rest to `other_posts`.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith ('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Ask HN:',len(ask_posts))
print('Show HN:',len(show_posts))
print('Others:',len(other_posts))

Ask HN: 1744
Show HN: 1162
Others: 17194


As shown above, the `Ask HN` has **1,744** posts, `Show HN` has **1,162** posts, while `Others` has **17,194** posts.

In [7]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print('Average Ask HN comments:', avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments/len(show_posts)
print('Average Show HN comments:', avg_show_comments)

Average Ask HN comments: 14.038417431192661
Average Show HN comments: 10.31669535283993


On average, the Ask HN Comments is more than Show HN Comments. Since Ask HN posts gather more comments in general, we will focus our remaining analysis just on Ask HN Posts.

### Finding the Number of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

* Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [10]:
import datetime as dt

result_list = []
for row in ask_posts:
    result_list.append([row[6],int(row[4])])
                       #First element being `created_at` column, second element being `num_comments` column
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = time.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment

sorted_comments_by_hour = dict(sorted(comments_by_hour.items(), reverse = True))

sorted_comments_by_hour

{'23': 543,
 '22': 479,
 '21': 1745,
 '20': 1722,
 '19': 1188,
 '18': 1439,
 '17': 1146,
 '16': 1814,
 '15': 4477,
 '14': 1416,
 '13': 1253,
 '12': 687,
 '11': 641,
 '10': 793,
 '09': 251,
 '08': 492,
 '07': 267,
 '06': 397,
 '05': 464,
 '04': 337,
 '03': 421,
 '02': 1381,
 '01': 683,
 '00': 447}

The dictionary that we have created above are:

* `counts_by_hour`:Calculating the Average Number of Comments for Ask HN Posts by Hour
* `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.


Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.


### Calculating the Average Number of Comments for Ask HN Posts by Hour



In [12]:
avg_by_hour = []
for hour in comments_by_hour:
    avg = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg])

sorted_avg_by_hour = sorted(avg_by_hour, reverse = True)

sorted_avg_by_hour

[['23', 7.985294117647059],
 ['22', 6.746478873239437],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['19', 10.8],
 ['18', 13.20183486238532],
 ['17', 11.46],
 ['16', 16.796296296296298],
 ['15', 38.5948275862069],
 ['14', 13.233644859813085],
 ['13', 14.741176470588234],
 ['12', 9.41095890410959],
 ['11', 11.051724137931034],
 ['10', 13.440677966101696],
 ['09', 5.5777777777777775],
 ['08', 10.25],
 ['07', 7.852941176470588],
 ['06', 9.022727272727273],
 ['05', 10.08695652173913],
 ['04', 7.170212765957447],
 ['03', 7.796296296296297],
 ['02', 23.810344827586206],
 ['01', 11.383333333333333],
 ['00', 8.127272727272727]]

It looks like the highest average comment by hour is on `3p.m`, while the lowest average comment by hour is on `9a.m`. However, this looks a little difficult to read. I will sort the list of lists and print only the 5 highest values in the format.

In [14]:
swap_avg_by_hour = [] #swapping `value` to first entry and `key` to second entry

for row in avg_by_hour:
    key = row[0]
    value = row[1]
    swap_avg_by_hour.append([value,key])

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [15]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hr = row[1]
    avg = row[0]
    time1 = dt.datetime.strptime(hr, '%H').strftime('%H:%M')
    output = "{t} : {a:.2f}, average comments per post".format(t = time1, a = avg)
    print(output)

Top 5 Hours for Ask Posts Comments
15:00 : 38.59, average comments per post
02:00 : 23.81, average comments per post
20:00 : 21.52, average comments per post
16:00 : 16.80, average comments per post
21:00 : 16.01, average comments per post


On average, the posts that gathers most comments are created at 15:00, or `3p.m` est, with an average of 38.59 comments per post. 

Between the first and second for highest average comments per post hours at 15:00 and 02:00, there seems to be a surge of 62% engagement in comments.

According to the [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), the timezone used is US Eastern Standard Time.

## Conlcusion

In this project, we have analyzed the [Hacker post]((https://www.kaggle.com/datasets/hacker-news/hacker-news-posts)) dataset, with the goal of **finding out if '*Ask HN*' post or '*Show HN*' posts gather most comments on average by the hour**. We have also collected and sorted the data, formatted and cleaned the data for analysis, and analysed the data.

Based on our analysis, the highest average comments by post type would be:
* To have the post categorised as *Ask HN* post
* Created between 15:00 to 16:00 US Eastern Standard Time.