# Guided Project: Exploring Hacker News Posts

## Introduction
Hello everyone! This is my second guided project where we will explore Hacker News posts. It is a community similar to Reddit where people publish posts, and get reactions in the form of comments and upvotes/downvotes. One of the website's features is that by using the keywords `Ask HN` and `Show HN` users can directly ask questions to the Hacker News community or send answers to the given questions. 

## The aim
The main goal will be to realize whether the keywords above receive more activity on average, and does the time of the day affect the number of comments received on a post on average.

In [35]:
import csv

# Reading the file
with open("hacker_news.csv") as opened_file:
    hn = list(csv.reader(opened_file))

# Printing the first few rows
for row in hn[:5]:
    print(row, end="\n\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']



## Removing the header
The following code block illustrates the first five rows of our dataset excluding the header row.

In [36]:
# Removing the first row 
headers = hn[0]
hn = hn[1:]
print(headers)

# Printing the first 5 rows
for row in hn[:5]:
    print(row, end="\n\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']



## Extracting Ask HN and Show HN PostsBelow, we are filtering our data into three categories: `Ask HN`, `Show HN` and others. 


In [37]:
# Setting the empty lists
ask_posts = []
show_posts = []
other_posts = []

# Filtering into three types of posts
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Checking the number of each type of post
print(len(ask_posts), len(show_posts), len(other_posts), sep="\n")

1744
1162
17194


## Number of comments for Ask HN and Show HN Posts on average
We calculate which type of the post among the first two receive the most comments on average.

In [38]:
# Setting the total number of comments for Ask HN posts
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

# Computing the average 
avg_ask_comments = total_ask_comments / len(ask_posts)
print(f"On average, Ask HN posts receive {round(avg_ask_comments)} comments")

# Setting the total number of comments for Show HN posts
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

# Computing the average 
avg_show_comments = total_show_comments / len(show_posts)
print(f"On average, Show HN posts receive {round(avg_show_comments)} comments")

On average, Ask HN posts receive 14 comments
On average, Show HN posts receive 10 comments


## Finding the number of Ask HN comments based on the hour created
We have already found that `Ask HN` posts receive more comments than `Show HN` posts on average. As a result, we will focus on the former and find at what time of the day `Ask HN` gets the most comments.

In [39]:
import datetime as dt

# Creating a list consisting of two items
result_list = []
for row in ask_posts:
    time = row[6]
    comments = int(row[4])
    temp_list = [time, comments]
    result_list.append(temp_list)

# Setting empty dictionaries
counts_by_hour = {}
comments_by_hour = {}

# Counting the comments in each hour of a day
for row in result_list:
    hour = row[0]
    date_hour_object = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    select_hour = date_hour_object.strftime("%H")
    if select_hour not in counts_by_hour:
        counts_by_hour[select_hour] = 1
        comments_by_hour[select_hour] = row[1]
    else:
        counts_by_hour[select_hour] += 1
        comments_by_hour[select_hour] += row[1]
        
# Showing the results
print(counts_by_hour, comments_by_hour, sep="\n\n")    

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Here we find the average number of Ask HN comments in each hour.

In [40]:
# Setting an empty list
avg_by_hour = []

# Calculating the average
comments_keys = list(comments_by_hour.keys())
for key in comments_keys:
    for el in comments_by_hour:
        temp_list = [key, round(comments_by_hour[key] / counts_by_hour[key], 2)]
    avg_by_hour.append(temp_list)

# Showing the results
for i in avg_by_hour:
    print(i)

['09', 5.58]
['13', 14.74]
['10', 13.44]
['14', 13.23]
['16', 16.8]
['23', 7.99]
['12', 9.41]
['17', 11.46]
['15', 38.59]
['21', 16.01]
['20', 21.52]
['02', 23.81]
['18', 13.2]
['03', 7.8]
['05', 10.09]
['19', 10.8]
['01', 11.38]
['22', 6.75]
['08', 10.25]
['04', 7.17]
['00', 8.13]
['06', 9.02]
['07', 7.85]
['11', 11.05]


## Top 5 hours to create an Ask HN post 
Lastly, we find the most suitable timeslots to create `Ask HN` posts to be likely to get as many comments as possible.

In [41]:
# Creating a swapped list
swap_avg_by_hour = []
for cell in avg_by_hour:
    swap_avg_by_hour.append([cell[1], cell[0]])

# Sorting the list
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Printing the results
print("Top 5 Hours for Ask Posts Comments")
for el in sorted_swap[:5]:
    original_time = dt.datetime.strptime(el[1], "%H")
    updated_time_be = original_time + dt.timedelta(hours=6)
    updated_time_be = updated_time_be.strftime("%H:%M")
    print(f"{updated_time_be} (CET): {el[0]} average comments per post")

Top 5 Hours for Ask Posts Comments
21:00 (CET): 38.59 average comments per post
08:00 (CET): 23.81 average comments per post
02:00 (CET): 21.52 average comments per post
22:00 (CET): 16.8 average comments per post
03:00 (CET): 16.01 average comments per post


## Conclusion
In this project we have analysed data retrieved from `Hacker News` online platform where we have found out that `Ask HN` posts receive more comments on average. In addition, we have calculated the best time to create such posts. The times are presented according to the Central European Timezone (UTC+01:00).