## Guided Project: Exploring Hacker News Posts 
*This is the guided project for the 'Python for Data Science: Intermediate' course of Dataquest.*

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

You can find the data set at https://www.kaggle.com/hacker-news/hacker-news-posts, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question, such as 'Ask HN: How to improve my personal website?'
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting, such as 'Show HN: Shanhu.io, a programming playground powered by e8vm'.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [9]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In order to analyze our data, we need to first remove the row containing the column headers. 
Let's remove that first row next and print it and the first 5 rows of the remaining dataset.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Extracting Ask HN and Show HN Posts
Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [3]:
# Create three empty lists called ask_posts, show_posts, and other_posts
ask_posts = []
show_posts = []
other_posts = []

# For each row in hn:
# Assign the lowercase version of the title in each row to a variable named title
# if title starts with 'ask hn', append the row to ask_posts. Etc.
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# lets check some lengths to see if this went well
print(len(hn))        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

20100
1744
1162
17194


# Calculating the Average Number of Comments for Ask HN and Show HN Posts
Next, let's determine if ask posts or show posts receive more comments on average.

In [4]:
# Find the total number of comments in ask posts
# For each row in ask_posts:
# get the number of comments from each row and assign it to a variable named 'comments'
# Add this value to total_ask_comments
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

# Compute the average number of comments on ask posts    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
# Find the total number of comments in show posts
# For each row in show_posts:
# get the number of comments from each row and assign it to a variable named 'comments'
# Add this value to total_show_comments
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

# Compute the average number of comments on show posts    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


As you can see, ask posts receive 14 comments on average, whereas show posts receive 10 comments on average. 
I think this makes sense, as people are more inclined to respond to a question to help someone out. 

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

# Finding the Amount of Ask Posts and Comments by Hour Created
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.


In [8]:
import datetime as dt

# Create an empty list and assign it to result_list.
result_list = []

# Iterate over ask_posts and append to result_list a list with two elements:
# The first element shall be the column created_at.
# The second element shall be the number of comments of the post.
for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])

# Create two empty dictionaries called counts_by_hour and comments_by_hour.
counts_by_hour = {}
comments_by_hour = {}

# Loop through each row of result_list.
# Date is the first element of the row.
# Use the datetime.strptime() method to parse the date and create a datetime object.
# Use the datetime.strftime() method to select just the hour from the datetime object.
for row in result_list:
    date = row[0]
    comment = row[1]
    date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date_dt.strftime("%H")

# counts_by_hour: contains the number of ask posts created during each hour of the day.
# comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.    
    
# if the hour is not a key in counts_by_hour, create the key in counts_by_hour and set it equal to 1.
# + Create the key in comments_by_hour and set it equal to the comment number.
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
# If the hour is already a key in counts_by_hour, increment the value in counts_by_hour by 1.
# + Increment the value in comments_by_hour by the comment number.
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

# Calculating the Average Number of Comments for Ask HN Posts by Hour
Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [19]:
avg_by_hour = []
for hour in counts_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])

avg_by_hour

[['12', 9.41095890410959],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['03', 7.796296296296297],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['00', 8.127272727272727],
 ['09', 5.5777777777777775],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['01', 11.383333333333333],
 ['08', 10.25],
 ['19', 10.8],
 ['22', 6.746478873239437],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['06', 9.022727272727273],
 ['13', 14.741176470588234],
 ['05', 10.08695652173913],
 ['11', 11.051724137931034],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['04', 7.170212765957447],
 ['07', 7.852941176470588]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [31]:
# Create a list that equals avg_by_hour with swapped columns
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

# Use the sorted() function to sort swap_avg_by_hour in descending order.
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [40]:
print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[:5]:
    date = dt.datetime.strptime(hour, "%H")
    date_str = dt.datetime.strftime(date, "%H:%M")
    print("{a}: {b:.2f} average comments per post".format(a = date_str, b= avg)
         )          

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


To have a higher chance of receiving comments, you should create a post in the 15:00 hour. This will get you an average of 38,59 comments.  
If you write a post at 2:00, you will get an average of 23,81 comments. This is a lot less.
The timezone used in this data is Eastern Time in the US. 

# Conclusion
In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. 
What should be noted, is that all submissions that did not receive any comments were removed from the dataset we started with. And ask posts received more comments than show posts. 
So if you want to receive more comments on you Ask & show posts on Hacker News, post an Ask post between 15:00 and 16:00 est.