## Guided Project: Exploring Hacker News Posts

**Hacker News** is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### Removing Headers from a Lists of Lists
1. Extract the first row of data, and assign it to the variable headers.
2. Remove the first row from hn.
3. Display headers.
4. Display the first five rows of hn to verify that you removed the header row properly.

In [11]:
from csv import reader

open_file = open("hacker_news.csv", encoding="utf-8")
read_file = reader(open_file)
hn = list(read_file)

In [12]:
headers = hn[0]
hn = hn[1:]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

### Extracting Ask HN and Show HN Posts
1. Create three empty lists called ask_posts, show_posts, and other_posts.
2. Loop through each row in hn.
    - Assign the title in each row to a variable named title.
        - Because the title column is the second column, you'll need to get the element at index 1 in each row.
3. Implement the following steps:
    - If the lowercase version of title starts with ask hn, append the row to ask_posts.
    - Else if the lowercase version of title starts with show hn, append the row to show_posts.
Else append to other_posts.
4. Check the number of posts in ask_posts, show_posts, and other_posts

In [13]:
ask_posts = []
show_posts = []
other_posts = []

for element in hn:
    title = element[1]
    l_title = title.lower()
    if l_title.startswith('ask hn'):
        ask_posts.append(element)
    elif l_title.startswith('show hn'):
        show_posts.append(element)
    else:
        other_posts.append(element)

### Calculating the Average Number of Comments for Ask HN and Show HN Posts

1. Find the total number of comments in ask posts and assign it to total_ask_comments.
    - Set total_ask_comments to 0.
2. Use a for loop to iterate over the ask posts.
    - Because the num_comments column is the fifth column in ask_posts, you'll need to get the element at index 4 in each row.
        - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
        - Add this value to total_ask_comments.
3. Compute the average number of comments on ask posts and assign it to avg_ask_comments.
4. Print avg_ask_comments.
5. Find the total number of comments in show posts and assign it to total_show_comments.
    - Set total_show_comments to 0.
6. Use a for loop to iterate over the show posts.
    - Because the num_comments column is the fifth column in show_posts, you'll need to get the element at index 4 in each row.
        - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments.
        - Add this value to total_show_comments.
7. Compute the average number of comments on show posts and assign it to avg_show_comments.
8. Print avg_show_comments.
9. Do show posts or ask posts receive more comments on average? Write a markdown cell explaining your findings.

In [14]:
total_ask_comments = 0
for elements in ask_posts:
    total_ask_comments += int(elements[4])
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for elements in show_posts:
    total_show_comments += int(elements[4])
avg_show_comments = total_show_comments / len(show_posts)

print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


### Finding the Amount of Ask Posts and Comments by Hour Created

1. Import the datetime module as dt.
2. Create an empty list and assign it to result_list. This will be a list of 3. lists.
3. Iterate over ask_posts and append to result_list a list with two elements:
    - The first element shall be the column created_at.
        - Because the created_at column is the seventh column in ask_posts, you'll need to get the element at index 6 in each row.
    - The second element shall be the number of comments of the post.
        - You'll also need to convert the value to an integer.
4. Create two empty dictionaries called counts_by_hour and comments_by_hour.
5. Loop through each row of result_list.
6. Extract the hour from the date, which is the first element of the row.
7. Use the datetime.strptime() method to parse the date and create a datetime object.
8. Use the string we want to parse as the first argument and a string that specifies the format as the second argument.
    - Use the datetime.strftime() method to select just the hour from the datetime object.
    - If the hour isn't a key in counts_by_hour:
        - Create the key in counts_by_hour and set it equal to 1.
        - Create the key in comments_by_hour and set it equal to the comment number.
    - If the hour is already a key in counts_by_hour:
        - Increment the value in counts_by_hour by 1.
        - Increment the value in comments_by_hour by the comment number.

In [17]:
import datetime as dt

result_list = []
for element in ask_posts:
    result_list.append([element[6], int(element[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for element in result_list:
    time = dt.datetime.strptime(element[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time, "%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(element[1])
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(element[1])

### Calculating the Average Number of Comments for Ask HN Posts by Hour

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.<br>

1. Use the example above to calculate the average number of comments per post for posts created during each hour of the day.
2. The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post. Assign the result to a variable named avg_by_hour. Display the results.

In [19]:
avg_by_hour = []
for element in counts_by_hour:
    avg_by_hour.append([element, comments_by_hour[element]/counts_by_hour[element]])

### Sorting and Printing Values from a List of Lists
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.<br>

1. Create a list that equals avg_by_hour with swapped columns.
    - Create an empty list and assign it to swap_avg_by_hour.
    - Iterate over the rows of avg_by_hour and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.
2. Print swap_avg_by_hour.
3. Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
    - Set the reverse argument to True, so that the highest value in the first column appears first in the list.
    - Assign the result to sorted_swap.
4. Print the string "Top 5 Hours for Ask Posts Comments".
5. Loop through each average and each hour (in this order) in the first five lists of sorted_swap.
6. Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.
    - To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time.
    - To format the average, you can use {:.2f} to indicate that just two decimal places should be used.
7. Which hours should you create a post during to have a higher chance of receiving comments? Refer back to the documentation for the data set to convert the times to the time zone you live in. Write a markdown cell explaining your findings.

In [20]:
swap_avg_by_hour = []
for element in avg_by_hour:
    swap_avg_by_hour.append([element[1], element[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for i in range(0, 5):
    print("{0}: {1:.2f} average comments per post".format(dt.datetime.strftime(dt.datetime.strptime(sorted_swap[i][1], "%H"), "%H:00"), sorted_swap[i][0])
)    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
