# Exploring Hackers News Posts
In this project, we'll compare two different types of posts from Hacker News, a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Introduction
First, we'll read in the data and remove the headers.

In [2]:
from csv import reader # First import the "csv" module and form the module import the definition "reader" 
opened_file = open("hacker_news.csv") # Next open the file using the "open()" function.
read_file = reader(opened_file) # Read the file using the "reader()" definition from the "csv" module.
making_the_file_into_a_list = list(read_file) # Make the file a list using the "list()" function.
hn = making_the_file_into_a_list # Assign the list of list file to a variable named "hn".
hn[:5] # Printing the first five rows from the "hn" dataset.

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers from a Lists of Lists

In [3]:
headers = hn[:1] # We create a variable named "headers" and assign it to the first row of the "hn" dataset. This is the row we want to remove.
hn = hn[1:] # Next we update the dataset becasue we dont want the header row to be printed.
print(headers) # Printing the row we are removing.
print("\n")
print(hn[:5]) # Showing the dataset with the row removed.

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created. Let's start by exploring the number of comments for each type of post.



## Extracting Ask HN and Show HN Posts

First, we'll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. 

In [4]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
ask_posts = []
show_posts = []
other_posts = []

for data in hn: # Using a "for loop".
    title = data[1] # This is the title of every single post.
    if title.lower().startswith("ask hn"): # Using an if statement to check wether the lower case data starts with "ask hn".
        ask_posts.append(data) # If the title started with "ask hn" then append that "title" to the empty list "ask_posts".
    elif title.lower().startswith("show hn"): # Using an if statement to check wether the lower case data starts with "show hn"
        show_posts.append(data) # If the title started with "show hn" then append that "title" to the empty list "show_posts".
    else: # If none of the other statments are true then append the title too "other_posts"
        other_posts.append(data)

print(ask_posts[:2]) # Printing the first two rows of "ask_posts"
print("\n")
print(show_posts[:2]) # Printing the first two rows of "show_posts"
print("\n")
print(other_posts[:2]) # Prinitng the first two rows of "other_posts"
print("\n")
print(len(ask_posts)) # Printing the length of "ask_posts"
print("\n")
print(len(show_posts)) # Printing the length of "show_posts"
print("\n")
print(len(other_posts)) # Printing the length of "other_posts"

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


1744


1162


17194


As you can tell separating the data into different lists will make it easier to analyze the data in following steps.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

### Finding the Average Number of Comments in our List "ask_posts"

In [5]:
# Calculate the average number of comments `Ask HN` posts receive.

total_ask_comments = 0
for data in ask_posts:
    num_comments_in_ask_posts = int(data[4])
    total_ask_comments =total_ask_comments + num_comments_in_ask_posts
avg_ask_comments = total_ask_comments / len(ask_posts) 
print("The amount of average comments each post receives for 'Ask HN' is ",avg_ask_comments,".")

The amount of average comments each post receives for 'Ask HN' is  14.038417431192661 .


### Finding the Average Number of Comments in our List "show_posts"

In [6]:
# Calculate the average number of comments `Show HN` posts receive.

total_show_comments = 0
for data in show_posts:
    num_comments_in_shows_posts = int(data[4])
    total_show_comments = total_show_comments + num_comments_in_shows_posts
avg_show_comments = total_show_comments / len(show_posts)
print("The amount of average comments each post receives for 'Show HN' is ",avg_show_comments,".")

The amount of average comments each post receives for 'Show HN' is  10.31669535283993 .


On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

In [56]:
import datetime as dt

result_list = [] # Creat an empty list

for data in ask_posts:
    created_at = data[6] # The date that the post is created
    num_comments_in_shows_posts = int(data[4]) # The number of comments on the post
    result_list.append([created_at,num_comments_in_shows_posts]) # Append both of the two variable to the empty list to creat a list of list

counts_by_hour = {} # Empty dictionary number one
comments_by_hour = {} # Empty dictionary number two
date_format = "%m/%d/%Y %H:%M" # The format for the "datetime.strptime" constructor

for data in result_list:
    hour_from_date = data[0] # The date from the list named "result_list". The format of the list in "datetime.strptime()" format is ("%m/%d/%Y %H:%M").
    comments = data[1] # The amount of comments received for that excat post
    time = dt.datetime.strptime(hour_from_date,date_format).strftime("%H") # First using the import the "datetime" module as "dt".Then using the "datetime" class and from that class using the "strptime()" constructor.
    # Next I entered the string I wanted to creat and the format for it. Then used the strftime constructor to access the specific hour.  
    if time not in counts_by_hour: # We say that a certain hour for instance : "09" isnt already in the dictionary then :
        counts_by_hour[time] = 1 # Update the key value from the "counts_by_hour" dictionary by setting it equal to one.
        comments_by_hour[time] = comments # Also update the key value of the "comments_by_hour" dictionary by setting it equal to the amount of comments for that post.
    else: # If the time is already in the "counts_by_hour" dictionary then:
        comments_by_hour[time] += comments # Update the key value by adding the amount of comments
        counts_by_hour[time] += 1 # Update the key value by adding one each time.

        
print(comments_by_hour)


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask HN Posts by Hour

In [77]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for data in comments_by_hour:
    #print(data) ## Prints the hour
    #print(comments_by_hour[data]) ## Prints the amount of comments per that hour
    avg_by_hour.append([data,comments_by_hour[data]/counts_by_hour[data]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sorting and Printing Values from a List of Lists

In [98]:
swap_avg_by_hour = []

for data in avg_by_hour:
    swap_avg_by_hour.append([data[1],data[0]])

sorted_swap = sorted(swap_avg_by_hour,reverse=True)

print('Top 5 Hours to Ask Posts Comments:')
for avg, hr in sorted_swap[:5]:
     print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours to Ask POsts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.
According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.


# Conclusion
In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).
However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.