# Hacker News Posts: Data Analysis Project

Hacker News is a site started by the startup incubator Y Combinator [https://www.ycombinator.com/]
where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Below are the description of the columns:
 - id: The unique identifier from Hacker News for the post
 - title: The title of the post
 - url: The URL that the posts links to, if it the post has a URL
 - num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
 - num_comments: The number of comments that were made on the post
 - author: The username of the person who submitted the post
 - created_at: The date and time at which the post was submitted

## Import library and read in data

In [1]:
from csv import reader
with open ("hacker_news.csv", encoding="utf-8")as file:
    data = reader(file)
    hn = list(data)
hn[:5]
    

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

The first line in the inner list contains the column headers. I will remove this line in order to analyze the data 

In [2]:
# Assign the first row to a header variable
headers = hn[:1]

# Remove the first row from the hn dataset
hn = hn[1:]

In [3]:
# Display header
headers

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]

In [4]:
# Display first five rows without headers
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Now that I have removed the header, I can filter for post titles that begins with `Ask HN or Ahow HN` by creating a lists of lists containing just the data for those titles

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [6]:
# Checking the lenght of the lists to make they have been populated
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


### Calulating the Average Number of comments for Ask HN and show HN posts
lets's determine if ask posts or show posts receive comments on average

In [7]:
total_ask_comments = 0
for row in ask_posts:
    data = row[4]
    data = int(data)
    total_ask_comments += data

avg_ask_comment = total_ask_comments / len(ask_posts)
avg_ask_comment

14.038417431192661

In [8]:
total_show_comments = 0
for row in show_posts:
    data = row[4]
    data = int(data)
    total_show_comments += data

avg_show_comments = total_show_comments / len(show_posts)
avg_show_comments

10.31669535283993

The ask post has more comments than the show post becuase most contributor tends to comment on the ask post than the show post.

### Finding the Amount of Ask Posts and Comments by Hours Created

Since ask posts are most likely to receive comments, I'll focus my remaining analysis just on these posts

In [9]:
# Import the datetime module as dt
import datetime as dt

In [10]:
# Create an empty list and assign it to result_list
result_list = []
#Iterate over ask_posts and append to result_list a list with two elements
for row in ask_posts:
    first_element = row[6]
    second_element = int(row[4])
    result_list.append([first_element, second_element])
#Create two empty dictionaries called counts_by_hour and comments_by_hour. 
comments_by_hour = {}
counts_by_hour = {} 
    
    #Loop through each row of result_list.
for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    #hour = date.strftime("%I")
        
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
            
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
            
        


In [11]:
# printing the dictionaries
#print(counts_by_hour)
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### Calculating the Average Number of comments for Ask Hn Posts by Hour

In [12]:
# # Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []
for comment in comments_by_hour:
    avg_by_hour.append([comment, comments_by_hour[comment] / counts_by_hour[comment]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Sorting and Printing Values  from a List of Lists

In [13]:
# Creat an empty list that will contain swapped columns
swap_avg_by_hour = []
# loop over the swap columns
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [14]:
# sort the swap columns
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [15]:
# Top five Hours for Ask Comments
sorted_swap[:5]

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]

In [18]:
# Printing the Top 5 Hours for Ask posts Comments
for i in sorted_swap[:5]:
    time = dt.datetime.strptime(i[1], "%H")
    time_2 = dt.datetime.strftime(time, "%H:%M")
    print(f"{time_2} : {i[0]:.2f} average comments per post")
    

15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


I will recommend that a post should be made by 2pm in the after and 8pm in the night to have higher chance of receving the most comment.