## Analyze Hacker Rank Website "Ask HN" and "Show HN" posts

### Introduction

1. The purpose of this project is to analyze the posts from [Hacker News](https://news.ycombinator.com/) maintained by y combinator
2. Utilizing a filtered down readily available dataset, analyze posts that have `Ask HN` or `Show HN` in them. Data source is [Kaggle.com](https://www.kaggle.com/hacker-news/hacker-news-posts)
3. Analyze this dataset to see if Ask or Show HN posts receive more comments on average
4. Also, peform analysis on the dataset to see if posts created at certail times receive more comments on average

In [1]:
#1. Import hacker_news.csv dataset
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
#2. Remove headers from the hn list of lists
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Identify posts that begin with Ask HN or Show HN and put it into a seperate list
1. Create 3 seperate lists for ask hn, show hn and other post comments
2. Loop through the hn dataset to check if the title begins with ask hn or show hn and assign to the respective lists
3. Display total number of posts in each list

In [7]:
# Create 3 lists called ask_posts, show_posts and other_posts
ask_posts=[]
show_posts=[]
other_posts=[]

# Loop through each row in hn
for eachrow in hn:
    title = eachrow[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(eachrow)
    elif title.lower().startswith('show hn'):
        show_posts.append(eachrow)
    else:
        other_posts.append(eachrow)

# Check number of posts in ask_hn, show_hn and other_posts       
print('Number of posts in ask posts: ' + str(len(ask_posts)))
print('Number of posts in ask posts: ' + str(len(show_posts)))
print('Number of posts in ask posts: ' + str(len(other_posts)))

Number of posts in ask posts: 1744
Number of posts in ask posts: 1162
Number of posts in ask posts: 17194


### Analysis of `Ask HN` and `Show HN` posts
Performing analysis on `Ask HN` or `Show HN` to see if they received more comments on average than other types of posts

In [9]:
# Declare variable total_ask_comments and set to 0
# Get total number of comments in ask_posts dataset and calculate average
total_ask_comments = 0
for eachrow in ask_posts:
    number_of_comments_ask=int(eachrow[4])
    total_ask_comments += number_of_comments_ask
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

# Declare variable total_show_comments and set to 0
# Get total number of comments in show_posts dataset and calculate average
total_show_comments = 0
for eachrow in show_posts:
    number_of_comments_show = int(eachrow[4])
    total_show_comments += int(number_of_comments_show)
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


### Findings
1. On average `Ask HN` posts received 14 comments per post and `Show HN` posts received 10 comments per post
2. It can be seen that on an average `Ask HN` posts receive 4 more comments that the `Show HN` posts
3. As ask posts receive more comments, focussing the rest of this analysis on these posts

### Part 1
1. The analysis for the Ask HN posts is broken down into 2 parts. In this first part we are going to calculate the number of posts created in each hour along with number of comments received
2. In the second portion, we are going to calculate average number of comments ask posts received by hour

In [18]:
# Step 1: Import datetime module
import datetime as dt

# Step 2: Create an empty list called result_list and assign created_at and comments of the post
# To create a list of list first an inner list is created which stores these 2 values for each row and appends to a master list
result_list = [] # Master list
inner_list = []
for eachrow in ask_posts:
    inner_list=[]
    inner_list.append(eachrow[6])
    inner_list.append(int(eachrow[4]))
    result_list.append(inner_list)

# Step 3: Create 2 empty dictionaries called counts_by_hour and comments_by_hour
# Segregate hours from the created date and count number of posts for each hour and assign to counts_by_hour
# Segregate comments for post by hour and assign to comments_by_hour dictionary
counts_by_hour={}
comments_by_hour={}
for eachrow in result_list:
    formatted_date = dt.datetime.strptime(eachrow[0], '%m/%d/%Y %H:%M')
    hour_from_date = dt.datetime.strftime(formatted_date, '%H')
    if hour_from_date not in counts_by_hour:
        counts_by_hour[hour_from_date] = 1
        comments_by_hour[hour_from_date] = eachrow[1]
    else:
        counts_by_hour[hour_from_date] += 1
        comments_by_hour[hour_from_date] += eachrow[1]

print(counts_by_hour)
print(comments_by_hour)

{'03': 54, '06': 44, '15': 116, '00': 55, '21': 109, '19': 110, '13': 85, '08': 48, '18': 109, '02': 58, '11': 58, '12': 73, '04': 47, '23': 68, '20': 80, '01': 60, '05': 46, '07': 34, '10': 59, '16': 108, '09': 45, '22': 71, '14': 107, '17': 100}
{'03': 421, '06': 397, '15': 4477, '00': 447, '21': 1745, '19': 1188, '13': 1253, '08': 492, '18': 1439, '02': 1381, '11': 641, '12': 687, '04': 337, '23': 543, '20': 1722, '01': 683, '05': 464, '07': 267, '10': 793, '16': 1814, '09': 251, '22': 479, '14': 1416, '17': 1146}


### Part 2
1. In the second part of this analysis we will use the `counts_by_hour` and `comments_by_hour` dictionaries created in part 1 to find the average number of comments for posts created during each hour of the day

In [19]:
# Using the dictionaries create a list of list called average comments per post
# Take the hour from the counts_by_hour dictionary 
# Calculate average comments by dividing number of comments from comments_by_hour dictionary by the number of posts from counts_by_hour dictionary
avg_by_hour=[]
for eachhour in counts_by_hour:
    avg_by_hour.append([eachhour, comments_by_hour[eachhour]/counts_by_hour[eachhour]])
    
print(avg_by_hour)


[['03', 7.796296296296297], ['06', 9.022727272727273], ['15', 38.5948275862069], ['00', 8.127272727272727], ['21', 16.009174311926607], ['19', 10.8], ['13', 14.741176470588234], ['08', 10.25], ['18', 13.20183486238532], ['02', 23.810344827586206], ['11', 11.051724137931034], ['12', 9.41095890410959], ['04', 7.170212765957447], ['23', 7.985294117647059], ['20', 21.525], ['01', 11.383333333333333], ['05', 10.08695652173913], ['07', 7.852941176470588], ['10', 13.440677966101696], ['16', 16.796296296296298], ['09', 5.5777777777777775], ['22', 6.746478873239437], ['14', 13.233644859813085], ['17', 11.46]]


### Final cleanup
1. Sort the list of lists by highest number of comments descending
2. Print the five highest values in a format that is easier to understand

In [26]:
swap_avg_by_hour = []
for eachrow in avg_by_hour:
    swap_avg_by_hour.append([eachrow[1],eachrow[0]])
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour,reverse=True)
# print(sorted_swap[:5])
print("Top 5 Hours for Ask Posts Comments")
for eachrow in sorted_swap[:5]:
    template_string = '{hour}: {comments:.2f} average comments per post'
    result_string=template_string.format(hour=(dt.datetime.strftime(dt.datetime.strptime(eachrow[1],'%H'),'%H:%M')),comments=eachrow[0])
    print(result_string)

[[7.796296296296297, '03'], [9.022727272727273, '06'], [38.5948275862069, '15'], [8.127272727272727, '00'], [16.009174311926607, '21'], [10.8, '19'], [14.741176470588234, '13'], [10.25, '08'], [13.20183486238532, '18'], [23.810344827586206, '02'], [11.051724137931034, '11'], [9.41095890410959, '12'], [7.170212765957447, '04'], [7.985294117647059, '23'], [21.525, '20'], [11.383333333333333, '01'], [10.08695652173913, '05'], [7.852941176470588, '07'], [13.440677966101696, '10'], [16.796296296296298, '16'], [5.5777777777777775, '09'], [6.746478873239437, '22'], [13.233644859813085, '14'], [11.46, '17']]
Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


### Conclusions
1. The 5 hours when posts should be created to recieve highest number of average comments are 3 PM EST, 2 AM EST, 8 PM EST, 4 PM EST and 9 PM EST
2. Other than 2 AM EST, the other times indicate most of the comments are coming from within the US when people get off their work
3. The 2 AM time signifies mostly other time zones globally