## Introduction 

### Data source 
Hackernews. 

### Data definition
|Column title|Meaning|
|------------|-------|
|id |The unique identifier from Hacker News for the post|
|title| The title of the post|
|url | The URL that the posts links to, if it the post has a URL |
|num_points |The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
|num_comments |The number of comments that were made on the post |
|author| The username of the person who submitted the post |
|created_at| The date and time at which the post was submitted |

### What the analysis is about
1. I want to compare two types of posts and see which posts get more comments on average.  
(1) posts submitted by the users to show projects and products [Show HN] (2) posts in form of questions submitted by the users [Ask HN]

2. What are the peak times when people submit most comments on average?

## Step 1: Connecting to the data

In [2]:
from csv import reader
hn=list(reader(open("hacker_news.csv")))

In [3]:
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [4]:
#Extract the first row of data, and assign it to the variable headers
headers = hn[:1]

#Display headers.
headers

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]

In [5]:
#Remove the first row from hn.
hn= hn[1:]

In [6]:
#I want to separate ASK HN and SHOW HN posts. 
#Separated "ask posts" and the "show posts" into two list of lists named ask_posts and show_posts
ask_posts = []
show_posts = []
other_posts = []

for row in hn: 
    #Because the title column is the second column, you'll need to get the element at index 1 in each row.
    title=row[1]
    #If the lowercase version of title starts with ask hn, append the row to ask_posts.
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    #Else if the lowercase version of title starts with show hn, append the row to show_posts.    
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

#Check the number of posts in ask_posts, show_posts, and other_posts.
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


In [7]:
ask_posts[1]

['10610020',
 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
 '',
 '28',
 '29',
 'tkfx',
 '11/22/2015 13:43']

### Question 1: Which posts receive most comments on average? 

In [8]:
#Total number of comments in ASK posts. 

total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments=total_ask_comments/len(ask_posts)
print("Average comments on posts inquiry posts: ",avg_ask_comments)

total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments=total_show_comments/len(show_posts)
print("Average comments on showcasing products/projects:",avg_show_comments)

Average comments on posts inquiry posts:  14.038417431192661
Average comments on showcasing products/projects: 10.31669535283993


#### : Verdict : More engagement (via comments) on posts where people ask questions than posts where people showcase products

### Follow up question: What time of the day do we get most comments on these posts? 

In [26]:
#Import the datetime module as dt.
import datetime as dt

#Create an empty list
result_list = []

#Iterate over ask_posts
for post in ask_posts:
    created_at=post[6]
    num_comments=int(post[4])
    
    #As we are asked to create list of lists,t
    #he columns which we want to append to results_list should be appended in the following format:
    #example: results_list.append([column1, column2])
    
    result_list.append([created_at, num_comments])
    
#Create two empty dictionaries
counts_by_hour = {}
comments_by_hour = {}
date_format=("%m/%d/%Y %H:%M")

#Loop through each row
for item in result_list:
    date_hour=item[0]
    comments=item[1]
    t1 = dt.datetime.strptime(date_hour, date_format).strftime("%H")
    #created_dt=dt.datetime.strftime("%H")
    
    if t1 in counts_by_hour:
        counts_by_hour[t1]+= 1
        comments_by_hour[t1]+= comments
    else:
        counts_by_hour[t1]=1
        comments_by_hour[t1]=comments


In [24]:
#contains the number of ask posts created during each hour of the day.
counts_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

In [25]:
#contains the corresponding number of comments ask posts created at each hour received.
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

### The average number of comments per post for posts created during each hour of the day.

In [29]:
avg_by_hour = []

for counts in counts_by_hour: 
    avg_by_hour.append([counts,comments_by_hour[counts]/counts_by_hour[counts]])
    
avg_by_hour

[['18', 13.20183486238532],
 ['13', 14.741176470588234],
 ['11', 11.051724137931034],
 ['06', 9.022727272727273],
 ['16', 16.796296296296298],
 ['21', 16.009174311926607],
 ['00', 8.127272727272727],
 ['01', 11.383333333333333],
 ['02', 23.810344827586206],
 ['07', 7.852941176470588],
 ['15', 38.5948275862069],
 ['19', 10.8],
 ['20', 21.525],
 ['17', 11.46],
 ['10', 13.440677966101696],
 ['03', 7.796296296296297],
 ['08', 10.25],
 ['22', 6.746478873239437],
 ['05', 10.08695652173913],
 ['23', 7.985294117647059],
 ['09', 5.5777777777777775],
 ['14', 13.233644859813085],
 ['04', 7.170212765957447],
 ['12', 9.41095890410959]]

In [34]:
swap_avg_by_hour = []

for swap in avg_by_hour:
    swap_avg_by_hour.append([swap[1], swap[0]])
    
swap_avg_by_hour

[[13.20183486238532, '18'],
 [14.741176470588234, '13'],
 [11.051724137931034, '11'],
 [9.022727272727273, '06'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [8.127272727272727, '00'],
 [11.383333333333333, '01'],
 [23.810344827586206, '02'],
 [7.852941176470588, '07'],
 [38.5948275862069, '15'],
 [10.8, '19'],
 [21.525, '20'],
 [11.46, '17'],
 [13.440677966101696, '10'],
 [7.796296296296297, '03'],
 [10.25, '08'],
 [6.746478873239437, '22'],
 [10.08695652173913, '05'],
 [7.985294117647059, '23'],
 [5.5777777777777775, '09'],
 [13.233644859813085, '14'],
 [7.170212765957447, '04'],
 [9.41095890410959, '12']]

In [35]:
sorted_swap=sorted(swap_avg_by_hour, reverse=True)

In [36]:
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

### The Top 5 Hours for Ask Posts Comments

In [41]:
for avg,hr in sorted_swap[:5]:
    print("{}:00 : {:.2f} average comments per day".format(hr, avg))

15:00 : 38.59 average comments per day
02:00 : 23.81 average comments per day
20:00 : 21.52 average comments per day
16:00 : 16.80 average comments per day
21:00 : 16.01 average comments per day


### Verdict

Creating a post within these hours will have a higher chance of receiving comments.As the time zone is Eastern Time in the US, this translates to: 

- 15 is 9 pm CPH time
- 2 is 8 am CPH time
- 20 is 4 am CPH time
- 16 is 10 pm CPH time
- 21 is 3 pm CPH time