# Exploring Hacker News Posts

*Hacker news is a site started by a startup where user submitted stories are voted and commented upon. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. * Users submit Ask HN posts to ask the Hacker News community a specific question.
Users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.*

## Data Dictionary

* id: The unique identifier from Hacker News for the post
* title: The title of the post
* url: The URL that the posts links to, if it the post has a URL
* num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the post
* author: The username of the person who submitted the post
* created_at: The date and time at which the post was submitted

## Goal


* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?


### Reading the Input file

In [1]:
from csv import reader
opened_file=open('hacker_news.csv')
read_file=reader(opened_file)
hn=list(read_file)

In [2]:
headers=hn[0]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:
hn=hn[1:]

In [4]:
#Checking first five rows of the data
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [5]:
ask_posts=[]
show_posts=[]
other_posts=[]

In [6]:
for row in hn:
    title_hn=row[1]
    if title_hn.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title_hn.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [7]:
#Check the length of number of posts in ask hn
print('Number of posts that start with ask hn:',len(ask_posts))

Number of posts that start with ask hn: 1744


In [8]:
#Check the length of number of posts in show hn
print('Number of posts that start with show hn:',len(show_posts))

Number of posts that start with show hn: 1162


In [9]:
#Check the length of number of posts in other posts
print('Number of posts that start with title other than ask hn or show hn:',len(other_posts))

Number of posts that start with title other than ask hn or show hn: 17194


In [10]:
#First five rows of ask posts
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [11]:
#First five rows of show posts
show_posts[:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

In [12]:
## Checking the type of number of comments in the hn
type(hn[1][4])

str

In [13]:
# Determine if ask or show posts receive more comments on average
total_ask_comments=0
for row in ask_posts:
    num_comments=row[4]
    num_comments=int(num_comments)
    total_ask_comments+=num_comments

##Check the total number of comments in ask posts
total_ask_comments

24483

In [14]:
# Average number of comments on ask posts
avg_ask_comments=round(total_ask_comments/len(ask_posts),2)

In [15]:
print(avg_ask_comments)

14.04


In [16]:
# Determine if ask or show posts receive more comments on average
total_show_comments=0
for row in show_posts:
    num_comments=row[4]
    num_comments=int(num_comments)
    total_show_comments+=num_comments

##Check the total number of comments in show posts
total_show_comments

11988

In [61]:
# Average number of comments on ask posts
avg_show_comments=round(total_show_comments/len(show_posts),2)

In [62]:
print(avg_show_comments)

10.32


*Looking at the average number of comments on ask posts and show posts, we clearly see there is on an average 4 more comments on ask posts compared to show posts. But there might be a possibility that some of the number of comments in the ask posts are outliers and driving the average up for number of comments on ask posts.*

In [30]:
## Looking at the min and max range 
def find_max(ds,index):
    comment=[]
    for row in ds:
        num_comments=row[index]
        num_comments=int(num_comments)
        comment.append(num_comments)
        
    return max(comment), min(comment)

In [31]:
find_max(ask_posts,4)

(947, 1)

In [32]:
find_max(show_posts,4)

(306, 1)

*As can be seen the range in the ask posts for number of comments goes from 1 to 947 while in show posts is 1 to 306. Few larger number of comments in ask posts are contributing to the higher number of avg comments in ask posts compared to show posts. On average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:*

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.


In [36]:
# Step 1 - Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
import datetime as dt
result_list=[]
for row in ask_posts:
    created_at = row[6] #Created at
    num_comments=row[4] #Number of comments
    num_comments=int(num_comments)
    result_list.append([created_at,num_comments])

In [37]:
result_list[:6]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17],
 ['9/26/2015 23:23', 1]]

In [38]:
type(result_list[1][0])

str

In [40]:
counts_by_hour={}
comments_by_hour={}
for row in result_list:
    dttime=row[0]
    num_comments=row[1]
    date_str = dt.datetime.strptime(dttime, "%m/%d/%Y %H:%M") #Convert string into datetime object
    hour=date_str.strftime("%H") #24 hour time format
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=num_comments
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=num_comments 
    

In [41]:
print(counts_by_hour) #Number of ask posts created during each hour of the day

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [42]:
print(comments_by_hour) #Number of comments on ask posts during each hour of the day

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [70]:
# We will calculate the average number of comments on ask posts created during each hour of the day
avg_by_hour=[]
for hour in comments_by_hour:
    if hour in counts_by_hour:
        avg_by_hour.append([hour,round(comments_by_hour[hour]/counts_by_hour[hour],2)])


In [71]:
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


*Its still difficult to identify which hour has the highest average number of comments per posts since the order is not sorted.*

In [72]:
swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

In [73]:
print(swap_avg_by_hour)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]


In [74]:
#Sorting the swap avg by hour list in descending order as it will sort on avg number of comments per posts by hour
sorted_swap= sorted(swap_avg_by_hour,reverse=True)

In [75]:
print(sorted_swap)

[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16'], [16.01, '21'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [13.2, '18'], [11.46, '17'], [11.38, '01'], [11.05, '11'], [10.8, '19'], [10.25, '08'], [10.09, '05'], [9.41, '12'], [9.02, '06'], [8.13, '00'], [7.99, '23'], [7.85, '07'], [7.8, '03'], [7.17, '04'], [6.75, '22'], [5.58, '09']]


In [77]:
print('*' * 10, "Top 5 Hours for Ask Posts Comments:", '*' * 10,'\n')
for row in sorted_swap[:5]:
    avg_comments=row[0]
    dttime=row[1]
    date_str = dt.datetime.strptime(dttime, "%H") #Convert string into datetime object
    hour=date_str.strftime("%H:%M") #24 hour time format
    my_string = "{}: {} average comments per post"
    
    print(my_string.format(hour,avg_comments))

********** Top 5 Hours for Ask Posts Comments: ********** 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


## Conclusion

*So, looking at the top 5 hours for ask posts commments, one can see that there is 38.59 avg comments per posts if the post is created at 3 pm in the afteroon. Surprisingly, the next highest is at 2 am in the morning where 23.81 avg comments per posts are made. Top 5 hours for avg comments shows that if the ask posts is created between 3 pm - 9 pm, there are good chances of getting higher comments compared to the posts created during some other time.*