# Exploring Hacker News Post

### Data set [Hacker News Post](https://www.kaggle.com/hacker-news/hacker-news-posts)
This data set has been reduced to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 
<br>The second column of data set is "title". We are particularly interested in posts whose titles either begin with Ask HN or Show HN. We'll compare these two types of posts to determine the following:
 - Do Ask HN or Show HN receive more comments on average?
 - Do posts created at a certain time receive more comments on average?

In [6]:
from csv import reader
import datetime as dt

## Data Cleaning

In [7]:
hn = list(reader(open("hacker_news.csv")))

Displays first five rows of data set

In [8]:
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Headers is the first row of data set which consists of names of columns

In [9]:
headers = hn[0]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Displays first five rows of data set after removing the first row

In [10]:
hn = hn[1:]
hn[0:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

separates ask posts, show posts and other posts into three different lists

In [11]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if(title.lower().startswith("ask hn")):
        ask_posts.append(row)
    elif(title.lower().startswith("show hn")):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("No of ask posts are ",len(ask_posts))
print("No of show posts are ",len(show_posts))
print("No of other posts are ",len(other_posts))

No of ask posts are  1744
No of show posts are  1162
No of other posts are  17194


Given any posts, this function computes the average no of comments for it

In [12]:
def compute_avg_comments(posts):
    num_comments = 0
    for post in posts:
        comments = int(post[4])
        num_comments += comments
        
    avg_comments = num_comments/len(posts)
    return avg_comments

In [13]:
avg_ask_comments = compute_avg_comments(ask_posts)
print("Average number of comments in ask posts are {0:.2f}.".format(avg_ask_comments))

Average number of comments in ask posts are 14.04.


In [14]:
avg_show_comments = compute_avg_comments(show_posts)
print("Average number of comments in show posts are {0:.2f}.".format(avg_show_comments))

Average number of comments in show posts are 10.32.


#### Ask posts has greater average number of comments

Saves time stamp and no of comments for a post as a list in a list of lists. We will use this later to find number of posts made in a certain hour and number of comments made in a certain hour

In [15]:
result_list = []
for post in ask_posts:
    created = post[6]
    comments = int(post[4])
    result_list.append([created,comments])

In [16]:
result_list[:3]

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

counts_by_hour is a dictionary to store number of posts made in a certain hour(value) against hour(key).
comments_by_hour is a dictionary to store number of comments made in a certain hour(value) against hour(key).

In [17]:
counts_by_hour = {}
comments_by_hour = {}

for ls in result_list:
    date_ = dt.datetime.strptime(ls[0], "%m/%d/%Y %H:%M")
    hour = date_.strftime('%H')
    comments = ls[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments


Using above two dictionaries, we compute the average number of comments made on an ask post in a certain hour

In [18]:
avg_by_hour = []
for hour, count in  counts_by_hour.items():
    avg_comments = comments_by_hour[hour]/count
    avg_by_hour.append([hour, avg_comments])

In [19]:
avg_by_hour[:3]

[['19', 10.8], ['05', 10.08695652173913], ['14', 13.233644859813085]]

swaps average value with the hour

In [20]:
swap_avg_by_hour = [[avg_by_hour[i][1],avg_by_hour[i][0]] for i in range(0,len(avg_by_hour))]

In [21]:
swap_avg_by_hour[:3]

[[10.8, '19'], [10.08695652173913, '05'], [13.233644859813085, '14']]

sorts with respect to average

In [22]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True)

In [23]:
sorted_swap[:5]

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]

Prints top five rows with highest number of average comments

In [24]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


### Findings

Ask posts has greater average number of comments per post. Most of the comments are made between 15:00 and 16:00. We see that highest most value of average comments per post is almost 65% more than second highest value.