# Data Analysis on 'Hacker News Website'
Hacker news is a site started by the starup incubator **[Y Combinator](https://www.ycombinator.com/)**, where user-submitted stories are voted and commented upon, similar to reddit. Hacker news is extremly popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreads of thousands of visitors as a result.The dataset consists of following features:

- <font color='red'>id</font>: The unique identifier from Hacker News for that post. 
- <font color='red'>title</font>: Tha name or title of the post.
- <font color='red'>url</font>: The URL that post link to, if it has.
- <font color='red'>num_points</font>: The number of points the post acquired, calculate as the total number of upvotes minus the total number of downvotes.
- <font color='red'>num_comments</font>: The number of comments that were made on the post.
- <font color='red'>author</font>: The username of the author who sumbmitted the post.
- <font color='red'>created_at</font>: The date and time at which post was submitted.

For raw dataset, you have to click [here](https://www.kaggle.com/hacker-news/hacker-news-posts).


In [3]:
# import and read the file
from csv import reader
from IPython.display import HTML, display

opened_file = open('hacker_news.csv') #file name
read_file = reader(opened_file)
hn = list(read_file)

display(HTML(
   '<table><tr>{}</tr></table>'.format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in hn[:5])
       )
))

0,1,2,3,4,5,6
id,title,url,num_points,num_comments,author,created_at
12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
10975351,How to Use Open Source and Shut the Fuck Up at the Same Time,http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/,39,10,josep2,1/26/2016 19:30
11964716,Florida DJs May Face Felony for April Fools' Water Joke,http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/,2,1,vezycash,6/23/2016 22:20
11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429,3,1,hswarna,6/17/2016 0:01


# We're interested in Specific Posts
One can see above what each header is meant for. For this analsis work, we're specifically interested in posts whose 'title' bigins with either 'Ask HN' or 'Show HN'. Here users submit <font color='red'>Ask HN</font> post to ask the Hacker News community a specific question. Below examples give glimpse of that posts:
- **Ask HN: How to improve my personal website?**
- **Ask HN: Am I the only one outraged by Twitter shutting down share counts?**

In a very similar way, users submit <font color='red'>Show HN</font> posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of exmples:
- **Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'**
- ** Show HN: Something pointless I made**


In [7]:
# displa data 
print(*hn[0:5], sep='\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


# Remove Headers
One can notice that the dataset has included with headers in the first column. In order to analze data, we need to first remove the row containing column headers.

In [19]:
# remove the column headers from the data
headers = hn[0]
print(headers)
print('\n')
hn_new = hn[1:]
print(*hn_new[:2], sep='\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


# Select Specific Posts: 'Ask HN' & 'Show HN'
Separate the data based on wheather the post is asked or showed, that is <font color = 'red'>Ask HN/Show HN</font> starting text at each of the title index row. Python's 'startswith' string method helps us in findig the needed text from the given text.

At the end display the number of <font color = 'red'>Ask HN/Show HN/ other HN</font> posts and their outputs.

In [31]:
# empty lists to store the selected posts
ask_posts = []
show_posts = []
other_posts = []

# looping through each row in hn
for row in hn_new:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# display ask_posts and number of ask posts
print('The number of Ask HN Posts:', len(ask_posts))
display(HTML(
   '<table><tr>{}</tr></table>'.format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in ask_posts[:2])
       )
))
  
print('\n')
print('The number of Show HN Posts:', len(show_posts))
display(HTML(
   '<table><tr>{}</tr></table>'.format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in show_posts[:2])
       )
))

print('\n')
print('The number of other HN Posts:', len(other_posts))
display(HTML(
   '<table><tr>{}</tr></table>'.format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in other_posts[:2])
       )
))

The number of Ask HN Posts: 1744


0,1,2,3,4,5,6
12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
10610020,Ask HN: Am I the only one outraged by Twitter shutting down share counts?,,28,29,tkfx,11/22/2015 13:43




The number of Show HN Posts: 1162


0,1,2,3,4,5,6
10627194,Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform,https://iot.seeed.cc,26,22,kfihihc,11/25/2015 14:03
10646440,Show HN: Something pointless I made,http://dn.ht/picklecat/,747,102,dhotson,11/29/2015 22:46




The number of other HN Posts: 17194


0,1,2,3,4,5,6
12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
10975351,How to Use Open Source and Shut the Fuck Up at the Same Time,http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/,39,10,josep2,1/26/2016 19:30


# Ask Posts' vs Show Posts' comments
In the following exercise, we're interested in to determine which type of post receives more comments out of two. First caculate the total number of comments in each type and divide it by the total number of posts which we've calculated in the above exercise. So this output will give us the average number of comments for each type of posts.  

In [32]:
# to determine the number of comments in the selected posts
def cal_comments(posts):
    # define variables to store summation
    total_comments = 0
    no_posts = 0
   
    # loop through eithe 'ask_posts' or 'show_posts'
    for row in posts:
        no_posts += 1
        comment = int(row[4])
        total_comments += comment
        
    avg_comments = (total_comments / no_posts)
    return avg_comments
    
avg_ask_comments = cal_comments(ask_posts)
avg_show_comments = cal_comments(show_posts)


print('Average number of comments on ask_posts is:', avg_ask_comments)
print('Average number of comments on show_posts is:', avg_show_comments)
    
    
    

Average number of comments on ask_posts is: 14.038417431192661
Average number of comments on show_posts is: 10.31669535283993


# Ask Posts v/s Show Posts
On average, Ask HN-posts recieves more comments compared to Show HN-posts. One can thik about this result as, people are more interested in 'Ask posts' compared to 'show posts'. Since 'Ask posts' recieved more comments than 'Show Posts', we'll focus only on analysis of Ask posts. As we have a data regarding the time at which these posts got created, we'll calculate the number of posts created in each hour of the day along with the number of comments recieved. 



In [63]:
# import datetime module as dt
import datetime as dt
import operator
result_list = []

# loop through ask_posts 
for row in ask_posts:
    elements = [row[6], row[4]]
    result_list.append(elements)

#print(*result_list[0:5], sep='\n')

# Create a dictionary to hold the post and comments
posts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date = row[0]
    dt_object = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')  # create a datetime object
    time = dt_object.strftime('%H')
   
    if time not in posts_by_hour:  #print(time)
        posts_by_hour[time] = 1
        comments_by_hour[time] = 1
    else:
        posts_by_hour[time] += 1
        comments_by_hour[time] += int(row[1])
        
        
# modifying dictionary display
class SortedDisplayDict(dict):
   def __str__(self):
       return "{" + ", ".join("%r: %r" % (key, self[key]) for key in sorted(self)) + "}"

# display the results
print('The number of posts received in each hour:', '\n', SortedDisplayDict(posts_by_hour), '\n')
print('The number of comments received in each hour:', '\n', SortedDisplayDict(comments_by_hour))


The number of posts received in each hour: 
 {'00': 55, '01': 60, '02': 58, '03': 54, '04': 47, '05': 46, '06': 44, '07': 34, '08': 48, '09': 45, '10': 59, '11': 58, '12': 73, '13': 85, '14': 107, '15': 116, '16': 108, '17': 100, '18': 109, '19': 110, '20': 80, '21': 109, '22': 71, '23': 68} 

The number of comments received in each hour: 
 {'00': 438, '01': 651, '02': 1379, '03': 421, '04': 335, '05': 436, '06': 397, '07': 266, '08': 488, '09': 246, '10': 793, '11': 640, '12': 684, '13': 1225, '14': 1414, '15': 4477, '16': 1798, '17': 1146, '18': 1438, '19': 1186, '20': 1721, '21': 1742, '22': 478, '23': 543}


# Average on number of Comments received
In the last exercise, we have calculated the following things.
1. counts_by_hour: contains the number of ask posts created during each hour of the day.
2. Comments_by_hour: contains the corresponding number of comments for each ask post created at that hour.

Following are the steps in calculating the average number of comments during each hour of a day.
- Initialize the empty list 
- Calculate the average number of comments received at each hour of a day by dividing the total number of comments with the total number of posts for that time.
- Append the empty list with the result.

In [72]:
#Initialize a list to hold the average

avg_by_hour = {}

for hour in comments_by_hour:
    avg_by_hour[hour] = format(comments_by_hour[hour] / posts_by_hour[hour], '.2f')
print("The average number of comments recieved at each hour of the day:",'\n', SortedDisplayDict(avg_by_hour))


The average number of comments recieved at each hour of the day: 
 {'00': '7.96', '01': '10.85', '02': '23.78', '03': '7.80', '04': '7.13', '05': '9.48', '06': '9.02', '07': '7.82', '08': '10.17', '09': '5.47', '10': '13.44', '11': '11.03', '12': '9.37', '13': '14.41', '14': '13.21', '15': '38.59', '16': '16.65', '17': '11.46', '18': '13.19', '19': '10.78', '20': '21.51', '21': '15.98', '22': '6.73', '23': '7.99'}


# Final Words
That's it for the day! Here's  a quick summary of what we've accomplished in this Data Analysis on 'Hacker News Website' project.
- Collected, sorted and analyzed the data from the HN website.
- Formatted and cleaned the data to fit for our requirements.
- Picked up the important feature from the data.
- Finaly,Calculated the average number of comments received on each hour of the day. 

    