##  <center> Exploring Hacker News Posts </center>

In this project, I have worked on Hacker News dataset where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit to analyze the data to identify "Do posts created at a certain time receive more comments on average?"

In [1]:
# Reading of csv's

from csv import reader
opened_file= open("hacker_news.csv")
read_file= reader(opened_file)
hn_list=list(read_file)
hn_header= hn_list[0]
hn= hn_list[1:]

In [2]:
# Function to explore dataset.

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(hn, 0, 3, True)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


Number of rows: 20100
Number of columns: 7


In [4]:
hn_header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

#### Extracting Ask HN and Show HN Posts

Extracted post titles beginning with Ask HN or Show HN by creating new lists of lists containing just the data for those titles.

In [5]:
ask_posts= list()
show_posts= list()
other_posts= list()
for row in hn:
    title=row[1]
    title=title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))    
print(len(show_posts))  
print(len(other_posts))  

1744
1162
17194


In [6]:
explore_data(ask_posts, 0, 1, True)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


Number of rows: 1744
Number of columns: 7


In [7]:
explore_data(show_posts, 0, 1, True)

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


Number of rows: 1162
Number of columns: 7


#### Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [8]:
# Total number of comments in ask posts

total_ask_comments=0

for row in ask_posts:
    total_ask=int(row[4])
    total_ask_comments += total_ask
    avg_ask_comments = float(total_ask_comments/len(ask_posts))
print('Avgerage Ask comments: ',avg_ask_comments)  
                        

Avgerage Ask comments:  14.038417431192661


In [9]:
# Total number of comments in show posts

total_show_comments=0

for row in  show_posts:
    total_show=int(row[4])
    total_show_comments += total_show
    avg_show_comments= float(total_show_comments/len(show_posts))
print('Average Show Comments: ',avg_show_comments)
    

Average Show Comments:  10.31669535283993


It's clear that Ask posts received more comments than Show posts.

####  Finding the Amount of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, I focused remaining analysis just on these posts.

I've determined if ask posts created at a certain time are more likely to attract comments. The following steps are used to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [16]:
import datetime as dt

result_list= list() #list of list

for row in ask_posts:
    result= list()
    created_at= row[6]
    result.append(created_at)
    no_comment= int(row[4])
    result.append(no_comment)
    result_list.append(result)
    
counts_by_hour= dict()
comments_by_hour= dict()

for row in result_list:
    date= row[0]
    dateObj= dt.datetime.strptime(date,"%m/%d/%Y %H:%M") 
    hour= dateObj.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] =1
        comments_by_hour[hour] =no_comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += no_comment
         
print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 90, '13': 170, '10': 118, '14': 214, '16': 216, '23': 136, '12': 146, '17': 200, '15': 232, '21': 218, '20': 160, '02': 116, '18': 218, '03': 108, '05': 92, '19': 220, '01': 120, '22': 142, '08': 96, '04': 94, '00': 110, '06': 88, '07': 68, '11': 116}


#### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [17]:
 len(comments_by_hour)

24

In [12]:
avg_by_hour= list()

for hour in comments_by_hour:      
    avg_by_hour.append([hour, float(comments_by_hour[hour]/
                        len(comments_by_hour))])
avg_by_hour[:3]

[['09', 3.75], ['13', 7.083333333333333], ['10', 4.916666666666667]]

#### Sorting and Printing Values from a List of Lists

In [18]:
swap_avg_by_hour= list()

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour[:5])

[[3.75, '09'], [7.083333333333333, '13'], [4.916666666666667, '10'], [8.916666666666666, '14'], [9.0, '16']]


In [19]:
sorted_swap= sorted(swap_avg_by_hour,reverse= True)
print('Top 5 Hours for Ask Posts Comments: ',sorted_swap[:5])

Top 5 Hours for Ask Posts Comments:  [[9.666666666666666, '15'], [9.166666666666666, '19'], [9.083333333333334, '21'], [9.083333333333334, '18'], [9.0, '16']]


In [22]:
for row in sorted_swap[:5]:
    hour=row[1]
    hour= dt.datetime.strptime(hour,"%H")
    hour=hour.strftime("%I:%M")
    avg=row[0]
    print("{h}: hour has {a:.2f} average comments per post".format(h=hour ,a= avg))    
    

03:00: hour has 9.67 average comments per post
07:00: hour has 9.17 average comments per post
09:00: hour has 9.08 average comments per post
06:00: hour has 9.08 average comments per post
04:00: hour has 9.00 average comments per post


#### As per my findings we should create post during 03:00 pm to receive the highest number of comments.