# Exploring Hacker News Post

For this project we are analyzing the frequency of comments. The comments we are looking at derive from the community posts of "Ask HN," "Show HN,", and also observing if any posts significantly receive more comments when posted at a certain time.

I retrieved this dataset from the Hacker News Kaggle account. The dataset was updated approximately 2017.

To note: In comparison with the DataQuest exercise data examples it is possible I received an updated or non-sampled version of the dataset.


# First we will import and read the data

And just to preview some entries we will display only the first five lines.

In [1]:
#importing and reading data as a list of lists
#only displaying first five lines
import csv

hp = open("/home/main/Downloads/Exercises/HN_posts.csv")
hacker_posts = list(csv.reader(hp))
hacker_posts[:5]



[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

###### Above note: from the headers information we know what the data set contains such as number of comments which we will focus on later.


In [2]:
#Removing headers

headers = hacker_posts[0]
hacker_posts = hacker_posts[1:]


print(headers)
print(hacker_posts[:5])

#Spreading out the data entries for readability
sort_h = sorted(hacker_posts, reverse = True)
sort_h[:5]


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

## Creating lists and sorting posts into three categories

In the next section, we are creating three lists. As we are focused on determining comment frequency among Ask HN and Show HN we will have a list for each naturally. For any data entries not fitting the parameters we will have a third list, 'other post.'


In [3]:
#Creating three empty lists
ask_posts = []
show_posts = []
other_posts = []

#Looping through each row to sort and append
for posts in hacker_posts:
    title = posts[1]
    
    #appending based on the parameters of the starting text 
    if title.lower().startswith("ask hn"): 
        ask_posts.append(posts)
        
    elif title.lower().startswith("show hn"):
        show_posts.append(posts)
        
    else:
        other_posts.append(posts)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))  

9139
10158
273822


## Determining averages of the Ask and Show Hacker News posts

We now have three distinct lists we can analyze individually. Let's calculate the averages of Ask HN and Show HN. 

In [4]:
#Average comments for 'Ask HN'
# num_comments is the fifth column in the dataset therefore index is 4

total_ask_comments = 0

for posts in ask_posts:
    total_ask_comments += int(posts[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)



print(round(avg_ask_comments))

10


In [5]:
#Average comments for 'Show HN'

total_show_comments = 0

for posts in show_posts:
    total_show_comments += int(posts[4])
    
avg_show_comments = total_show_comments / len(show_posts)


print(round(avg_show_comments))

5


As it is clear, from the data, Ask HN receives more activity than Show HN. We will delve more into the analysis on the Ask HN posts.

# Determining the number of Ask Posts and Comments created by Hour

Here we are creating two dictionaries. "Counts by hour" will contain the number of Ask Posts created during each hour of the day. "Comment by hour" will have the corresponding number of comments the Ask Posts were created at each hour received.

In [6]:
#Determining amount of Ask HN posts created during each hour of the day

import datetime as dt

result_list = []

#created_at is on column 7 so index is 6
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

#creating two empty dictionaries
counts_by_hour = {}
comments_by_hour = {}


date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

## Determining the average of Ask HN posts by hour

With the above information we can now calculate the average amount of comments Ask Posts created at each hour of the day received.


In [7]:
#Determining average amount of comments of Ask HN posts created at each hour of the day received

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_by_hour


[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

## Sorting and Printing the highest average comments

In [8]:
#Create new list
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

#lets sort for readability
sort = sorted(swap_avg_by_hour, reverse = True)
sort


[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [9]:
#Sorting and printing the values of top 5 hours w/ highest avg comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sort[:5]:
    print("{}: {:.2f} average comments per post".
          format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))

Top 5 Hours for 'Ask HN' Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


# Conclusion

The highest amount of comments is at the 15:00 hour. From 12:00 to 15:00 the average increases greatly over 5 hours. This time period range are typical lunch hour times; considering any day of the week we can deduce food makes HN users more engaged.

We could inlude average commenting numbers in the Show HN to support this theory to see if they align closely with the Ask HN results.

Another route to take is we could include posts that didn't receive comments at all. We would be delving further into the "Other_posts" list as created in the analysis. From that information, we can conduct further analysis to determine what times have the least amount of engagement. Considering all three lists to determine to "Top 5 hours of Overall least Comments/Engagement" we can possibly create a marketing campaign or utilize simple engagement tactics to increase commenting and engagement traffic.