# <div align="center">**Exploring Hacker News Posts Project**</div>

### **Analysis:** Dataset used is pulled from https://www.kaggle.com/hacker-news/hacker-news-posts. Dataset is roughly 20,000 rows from the original 300,000. Certain rows were removed that did not receive any comments, and then the remaining rows were randomly sampled for use in final dataset.

| Column name | Description |
| :--------: | :--------: |
| id | Unique ID per post |
| title | Post title |
| url | URL of post link |
| num_points | Total of upvotes minus downvotes |
| num_comments | Number of comments on post |
| author | Username of author |
| created_at | Date and Time of post submission |

### **Goal:** Analyze Hacker News posts to determine which of "Ask HN" or "Show HN" posts receive more comments on average and then digging deeper to see if those posts receive more comments on average at specific timeframes.

---------------

## **Step 1** - Extract and filter data. Determine if "Ask HN" or "Show HN" posts receive more comments on average.

In [11]:
## Import in CSV reader and set up file for use
from csv import reader
open_file = open("197_419_bundle_archive/HN_posts_year_to_Sep_26_2016.csv")
read_file = reader(open_file)
hn = list(read_file)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

In [12]:
## Separate header and remove from rest of data
headers = hn[0]

hn = hn[1:]

## Verify headers are split correctly from main data
print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

In [21]:
## Initial filter of data to post types and verify lists are filled
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print(ask_posts[:2])

9139
10158
273822
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']]


In [32]:
## Determine if "Ask HN" or "Show HN" receive more comments on average
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("The average number of comments for \"Ask HN\" posts are", avg_ask_comments)

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print("The average number of comments for \"Show HN\" posts are", avg_show_comments)

The average number of comments for "Ask HN" posts are 10.393478498741656
The average number of comments for "Show HN" posts are 4.886099625910612


#### On average, "Ask HN" posts receive much more comments per post at 10, whereas "Show HN" posts are likely to receive just under 5 comments. Due to the higher average of comments on "Ask HN" posts, further analysis will be conducted exclusively on "Ask HN" posts.

-------------------

## **Step 2** - Additional analysis of "Ask HN" posts to determine if posts receive more comments on average at specific timeframes

In [33]:
import datetime as dt

In [45]:
## Creation of lists
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comments = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comments
    
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comments

In [47]:
## Determine average number of comments per post per hour of the day
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [56]:
## Clean result to identify what hour has the highest average
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

## Print top 5 hours with the highest average number of comments
print("Top 5 Hours for Ask HN Posts Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))

Top 5 Hours for Ask HN Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


### **15:00 or 3PM is the hour that receives the highest average number of comments per post per day. This is roughly an increase of *57%* over the second highest average number of comments per hour.**

-------------

## **Conclusion:** The type and timeframe of a post to receive the highest average number of comments is an "Ask HN" post that is submitted around 15:00 or 3PM Eastern time. The second ideal timeframe would be to submit the post around 13:00 or 1PM Eastern time. There is likely to be a drastic difference in the average number of comments depending on if a post is submitted during the ideal timeframe or not.