# Exploring Hacker News Posts
---
In this project, we'll work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).
<br><br>
The original dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

---
We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. And submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. 
<br><br>
The questions we wish to answer in this projects are:
* Do `Ask HN` or `Show HN` receive more comments on average? 
* Do posts created at a certain time receive more comments on average?

 ## Import and read in the dataset

In [21]:
import csv
file = open('hacker_news.csv')
hn = list(csv.reader(file))

# View first 5
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [22]:
# Display all output from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [23]:
headers = hn[0]
hn = hn[1:]
headers
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Data cleaning

In [24]:
# Group posts by type
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('There are {} posts in \'Ask HN\''.format(len(ask_posts))) 
print('There are {} posts in \'Show HN\''.format(len(show_posts))) 
print('There are {} posts in neither'.format(len(other_posts))) 

There are 1744 posts in 'Ask HN'
There are 1162 posts in 'Show HN'
There are 17194 posts in neither


## Data analysis

In [25]:
# Create a function that returns the total & average comments in ask_post and show_posts 
def comments_counter(data):
    length = len(data)
    total_comments = 0
    
    for row in data:
        num_comments = int(row[4])
        total_comments+=num_comments
        
    avg_comments = total_comments/length
    print(avg_comments)
    return avg_comments

In [26]:
avg_ask_comments = comments_counter(ask_posts)
avg_show_comments = comments_counter(show_posts)

14.038417431192661
10.31669535283993


In [27]:
print('In average, there are {} comments in an \'Ask HN\' post.'.format(avg_ask_comments))
print('In average, there are {} comments in a \'Show HN\' post.'.format(avg_show_comments))

In average, there are 14.038417431192661 comments in an 'Ask HN' post.
In average, there are 10.31669535283993 comments in a 'Show HN' post.


From the results above, we find out that `Ask HN` recieves about **4** more comments on average than `Show HN`. Which means people are more likely to answer a question than comment on a project. 

---
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
<br><br>
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [28]:
# Initialize an empty list to hold rows of posts creation time and number of comments
result_list = []

for row in ask_posts:
    creation_time = row[6]
    num_comments = int(row[4])
    result_list.append([creation_time, num_comments])

In [30]:
result_list[:5]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [39]:
import datetime as dt

# Count number of comments by hour
counts_by_hour = dict()
comments_by_hour = dict()
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    num_comments = row[1]
    time = row[0]
    # Convert time to datetime using strptime with date_format and extract the hour using strftime
    hour = dt.datetime.strptime(time, date_format).strftime('%H') 
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments

In [40]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [41]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [42]:
# Calculate average comments by hour 
avg_by_hour = {hour:comments/counts_by_hour[hour] for hour, comments in comments_by_hour.items()}

In [43]:
avg_by_hour

{'09': 5.5777777777777775,
 '13': 14.741176470588234,
 '10': 13.440677966101696,
 '14': 13.233644859813085,
 '16': 16.796296296296298,
 '23': 7.985294117647059,
 '12': 9.41095890410959,
 '17': 11.46,
 '15': 38.5948275862069,
 '21': 16.009174311926607,
 '20': 21.525,
 '02': 23.810344827586206,
 '18': 13.20183486238532,
 '03': 7.796296296296297,
 '05': 10.08695652173913,
 '19': 10.8,
 '01': 11.383333333333333,
 '22': 6.746478873239437,
 '08': 10.25,
 '04': 7.170212765957447,
 '00': 8.127272727272727,
 '06': 9.022727272727273,
 '07': 7.852941176470588,
 '11': 11.051724137931034}

In [44]:
# Sort avg_by_hour by number of average comments
sorted_avg_by_hour = {hour:comments for hour, comments in sorted(avg_by_hour.items(),
                                                                key = lambda item: item[1],
                                                                reverse = True) }

In [51]:
avg_by_hour_list = list(sorted_avg_by_hour.items())
avg_by_hour_list

[('15', 38.5948275862069),
 ('02', 23.810344827586206),
 ('20', 21.525),
 ('16', 16.796296296296298),
 ('21', 16.009174311926607),
 ('13', 14.741176470588234),
 ('10', 13.440677966101696),
 ('14', 13.233644859813085),
 ('18', 13.20183486238532),
 ('17', 11.46),
 ('01', 11.383333333333333),
 ('11', 11.051724137931034),
 ('19', 10.8),
 ('08', 10.25),
 ('05', 10.08695652173913),
 ('12', 9.41095890410959),
 ('06', 9.022727272727273),
 ('00', 8.127272727272727),
 ('23', 7.985294117647059),
 ('07', 7.852941176470588),
 ('03', 7.796296296296297),
 ('04', 7.170212765957447),
 ('22', 6.746478873239437),
 ('09', 5.5777777777777775)]

In [56]:
print('Top 5 Hours for Ask Posts Comments')
for i in range(5):
    item = avg_by_hour_list[i]
    hour = dt.datetime.strptime(item[0], '%H').strftime('%H:%M')
    comments = item[1]
    print('{} has {:.2f} comments in average'.format(hour, comments))

Top 5 Hours for Ask Posts Comments
15:00 has 38.59 comments in average
02:00 has 23.81 comments in average
20:00 has 21.52 comments in average
16:00 has 16.80 comments in average
21:00 has 16.01 comments in average


## Conclusion:
Conveniently, since `created_at`: the date and time the post was made (the time zone is Eastern Time in the US is defined with the same timezone I'm in, if I want to optimize the number of comments on my post with a question, I should create it around 15:00. 