# Exploring Hacker News Posts

Time is an important factor in analyses. By analysing the time, we can identify underlying trends. For an instance in the field of social media, time analysis can help us to find the best time to receive comments from users.

In this project, we will work with a dataset of submissions to a popular technology site called Hacker News. This project is mainly aiming to find out two questions:

1. Do `Ask HN` or `Show HN` receive more comments than other posts?
2. Do posts created at a certain time or time span receive more comments on average?

## Dataset information

The dataset can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), but note that we use a reduced dataset from 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

Headers include:
- `id`: Unique identifier from Hacker News for the post
- `title`: Title of the post
- `url`: The URL that the posts links to, if the post has a URL
- `num_points`: Number of points that the post get
- `num_comments`: Number of comments that the post get
- `author`: Author of the post
- `created_at`: Time the post was created 

In [11]:
from csv import reader
opened_file=open('hacker_news.csv')
read_file=reader(opened_file)
hn=list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Let's remove the header for our convenience.

In [3]:
headers=hn[0]
hn=hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Isolating the Ask HN and Show HN Posts

We are only concerned with post titles beginning with `Ask HN` or `Show HN`, so we'll create new lists of lists chontaining just the data for those titles.

In [4]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    if title.startswith('Ask HN'):
        ask_posts.append(row)
    elif title.startswith('Show HN'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1742
1161
17197


So we see that there are 1742 titles starting with `Ask HN`, 1161 titles starting with `Show HN`, and most of titles starting with other words.

Next, let's determine if ask posts or show posts receive more comments on average.

In [5]:
total_ask_comments=0

for row in ask_posts:
    num=int(row[4])
    total_ask_comments+=num
    
avg_ask_comments=total_ask_comments/len(ask_posts)
print(avg_ask_comments)

total_show_comments=0

for row in show_posts:
    num=int(row[4])
    total_show_comments+=num
    
avg_show_comments=total_show_comments/len(show_posts)
print(avg_show_comments)

total_other_comments=0


    

14.044776119402986
10.324720068906116


It shows that ask posts receive 14 comments on average, which is more than how many comments show posts receive (10 comments on average).

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments. Therefore, we'll calculate the number of ask posts and comments by hour created.

In [6]:
import datetime as dt
from datetime import datetime
result_list=[]
for row in ask_posts:
    time=row[6]
    num=int(row[4])
    result_list.append([time,num])
    
counts_by_hour={}
comments_by_hour={}
for row in result_list:
    time=dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour=time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=row[1]
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=row[1]
        
    
    
    

In [7]:
avg_by_hour=[]
for hour in comments_by_hour:
    count=counts_by_hour[hour]
    avg_by_hour.append([hour, comments_by_hour[hour]/count])
    
print(avg_by_hour)
    

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.24074074074074], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.12962962962963], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [8]:
swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.24074074074074, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.12962962962963, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [9]:
sorted_swap=sorted(swap_avg_by_hour, reverse=True)

In [10]:
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
    time=dt.datetime.strptime(row[1],"%H")
    hour=time.strftime("%H:%M")
    print('{}: {:.2f} average comments per post'.format(hour,row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

We can conclude that top 5 hours for Ask posts to get most comments are: 15: