![image](https://uploads-ssl.webflow.com/5e5e26b57a149fc28773c703/5eaf3dc2f728bb4e333a1546_hacker-news-logo.jpeg)

## Analyzing Data from Hacker News 

In this project I will be analyzing data from posts titles that start with "Ask HN" and "Show HN." The former has to do with asking the Hacker News' community about a topic and the latter has to do with showing something off. The answer I am attempting to answer through observable patterns with data is: "What post type and time is the best time to post in order to recieve comments?"

The dataset I will be using can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts).

In [1]:
from csv import reader #import reader function from csv module
import datetime as dt #import datetime with alias dt

opened_file = open(r'C:\Users\david\Desktop\HN.csv', encoding="utf-8")
read_file = reader(opened_file)
list_file = list(read_file)
hnp_header = list_file[0]
hnp_data = list_file[1:]

print(hnp_header) #print the header of the dataset
print("Number of rows in hnp_data:", len(hnp_data))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
Number of rows in hnp_data: 293119


---

Below I will loop through the dataset. Every title that starts with "Ask HN" will be added to `ask_posts` and titles that start with "Show HN" will be added to `show_posts`. Every other title that does not meet the above critera will be added to `other_posts`.

In [2]:
ask_posts = [] #code block sorts posts that start with ask hn and show hn into respective lists
show_posts = []
other_posts = []

for row in hnp_data:
    title = row[1].lower() #utilizing .lower function to read strings as lowercase values
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"): #using .startwith parameter to check if the title starts with specificed string
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of ask posts:', len(ask_posts))
print('Number of show posts:', len(show_posts))
print('Number of every other kind of post:', len(other_posts))

Number of ask posts: 9139
Number of show posts: 10158
Number of every other kind of post: 273822


---

Now I will count the total number of comments for both post types, where `total_ask_comments` will correlate to the total number of comments recieved for ask posts and `total_show_comments` will correlate to the total number of comments recieved for show posts.
    Afterwards, I will calculate the average number of comments for each post.

In [3]:
total_ask_comments = 0 #find total number of comments for ask posts
for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts) #Calculate average number of comments for ask posts
print("Total number of comments for ask posts:", total_ask_comments)
print("Average number of comments for ask posts:", avg_ask_comments)

Total number of comments for ask posts: 94986
Average number of comments for ask posts: 10.393478498741656


In [4]:
total_show_comments = 0 #fnd total number of comments for "Show HN" posts
for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts) #Calculate average number of comments for "Show HN" posts
print("Total number of comments for show posts:", total_show_comments)
print("Average number of comments for show posts:", avg_show_comments)

Total number of comments for show posts: 49633
Average number of comments for show posts: 4.886099625910612


As shown above, ask posts recieve more comments on average. Therefore I will be focusing my analysis on these posts from this point forward in order to best answer the question stated in the introduction.

---

Below I will be calculating the number of ask posts created every hour of the day AND calculating the average number of comments ask posts recieve per hour.

In [5]:
result_list = [] #append every row for created_at column and ask_posts column in a list

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
print('Example of the data recorded for every row:')
print(result_list[:3])

Example of the data recorded for every row:
[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0]]


In [6]:
comments_per_hour = {} #contains the corresponding number of comments ask posts created at each hour received
posts_per_hour = {} #contains the number of ask posts created during each hour of the day

for row in result_list:
    date_str = row[0]
    num_comments = row[1]
    dt_object = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M") #parase string into date format
    hour = dt_object.strftime("%H")
    
    if hour in posts_per_hour:
        comments_per_hour[hour] += num_comments
        posts_per_hour[hour] += 1
    else: 
        comments_per_hour[hour] = num_comments
        posts_per_hour[hour] = 1

print("Average number of comments for ask posts per hour:")
comments_per_hour #average number of comments for ask posts per hour

Average number of comments for ask posts per hour:


{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In [7]:
#calculate the average number of comments for ask posts per hour
avg_per_hour = []

for row in comments_per_hour:
    avg_per_hour.append([row, comments_per_hour[row] / posts_per_hour[row]])
    
print('Average number of comments for ask posts per hour:')
avg_per_hour

Average number of comments for ask posts per hour:


[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

In [9]:
#swap the rows in avg_per_hour and sort by descending order
swap_avg_per_hour = []

for row in avg_per_hour:
    swap_avg_per_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_per_hour, reverse=True)
sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [10]:
print("Top 5 Hours for 'Ask HN' Comments:")
for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))

Top 5 Hours for 'Ask HN' Comments:
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


The time of day which statistically recieves the highest number of comments is 3:00 pm EST (according to the timezone used in the [documentation](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), or 15:00, with an average of ~38 comments per post. That's more than double the number of comments than the second highest average, which is 2:00 am.

In conclusion, I analyzed ask and show posts from Hacker News and determined that in order for one to get the higehst chance of recieving a comment on their post it would have to be of the ask hn category and would have to occur around the time 3:00 pm EST, as our data indicates.