# Hacker News posts

In this project we will sort through posts from HN and pull info from any line containing `Ask HN:` or `Show HN:`. With this info we will look into specific times that posts receive "Upvotes", determining which category and what time a user is most likely to have a highly viewed post. This project is to show my ability to parse data and used dates and times in python.

In [6]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0]
hn = hn[1:]

In the code above we opened and read the csv file, coverted it into a list, and assigned the header and body appropriately

In [7]:
print(headers, "\n", hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 
 [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [8]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [10]:
print("No. of Ask HN posts: ", len(ask_posts))
print("No. of Show HN posts: ", len(show_posts))
print("No. of Other HN posts: ", len(other_posts))


No. of Ask HN posts:  1744
No. of Show HN posts:  1162
No. of Other HN posts:  17194


In the segment above, we created 3 empty sets and looped through each row, verifying what the post started with, and `.appending` the list an needed. As we can see, there are more posts "Asking" HN questions, than "Showing" HN.

## Finding the total number of comments

In [11]:
def avg_num_comments(post_list, index=4):
    total_comments = 0
    for row in post_list:
        comments = int(row[index])
        total_comments += comments
        
    avg_comments = total_comments/ len(post_list)
    return avg_comments, total_comments

In [18]:
# displaying the average no of comments and the total number of comments for 'Ask HN' posts
avg_num_comments(ask_posts)

(14.038417431192661, 24483)

In [17]:
# displaying the average no of comments and the total number of comments for 'Show HN' posts
avg_num_comments(show_posts)

(10.31669535283993, 11988)

Using the function we wrote `avg_num_comments` we can see that on average, "Ask HN" posts receive more comments than "Show HN" posts do, with over double the total amount of comments.

## Ask Posts and Comments by Hour Created

In [20]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

In [23]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

The output above shows us that `Ask posts` created at the hours `15:00` showed more potential of getting feedback from users on their posts. In general posts created in between noon hours say `13:00` to late hours of the day say `21:00` had more feedbacks on a posts


In [24]:
# calculating the average number of comments per post for post each hour of the day
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

The result above further affirms the results gotten before. E.g There are approximately 39 Ask posts made at `15:00` every day which reflects on the feedback/comments from users which is about `4477 comments`. Also we also notice that, there are lots of posts made between noon hours to before midnight, which accounts for the large amount of comments seen during the day.

In [25]:
swap_avg_by_hour = []
for row in avg_by_hour:
    a = row[1]
    b = row[0]
    swap_avg_by_hour.append([a,b])

print('============== Unsorted avg values ===========')    
print(swap_avg_by_hour[:5])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('============= Sorted avg values ==============')
print('Top 5 Hours for Ask Posts Comments')
# sorted_swap[:6]

for avg, hr in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post'.format(
        dt.datetime.strptime(hr, '%H').strftime('%H:%M'), avg))

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16']]
Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The result of the values shown above shows to gain maximum made on a post, a post should be made most preferably at 15:00. However other suitable hours could be `2:00`, `20:00`, `16:00` and `21:00`

## Calculate the amount of points from "Ask" or "Show" posts

In [26]:
# Finding the average number of counts for Ask posts

total_ask_count = 0
for row in ask_posts:
    count = row[3]
    if count != '':
        count = int(row[3])
        total_ask_count += count
    
    
avg_ask_counts = total_ask_count/len(ask_posts)
print('The average number of counts for Ask Posts is {:.2f}'.format(avg_ask_counts))

The average number of counts for Ask Posts is 15.06


In [27]:
# Finding the average number of counts for Show Posts

total_show_count = 0
for row in show_posts:
    show_count = row[3]
    if show_count != '':
        show_count = int(row[3])
        total_show_count += show_count
    
    
avg_show_counts = total_show_count/len(show_posts)
print('The average number of counts for Show Posts is {:.2f}'.format(avg_show_counts))

The average number of counts for Show Posts is 27.56


This information above informs ua that there are more ratings for `Show` posts than `Ask` posts. This implies that the HN community values contributions to the community more than feedback from users

## Post Upvotes per Time of Day

Here we will look at whether or not time of day affects the amount of upvotes a post gets.

In [28]:
ask_list_counts_vs_time = []
# total=0

# Checking for upvoting vs. time for Ask Posts
for posts in ask_posts:
    created = posts[6]
    counts = int(posts[3])
    ask_list_counts_vs_time.append([created, counts])
    
ask_counts_by_hour = {}
ask_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in ask_list_counts_vs_time:
    date = row[0]
    count = row[1]
    created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if created_obj not in ask_counts_by_hour:
        ask_counts_by_hour[created_obj] = 1
        ask_comments_by_hour[created_obj] = count
    else:
        ask_counts_by_hour[created_obj] += 1
        ask_comments_by_hour[created_obj] += count
        
print('the number of counts on ask posts by the hour are:')
ask_counts_by_hour  

the number of counts on ask posts by the hour are:


{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [29]:
show_list_counts_vs_time = []

# Checking for upvoting vs. time for Show Posts
for post in show_posts:
    created = post[6]
    counts = int(post[3])
    show_list_counts_vs_time.append([created, counts])
    
show_counts_by_hour = {}
show_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in show_list_counts_vs_time:
    date = row[0]
    count = row[1]
    show_created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if show_created_obj not in show_counts_by_hour:
        show_counts_by_hour[show_created_obj] = 1
        show_comments_by_hour[show_created_obj] = count
    else:
        show_counts_by_hour[show_created_obj] += 1
        show_comments_by_hour[show_created_obj] += count
        
print('the number of counts on show posts by the hour are:')
show_counts_by_hour

the number of counts on show posts by the hour are:


{'14': 86,
 '22': 46,
 '18': 61,
 '07': 26,
 '20': 60,
 '05': 19,
 '16': 93,
 '19': 55,
 '15': 78,
 '03': 27,
 '17': 93,
 '06': 16,
 '02': 30,
 '13': 99,
 '08': 34,
 '21': 47,
 '04': 26,
 '11': 44,
 '12': 61,
 '23': 36,
 '09': 30,
 '01': 28,
 '10': 36,
 '00': 31}

## Conclusion

From the results above we can conclude:

- Show posts uploaded between the hours 13:00 and 17:00 are more likely to get higher number of upvotes
- Ask posts will more likely be upvoted between 13:00 and 21:00