## Insights about user posts - Hackernews Forum Posts analysis
---
This is a guided project from Dataquest.IO course. The dataset we are working with is csv file having posts from HackerNews forum in year 2016. My objectives here are:
- categorize the posts
- find statistical data like how many posts per hour
- on an average how many users reply to posts etc

In [1]:
# added new code to detect the encoding using chardet and then open the file again using context manager method to read

import chardet
with open("hacker_news.csv", mode = 'rb') as file:
    en_code = chardet.detect(file.read(800))['encoding']

#use en_code variable from above which holds the encoding value to decode the file.
#once the file is read, split the heading row from rest of the data.
import csv as c
with open("hacker_news.csv", encoding = en_code) as file:
    hn = [row for row in list(c.reader(file))]
headers = hn[:1]
hn = hn[1:]
# printing the row heading for future reference
print (headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [2]:
#Based on the title of the posts classify the posts as Ask, Show or others.
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith("ask"):
        ask_posts.append(row)
    elif title.startswith("show"):
        show_posts.append(row)
    else:
        other_posts.append(row)

#calculating the total comments posted in the ask posts list based on num_comments field

#ask post statistics
total_ask_comments = 0
for row in ask_posts:
    num_comments_ask = int(row[4])
    total_ask_comments = total_ask_comments + num_comments_ask

avg_ask_comments = round (total_ask_comments / len(ask_posts))
print("The average comments per ask post is", avg_ask_comments)

#show post statistics
total_show_comments = 0
for row in show_posts:
    num_comments_show = int(row[4])
    total_show_comments = total_show_comments + num_comments_show

avg_show_comments = round (total_show_comments / len(show_posts))
print("The average comments per show post is", avg_show_comments)

The average comments per ask post is 10
The average comments per show post is 5


## Insight
- On an average there is atleast 10 comments per post in the Ask HN forum thread of the Hacker News.
- The Show HN is having only half the number of average comments.
- It appears users are more active in answering a question.

In [3]:
import datetime as dt
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    datetime = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = datetime.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] = comments_by_hour[hour] + row[1]

avg_by_hour = []
for each in comments_by_hour:
    avg_by_hour.append([each, (comments_by_hour[each] / counts_by_hour[each])])

swap_avg_by_hour = [[each[1],each[0]] for each in avg_by_hour]
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print ("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour_dt = dt.datetime.strptime(row[1], "%H")
    hour_str = hour_dt.strftime("%H:00")
    print (("{} : {:.2f} average comments per post").format(hour_str, row[0]))

Top 5 Hours for Ask Posts Comments
15:00 : 28.46 average comments per post
13:00 : 16.17 average comments per post
12:00 : 12.19 average comments per post
02:00 : 11.06 average comments per post
10:00 : 10.52 average comments per post


## Insight
- It looks like the users from all over the world are posting comments in the forums.
- But based on the top 3 average comments posted it does look like the afternoon of HackerNews servertime there is a lot of activity.
- One explanation could be HackerNews staff themselves replying to lots of comments or academic students browsing after school hours etc.

In [4]:
#Determine if show or ask posts receive more points on average.
total_ask_points = 0
for row in ask_posts:
    num_points_ask = int(row[3])
    total_ask_points = total_ask_points + num_points_ask

avg_ask_points = round (total_ask_points / len(ask_posts))
print ("The total number of ask posts are", len(ask_posts))
print("The average point per ask post is", avg_ask_points)

total_show_points = 0
for row in show_posts:
    num_points_show = int(row[3])
    total_show_points = total_show_points + num_points_show

avg_show_points = round (total_show_points / len(show_posts))
print("")
print ("The total number of show posts are", len(show_posts))
print("The average point per show post is", avg_show_points)

The total number of ask posts are 9269
The average point per ask post is 11

The total number of show posts are 10218
The average point per show post is 15


## Insight

- The total number of ask and show posts are almost the same.
- Even then the average point per show post is greater than ask by 4 points.
- This suggests that the community as a whole is interested in learning/commenting new information.
- The earlier example showed that there is 2 times average comments received per ask post.
- But here show posts are getting more points. This could mean given a set of active users at a time few may not be actively commenting so as to reduce the count of duplicate posts. 
- This can further confirmed if we are to cross verify active users for a given time and posts made / comments received.

In [19]:
#Determine if posts created at a certain time are more likely to receive more points.




#Compare your results to the average number of comments and points other posts receive.