<a href="https://colab.research.google.com/github/amitagl27/jupyternotebooks/blob/master/ExploringHackerNewsPosts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Guided Project for Exploration of Hacker News Posts

## In this notebook, I am going to analyze posts on Hacker News website and explore what are the parameters (time, title) for which there is more activity on the post

---
Reading hacker_news.csv 
> This csv file contains data downloaded from hackernew website. Data has been refined and  
reduced from almost 300k rows to approximately 20k rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

> `id`: The unique identifier from Hacker News for the post
<br>
> `title`: The title of the post
<br>
> `url`: The URL that the posts links to, if it the post has a URL
<br>
> `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
<br>
> `num_comments`: The number of comments that were made on the post
<br>
> `author`: The username of the person who submitted the post
<br>
> `created_at`: The date and time at which the post was submitted
<br>


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
filepath = '/content/drive/My Drive/Colab Notebooks/mydatasets/hacker_news.csv'
from csv import reader
opened_file = open(filepath,encoding='utf8')
csvfile= reader(opened_file)
hndatalist = list(csvfile)
#get headers
hnheaders = hndatalist[0]
#remove headers from the dataset
hndatalist = hndatalist[1:]


### There are various titles in above dataset. However, in this project, We're specifically interested in posts whose titles begin with either `Ask HN' or 'Show HN'. We'll need to filter out our dataset to have only these titles

> Users submit `Ask HN` posts to ask the Hacker News community a specific question
<br>
>users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting

We'll create 3 seperate lists to store posts 
* `ask_posts` : which starts with  **Ask HN** ,
* `show_posts`: which starts with **Show HN**  and 
* `other_posts`:All Other Posts 


In [None]:
ask_posts = []
show_posts = []
other_posts = []
for post in hndatalist:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)



### Below are the number of posts for each category:
* total number of posts in the file: `20100` 
* number of posts which starts with **Ask HN** : `1744` 
* number of posts which starts with **Show HN**: `1162`
* number of Other posts:  `17194`

Next we'll figure out total number of comments on each of these categories as well as average comments on each category


In [None]:
total_ask_comments = 0
total_show_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
for post in show_posts:
    total_show_comments += int(post[4])
avg_ask_comments = round(total_ask_comments/len(ask_posts),2)
avg_show_comments = round(total_show_comments/len(show_posts),2)

### From the average ask comments on each category we can see that there are more number of comments on **Ask HN** titled post.

* Total number of comments on **Ask HN** posts: `24483`
* Total number of comments on **Show HN** posts: `11988`
* Average comments on **Ask HN** posts: `14.04`
* Average comments on **Show HN** posts: `10.32`

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments. Thus we need to calculate average number of comments by each hour of the day.

We'll create two seperate dictionary one for count of posts per hour and another for toal number of comments per hour. 
* `counts_by_hour`: contains the number of ask posts created during each hour of the day.
* `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.


In [None]:
import datetime as dt
result_list = [] #creating a seperate list to store just created_at and number_of_comments
for post in ask_posts:
    result_list.append([post[6],post[4]])

counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    curdate = dt.datetime.strptime(result[0],"%m/%d/%Y %H:%M")
    cur_hour = curdate.strftime("%H")
    if cur_hour in counts_by_hour:
        counts_by_hour[cur_hour] += 1
    else:
        counts_by_hour[cur_hour] = 0
    if cur_hour in comments_by_hour:
        comments_by_hour[cur_hour] += int(result[1])
    else:
        comments_by_hour[cur_hour] = int(result[1])





Now we'll calculate average number of comments per hour using the two dictionary genreated above and store it in a list of list. 

* `avg_by_hour` : list of list containing average number of comments per hour

In [None]:
avg_by_hour = []
for hour in counts_by_hour:
    time = dt.datetime.strptime(hour,"%H")
    formattedTime = time.strftime("%H:%S")
    avg_by_hour.append([round(comments_by_hour[hour]/counts_by_hour[hour],2),formattedTime])
avg_by_hour = sorted(avg_by_hour,reverse=True)

#print("| Average | ", " Hour |")
#print("|------|------|")
#for item in avg_by_hour:
#    print("| " ,item[0]," | ", item[1]," |")


### Based on the calculation above we get below results set for average number of comments received each hour. 

| Average |   Hour |
| --- | --- |
|  38.93  |  15:00  |
|  24.23  |  02:00  |
|  21.8  |  20:00  |
|  16.95  |  16:00  |
|  16.16  |  21:00  |
|  14.92  |  13:00  |
|  13.67  |  10:00  |
|  13.36  |  14:00  |
|  13.32  |  18:00  |
|  11.58  |  17:00  |
|  11.58  |  01:00  |
|  11.25  |  11:00  |
|  10.9  |  19:00  |
|  10.47  |  08:00  |
|  10.31  |  05:00  |
|  9.54  |  12:00  |
|  9.23  |  06:00  |
|  8.28  |  00:00  |
|  8.1  |  23:00  |
|  8.09  |  07:00  |
|  7.94  |  03:00  |
|  7.33  |  04:00  |
|  6.84  |  22:00  |
|  5.7  |  09:00  |

We can see that posts created at 3:00 PM has got more number of comments on average.

## Conclusion:
AS the Time zone of the data set (from documentaition ) is in the Eastern Time in the USA which is (GMT-4). 

My timezone is GMT + 8 (MYT) . 
So I should create posts at around 03:00 AM in the morning (15:00+12 hours) to get more activity on it

