<h1><u>Data Exploration and Analysis on Hacker News Posts</u></h1>

<h2>Introduction</h2>
<p>Today, we explore the dataset that contains posts from the Hacker News website from the last 12 months up to the date of September 26 2016 and the scrapped data can be found originally <a href = "https://www.kaggle.com/hacker-news/hacker-news-posts">here</a>. The Hacker News website is a social media site that specifically focuses on computer science/entrepreneurship that has content which generally attracts the curiosity of likeminded users. Futhermore, we are interested in posts that has title names that include 'Ask HN' or 'Show HN' which implies that the user either wants to ask the Hacker News (HN) community a question or show the Hacker News community their project/s or findings.</p>

<h2>Objective</h2>
<p>Our goals today is to find out which posts with those titles ('Ask HN' or 'Show HN') receive more comments on average. And also does a post at a certain hour gain more post then another posted at a different hour of the day?</p>

<h2>Importing the data</h2>
<p>First things first, let's import the dataset and inspect the first five rows of the data to get an idea of what we are dealing with:</p>

In [34]:
# import hacker_news.csv file into python lists of lists
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

# header (columns of the dataset)
print("----------Columns of the Hacker News dataset----------")
print(hn_header)

# print first 5 rows
print("\n----------First 5 rows of the Hacker News dataset----------")
for row in hn[0:6]:
    print(row)

----------Columns of the Hacker News dataset----------
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

----------First 5 rows of the Hacker News dataset----------
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://ww

<b>Comment:</b> Here, we notice that the dataset has 7 columns which means each row contains 7 variables and we can go to the author of the dataset to get a description of what each dataset represents.

<b>Description of the columns:</b>
<ul>
<li><b>id</b>: The unique identifier from Hacker News for the post</li>
<li><b>title</b>: The title of the post</li>
<li><b>url</b>: The URL that the posts links to, if it the post has a URL</li>
<li><b>num_points</b>: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes</li>
<li><b>num_comments</b>: The number of comments that were made on the post</li>
<li><b>author</b>: The username of the person who submitted the post</li>
<li><b>created_at</b>: The date and time at which the post was submitted</li>
</ul>

Now, that we have a dataset loaded up, we can now decide to filter out the posts by finding the titles that had 'Ask HN' or 'Show HN'.

In [35]:
# create three lists to store the posts
# ask_posts: stores the 'Ask HN' posts
# show_posts: stores the 'Show HN' posts
# other_posts: stores the other posts
ask_posts, show_posts, other_posts = [], [], []

# iterate through the dataset to and use the 
# title (index = 1) name to filter into each of the 3 lists
for row in hn:
    title = row[1]
    if (title.lower().startswith("ask hn")):
        ask_posts.append(row)
    elif (title.lower().startswith("show hn")):
        show_posts.append(row)
    else:
        other_posts.append(row)

# print out the number of posts in each lists
print("Number of 'Ask HN' posts:", len(ask_posts))
print("Number of 'Show HN' posts:", len(show_posts))
print("Number of other posts:", len(other_posts))  

Number of 'Ask HN' posts: 1744
Number of 'Show HN' posts: 1162
Number of other posts: 17194


<b>Comment:</b> So, it looks like there is 1744 posts in the dataset that involves the poster asking a question to the Hacker News Community, 1162 posts regarding the poster showcasing a project or finding and 17194 posts are other posts that don't fit in either category.

Now, we have found our Ask and Show posts, it is time to count the numer of comments for each of the lists of posts as one of the goals was to find out which category of posts receive more comments on average.

In [36]:
# get total number of comments in ask posts
total_ask_comments = 0

# find the number of comments in ask posts
for post in ask_posts:
    n_comments = int(post[4])
    total_ask_comments += n_comments
avg_ask_comments = total_ask_comments/len(ask_posts)

# display the average number of comments in ask posts
print("The average number of comments in ask posts:", avg_ask_comments)

# get total number of comments in show posts
total_show_comments = 0

# find the number of comments in show posts
for post in show_posts:
    n_comments = int(post[4])
    total_show_comments += n_comments
avg_show_comments = total_show_comments/len(show_posts)

# display the average number of comments in show posts
print("The average number of comments in show posts:", avg_show_comments)

The average number of comments in ask posts: 14.038417431192661
The average number of comments in show posts: 10.31669535283993


<b>Comment:</b> Here, we can see that the average number of comments is more present in any posts that is an 'Ask HN' post. This would make sense as generally asking questions should generally invoke more of a response as people are inclined to provide an answer. And now we have shown that this phenomenon is true in these post.

Since, we deduce that posts that start with the title 'Ask HN' generally receive more comments on average. We can now focus on the second part of our objective, which is to figure out if the post creation time influences the average amount of comments received on the post. Because the 'Ask HN' post receive more comments, we will use those posts as the focus of this next objective.

In [51]:
# import modules
import datetime as dt

# for each post fectch the post creation time and the number of comments
result_list = []
for post in ask_posts:
    post_creation_time = post[6]
    post_num_comments = int(post[4])
    result = [post_creation_time, post_num_comments]
    result_list.append(result)

# find the frequency of comments received at a certain hour
# get the hour the post was created and get the number of comments received
# this will tells us how many comments are received if the post is created 
# that hour then we can get an average of comments for the hour 
# by dividing the amount of posts made in that hour
counts_by_hour, comments_by_hour = {}, {}
for result in result_list:
    creation_date = result[0]
    creation_date = dt.datetime.strptime(creation_date, "%m/%d/%Y %H:%M")
    hour = creation_date.hour
    if (hour not in counts_by_hour):
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = result[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]

# get the average number of comments for the hour of which the post is created
average_by_hour = []
for hour in counts_by_hour:
    average_comments = comments_by_hour[hour]/counts_by_hour[hour]
    average_by_hour.append([hour, average_comments])
    
# display the results of average number of comments by hour
for item in average_by_hour:
    hour = item[0]
    average = item[1]
    print("Hour:", hour, " -> Average number of comments:", round(average, 2))

Hour: 0  -> Average number of comments: 8.13
Hour: 1  -> Average number of comments: 11.38
Hour: 2  -> Average number of comments: 23.81
Hour: 3  -> Average number of comments: 7.8
Hour: 4  -> Average number of comments: 7.17
Hour: 5  -> Average number of comments: 10.09
Hour: 6  -> Average number of comments: 9.02
Hour: 7  -> Average number of comments: 7.85
Hour: 8  -> Average number of comments: 10.25
Hour: 9  -> Average number of comments: 5.58
Hour: 10  -> Average number of comments: 13.44
Hour: 11  -> Average number of comments: 11.05
Hour: 12  -> Average number of comments: 9.41
Hour: 13  -> Average number of comments: 14.74
Hour: 14  -> Average number of comments: 13.23
Hour: 15  -> Average number of comments: 38.59
Hour: 16  -> Average number of comments: 16.8
Hour: 17  -> Average number of comments: 11.46
Hour: 18  -> Average number of comments: 13.2
Hour: 19  -> Average number of comments: 10.8
Hour: 20  -> Average number of comments: 21.52
Hour: 21  -> Average number of com

<b>Comment:</b> So, we can see that the results are hard to read for each hour and the average number of comments. Thus, before we start to infer something let's display this table of results alittle neater.

In [69]:
# sort the results based on the average number of comments in descending order
sorted_results = sorted(average_by_hour, key = lambda x: x[1], reverse = True)

# display the top 3 results
print("Top 3 hours for 'Ask HN' posts to receive the most comments:")
for item in sorted_results[0:3]:
    hour = item[0]
    hour = dt.datetime.strptime(str(hour), "%H")
    hour = dt.datetime.strftime(hour, "%H:%M")
    result = "{hour}: {avg_comments:0.2f} average comments per post" 
    result = result.format(hour = hour, avg_comments = item[1])
    print(result)

Top 3 hours for 'Ask HN' posts to receive the most comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post


<b>Comment:</b> And the results are in! We see that 3 PM (Eastern Time in the US) is the most optimal time to write up a 'Ask HN' post as it gets on average 38.59 comments and is followed by 2 AM and finally 8 PM. So, if we wanted to get our question answered then these would be the optimal times to post some questions.

<h2>Conclusion Of Analysis</h2>
<p>Thus, we completed our objectives today, which were to find out if which of the 'Ask HN' or 'Show HN' posts gets more comments on average and then we found out what is the most optimal hour to put up a post such that it receives the most comments (on average). From this analysis, we concluded that the 'Ask HN' posts is where the most comments on average are found and the most optimal time to post these posts to get the most average number of comments is 3 PM Eastern Time in the USA.</p>