## Guided Project: Exploring Hacker News Posts

The goal of our project is to analyze posts submitted on Hacker News website to determine whether posts to ask questions to the community receive on average more comments than posts to show something to the community. Also we would determine if the time of the posts' creation has incidence on the number of comments it receives.

#### Import necessary libraries

In [1]:
# import csv library to read csv file.
import csv

# import datetime to work with the data in the created_at column
import datetime as dt

#### Data collection
1. Read the **"hacker_news.csv"** file in as a list of lists and assign result to the variable *hn*
3. Display the first five rows of *hn*

In [2]:
with open("./Assets/hacker_news.csv") as file:
    # read the "hacker_news.csv" with csv reader
    reader = csv.reader(file)
    # assign the reading result to the variable hn as a list of lists.
    hn =  list(reader)

# To visualize and try to understand our data
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


#### Data cleaning

1. What we want to do first it to remove row containing the column headers from our data. It's he first row in *hn*

In [3]:
# Extract row headers and assign it to the variable headers
headers = hn[0]
# Remove the first row from hn
hn.remove(headers)

# Verification if all went ok
print(headers)
print()
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


2. We are interested in posts whose title begin with **Ask HN** or **Show HN**. At this step, we'll extract datas corresponding to these conditions.

In [4]:
# We create 3 lists:
# ask_posts => contains posts that title begin with Ask HN
# show_posts => contains posts that title begin with Show HN
# other_posts => correspond to the rest of the post
ask_posts = []
show_posts = []
other_posts = []

# We loop on hn
# for each row in hn, we verify if the title (index 1) start with Ask HN or Show HN with String method startswith
# As capitalization matter with startswith, we use lower() on our title and give to startswith lowercase version of Ask HN and Show HN
for row in hn:
    title = row[1].lower()
    
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(f"{len(ask_posts)} Ask HN posts")
print(f"{len(show_posts)} Show HN posts")
print(f"{len(other_posts)} other posts")

1744 Ask HN posts
1162 Show HN posts
17194 other posts


#### Data analysis
Now that we extracted the posts that matter to us 

1. let's determine if ask posts or show posts receive more comments on average.

In [5]:
# We create variables to receive to total number of comments for each ask posts
total_ask_comments = 0

# we loop over the list of ask posts
# get the number of comments
# and add it to total_ask_comments
for row in ask_posts:
    total_ask_comments += int(row[4])

# We calculate the average number of comments by divise total_ask_comments with the total number of Ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print(f"Average ask comments: {avg_ask_comments}")

# same process to show posts
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(ask_posts)
print(f"Average show comments: {avg_show_comments}")

Average ask comments: 14.038417431192661
Average show comments: 6.873853211009174


Post to ask question receive on average around 14 comments while pots to show thing to the community receive around 7 comments.  
Globally Ask posts receive 2 time more comments than Show posts.

Since Ask posts receive on average more comments than Show posts, let use ask posts data and determine if ask posts created at a certain time are more likely to attract comments.

In [6]:
# First let's calculate the number of ask posts created in each hour of the day, 
# along with the number of comments receive.

# We extract the data we need by creating a list of lists that contains
# the date of posts creation and the comments it receive at this date.
result_list = []

for row in ask_posts:
    post_creation_date = row[6]
    comment_number =  int(row[4])
    result_list.append([post_creation_date, comment_number])

# number of posts per hour
counts_by_hour = {}
# number of comments per hour
comments_by_hour = {}

for row in result_list:
    comment_number =  row[1]
    # the date is in a format like this "month/day/year hour:minute"
    # we only need the hour
    # to extract it, we use strptime() to convert our string to a datetime object
    # then get the hour by useing the attribute .hour
    hour = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M").hour
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment_number
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment_number

# Let's calculate the average number of comments per post for posts created during each hour of the day
avg_by_hour = []

for hour in comments_by_hour:
    # The average comments for an hour is the total number of comments for this hour divise by the number of posts
    avg_number_comments_by_hour = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg_number_comments_by_hour])

# We change the position of avg_comments and hour in order to sort by avg_comments in descending order
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour, end="\n\n")

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# We display the first 5 rows.
# dt.datetime.strptime(str(row[1]), "%H").strftime("%H:%M") => to format the hour like this : "hour:00"
for row in sorted_swap[:5]:
    print(f"{dt.datetime.strptime(str(row[1]), "%H").strftime("%H:%M")}: {row[0]:.2f} average comments per post")

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We already determine that globally Ask posts receive on average around 14 comments. Ask posts posted at 21, 16, 20, 02, 15 hours globally receive more comments than the average (14 comments). We can then conclude that posts created at these hours have  a higher chance of receiving comments. 15 is the optimal hours. 
All the hours correspond to the zone of Eastern Time in the US. Since We are in Moscow, let's determine when somebody in Moscow should create a posts.

In [7]:
# difference of time between Eastern Time in the US and Moscow = +8
time_difference = dt.timedelta(hours=8) 

# let's display our result in Moscow time
for row in sorted_swap[:5]:
    print(f"{(dt.datetime.strptime(str(row[1]), "%H") + time_difference).strftime("%H:%M")}: {row[0]:.2f} average comments per post")

23:00: 38.59 average comments per post
10:00: 23.81 average comments per post
04:00: 21.52 average comments per post
00:00: 16.80 average comments per post
05:00: 16.01 average comments per post


So for Moscow the hours are: 23 (the optimal hours), 10, 04, 00, 05.