# Hacker News guided project

In this project, we'll work with a dataset of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- **id:** the unique identifier from Hacker News for the post
- **title:** the title of the post
- **url:** the URL that the posts links to, if the post has a URL
- **num_points:** the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments:** the number of comments on the post
- **author:** the username of the person who submitted the post
- **created_at:** the date and time of the post's submission
    
**We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.**

We'll compare these two types of posts to determine the following:

**Do Ask HN or Show HN receive more comments on average?**

**Do posts created at a certain time receive more comments on average?**

Let's start by importing the libraries we need and reading the dataset into a list of lists.

In [1]:
# create a new class which makes it possible to format text in colour or bold etc.

class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

### Import data

In [2]:
from csv import reader

# import pandas as pd

opened_file = open('hacker_news.csv',encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)

# hn = pd.read_csv('CSV FILES/hacker_news.csv')

print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


In [3]:
headers = hn[0]
hn = hn[1:]

print(headers)
print('')
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


### Check and Clean data

Now that all the data has been imported and the header has been split from the actual data its time for the next stap in this project. First the data will be checked on missing data. After the checks have been performed the data will be filtered.

In [4]:
# Define a function to search for null data in the given dataset
def check_null_data(dataset_header, dataset):
    
    total_missing = []
    columns = len(dataset_header)
    
    # Loop over each row in the dataset to identify any missing values at the given index
    for column in range(columns):
    
        null_value = False 
        null_count = 0
    
        for row in dataset:
            if row[column] == '':
                null_value = True
                null_count += 1
            if null_value == True:
                #print(a)
                #print(dataset_header,'\n')
                #print('Row Index: ', dataset.index(row),'\n') # Print the row number where the error was found
                #print(row, '\n')
                null_value = False
        
        total_missing.append([dataset_header[column],null_count])
        
    # Print the number of missing values identified at the given index
    template = "Column " + color.BOLD + "{} " + color.END + "has " + color.YELLOW + color.BOLD + "{} " + color.END + "missing values"
    
    for col, missing in total_missing:    
        print(template.format(col, missing))
    
#run the defined function    
check_null_data(headers, hn)

Column [1mid [0mhas [93m[1m0 [0mmissing values
Column [1mtitle [0mhas [93m[1m0 [0mmissing values
Column [1murl [0mhas [93m[1m13863 [0mmissing values
Column [1mnum_points [0mhas [93m[1m0 [0mmissing values
Column [1mnum_comments [0mhas [93m[1m0 [0mmissing values
Column [1mauthor [0mhas [93m[1m0 [0mmissing values
Column [1mcreated_at [0mhas [93m[1m0 [0mmissing values


As can be seen above, there are a lot of missing values in the column **'url'**. All other columns show no missing values.Since the information about URL is not deemed necessary we can continue with filtering the data. As mentioned before we are especially interested in 'ask posts' and 'show posts'. The next step will be to split the dataset in three lists.

In [5]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

#show how many posts there are of each type 

template = "There are" + color.YELLOW + color.BOLD + " {:,} " + color.END +  "of {} in the dataset "
        
print(template.format(len(ask_posts),color.BOLD + "'ask posts'" + color.END),'\n')
print(template.format(len(show_posts),color.BOLD + "'show posts'" + color.END),'\n')
print(template.format(len(other_posts),color.BOLD + "'other posts'" + color.END))

There are[93m[1m 9,139 [0mof [1m'ask posts'[0m in the dataset  

There are[93m[1m 10,158 [0mof [1m'show posts'[0m in the dataset  

There are[93m[1m 273,822 [0mof [1m'other posts'[0m in the dataset 


In [6]:
total_ask_comments = 0

for row in ask_posts:
    
    ask_comment = int(row[4])
    total_ask_comments += ask_comment
    
avg_ask_comments = total_ask_comments/len(ask_posts)

total_show_comments = 0

for row in show_posts:
    
    show_comment = int(row[4])
    total_show_comments += show_comment
    
avg_show_comments = total_show_comments/len(show_posts)

template = "Average comments on {} are: "+ color.BOLD + color.YELLOW + "{}" + color.END

#calc avg ask comments

print(template.format(color.BOLD + "ask posts" + color.END,avg_ask_comments),'\n')

#calc avg show comments
print(template.format(color.BOLD + "show post" + color.END, avg_show_comments))

Average comments on [1mask posts[0m are: [1m[93m10.393478498741656[0m 

Average comments on [1mshow post[0m are: [1m[93m4.886099625910612[0m


After calculating the average amount of comments per post for both ask posts as well as show post it can be concluded that ask posts receive more comments on average.

The second question that needs to be answered is: Do posts created at a certain time receive more comments on average?

In [7]:
#importing datetime library
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6],int(row[4])])#appending created at and number of comments

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    
    comments_count = row[1]
    date_string = row[0]
    
    date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
    
    hour_created = date_created.hour
    
    if hour_created not in counts_by_hour:
        
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments_count 
    else:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments_count
        


In [8]:
print(color.BOLD + 'Posts created' + color.END, 'by hour: ', counts_by_hour,'\n')

print(color.BOLD + 'Total comments added' + color.END, 'by hour:', comments_by_hour)

[1mPosts created[0m by hour:  {2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209} 

[1mTotal comments added[0m by hour: {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


In [9]:
avg_by_hour = []

for key in counts_by_hour:
    avg_posts = comments_by_hour[key]/counts_by_hour[key]
    avg_by_hour.append([key,avg_posts])

In [10]:
#let's swap the column order from avg_by_hour
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
swap_avg_by_hour

[[11.137546468401487, 2],
 [7.407801418439717, 1],
 [8.804177545691905, 22],
 [8.687258687258687, 21],
 [7.163043478260869, 19],
 [9.449744463373083, 17],
 [28.676470588235293, 15],
 [9.692007797270955, 14],
 [16.31756756756757, 13],
 [8.96474358974359, 11],
 [10.684397163120567, 10],
 [6.653153153153153, 9],
 [7.013274336283186, 7],
 [7.948339483394834, 3],
 [6.696793002915452, 23],
 [8.749019607843136, 20],
 [7.713298791018998, 16],
 [9.190661478599221, 8],
 [7.5647840531561465, 0],
 [7.94299674267101, 18],
 [12.380116959064328, 12],
 [9.7119341563786, 4],
 [6.782051282051282, 6],
 [8.794258373205741, 5]]

In [11]:
#sorting swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print(color.BOLD + "Top 5 Hours for Ask Posts Comments" + color.END,'\n')

for row in sorted_swap[:5]:
    hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
    hour_formatted += dt.timedelta(hours=2) #converting EST to UTC-3  
    hour_formatted = hour_formatted.strftime('%H:%M')

    print('At {}: Posts recieve on average {:.2f} comments per post'.format(hour_formatted,row[0]))
    

[1mTop 5 Hours for Ask Posts Comments[0m 

At 17:00: Posts recieve on average 28.68 comments per post
At 15:00: Posts recieve on average 16.32 comments per post
At 14:00: Posts recieve on average 12.38 comments per post
At 04:00: Posts recieve on average 11.14 comments per post
At 12:00: Posts recieve on average 10.68 comments per post


The results show that at certain hours posts tend to receive more comments. The end of the afternoon and beginning of the evening are times that a post is more likely to receive a lot of comments.