In [116]:
import csv
import datetime as dt

This project got no panda, numpy and other libraries for a specific purpose.

In this project I'll be working on data from site ['Hacker News'](https://news.ycombinator.com/). I am taking the data from Kaggle from [this project](https://www.kaggle.com/hacker-news/hacker-news-posts). The dataset gotabout 300 000 rows.

|coumn|description|
|-----|-----|
|id|id from Hacker News|
|title|title of the post|
|url|rul that the post link to|
|num_points|number of point acquired|
|num_comment|number of comment|
|author|username of the author|
|created_at|date and time the post was created|

Some of the posts can begin with 'Ask HN' what mean that this is a question that somebody asked the Hacker News community. Some other begin with 'Show NH' - this is when users want to show some project,product or something interesting. Below are some examples of the data:

In [137]:
f = open('hackers_news.csv', encoding='UTF-8')
hn = list(csv.reader(f))
hn[:4]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19']]

### Splitting the array into header and the data

In [118]:
header = hn[0]
data = hn[1:]

In [119]:
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [126]:
len(data)

293119

In [120]:
data[0]

['12579008',
 'You have two days to comment if you want stem cells to be classified as your own',
 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
 '1',
 '0',
 'altstar',
 '9/26/2016 3:26']

This is where I categorize the data for the posts with 'ash hn' and 'show hn' or without. The second subdivision is based on existing comments.

In [130]:
ask_hn_comm = []
show_hn_comm = []
other_comm = []
ask_hn = []
show_hn = []
other = []


for row in data:
    title = row[1].lower()
    if int(row[4]) > 0:
        if 'ask hn' in title:
            ask_hn_comm.append(row)
        elif 'show hn' in title:
            show_hn_comm.append(row)
        else:
            other_comm.append(row)  
            
    if 'ask hn' in title:
        ask_hn.append(row)
    elif 'show hn' in title:
        show_hn.append(row)
    else:
        other.append(row)
        
            
print('There is '+ str(len(ask_hn_comm)) + ' posts with \'ask nh\' and comment')
print('There is '+ str(len(show_hn_comm)) + ' posts with \'show nh\' and comment')
print('There is '+ str(len(other_comm)) + ' posts with neither ask no show and comment')
print('\n')
print('There is '+ str(len(ask_hn)) + ' posts with \'ask nh\'')
print('There is '+ str(len(show_hn)) + ' posts with \'show nh\'')
print('There is '+ str(len(other)) + ' posts with neither ask no show')

There is 6918 posts with 'ask nh' and comment
There is 5068 posts with 'show nh' and comment
There is 68415 posts with neither ask no show and comment


There is 9147 posts with 'ask nh'
There is 10170 posts with 'show nh'
There is 273802 posts with neither ask no show


### Counting the Average Number Of Comments for each part

In [131]:
def count_average(data):
    i = 0
    for row in data:
        i += int(row[4])
    average = i/len(data)
    return average

In [132]:
average_ask = count_average(ask_hn)
average_show = count_average(show_hn)
average_other = count_average(other)
average_ask_hn_comm = count_average(ask_hn_comm)
average_show_hn_comm = count_average(show_hn_comm)
average_other_comm = count_average(other_comm)

print('The average count of comments for all ask submissions: ' + str(round(average_ask,2)))
print('The average count of comments for all show submissions: ' + str(round(average_show,2)))
print('The average count of comments for all other submissions: ' + str(round(average_other,2)))
print('The average count of comments for ask submissions with comments: ' + str(round(average_ask_hn_comm,2)))
print('The average count of comments for show submissions with comments: ' + str(round(average_show_hn_comm,2)))
print('The average count of comments for other submissions with comments: ' + str(round(average_other_comm,2)))

The average count of comments for all ask submissions: 10.39
The average count of comments for all show submissions: 4.89
The average count of comments for all other submissions: 6.46
The average count of comments for ask submissions with comments: 13.73
The average count of comments for show submissions with comments: 9.8
The average count of comments for other submissions with comments: 25.84


### How many Posts each category was created per hour

In [133]:
def avg_comments_per_hour(data):
    list = []

    for row in data:
        list.append([row[6], int(row[4])])

    comments_by_hour = {}
    counts_by_hour = {}
    date_format = "%m/%d/%Y %H:%M"

    for row in list:
        date = row[0]
        comment = row[1]
        time = dt.datetime.strptime(date, date_format).strftime("%H")    
        if time in counts_by_hour:
            comments_by_hour[time] += comment
            counts_by_hour[time] += 1
        else:
            comments_by_hour[time] = 1
            counts_by_hour[time] = 1    
    
    avg_by_hour = []
    for hour in comments_by_hour:
        avg_by_hour.append((hour,(comments_by_hour[hour])/counts_by_hour[hour]))
    return avg_by_hour

In [134]:
avg_ask_by_hour = avg_comments_per_hour(ask_hn)
avg_show_by_hour = avg_comments_per_hour(show_hn)
avg_other_by_hour = avg_comments_per_hour(other)
avg_ask_by_hour_comm = avg_comments_per_hour(ask_hn_comm)
avg_show_by_hour_comm = avg_comments_per_hour(show_hn_comm)
avg_other_by_hour_comm = avg_comments_per_hour(other_comm)

### Calculate the average number of comments for submissions by hour

In [138]:
def show_top_hour(avg_by_hour, submission_type):
    swap_avg_by_hour = []

    for row in avg_by_hour:
        swap_avg_by_hour.append([row[1], row[0]])

    sorted_swap = sorted(swap_avg_by_hour, reverse=True)
    print('\n' + 'Top 5 Hours for \'' + submission_type)
    for avg, hr in sorted_swap[:5]:
        print(
            "{}: {:.2f} average comments per post".format(
                dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
            )
        )

In [139]:
show_top_hour(avg_ask_by_hour, 'ASK HN')
show_top_hour(avg_show_by_hour, 'SHOW HN')
show_top_hour(avg_other_by_hour, 'OTHER')
show_top_hour(avg_ask_by_hour_comm, 'ASK HN with comments')
show_top_hour(avg_show_by_hour_comm, 'Show HN with comments')
show_top_hour(avg_other_by_hour_comm, 'Other with comments')


Top 5 Hours for 'ASK HN
15:00: 28.64 average comments per post
13:00: 16.32 average comments per post
12:00: 12.37 average comments per post
02:00: 11.12 average comments per post
10:00: 10.69 average comments per post

Top 5 Hours for 'SHOW HN
12:00: 7.00 average comments per post
07:00: 6.69 average comments per post
11:00: 5.99 average comments per post
08:00: 5.61 average comments per post
14:00: 5.52 average comments per post

Top 5 Hours for 'OTHER
12:00: 7.59 average comments per post
11:00: 7.38 average comments per post
02:00: 7.18 average comments per post
13:00: 7.15 average comments per post
05:00: 6.79 average comments per post

Top 5 Hours for 'ASK HN with comments
15:00: 39.56 average comments per post
13:00: 22.22 average comments per post
12:00: 15.43 average comments per post
10:00: 13.74 average comments per post
17:00: 13.73 average comments per post

Top 5 Hours for 'Show HN with comments
07:00: 12.39 average comments per post
12:00: 12.03 average comments per pos

The data show that most intense time for ASK NH posts is between 12 and 15. First set of data with average almost 29 takes into account all the data with 'ask NH' phrase, but the second listing got calculations based only on posts that have at least one comment. Then the average is almost 25% higher. It shows that 15 is the hour that people like to answer the question the most.

The best hours for showing some interesting things on HN are just before 7 in the morning, because people respond to posts a lot at 7 and 8 a.m., probably at the beginning of the workday and the second good time is around a lunch time 12 and 14.

The important thing I see in this data is the difference betweern average in comments for 'other' category. For all data the maximum average is about 7 and for specified data it is almost 30. It shows that there is a lot posts without any comments. For the ask and show categories the difference isnt't that big. It means that people are more willing to answer posts with 'ask' or 'show' catagory or that this kind of posts are better prepared/ with more interesting content.
