## HackerNews Posts Analysis
Data from https://news.ycombinator.com/

This analysis is based on importing some python modules without diving into more easy to use libraries like pandas, numpy. It is more of a next level than the previous project where python was hard coded by using the default functions, methods

**Goals**:
1. Determine if show or ask posts receive more comments
2. Which hours are more likely to receive more comments?
3. Determine if show or ask posts receive more points
4. Which hours are more likely to receive more points?


### 1. Import and remove header

In [1]:
import csv
f = open("hacker_news.csv")
readfile = csv.reader(f)

hn = list(readfile)

#First five rows of the dataframe created (list of lists)
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

### 2. Remove headers

In [2]:
headers = hn[0]
headers

#overwriting the hn list, hence I have to use [0:5] to display the first five rows of the new hn
hn = hn[1:] 
hn[0:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

### 3. Extraction ask HN, show HN

In [3]:
#startswith for string check
ask_posts = []
show_posts = []
other_posts = []

for each in hn:
    title = each[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(each)
    elif title.startswith('show hn'):
        show_posts.append(each)
    else:
        other_posts.append(each)

print("number of posts with ask hn:",len(ask_posts))        

print("number of posts with show hn:",len(show_posts))

print("number of other posts:",len(other_posts))


number of posts with ask hn: 1744
number of posts with show hn: 1162
number of other posts: 17194


In [4]:
show_posts[0:4]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11']]

### 4. Calculating average number of comments

In [5]:
total_ask_comments = 0
total_show_comments = 0
#avg_ask_comments = 0

#type conversion is mandatory check before any calculations. If it is integet by int, if it is string by str
for each in ask_posts:
    acomments = int(each[4])
    total_ask_comments =  total_ask_comments + acomments
    
avg_ask_comments = (total_ask_comments/len(ask_posts))
print(avg_ask_comments)
                                                   
for each in show_posts:
    scomments = int(each[4])
    total_show_comments =  total_show_comments + scomments
    
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

if avg_show_comments > avg_ask_comments:
    print("Number of average comments for a show post is greater than that of an ask post")
else:
    print("Number of average comments for a show post is less than that of an ask post")



14.038417431192661
10.31669535283993
Number of average comments for a show post is less than that of an ask post


**Finding:** Number of average comments for an ask post is greater than that of a show post

### 5. Ask posts comments analysis - by hour

#### Two element list from ask_posts

In [6]:
import datetime as dt
result_list =[]

for each in ask_posts:
    sublist = []
    createdtime = each[6]
    comments = int(each[4])
    sublist.append(createdtime)
    sublist.append(comments)
    result_list.append(sublist)
       
result_list[0:4]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3]]

The above step helps us focus on the elements required to get the frequency table.

#### Frequency table creation using the sublist (date, comments)

In [7]:
counts_by_hour = {}
comments_by_hour = {}

for each in result_list:
    dtobj = each[0]
    dtparsed = dt.datetime.strptime(dtobj, "%m/%d/%Y %H:%M")
    hr = dt.datetime.strftime(dtparsed,"%H")
    #print(hr)
    
    if hr not in counts_by_hour:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = int(each[1])
    else:
        counts_by_hour[hr] = counts_by_hour[hr] + 1
        comments_by_hour[hr] = comments_by_hour[hr] + int(each[1])

print(comments_by_hour)
print(counts_by_hour)

{'11': 641, '21': 1745, '03': 421, '22': 479, '18': 1439, '09': 251, '12': 687, '08': 492, '05': 464, '16': 1814, '02': 1381, '17': 1146, '15': 4477, '23': 543, '20': 1722, '06': 397, '07': 267, '04': 337, '14': 1416, '19': 1188, '13': 1253, '01': 683, '00': 447, '10': 793}
{'11': 58, '21': 109, '03': 54, '22': 71, '18': 109, '09': 45, '12': 73, '08': 48, '05': 46, '16': 108, '02': 58, '17': 100, '15': 116, '23': 68, '20': 80, '06': 44, '07': 34, '04': 47, '14': 107, '19': 110, '13': 85, '01': 60, '00': 55, '10': 59}


### 6. Average number of comments in an hour

In [8]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour    
#1. lists can also be created from dictionaries
#2. Calculatons can be done on the key values of dictionaries

[['11', 11.051724137931034],
 ['21', 16.009174311926607],
 ['03', 7.796296296296297],
 ['22', 6.746478873239437],
 ['18', 13.20183486238532],
 ['09', 5.5777777777777775],
 ['12', 9.41095890410959],
 ['08', 10.25],
 ['05', 10.08695652173913],
 ['16', 16.796296296296298],
 ['02', 23.810344827586206],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['23', 7.985294117647059],
 ['20', 21.525],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['04', 7.170212765957447],
 ['14', 13.233644859813085],
 ['19', 10.8],
 ['13', 14.741176470588234],
 ['01', 11.383333333333333],
 ['00', 8.127272727272727],
 ['10', 13.440677966101696]]

### 7. Sorting the final list

#### Swapping the sublist elements

In [9]:
swap_avg_by_hour = []
for each in avg_by_hour:
    #print(key)
    #creating a sublist and appending it
    sublist = []
    sublist.append(each[1])
    sublist.append(each[0])
    swap_avg_by_hour.append(sublist)
print(swap_avg_by_hour)

[[11.051724137931034, '11'], [16.009174311926607, '21'], [7.796296296296297, '03'], [6.746478873239437, '22'], [13.20183486238532, '18'], [5.5777777777777775, '09'], [9.41095890410959, '12'], [10.25, '08'], [10.08695652173913, '05'], [16.796296296296298, '16'], [23.810344827586206, '02'], [11.46, '17'], [38.5948275862069, '15'], [7.985294117647059, '23'], [21.525, '20'], [9.022727272727273, '06'], [7.852941176470588, '07'], [7.170212765957447, '04'], [13.233644859813085, '14'], [10.8, '19'], [14.741176470588234, '13'], [11.383333333333333, '01'], [8.127272727272727, '00'], [13.440677966101696, '10']]


#### Top 5 hours for ask posts comments

In [10]:
#sorted function to arrange 
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 hours for Ask Posts Comments")

for i in range(0,5):
    hrobj = dt.datetime.strptime(sorted_swap[i][1],"%H") # Assigns what is what
    hrobj_string = dt.datetime.strftime(hrobj, "%H:%M") #To final string
    avg = sorted_swap[i][0]
    template = '''
    {}: {:.2f} average comments per post    
    '''
    print(template.format(hrobj_string,avg))

Top 5 hours for Ask Posts Comments

    15:00: 38.59 average comments per post    
    

    02:00: 23.81 average comments per post    
    

    20:00: 21.52 average comments per post    
    

    16:00: 16.80 average comments per post    
    

    21:00: 16.01 average comments per post    
    


**Finding**:

The above times are in EST(eastern standard time). To get more comments, I (in IST) have to create a good number of posts during **01:30 - 02:30, 12:30 - 13:30, 06:30 - 07:30, 02:30 - 03:30, 07:30 - 08:30** in the same order. Basically, these are the early morning hours for an Indian resident with one exception of getting comments during the afternoo (in India).   

### *Average number of points received by Ask and Show posts

In [30]:
#ask_posts[0:5]
apoints = 0
spoints = 0

for each in ask_posts:
    apoints = apoints + int(each[3])

print("Total Number of points received by all ask post:", apoints, "\n"
      "Avg. number of points per ask post:", (apoints/len(ask_posts)),"\n")

for each in show_posts:
    spoints = spoints + int(each[3])

print("Total Number of points received by all show post:", spoints,"\n",
      "Avg. number of points per show post:", (spoints/len(show_posts)),"\n")

if (apoints/len(ask_posts)) > (spoints/len(show_posts)):
    print("Number of points received on average by an ask post is greater than that of a show post")
else:
    print("Number of points received on average by a show post is greater than that of an ask post")

Total Number of points received by all ask post: 26268 
Avg. number of points per ask post: 15.061926605504587 

Total Number of points received by all show post: 32019 
 Avg. number of points per show post: 27.555077452667813 

Number of points received on average by a show post is greater than that of an ask post


**Finding:** Number of points received on average by a show post is greater than that of an ask post

### *Finding out if show at certain times receive more points

As avg. number of points per show post is greater, we will continue further analysis on which times are likely to receive more point w.r.t show posts.


In [32]:
#show_posts[0:3]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05']]

In [37]:
#import datetime as dt
result_list_points_date = []

for each in show_posts:
    sublist = []
    createdtime = each[6]
    points = int(each[3])
    sublist.append(createdtime)
    sublist.append(points)
    result_list_points_date.append(sublist)
       
result_list_points_date[0:4]

[['11/25/2015 14:03', 26],
 ['11/29/2015 22:46', 747],
 ['4/28/2016 18:05', 1],
 ['7/28/2016 7:11', 3]]

The above step helps us focus on the elements required to get the frequency table of **points by hour**.

#### Frequency table creation using the sublist (date, points for each post)

In [39]:
show_counts_by_hour = {}
points_by_hour = {}

for each in result_list_points_date:
    dtobj = each[0]
    dtparsed = dt.datetime.strptime(dtobj, "%m/%d/%Y %H:%M")
    hr = dt.datetime.strftime(dtparsed,"%H")
    #print(hr)
    
    if hr not in show_counts_by_hour:
        show_counts_by_hour[hr] = 1
        points_by_hour[hr] = int(each[1])
    else:
        show_counts_by_hour[hr] = show_counts_by_hour[hr] + 1
        points_by_hour[hr] = points_by_hour[hr] + int(each[1])

print(points_by_hour)
print(show_counts_by_hour)

{'21': 866, '03': 679, '22': 1856, '18': 2215, '23': 1526, '12': 2543, '08': 519, '05': 104, '16': 2634, '02': 340, '11': 1480, '15': 2228, '09': 553, '20': 1819, '06': 375, '07': 494, '00': 1173, '04': 386, '14': 2187, '19': 1702, '13': 2438, '01': 700, '17': 2521, '10': 681}
{'21': 47, '03': 27, '22': 46, '18': 61, '23': 36, '12': 61, '08': 34, '05': 19, '16': 93, '02': 30, '11': 44, '15': 78, '09': 30, '20': 60, '06': 16, '07': 26, '00': 31, '04': 26, '14': 86, '19': 55, '13': 99, '01': 28, '17': 93, '10': 36}


#### Average number of points for show post in any hour

In [40]:
show_avg_by_hour = []

for hour in points_by_hour:
    show_avg_by_hour.append([hour,points_by_hour[hour]/show_counts_by_hour[hour]])
    
show_avg_by_hour    
#1. lists can also be created from dictionaries
#2. Calculatons can be done on the key values of dictionaries

[['21', 18.425531914893618],
 ['03', 25.14814814814815],
 ['22', 40.34782608695652],
 ['18', 36.31147540983606],
 ['23', 42.388888888888886],
 ['12', 41.68852459016394],
 ['08', 15.264705882352942],
 ['05', 5.473684210526316],
 ['16', 28.322580645161292],
 ['02', 11.333333333333334],
 ['11', 33.63636363636363],
 ['15', 28.564102564102566],
 ['09', 18.433333333333334],
 ['20', 30.316666666666666],
 ['06', 23.4375],
 ['07', 19.0],
 ['00', 37.83870967741935],
 ['04', 14.846153846153847],
 ['14', 25.430232558139537],
 ['19', 30.945454545454545],
 ['13', 24.626262626262626],
 ['01', 25.0],
 ['17', 27.107526881720432],
 ['10', 18.916666666666668]]

#### Sorting the final list of top hours with more number of points for show posts

##### Swapping the sublist elements of show post 'by hour' categorization

In [41]:
swap_showavg_by_hour = []
for each in show_avg_by_hour:
    #print(key)
    #creating a sublist and appending it
    sublist = []
    sublist.append(each[1])
    sublist.append(each[0])
    swap_showavg_by_hour.append(sublist)
print(swap_showavg_by_hour)

[[18.425531914893618, '21'], [25.14814814814815, '03'], [40.34782608695652, '22'], [36.31147540983606, '18'], [42.388888888888886, '23'], [41.68852459016394, '12'], [15.264705882352942, '08'], [5.473684210526316, '05'], [28.322580645161292, '16'], [11.333333333333334, '02'], [33.63636363636363, '11'], [28.564102564102566, '15'], [18.433333333333334, '09'], [30.316666666666666, '20'], [23.4375, '06'], [19.0, '07'], [37.83870967741935, '00'], [14.846153846153847, '04'], [25.430232558139537, '14'], [30.945454545454545, '19'], [24.626262626262626, '13'], [25.0, '01'], [27.107526881720432, '17'], [18.916666666666668, '10']]


##### Top 5 hours for ask posts comments

In [43]:
#sorted function to arrange 
show_sorted_swap = sorted(swap_showavg_by_hour, reverse = True)
print("Top 5 hours for Show Posts Points")

for i in range(0,5):
    hrobj = dt.datetime.strptime(show_sorted_swap[i][1],"%H") # Assigns what is what
    hrobj_string = dt.datetime.strftime(hrobj, "%H:%M") #To final string
    show_avg = show_sorted_swap[i][0]
    template = '''
    {}: {:.2f} average points per post    
    '''
    print(template.format(hrobj_string,show_avg))

Top 5 hours for Show Posts Points

    23:00: 42.39 average points per post    
    

    12:00: 41.69 average points per post    
    

    22:00: 40.35 average points per post    
    

    00:00: 37.84 average points per post    
    

    18:00: 36.31 average points per post    
    


**Finding**: 
The above times are in EST(eastern standard time). To get more points, I (in IST) can try to create a good number of show posts during **09:30 - 10:30, 22:30 - 23:30, 08:30 - 09:30, 10:30 - 11:30, 04:30 - 05:30** in the same order. Basically, these are the morning work hours for an Indian resident. The top hours which received points on an average are very close in magnitude unlike the comments per hour in the previous analysis on ask posts. This would prompt us to further dig into the data. Location of responders either for comments or for points (upvotes or downvotes) can give us more insight into when to actually schedule a post. 