<H1>Breaking Down Hackers News Posts:</H1>

<H2>Goal:</H2>

Using a subset of 20,000 out of 300,000, we’ll be exploring and comparing two types of the post, Show HN and Ask HN, from the Hacker New, to find the answer to the following questions:

<ol>
  <li>Do Ask HN or Show HN receive more comments on average?
</li>
  <li>Do posts created at a certain time receive more comments on average?</li>
</ol> 

 
<H2>Background:</H2>

<a href="https://news.ycombinator.com/">Hacker News:</a> Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to Reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

<b>Source: Dataquest: Guided Project: Exploring Hacker News Posts</b>
 
 <H2>Definitions:</H2>
 
Ask HN: A post that users submit posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?"
Show HN: A post that users submit posts to show the Hacker News community a project, product, or just generally something interesting

<H2>Data Selection:</H2>

Data set was reduced from 300,00 rows to 20,000 with excluding all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

The following dataset can be found <a href="https://www.kaggle.com/hacker-news/hacker-news-posts">here</a>



<H1>Data Cleaning Process Overview</H1>

<ol>
<li>Importing Data</li>
<li>Remove the Headers</li>
<li>Extracting Ask HN and Show HN Posts</li>
<li>Calculating the Average Number of Comments for Ask HN and Show HN Posts</li>
<li>Finding the Amount of Ask Posts and Comments by Hour Created</li>
<li>Calculating the Average Number of Comments for Ask HN Posts by Hour</li>
<li>Sorting and Printing Values from a List of Lists</li>
<li> Formatting</li>

</ol>

<H2> Importing Data</H2>

In this step, we'll import the dataset sample of 20,000 records of posts that has one comment or more. Then print out only 6 rows. 

In [2]:
from csv import reader

open = open('hacker_news.csv')
file= reader(open)
hn=list(file)

#Print out the first six row
print(hn[:4])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


<H2> Remove the Headers</H2>

This process will remove the header only to manipulate the data but before doing so I will place the header in a header variable to be used as a reference and below is a reference for clarity. 

<ul>
<li> index 0, id: The unique identifier from Hacker News for the post</li>
<li> index 1, title: The title of the post</li>
<li> index 2, url: The URL that the posts links to, if it the post has a URL</li>
<li> index 3, num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes</li>
<li>index 4, num_comments: The number of comments that were made on the post</li>
<li>index 5, author: The username of the person who submitted the post</li>
<li>index 6, created_at: The date and time at which the post was submitted</li>

In [3]:
headers = hn[0]
hn=hn[1:]
print(headers)
print(hn[:1])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']]


<H2>Extracting Ask HN and Show HN Posts</H2>

By isolating the dataset in three categories Ask, Show and others make it easier to get the total sum of comment by the hour and total average of comments by the hour. Using the string ask_ hn and show_hn to extract the post for each data list. To confirm our code work I will be print out each list length and the total list length which should be the same.

In [4]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts),len(show_posts),len(other_posts), len(hn))

if (len(ask_posts)+len(show_posts)+len(other_posts))== len(hn):
    print("Data lists match total list")
else:
    print("Data lists dose not match total list")
        
        

1744 1162 17194 20100
Data lists match total list


<H2>Calculating the Average Number of Comments for Ask HN and Show HN Posts</H2>

Parsing the main list into three separate categories makes it easier to aggregate the information in getting the average comments per post. But I also would like to know the average points per post as well so instead of rewriting the code for three different instances I will code a List_average function with the arguments of list and index, improve readability and reuse code. 


In [5]:
# This functions take the list and index to get the aveage if quantifiable 
#othewise error will occure
def avg_per_post(a_posts, index):
    total=0
    for post in a_posts:
        total += int(post[index])
    AvgPerPost = total / len(a_posts)
    return AvgPerPost

In [6]:
# Get the aveage comment per post  for ask comments
avg_ask_comments =avg_per_post(ask_posts, 4)
print(avg_ask_comments)
    
    

14.038417431192661


In [7]:
# Get the aveage comment per post  for show comments
avg_show_comments =avg_per_post(show_posts, 4)
print(avg_show_comments)

10.31669535283993


<H2>Finding the Amount of Ask Posts and Comments by Hour Created</H2>

Now that we can see that the average comments in ask posts list are higher than the show post list we can focus on the questions on how many comments are submitted per hour and what time of day the most comments are submitted for the ask HN data sets.

Using index 6, the created_at column as the argument for the DateTime module will be able to find on what hour comments were submitted.  


In [8]:
from datetime import datetime as dt

result_list = []
# first getting the date submitted and number of comments per post
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

comments_by_hour = {}
counts_by_hour = {}

for row in result_list:
    date = row[0]
    hour = dt.strptime(date, "%m/%d/%Y %H:%M")
    hour=hour.strftime("%H")
    if hour in counts_by_hour:
        comments_by_hour[hour] += row[1]
        counts_by_hour[hour] += 1
    else:
        comments_by_hour[hour] = row[1]
        counts_by_hour[hour] = 1

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

<H2>Calculating the Average Number of Comments for Ask HN Posts by Hour</H2>

Using the hour as a key for both comments_by_hour  and counts_by_hour dictionary we can get the average number of comments by the hour. Where  counts_by_hour is used as the denominator and the comments_by_hour is used as the numerator. 


In [9]:
avg_by_hour = []

for hr in comments_by_hour:# using the round function to be readable
    avg_by_hour.append([hr, round((comments_by_hour[hr] / counts_by_hour[hr]),2)])

avg_by_hour


[['04', 7.17],
 ['21', 16.01],
 ['08', 10.25],
 ['03', 7.8],
 ['17', 11.46],
 ['13', 14.74],
 ['19', 10.8],
 ['02', 23.81],
 ['09', 5.58],
 ['18', 13.2],
 ['05', 10.09],
 ['23', 7.99],
 ['10', 13.44],
 ['16', 16.8],
 ['15', 38.59],
 ['12', 9.41],
 ['20', 21.52],
 ['06', 9.02],
 ['00', 8.13],
 ['07', 7.85],
 ['01', 11.38],
 ['11', 11.05],
 ['22', 6.75],
 ['14', 13.23]]

<h2>Sorting and Printing Values from a List of Lists</h2>

As we print out the result of calculation we can see that is out of order and sorting the list would give us a better insight as to when the most post is. Unforutanly we are no able to use the sort function as it sorts the hour and not average with the given example.


In [10]:
sort_test=sorted(avg_by_hour, reverse=True)
sort_test

[['23', 7.99],
 ['22', 6.75],
 ['21', 16.01],
 ['20', 21.52],
 ['19', 10.8],
 ['18', 13.2],
 ['17', 11.46],
 ['16', 16.8],
 ['15', 38.59],
 ['14', 13.23],
 ['13', 14.74],
 ['12', 9.41],
 ['11', 11.05],
 ['10', 13.44],
 ['09', 5.58],
 ['08', 10.25],
 ['07', 7.85],
 ['06', 9.02],
 ['05', 10.09],
 ['04', 7.17],
 ['03', 7.8],
 ['02', 23.81],
 ['01', 11.38],
 ['00', 8.13]]

To solve this issue we can swap the index where the value is index 0 and the hour is index 1, which will allow us to use the sort function to give the desired results.   

In [11]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[38.59, '15'],
 [23.81, '02'],
 [21.52, '20'],
 [16.8, '16'],
 [16.01, '21'],
 [14.74, '13'],
 [13.44, '10'],
 [13.23, '14'],
 [13.2, '18'],
 [11.46, '17'],
 [11.38, '01'],
 [11.05, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.09, '05'],
 [9.41, '12'],
 [9.02, '06'],
 [8.13, '00'],
 [7.99, '23'],
 [7.85, '07'],
 [7.8, '03'],
 [7.17, '04'],
 [6.75, '22'],
 [5.58, '09']]

<h2> Formatting</h2>

The last step in our data cleaning process is the format the result readable way. 



In [12]:
print("Top 5 Hours for 'Ask HN' Comments")
temp="{} : {:.2f} average comments per post"
for avg, hr in sorted_swap[:5]:
    print(temp.format(dt.strptime(hr, "%H").strftime("%H:%M"),avg) )

Top 5 Hours for 'Ask HN' Comments
15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


<h2>Conclusion:</h2>

We see that Ask HN receives more comments on average of 14 per post compare to 10 comments per post for Show HN. Which given the list size of both, we can estimate that for Ask HN post would generate up to 24, 000 comments in total while show HN post would generate up to 11, 000 comments in total. Given the figures, if we want to Maximus interaction among user we should form our content in a question format. As to when we should post our content that data shows that we should post at the time between 1500 and 1600 or 3:00 and 4:00  as with the combined time span will allow us to cove  48% of when the most average comment is generated. 


In [13]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction import text
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline



In [17]:
# select only titles for ask_hn
title=[]
for row in ask_posts:
    title.append(row[1])
len(title)

1744

<H1>Extra:</H1>
I was curious to see if there were any buzz words among the ask_posts, by using  sick it k-clustering method that group words and is the first step to classification 


In [29]:
X=title
pip = Pipeline([('tf',TfidfVectorizer(stop_words='english')),])

X_df = pip.fit_transform(X).todense()
pca = PCA(n_components= 3).fit(X_df)
data2d =pca.transform(X_df)

true_k = 10
model = KMeans(n_clusters=true_k)
model.fit(X_df)
centroids = model.cluster_centers_

cen_2d=pca.transform(centroids)
terms = pip.get_params()['tf'].get_feature_names()
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.



Cluster 0:
 learning
 machine
 resources
 hn
 ask
 online
 peer
 swift
 chinese
 ios
Cluster 1:
 use
 app
 hn
 ask
 web
 framework
 payment
 android
 tracking
 data
Cluster 2:
 programming
 business
 start
 ask
 hn
 language
 learn
 languages
 idea
 today
Cluster 3:
 does
 hn
 ask
 company
 use
 passwords
 hate
 matter
 job
 manage
Cluster 4:
 job
 hn
 ask
 leave
 new
 passion
 salary
 work
 quit
 money
Cluster 5:
 best
 learn
 hn
 ask
 way
 place
 modern
 book
 2016
 code
Cluster 6:
 project
 working
 worth
 hn
 ask
 open
 ideas
 source
 idea
 management
Cluster 7:
 work
 hn
 ask
 ideas
 does
 hours
 day
 project
 app
 did
Cluster 8:
 startup
 good
 ask
 hn
 idea
 advice
 joining
 need
 man
 family
Cluster 9:
 hn
 ask
 software
 using
 did
 web
 new
 source
 open
 2016


I was not able to generate buzz word but I was able to get a sense of the general concept of the questions being asked like programing language and life management questions.
