# Hacker News


![](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

In this project, we'll work with a dataset of submissions to popular technology site **Hacker News**.

[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

### Data dictionary

-`id`: the unique identifier from Hacker News for the post

-`title`: the title of the post

-`url`: the URL that the posts links to, if the post has a URL

-`num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

-`num_comments`: the number of comments on the post

-`author`: the username of the person who submitted the post

-`created_at`: the date and time of the post's submission

***

We're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. 

Users submit `Ask HN` posts to ask the Hacker News community a specific question.

<br>

We'll compare these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### 1. Load `hacker_news.csv` dataset

In [1]:
from csv import reader

In [2]:
dataset = open('/home/ion/Formacion/git_repo_klone/albertjimrod/Python/Exploring_Hacker/dataset/hacker_news.csv',encoding = 'utf-8')
dataset_read = reader(dataset)
hn = list(dataset_read)

### 2. Removing headers 

In [3]:
header = hn[0] # We copy the first row that corresponds to the columns of the dataset into the Header variable

In [4]:
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [5]:
del hn[0] # We delete the row that corresponds to the header 

In [6]:
hn[0:5] # and keep all the contents of the dataset

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

### 3. Finding posts starting with `Ask HN` and `Show HN`Posts
 
Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

In [7]:
ask_posts = []
show_posts = []
other_posts = []

for fila in hn:
    title = fila[1] #title
    title = title.lower()   # all in lower_case
    
    if title.startswith('ask hn'):
        ask_posts.append(fila)
        
    elif title.startswith('show hn'):
        show_posts.append(fila)
        
    else:
        other_posts.append(fila)

print(f" ask posts number {len(ask_posts)} \n show post number {len(show_posts)} \n other number     {len(other_posts)} ")

 ask posts number 1744 
 show post number 1162 
 other number     17194 


### 4. Calculating the average number of comments for `Ask HN` and `Show HN` Posts

Let's determine if ask posts or show posts ive more comments on average.

In [8]:
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [9]:
type(header[3]) # checking type of data

str

In [10]:
def avg_comments(data_list):

    total_comments = 0
    total_length = len(data_list)
    
    for row in data_list:
        total_comments += int(row[4]) # num_comments column
    
    avg_comments = ( total_length / total_comments) * 100
    print(f"Average: {avg_comments}%")

In [11]:
avg_comments(ask_posts)

Average: 7.1233100518727275%


In [12]:
avg_comments(show_posts)

Average: 9.693026359693025%


In [13]:
header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

### 5. Titles of the highest score for `Ask` and `Show` posts

Below we are going to extract the questions and answers with the most votes and the number of votes.

In [14]:
def max_comments(posts):
    """
    Encuentra las publicaciones con el mayor número de comentarios y muestra los resultados en orden descendente.

    Args:
        posts: Lista de tuplas donde cada tupla representa una publicación y contiene al menos 5 elementos,
              其中 el quinto elemento (índice 4) es el número de comentarios.

    Returns:
        None. La función imprime por defecto los resultados en la terminal, pero puede devolver una lista
        con las publicaciones y sus comentarios ordenadas si se modifican los parámetros de retorno.

    Side Effects:
        - Modifica globalmente el diccionario `most_important_questions` que almacena las publicaciones
          y su número de comentarios.
        - Imprime en stdout.

    Notes:
        Si desea obtener una lista con los resultados sin imprimir, modifique la función para que retorne
        `max_comments_posts` en lugar de imprimir. Esto requiere ajustar el código dentro de la función.
    """
    max_comments_posts = []
    most_important_questions = {}
    max_comments = 0
    
    for comments in posts:
        if max_comments < int(comments[4]): # compare `max_comments` with int(comments[4])
            max_comments = int(comments[4]) # 1st is 0, but every row is updating
            # saving into dictionary
            num_comments = int(comments[4]) # then, this number_comments is the key of dictionary
            most_important_questions[num_comments] = comments[1]

    most_voted = list(reversed(sorted(most_important_questions.keys()))) # list of dictionary keys sorted
    ### printing output
    
    print("-- Score --            -- Tittle -- ")

    for score in most_voted:
        texto = "    {points}     {relevance}".format(points = score,
                                                      relevance = most_important_questions[score])
        print(texto)
        max_comments_posts.append(texto)

In [15]:
max_comments(ask_posts)

-- Score --            -- Tittle -- 
    947     Ask HN: Who is hiring? (August 2016)
    910     Ask HN: Who is hiring? (September 2016)
    266     Ask HN: What are the must-read books about economics/finance?
    250     Ask HN: Who wants to be hired? (June 2016)
    234     Ask HN: What are you currently building?
    182     Ask HN: What is your go-to example for a good REST API?
    37     Ask HN: Things you created in 2015?
    33     Ask HN: teaching basic coding and web design offline, solely via iOS devices?
    29     Ask HN: Am I the only one outraged by Twitter shutting down share counts?
    6     Ask HN: How to improve my personal website?


In [16]:
max_comments(show_posts)

-- Score --            -- Tittle -- 
    306     Show HN: BitKeeper  Enterprise-ready version control, now open-source
    168     Show HN: Nodal. Next-Generation Node.js Server and Framework
    134     Show HN: Download any song without knowing its name
    102     Show HN: Something pointless I made
    22     Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform


### 6. Finding the number of `Ask Posts` and comments by hour created.

In [17]:
import datetime as dt
import matplotlib.pyplot as plt

In [18]:
def comments_byhour(posts, output = 0):
    """
    Docstring comments_byhour:
    
    Descripción de lo que hace la funcion.

    Esta es una función procesa una lista de publicaciones (posts), extrae las horas de los comentarios, y calcula la media de comentarios 
    por hora. La función tiene dos objetivos principales:
    
    Procesa la lista dada de publicaciones: Extrae el tiempo de creación (timestamp) de cada publicación y el número de comentarios asociados.
    Calcula la media de comentarios por hora: Utiliza dos diccionarios para almacenar los conteos de horas y el total de comentarios. 
    Luego, calcula la media de comentarios por cada hora.
    
    La función puede devolver:
    Una lista ordenada de horas con sus respectivas medias si el parámetro output está configurado en 0.
    Un mensaje específico que muestra las medias si el parámetro output está configurado en otro valor.


    Resumen general:
    La función analiza el comportamiento de los comentarios basado en la hora, devuelve una lista ordenada de horas 
    con sus respectivas medias y puede mostrar un resultado legible en formato humano

    Argumentos:
        posts: Lista con los post Ask o Show
        output: selector para que la función imprima el resultado con valor 1 o para que retorne una tupla
    
    Devuelve:
        La función retorna una lista ordenada de horas y un valor (el valor medio)
    
    """
    stamp = [] # aqui se guardan las horas en formato date time
    result_list = [] # cuando se creo y el numero de comentarios sin formato date time
    
    counts_by_hours = {}
    comments_by_hours = {}    
    avg_by_hour = []
    tuplax = []
    num_points = 0
    
    for row in posts:
        time_stamp = row[6] #created_at
        time_obj = dt.datetime.strptime(time_stamp, "%m/%d/%Y %H:%M")
        stamp.append(time_obj)
        num_comment = row[4] #num_comments
        tupla = (time_stamp, num_comment)
        result_list.append(tupla)

    for row in result_list:
        time_stamp = row[0]
        time_obj = dt.datetime.strptime(time_stamp, "%m/%d/%Y %H:%M") # 2016-08-16 09:55:00)
        hour = time_obj.hour # extraigo la hora
        num_comments = int(row[1]) #  extraigo el numero de comentarios

    # Trabajando con los diccionarios para insertar horas y numero de comentarios
    
        if hour not in counts_by_hours:
            counts_by_hours[hour] = 1                                  # time_stamp
            comments_by_hours[hour] = num_comments                     # num_comment
            
        elif hour in counts_by_hours:
            counts_by_hours[hour] += 1                                 # time_stamp
            comments_by_hours[hour] += num_comments                    # num_comment


    for horas in counts_by_hours: # {9: 45, 13: 85,...} Number of ask post by hours 
        avg = comments_by_hours[horas] / counts_by_hours[horas]
        avg_by_hour.append([horas, avg]) #list, and the tuple.
        
    date_format = "%H"
    accu = 0
   
    sorted_avg_by_hour = sorted(avg_by_hour, reverse=True)
    


    if output == 0:
        return sorted_avg_by_hour, avg
    else:
        for row in sorted_avg_by_hour:
            horas = row[0]
            hora = str(horas)
            objeto_datetime = dt.datetime.strptime(hora, date_format)
            hora = objeto_datetime.hour
            avg_comment = row[1]
            accu += avg_comment

            text_template ="At {h}:00 the average comments per post is {c:.2f}".format(h = horas, c = avg_comment)
            print(text_template)
        
        leng = len(sorted_avg_by_hour)
        print("\n" + "average comments by day {avg}".format(avg = (accu/leng)))
        
        return

### 7. Average number of comments per hour on `Ask post`

In [19]:
comments_byhour(ask_posts,1)

At 23:00 the average comments per post is 7.99
At 22:00 the average comments per post is 6.75
At 21:00 the average comments per post is 16.01
At 20:00 the average comments per post is 21.52
At 19:00 the average comments per post is 10.80
At 18:00 the average comments per post is 13.20
At 17:00 the average comments per post is 11.46
At 16:00 the average comments per post is 16.80
At 15:00 the average comments per post is 38.59
At 14:00 the average comments per post is 13.23
At 13:00 the average comments per post is 14.74
At 12:00 the average comments per post is 9.41
At 11:00 the average comments per post is 11.05
At 10:00 the average comments per post is 13.44
At 9:00 the average comments per post is 5.58
At 8:00 the average comments per post is 10.25
At 7:00 the average comments per post is 7.85
At 6:00 the average comments per post is 9.02
At 5:00 the average comments per post is 10.09
At 4:00 the average comments per post is 7.17
At 3:00 the average comments per post is 7.80
At 2:00

### 8. Average number of comments per hour on `Show post`

In [20]:
comments_byhour(show_posts,1)

At 23:00 the average comments per post is 12.42
At 22:00 the average comments per post is 12.39
At 21:00 the average comments per post is 5.79
At 20:00 the average comments per post is 10.20
At 19:00 the average comments per post is 9.80
At 18:00 the average comments per post is 15.77
At 17:00 the average comments per post is 9.80
At 16:00 the average comments per post is 11.66
At 15:00 the average comments per post is 8.10
At 14:00 the average comments per post is 13.44
At 13:00 the average comments per post is 9.56
At 12:00 the average comments per post is 11.80
At 11:00 the average comments per post is 11.16
At 10:00 the average comments per post is 8.25
At 9:00 the average comments per post is 9.70
At 8:00 the average comments per post is 4.85
At 7:00 the average comments per post is 11.50
At 6:00 the average comments per post is 8.88
At 5:00 the average comments per post is 3.05
At 4:00 the average comments per post is 9.50
At 3:00 the average comments per post is 10.63
At 2:00 th

There are times of the day when there is more activity. 

What we are going to do is to do the average of votes during the 24 hours of the day and from here we will determine which hours are above that value and which hours are below.

We know that the average score is 12.75, so we could filter which hours are above or below that average

This function `threshold(list_name,average_value,selector = 1)` has 3 parameters. 

The first is the list of hours already ordered, the second parameter is the average score value among all the hours and the third determines whether what I want is to show the values that are above or below.

In [21]:
def threshold(data,average,selector = 1):
    texto = ">>> Average {av} <<<".format(av = average)
    print(texto)
    for index in data:
        if selector == 1: # selector == 1: muestra las horas que están relacionadas con un valor medio superior a la media
            if index[1] > average:
                print(index)
        elif selector == 0: # selector == 0: muestra las horas que están relacionadas con un valor medio inferior a la media
            if index[1] < average:
                print(index)

### 9. Hours with above-average activity in `ask_post`

In [22]:
data,average = comments_byhour(ask_posts,0)
threshold(data,average,1)

>>> Average 11.051724137931034 <<<
[21, 16.009174311926607]
[20, 21.525]
[18, 13.20183486238532]
[17, 11.46]
[16, 16.796296296296298]
[15, 38.5948275862069]
[14, 13.233644859813085]
[13, 14.741176470588234]
[10, 13.440677966101696]
[2, 23.810344827586206]
[1, 11.383333333333333]


### 10. hours with below-average activity in `ask_post`

In [23]:
threshold(data,average,0)

>>> Average 11.051724137931034 <<<
[23, 7.985294117647059]
[22, 6.746478873239437]
[19, 10.8]
[12, 9.41095890410959]
[9, 5.5777777777777775]
[8, 10.25]
[7, 7.852941176470588]
[6, 9.022727272727273]
[5, 10.08695652173913]
[4, 7.170212765957447]
[3, 7.796296296296297]
[0, 8.127272727272727]


## `show post`

### 11. Hours with above-average activity in `show_post`

In [24]:
data1,average1 = comments_byhour(show_posts,0)
threshold(data1,average1,1)

>>> Average 15.709677419354838 <<<
[18, 15.770491803278688]


### 12. Hours with below-average activity in `show_post`

In [25]:
threshold(data1,average1,0)

>>> Average 15.709677419354838 <<<
[23, 12.416666666666666]
[22, 12.391304347826088]
[21, 5.787234042553192]
[20, 10.2]
[19, 9.8]
[17, 9.795698924731182]
[16, 11.655913978494624]
[15, 8.102564102564102]
[14, 13.44186046511628]
[13, 9.555555555555555]
[12, 11.80327868852459]
[11, 11.159090909090908]
[10, 8.25]
[9, 9.7]
[8, 4.852941176470588]
[7, 11.5]
[6, 8.875]
[5, 3.0526315789473686]
[4, 9.5]
[3, 10.62962962962963]
[2, 4.233333333333333]
[1, 8.785714285714286]


It's now much easier to get an idea of what the times are in both posts, Posts created at a certain time, receive more comments on average as we can see.

---

### 13. Who's the most relevant authors in `Ask_posts` and in `Show post`

In [26]:
def most_relevant_authors(kind_of_post,limit):
    author = {}
    sort_dict = []
    
    for row in kind_of_post:
        who = row[5]
        
        if who in author:
            author[who] +=1
        else:
            author[who] = 1
            
    for (key,value) in author.items():
        sort_dict.append((value, key))
    if limit !=0:
        out = sorted(sort_dict, reverse = True)
        return out[0:limit]

### 14. Most relevant authors in `Ask_posts` by number of times posted on the forum.

In [27]:
most_relevant_authors(ask_posts,10)

[(16, 'hoodoof'),
 (14, 'tmaly'),
 (9, 'whoishiring'),
 (7, 'prmph'),
 (7, 'hanniabu'),
 (6, 'tixocloud'),
 (5, 'vijayr'),
 (5, 'soulbadguy'),
 (5, 'sharemywin'),
 (5, 'rayalez')]

### 15. Most relevant authors in `Show_posts`

In [28]:
most_relevant_authors(show_posts, 10)

[(4, 'vipul4vb'),
 (4, 'soheil'),
 (4, 'max0563'),
 (4, 'iisbum'),
 (4, 'emeth'),
 (4, 'chinchang'),
 (4, 'alexellisuk'),
 (3, 'stockkid'),
 (3, 'mojoe'),
 (3, 'gk_brown')]

### 16. The most significant activity by user in `Ask_posts`

Now that we know who are the users who have the most activity. Let's see what their posts have been and what score they got.

In [29]:
def all_activity_post(name, posts):
    datum_format = "%m/%d/%Y %H:%M"
    time_format = '%Y/%m/%d'
    time = 0
    num_points = 0
    post_published = {}
    scoring = 0

    print(" Score:       Date:   Post: ")
    for row in posts:
        if row[5] == name:           # ['author']
            time +=1
            scoring += int(row[3])
            num_points = int(row[3]) # ['num_points']
            tittle = row[1]
            
            ask_date = row[6]       # 'created_at'
            
            ask_date = dt.datetime.strptime(ask_date, datum_format) # date_format = "%m/%d/%Y %H:%M"
            date_object = ask_date.strftime(time_format)           # time_format = '%Y/%m/%d'
            
            post_published = "  {points} -- "  "  {date} " " {tittle},".format(tittle = row[1],
                                                                          date = date_object, points = row[3] )
            
            print(post_published)
    
    total_number_post = "Total number of post = {n_post}".format(n_post = time)
    avg_score = "Score average = {score:.2f}".format(score=scoring/time)
    print("\n")
    print(total_number_post)
    print(avg_score)

In [30]:
all_activity_post('hoodoof', ask_posts)

 Score:       Date:   Post: 
  3 --   2016/07/10  Ask HN: Imagine it's 1993  what would you put in an MVP web browser?,
  3 --   2016/01/11  Ask HN: Can anyone suggest a good RSS newsreader with a set of tech news feeds?,
  3 --   2016/02/26  Ask HN: Someone is stealing things from my car. What security camera would help?,
  2 --   2016/09/04  Ask HN: Why should open source support be free? I don't think it should.,
  9 --   2016/04/12  Ask HN: What's it like working at a cannabis tech startup company?,
  2 --   2016/04/08  Ask HN: What is the most money a bootstrapped, one-person company has sold for?,
  1 --   2016/05/20  Ask HN: Are you building something?  How long for? How much longer to go?,
  4 --   2016/04/07  Ask HN: Why is it still not possible to search an S3 bucket?,
  1 --   2016/05/04  Ask HN: Why does Etsy have so many items titled DO Not PURCHASE?,
  11 --   2016/06/03  Ask HN: Is there an up-to-date global index of conferences?,
  2 --   2016/04/23  Ask HN: What would 

In [31]:
all_activity_post('tmaly', ask_posts)

 Score:       Date:   Post: 
  12 --   2016/03/21  Ask HN: How do you find unused CSS?,
  3 --   2015/09/23  Ask HN: Laptop bag for 15 Macbook Pro for inclement weather,
  1 --   2016/06/22  Ask HN: Operational complexity of micro services?,
  8 --   2016/01/12  Ask HN: Online learning,
  2 --   2015/10/20  Ask HN: Wireless HDMI for Macbook to TV,
  3 --   2015/12/28  Ask HN: Arduino or Raspberry PI for teenager?,
  2 --   2016/01/15  Ask HN: Website Obesity Crisis,
  1 --   2015/10/26  Ask HN: Algorithm for Scheduling appointments by location,
  3 --   2016/08/27  Ask HN: Accuracy of ip to geo location?,
  2 --   2015/10/30  Ask HN: Feedback on design?,
  2 --   2016/05/17  Ask HN: What is your favorite podcast episode?,
  2 --   2016/03/29  Ask HN: Best use case writeups of SOA?,
  1 --   2016/06/06  Ask HN: Collaborative RSS?,
  2 --   2016/02/14  Ask HN: Weather site negative temps,


Total number of post = 14
Score average = 3.14


In [32]:
all_activity_post('whoishiring', ask_posts)

 Score:       Date:   Post: 
  137 --   2016/06/01  Ask HN: Who wants to be hired? (June 2016),
  41 --   2015/12/01  Ask HN: Freelancer? Seeking freelancer? (December 2015),
  521 --   2016/09/01  Ask HN: Who is hiring? (September 2016),
  22 --   2016/08/01  Ask HN: Who wants to be hired? (August 2016),
  72 --   2016/09/01  Ask HN: Freelancer? Seeking freelancer? (September 2016),
  534 --   2016/08/01  Ask HN: Who is hiring? (August 2016),
  154 --   2016/04/01  Ask HN: Who wants to be hired? (April 2016),
  117 --   2015/11/02  Ask HN: Freelancer? Seeking freelancer? (November 2015),
  84 --   2016/03/01  Ask HN: Who wants to be hired? (March 2016),


Total number of post = 9
Score average = 186.89


### 17. The most significant activity by user in `Show_posts`

In [33]:
all_activity_post('vipul4vb', show_posts)

 Score:       Date:   Post: 
  2 --   2016/04/04  Show HN: Quantitative user research reveals useful UX observation on LinkedIn,
  1 --   2016/06/06  Show HN: Import Balsamiq Mockups into CanvasFlip using this simple interface,
  1 --   2016/04/25  Show HN: Donald Trump V/s Hillary Clinton Better On-Site UX?,
  2 --   2016/04/06  Show HN: UX Insights on largest e-commerce app  Amazon,


Total number of post = 4
Score average = 1.50


In [34]:
all_activity_post('soheil', show_posts)

 Score:       Date:   Post: 
  1 --   2016/03/28  Show HN: Get hired as a team: Work with people you know,
  3 --   2016/02/04  Show HN: Demo: most accurate speech recognition,
  7 --   2016/03/06  Show HN: Lightweight Twitter for Mac client,
  2 --   2016/08/22  Show HN: 1) Build Team 2) Interview 3) Offer,


Total number of post = 4
Score average = 3.25


In [35]:
all_activity_post('max0563', show_posts)

 Score:       Date:   Post: 
  2 --   2015/09/28  Show HN: Resume Reviewers - Get Your Resume Reviewed by Real People for Free,
  4 --   2016/03/31  Show HN: A Highcharts Library for Polymer 1.0,
  51 --   2016/01/22  Show HN: HNLive  Hacker News in Real Time,
  6 --   2016/02/10  Show HN: Highcharts  JavaScript charts in one line,


Total number of post = 4
Score average = 15.75


### 18. Summary

The conclusion of the analysis is that hours with above-average activity of comments on the posts analyzed were identified. This analysis made it possible to determine when it is most convenient to publish content to maximize interaction, based on real engagement data.

It is also interesting to see who are the people behind the most voted posts and what their activity is.