![hacker_news_logo.png](hacker_news_logo.png)

### What is Hacker News:

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.


### Data dictionary

- **`id`**: the unique identifier from Hacker News for the post
- **`title`**: the title of the post
- **`ur`**: the URL that the posts links to, if the post has a URL
- **`num_points`**: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **`num_comments`**: the number of comments on the post
- **`author`**: the username of the person who submitted the post
- **`created_at`**: the date and time of the post's submission


### 


### Loading the dataset

We're specifically interested in posts with titles that begin with either **`Ask HN`** or **`Show HN`**

We will import Class from a module, but instead of loading all the content, lets have a look inside and then choose what we want

In [1]:
import csv
print(dir(csv))

['Dialect', 'DictReader', 'DictWriter', 'Error', 'QUOTE_ALL', 'QUOTE_MINIMAL', 'QUOTE_NONE', 'QUOTE_NONNUMERIC', 'Sniffer', 'StringIO', '_Dialect', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', 'excel', 'excel_tab', 'field_size_limit', 'get_dialect', 'list_dialects', 're', 'reader', 'register_dialect', 'unix_dialect', 'unregister_dialect', 'writer']


Once sawed, lets import de module an the Class.

In [2]:
from csv import reader

with open('hacker_news.csv') as file:
    read_file = reader(file)
    hn = list(read_file)

len(hn)

20101

In [3]:
header_hn = hn[0] # Columns name content
hn = hn[1:]       # Dataset

A sample of the first 5 rows of our dataset.

In [4]:
print(header_hn)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [5]:
print(hn[0:10])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http

### 3. Filtering `Ask HN` and `Show HN` Posts



In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)  
    else:
        other_posts.append(title)

In [7]:
text = {'ask_post': ask_posts,'show_post': show_posts,'other_post':other_posts }

for index in text:
    print("The number of {} is {}".format(index, len(text[index])))

The number of ask_post is 1744
The number of show_post is 1162
The number of other_post is 17194


### 4.  Calculating the Average Number of Comments for `Ask HN` and `Show HN` Posts

Let's determine if ask posts or show posts receive more comments on average.

Before calculating anything, what we are going to do is check the type of data we are going to work with

In [8]:
print(header_hn)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [9]:
total_ask_comments = 0

for comment in ask_posts[:5]:
    num_comments = comment[4]
    print(num_comments , type(num_comments))

6 <class 'str'>
29 <class 'str'>
1 <class 'str'>
3 <class 'str'>
17 <class 'str'>


We see that the column with the value of the comments to each post is a type of text string data, so we have to do a cast conversion.

### Do `Show post` or `Ask post` get more comments on average?

- `avg number` on `Ask post` and in `Show post`

In [10]:
def avg_comments(lista, column, name):
    total_comments = 0
    for post in lista:
        total_comments += int(post[column])
    #print(post[1:3])
        
    avg_ask_comments = total_comments / len(lista)
    print("The number of average comments in {} is: {}".format(name, avg_ask_comments))
    return total_comments

In [11]:
avg_ask_comments = avg_comments(ask_posts, 4, 'Ask HN')

The number of average comments in Ask HN is: 14.038417431192661


In [12]:
avg_show_comments = avg_comments(show_posts, 4, 'Show HN')

The number of average comments in Show HN is: 10.31669535283993


And as we can see, the number of questions is double that of the other value.

In [13]:
print(avg_ask_comments)

24483


In [14]:
print(avg_show_comments)

11988


| Show post vs Ask post  |   |
|:--|:--|
|number of ask_post   |1744|
|number of show_post  |1162|
|avg_ask_comments     |24483|
|avg_show_comments    |11988|

### 5. Comments by Hour Created

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

we'll determine if ask posts created at a certain time are more likely to attract comments. 

In [15]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    tupla = (created_at,num_comments)
    
    #result_list.append(created_at)
    result_list.append(tupla)

In [16]:
result_list[:10]

[('8/16/2016 9:55', '6'),
 ('11/22/2015 13:43', '29'),
 ('5/2/2016 10:14', '1'),
 ('8/2/2016 14:20', '3'),
 ('10/15/2015 16:38', '17'),
 ('9/26/2015 23:23', '1'),
 ('4/22/2016 12:24', '4'),
 ('11/16/2015 9:22', '1'),
 ('2/24/2016 17:57', '1'),
 ('6/4/2016 17:17', '2')]

In [17]:
counts_by_hour = {}   # cuántos posts por hora
comments_by_hour = {} # total de comentarios por hora 


for row in result_list:
    hour = row[0]
    time_obj = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    num_comments = int(row[1])

    hour = time_obj.hour
    
    # Contar post por hora
    if hour in counts_by_hour:
        counts_by_hour[hour] +=1
    else:
        counts_by_hour[hour] =1
    # Sumar comentarios por hora
    if hour in comments_by_hour:
        comments_by_hour[hour] +=num_comments
    else:
        comments_by_hour[hour] = num_comments

print("counts_by_hour")
print(counts_by_hour)
print("")
print("comments_by_hour")
print(comments_by_hour)

counts_by_hour
{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}

comments_by_hour
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


### 6 de 8 · Promedio de comentarios por hora


In [18]:
avg_by_hour = []
swap_avg_by_hour = []

print("#  Comentaries by hour  #")
for hour in counts_by_hour:
    average = comments_by_hour[hour] / counts_by_hour[hour]
    print("",hour,":" , average)
    data = (average, hour)
    swap_avg_by_hour.append(data)
print("  - - - - - - - - - - ")

#  Comentaries by hour  #
 9 : 5.5777777777777775
 13 : 14.741176470588234
 10 : 13.440677966101696
 14 : 13.233644859813085
 16 : 16.796296296296298
 23 : 7.985294117647059
 12 : 9.41095890410959
 17 : 11.46
 15 : 38.5948275862069
 21 : 16.009174311926607
 20 : 21.525
 2 : 23.810344827586206
 18 : 13.20183486238532
 3 : 7.796296296296297
 5 : 10.08695652173913
 19 : 10.8
 1 : 11.383333333333333
 22 : 6.746478873239437
 8 : 10.25
 4 : 7.170212765957447
 0 : 8.127272727272727
 6 : 9.022727272727273
 7 : 7.852941176470588
 11 : 11.051724137931034
  - - - - - - - - - - 


### 7 de 8 · Ordenar y mostrar las mejores horas



In [19]:
a = sorted(swap_avg_by_hour, reverse = True)

In [20]:
for i in a:
    print("Comments {} by hour {}".format(i[0], i[1]))

Comments 38.5948275862069 by hour 15
Comments 23.810344827586206 by hour 2
Comments 21.525 by hour 20
Comments 16.796296296296298 by hour 16
Comments 16.009174311926607 by hour 21
Comments 14.741176470588234 by hour 13
Comments 13.440677966101696 by hour 10
Comments 13.233644859813085 by hour 14
Comments 13.20183486238532 by hour 18
Comments 11.46 by hour 17
Comments 11.383333333333333 by hour 1
Comments 11.051724137931034 by hour 11
Comments 10.8 by hour 19
Comments 10.25 by hour 8
Comments 10.08695652173913 by hour 5
Comments 9.41095890410959 by hour 12
Comments 9.022727272727273 by hour 6
Comments 8.127272727272727 by hour 0
Comments 7.985294117647059 by hour 23
Comments 7.852941176470588 by hour 7
Comments 7.796296296296297 by hour 3
Comments 7.170212765957447 by hour 4
Comments 6.746478873239437 by hour 22
Comments 5.5777777777777775 by hour 9


In [21]:
### Summary

In [23]:
! pwd

/home/ion/Documentos/albertjimrod/dataquest_projects/01_Exploring Hacker News Posts
