# Exploring Hacker News Posts

This project should showcase my ability to work with more advanced features of the Python language, such as:
* object-oriented programming
* manipulation of strings
* working with dates and times

For this project, I'll be examining a dataset of Hacker News Posts. The dataset has been limited to a random selection of 20,000 posts with comments. Specifically, I'll be answering two questions using this dataset:
* do 'Ask HN' or 'Show HN' posts receive more comments, on average?
* how does the time of the posting affect the number of comments received?

The dataset I'll be working with can be found [here.](https://www.kaggle.com/hacker-news/hacker-news-posts)

## Preparing the Data

In [15]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn[0:5]
#reads the csv in as a list of lists and displays the first 5 rows

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [16]:
headers = hn[0:1]
hn = hn[1:]
print(headers)
hn[0:5]
#extracts the header row to a variable and removes it from the dataset, then displays the first 5 rows

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [18]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts), len(show_posts), len(other_posts))
#parses the dataset into three lists of lists: ask_posts, show_posts, and other_posts, based on what their title starts with, then prints the length of each

1744 1162 17194


## Finding the Avg. Number of Comments

In [23]:
def find_num_comments(database, index):
    total_comments = 0
    for row in database:
        num_comments = int(row[index])
        total_comments += num_comments
    avg_comments = total_comments/(len(database))
    print(avg_comments)
print("Avg. number of Ask Post comments:")
find_num_comments(ask_posts,4)
print('\n')
print("Avg. number of Show Post comments:")
find_num_comments(show_posts,4)
#creates a function that prints the avg. number of comments, then runs that function on ask_posts and show_posts 

Avg. number of Ask Post comments:
14.038417431192661


Avg. number of Show Post comments:
10.31669535283993


Our function found that Ask Posts receive, on average, over one-and-a-third times as many comments as Show Posts do on Hacker News.<br> <br>This answers my first question, but what about how time of the posting affects comments?

## Number of Comments by Post Time of Day

In [56]:
from datetime import datetime as dt
#imports the datetime module

In [66]:
def convert_datetime (database,timestamp_index, num_comments_index):
    result_list=[]
    counts_by_hour={}
    comments_by_hour={}
    for row in database:
        created_at = row[timestamp_index]
        num_comments = int(row[num_comments_index])
        combined = [created_at, num_comments]
        result_list.append(combined)
    for sublist in result_list:
        date_time = sublist[0]
        converted_dt = dt.strptime(date_time, '%m/%d/%Y %H:%M')
        hour = converted_dt.strftime('%H')
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = sublist[1]
        else:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += sublist[1]
    return (counts_by_hour, comments_by_hour)
#creates a function that returns two dictionaries: counts_by_hour, which contains the number of posts by hour, and comments_by_hour, which contains the number of comments by hour

In [68]:
ask_counts_by_hour, ask_comments_by_hour = convert_datetime(ask_posts,6,4)
show_counts_by_hour, show_comments_by_hour = convert_datetime(show_posts,6,4)
#assigns the counts_by_hour and comments_by_hour to their functions as applied to the Ask posts and Show posts

In [93]:
avg_ask_comments_by_hour = []
for hour in ask_comments_by_hour:
    avg = (ask_comments_by_hour[hour])/(ask_counts_by_hour[hour])
    avg_ask_comments_by_hour.append([avg, hour])
#calculates the number of ask comments for each hour

In [90]:
sorted_avg_ask=sorted(avg_ask_comments_by_hour)
print('Top 5 Hours for Ask Post Comments:')
template = "{hour}:00 EST - {comments:.2f} avg. comments per post"
for sublist in sorted_avg_ask[-5:]:
    string = template.format(hour = sublist[1], comments = sublist[0])
    print(string)
#prints the top 5 posting hours for comments on average for Ask posts

Top 5 Hours for Ask Post Comments:
21:00 EST - 16.01 avg. comments per post
16:00 EST - 16.80 avg. comments per post
20:00 EST - 21.52 avg. comments per post
02:00 EST - 23.81 avg. comments per post
15:00 EST - 38.59 avg. comments per post


In [92]:
avg_show_comments_by_hour = []
for hour in show_comments_by_hour:
    avg = (show_comments_by_hour[hour])/(show_counts_by_hour[hour])
    avg_show_comments_by_hour.append([avg, hour])
#calculates the number of show comments by hour

In [91]:
sorted_avg_ask_2=sorted(avg_show_comments_by_hour)
print('Top 5 Hours for Show Post Comments:')
template = "{hour}:00 EST - {comments:.2f} avg. comments per post"
for sublist in sorted_avg_ask_2[-5:]:
    string = template.format(hour = sublist[1], comments = sublist[0])
    print(string)
#prints the top 5 posting hours for comments on average for Show posts

Top 5 Hours for Show Post Comments:
22:00 EST - 12.39 avg. comments per post
23:00 EST - 12.42 avg. comments per post
14:00 EST - 13.44 avg. comments per post
00:00 EST - 15.71 avg. comments per post
18:00 EST - 15.77 avg. comments per post


From our two lists, we see that Ask posts made at 3:00PM get the most comments, with an average of almost 40 comments per Ask post made at this time.