Look at Hacker News, a popular technology website started by Y-Combinator. 
In particular, I will be focusing on `Ask HN` and `Show HN` posts. 
For this project, I want to answer 2 questions:
- Do `Ask HN` or `Show HN` posts receive more comments on average?
- Do posts created at a certain time receive more comments on average?
    

## Data Extraction
I am using a [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) of around 300,000 posts on HackerNews. 

In [5]:
from csv import reader
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## Cleaning

- Since I am focusing on `Ask HN` and `Show HN` posts, I have to filter them out from the dataset.

In [13]:
                                  # Initialize 3 empty lists to categorize posts

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()         # return lowercase version of titles to standardize case
    
    if title.startswith('ask hn'):
        ask_posts.append(row)     # append the entire row to these lists, not just the title.
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)   # any title not starting with 'ask_hn' or 'show_hn' goes here
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


## Data Analysis

### Post Type & Comments
Now that the posts have been categorized, I can determine if `Ask HN` or `Show HN` posts receive more comments on average.

In [47]:
total_ask_comments = 0
total_show_comments = 0

for i in ask_posts:
    a_comments = i[4]                #assign number of comments to variable
    a_comments = int(a_comments) 
    total_ask_comments += a_comments # accumulate each post's number of comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("For 'Ask HN', ", total_ask_comments, " total comments for ", len(ask_posts), " posts.", sep='')
print("The average number of comments received per post is ~", format(avg_ask_comments, '.2f'))

for j in show_posts:
    s_comments = j[4]
    s_comments = int(s_comments)
    total_show_comments += s_comments
avg_show_comments = total_show_comments / len(show_posts)
print("For 'Show HN', ", total_show_comments, " total comments for ", len(show_posts), " posts.", sep='')
print("The average number of comments received per post is ~", format(avg_show_comments, '.2f'))


For 'Ask HN', 94986 total comments for 9139 posts.
The average number of comments received per post is ~ 10.39
For 'Show HN', 49633 total comments for 10158 posts.
The average number of comments received per post is ~ 4.89


- Clearly, `Ask HN` posts (~10.39) have a *significantly higher average number of comments* than `Show HN` posts (~4.89). 

- This contrast is amplified by the fact that `Show HN` actually has more posts (10158) than `Ask HN` (9139), yet has significantly less comments on them.

### Post Time & Comments


Since `Ask HN` posts are more likely to receive comments, I will focus on this type of post for my analysis. 

- I want to determine if posts created at a certain time of day are more likely to attract comments.

In [165]:
import datetime as dt
result_list = []      # list of lists
for i in ask_posts:
    created = i[6]
    created_dt = dt.datetime.strptime(created, "%m/%d/%Y %H:%M")   # use strptime to parse date and time
    a_comments = int(i[4])  # number of comments
    result_list.append([created_dt, a_comments])    # append list to result_list

    
counts_by_hour = {}
comments_by_hour = {}

for d in result_list:
    hour = d[0]
    comments = d[1]
    hour_dt = dt.datetime.strftime(hour, "%H")    # extract only the hours from datetime
    
    if hour_dt not in counts_by_hour:             
        counts_by_hour[hour_dt] = 1               # key: hour of post, value: number of posts created
        comments_by_hour[hour_dt] = comments      # key: hour of post, value: number of comments for those posts
                                                       # NOT hour of comment
    elif hour_dt in counts_by_hour:  
        counts_by_hour[hour_dt] += 1
        comments_by_hour[hour_dt] += comments
        
print(counts_by_hour)
print(comments_by_hour)
    

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}




Using the `datetime` module, I was able to parse and format my post creation date and time. 
- This will *facilitate further analysis* on which post hour receives the most comments

Then, I created a frequency table for *number of posts for each hour of posting*, and a frequency table for *number of comments received for each hour of posting*. In other words:
- `counts_by_hour` is **{hour of post : number of posts}**
- `comments_by_hour` is **{hour of post : number of comments received for post created in that hour}**

This is useful because I want to calculate the **average number of comments per post in each hour.**
- For example, at 2am, there were 269 posts created, and 2996 comments to those posts.
- As you will see below, that leaves me with ~11.14 average comments per post, for posts created at 2am.


In [122]:
avg_by_hour = []

for c in comments_by_hour:                        # for each row in the dict
    avg_c = comments_by_hour[c]/counts_by_hour[c] # get average comments per post, by hour
    avg_c = float(format (avg_c, '.2f'))
    avg_by_hour.append([c, avg_c])                        # append to hpc list
print(avg_by_hour)
    
    

[['02', 11.14], ['01', 7.41], ['22', 8.8], ['21', 8.69], ['19', 7.16], ['17', 9.45], ['15', 28.68], ['14', 9.69], ['13', 16.32], ['11', 8.96], ['10', 10.68], ['09', 6.65], ['07', 7.01], ['03', 7.95], ['23', 6.7], ['20', 8.75], ['16', 7.71], ['08', 9.19], ['00', 7.56], ['18', 7.94], ['12', 12.38], ['04', 9.71], ['06', 6.78], ['05', 8.79]]


In [166]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])  # swap hour & avg comments, so we can sort by avg comments

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

[[28.68, '15'], [16.32, '13'], [12.38, '12'], [11.14, '02'], [10.68, '10'], [9.71, '04'], [9.69, '14'], [9.45, '17'], [9.19, '08'], [8.96, '11'], [8.8, '22'], [8.79, '05'], [8.75, '20'], [8.69, '21'], [7.95, '03'], [7.94, '18'], [7.71, '16'], [7.56, '00'], [7.41, '01'], [7.16, '19'], [7.01, '07'], [6.78, '06'], [6.7, '23'], [6.65, '09']]


- Having calculated the average comments per post per hour, I swapped the order of the list, from [hour, avg comments] to [avg comments, hour]
    - In doing so, I can now ***sort by the average comments, from most comments to least.***


In [176]:
print("The hours of 'Ask HN' posts that receive the most comments:")

for row in sorted_swap[:5]:                    # get top 5 rows
    hour = dt.datetime.strptime(row[1], "%H")  # parse hour, from str to dt
    hour = dt.date.strftime(hour, "%H:%M")     # format 2 digit hour to Hour:Min
    
                            # using str.format, {0} refers to avg comments, {1} refers to the hour
    print("For posts at {1}, there are {0} comments on average.".format(row[0], hour)) 

The hours of 'Ask HN' posts that receive the most comments:
For posts at 15:00, there are 28.68 comments on average.
For posts at 13:00, there are 16.32 comments on average.
For posts at 12:00, there are 12.38 comments on average.
For posts at 02:00, there are 11.14 comments on average.
For posts at 10:00, there are 10.68 comments on average.


### Generalizability of Analysis

**To get a sense of its generalizability, I will run the same analysis on `Show HN` posts, `Other` posts, and on `Overall` posts.**
- To simplify the process, I will create a **function** to replicate my analysis on these other types of posts.
- The analysis on `Overall` posts includes `Ask HN`, `Show HN`, and `Other` posts.

In [215]:
def analysis(p):
    import datetime as dt
    s_result_list = []      # list of lists
    for i in p:
        created = i[6]
        created_dt = dt.datetime.strptime(created, "%m/%d/%Y %H:%M")   # use strptime to parse date and time
        s_comments = int(i[4])  # number of comments
        s_result_list.append([created_dt, s_comments])    # append list to result_list
    s_counts_by_hour = {}
    s_comments_by_hour = {}
    for d in s_result_list:
        hour = d[0]
        comments = d[1]
        hour_dt = dt.datetime.strftime(hour, "%H")    # extract only the hours from datetime
        if hour_dt not in s_counts_by_hour:             
            s_counts_by_hour[hour_dt] = 1               # key: hour of post, value: number of posts created
            s_comments_by_hour[hour_dt] = comments      # key: hour of post, value: number of comments for those posts
                                                       # NOT hour of comment
        elif hour_dt in s_counts_by_hour:  
            s_counts_by_hour[hour_dt] += 1
            s_comments_by_hour[hour_dt] += comments     
    s_avg_by_hour = []
    for c in s_comments_by_hour:                        # for each row in the dict
        avg_c = s_comments_by_hour[c]/s_counts_by_hour[c] # get average comments per post, by hour
        avg_c = float(format (avg_c, '.2f'))
        s_avg_by_hour.append([c, avg_c])                        # append to hpc list
    s_swap_avg_by_hour = []
    for row in s_avg_by_hour:
        s_swap_avg_by_hour.append([row[1], row[0]])  # swap hour & avg comments, so we can sort by avg comments
    s_sorted_swap = sorted(s_swap_avg_by_hour, reverse=True)
    print("Here are the hours of posts that receive the most comments. \n", sep='')
    for row in s_sorted_swap[:5]:                    # get top 5 rows
        hour = dt.datetime.strptime(row[1], "%H")  # parse hour, from str to dt
        hour = dt.date.strftime(hour, "%H:%M")     # format 2 digit hour to Hour:Min

                            # using str.format, {0} refers to avg comments, {1} refers to the hour
        print("For posts at {1}, there are {0} comments on average.".format(row[0], hour)) 

In [216]:
print("For 'Show HN' posts:")
analysis(show_posts)

For 'Show HN' posts:
Here are the hours of posts that receive the most comments. 

For posts at 12:00, there are 6.99 comments on average.
For posts at 07:00, there are 6.68 comments on average.
For posts at 11:00, there are 6.0 comments on average.
For posts at 08:00, there are 5.6 comments on average.
For posts at 14:00, there are 5.52 comments on average.


In [217]:
print("For 'Other' posts:")
analysis(other_posts)

For 'Other' posts:
Here are the hours of posts that receive the most comments. 

For posts at 12:00, there are 7.59 comments on average.
For posts at 11:00, there are 7.37 comments on average.
For posts at 02:00, there are 7.18 comments on average.
For posts at 13:00, there are 7.15 comments on average.
For posts at 05:00, there are 6.79 comments on average.


In [218]:
print("For Overall posts:")
analysis(hn)

For all posts:
Here are the hours of posts that receive the most comments. 

For posts at 12:00, there are 7.69 comments on average.
For posts at 11:00, there are 7.37 comments on average.
For posts at 13:00, there are 7.34 comments on average.
For posts at 02:00, there are 7.27 comments on average.
For posts at 15:00, there are 7.05 comments on average.


## Conclusion

- Evidently, if you want to create an `Ask HN` post that receives a lot of comments, you should post at 3pm. 
    - For `Show HN` posts, you should post at 12pm. 
        - However, since the `Ask HN` posts received significantly more comments, its numbers are also more robust. 
        - For example, the highest average number of comments for `Show HN` posts (6.99) would rank 3rd from the bottom of the `Ask HN` posts. 
    - For `Other` posts, 12pm also comes first.
        - However, like `Show HN`, this number is also a lot less significant than `Ask HN` posts.
        - Additionally, there is very little difference between the top 5 hours.
    - For `Overall` posts, 12pm also comes first.
        - However, like `Show HN` and `Other`, this number is also a lot less significant than `Ask HN` posts.
        - Additionally, there is very little difference between the top 5 hours.
        
        
The generalizability test indicates that the best time to create a post is not consistent across all types of posts.
- While all the analyses seem to point to posting in the afternoon, the findings for `Show HN` posts, `Other` posts, and `Overall` posts are not robust enough. 

#### To conclude, if you want to create a post that receives a lot of comments:
- First, create an `Ask HN` post
- Second, create it at around 3pm. 


Further analyses can be done on which days do posts receive the most comments, 