## About the data

### Post and User Files for Offline Analysis
The best, most-current data files to use for offline Post and User Analysis are **PostsForAnalysis.txt** and **UsersForAnalysis.txt**. These file are hefty and will require offline statistical computing environments like Python or R to analyze. You may change the extension from .txt to .csv if you wish, as their formatting is comma-separated.

### Post and User Files for Real-time Exploration on data.world
To keep data exploration, queries, and visualizations on data.world smooth and speedy, we have randomly sampled the full **PostsForAnalysis** and **UsersForAnalysis** files down for real-time analysis. These are available below as **PostsForExploration.csv** and **UsersForExploration.csv**

### AllTopics.csv
*id* - Topic ID number - for use in Product Hunt API requests    
*name* - Topic name / 'slug'    
*description* - Topic description    
*num_followers* - Total number of followers (as of 11-29-2016)    
*num_posts* - Total number of posts (as of 11-29-2016); note that many products are posted to more than one topic    

### PostsForAnalysis.txt and PostsForExploration.csv
Columns 1 through 12 are:    
*id* - Post ID number    
*date* - Date in Month/Day/Year    
*day* - 7 days of week: Sunday through Saturday    
*created\_at* - Date/Time in Year-Month-DayT00:00:00.000-8:00    
*time\_of\_day* - 4 times of day: Morning, Afternoon, Evening, Night - described in more detail below    
*name* - Post name    
*tagline* - Post/product tagline    
*thumbnail\_type* - 4 thumbnail formats: Image, Video, Audio, Book Preview    
*product\_state* - 3 states: Default, Pre\_launch, or No\_Longer\_Online    
*comments\_count* - Number of comments made on the post    
*num_makers* - Number of makers of the product (0 denotes the maker is either not on Product Hunt or hasn't been tagged to the product)    
*num_topics* - Number of topics in which the post was tagged    

Columns 13 - 313 are all possible topics a post could be tagged within the timeframe of the data.    
TRUE denotes the post was tagged in that topic    
"No data" indicates FALSE; the post was not tagged in that topic    

Columns 314 and 315 are:    
*user\_id* - the ID number of the user who posted the post    
*votes\_count* - total number of votes for the post    

##### Time\_of\_day details
A new '*time\_of\_day*' column was added, which is the time of day during which the post was created, using the following heuristic breakdown:    
* Morning: 5am to 11:59:59.999am
* Afternoon: 12pm to 4:59:59.999pm
* Evening: 5pm to 8:59:59.999pm
* Night: 9pm to 4:59:59.999am
* NOTE: All times in America/Los_Angeles timezone

### UsersForAnalysis.txt and UsersForExploration.csv
*id* - User ID number    
*created\_at* - Date/Time in Year-Month-DayT00:00:00.000-8:00    
*name* - The user's name    
*username* - The user's Product Hunt handle    
*headline* - User headline - typically Title and Company but can be anything or nothing at all    
*twitter\_username* - User's Twitter handle (in the case they signed up via Twitter)    
*website\_url* - User's website    
*collections\_count* - Total number of collections for this user    
*followed\_topics\_count* - Total number of topics this user follows    
*followers\_count* - Number of followers this user has    
*followings\_count* - Number of Hunters this user follows    
*maker\_of\_count* - Number of products this user has made    
*posts\_count* - Number of products this user has hunted    
*votes\_count* - Number of votes this user has submitted    

### Timeframe of data
**AllTopics.csv** contains all topics, including their total number of followers and total number of posts within those topics as of 11-29-2016    

**UsersForAnalysis.txt** contains all users as of 11-30-2016, except hidden users. **UsersForExploration.csv** contains 50,000 randomly-sampled rows from **UsersForAnalysis.txt**.

Although the data dump from the Product Hunt API included posts from dates 11-24-2014 to 11-23-2016, one week's worth of the most recent posts (from 11-17-2016 to 11-23-2016) were removed to create **PostsForAnalysis.txt**. This week of posts was removed, because those posts had not been given ample time to receive votes, and thus would have on average fewer votes per post. Since the data contains plenty of posts across the two years, it is okay to remove them. As **PostsForExploration.csv** contains 5,000 randomly-sampled rows from **PostsForAnalysis.txt**, it was also subject to this removal process.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
posts = pd.read_csv('PostsForAnalysis.csv')
posts.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,date,day,created_at,time_of_day,name,tagline,thumbnail_type,product_state,comments_count,...,wi.fi,windows,wine,wordpress,writing.tools,xbox.one,yoga.books,youtube,user_id,votes_count
0,82423,2016-11-16,Wednesday,2016-11-16 00:20:00,Night,A.I. Experiments by Google,"Explore machine learning by playing w/ pics, m...",image,default,24,...,,,,,,,,,61044,1500
1,82480,2016-11-16,Wednesday,2016-11-16 05:40:53,Morning,Init.ai,Build powerful and intelligent conversational ...,image,default,43,...,,,,,,,,,1,802
2,82502,2016-11-16,Wednesday,2016-11-16 10:16:06,Morning,Google Earth VR,Walk the earth in VR,image,default,27,...,,,,,,,,,344208,544
3,82370,2016-11-16,Wednesday,2016-11-16 00:01:00,Night,Drop,Beautiful color picker with Touch Bar support,image,default,38,...,,,,,,,,,28756,446
4,82460,2016-11-16,Wednesday,2016-11-16 01:13:32,Night,Lookback Live,Real-time user research on mobile and desktop ...,image,default,25,...,,,,,,,,,591,416


In [6]:
sns.countplot(x='day', data=posts, hue='comment_count', palette='viridis')

KeyError: 'comment_count'