# Yammer Dataset

## Datasets
1. [Users](#Table-1:-Users) :19066 entries 
2. [Events](#Table-2:-Events): 340832 entries
3. [Email Events](#Table-3:-Email-Events): 90389 entries
4. [Rollup Periods](#Table-4:-Rollup-Periods): 56002 entries

In [3]:
import pandas as pd

### Table 1: `Users`
This table includes one row per user, with descriptive information about that user's account.   

|**FEATURES**| |
|:---------|---------:
|**user_id** |A unique ID per user    |   
|**created_at** |The time the user was created(first singed up) |
|**company_id**|The ID of the user's company  |
|**language**|The chosen language of the user |  
|**activated_at**|The time the user was activated, if they are active|  
|**state**|The state of the user (active or pending)  |


In [13]:
user=pd.read_csv('./dataset/yammer_users.csv') 
user.head()

Unnamed: 0,user_id,created_at,company_id,language,activated_at,state
0,0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active
1,1.0,2013-01-01 13:07:46,28.0,english,,pending
2,2.0,2013-01-01 10:59:05,51.0,english,,pending
3,3.0,2013-01-01 18:40:36,2800.0,german,2013-01-01 18:42:02,active
4,4.0,2013-01-01 14:37:51,5110.0,indian,2013-01-01 14:39:05,active


In [14]:
user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19066 entries, 0 to 19065
Data columns (total 6 columns):
user_id         19066 non-null float64
created_at      19066 non-null object
company_id      19066 non-null float64
language        19066 non-null object
activated_at    9381 non-null object
state           19066 non-null object
dtypes: float64(2), object(4)
memory usage: 893.8+ KB


***

### Table 2: `Events`
This table includes one row per event, where an event is an action that a user that has taken on Yammer. These events include login events, messaging events, search events, events logged as users progress through a signup funnel, events around received emails.

|**FEATURES**| |
|:---------|---------:
|**user_id** |A unique ID per user, which is the ID the user logging the event    |   
|**occurred_at** |The time the event occurred |
|**event_type**|The general event type. 
|**event name**|The specific action the user took. |  
|**location**|The country from which the event was logged|  
|**device**|The type of device used to log the event  |

`event_type`   
>There are two values in this dataset: 
<p><strong>"signup_flow"</strong>: which refers to anything occuring during the process of a user's authentication;    
        <p><strong>"engagement"</strong>: which refers to general product usage after the user has signed up for the first time.    

`event_name` 
><p><strong>create_user:</strong> User is added to Yammer's database during signup process
        <p><strong>enter_email:</strong> User begins the signup process by entering her email address
        <p><strong>enter_info:</strong> User enters her name and personal information during signup process
        <p><strong>complete_signup:</strong> User completes the entire signup/authentication process
        <p><strong>home_page:</strong> User loads the home page
        <p><strong>like_message:</strong> User likes another user's message
        <p><strong>login:</strong> User logs into Yammer
        <p><strong>search_autocomplete:</strong> User selects a search result from the autocomplete list
        <p><strong>search_run:</strong> User runs a search query and is taken to the search results page
        <p><strong>search_click_result_X:</strong> User clicks search result X on the results page, where X is a number from 1 through 10.
        <p><strong>send_message:</strong> User posts a message
        <p><strong>view_inbox:</strong> User views messages in her inbox
            

In [11]:
events=pd.read_csv('./dataset/yammer_experiments.csv') 
events.head()

Unnamed: 0,user_id,occurred_at,event_type,event_name,location,device,user_type
0,10522.0,2014-05-02 11:02:39,engagement,login,Japan,dell inspiron notebook,3.0
1,10522.0,2014-05-02 11:02:53,engagement,home_page,Japan,dell inspiron notebook,3.0
2,10522.0,2014-05-02 11:03:28,engagement,like_message,Japan,dell inspiron notebook,3.0
3,10522.0,2014-05-02 11:04:09,engagement,view_inbox,Japan,dell inspiron notebook,3.0
4,10522.0,2014-05-02 11:03:16,engagement,search_run,Japan,dell inspiron notebook,3.0


In [12]:
events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340832 entries, 0 to 340831
Data columns (total 7 columns):
user_id        340832 non-null float64
occurred_at    340832 non-null object
event_type     340832 non-null object
event_name     340832 non-null object
location       340832 non-null object
device         340832 non-null object
user_type      325255 non-null float64
dtypes: float64(2), object(5)
memory usage: 18.2+ MB


*** 


### Table 3: `Email Events`
This table contains events specific to the sending of emails. It is similar in structure to the events table above.

|**FEATURES**| |
|:---------|---------:
|**user_id** |The ID of the user to whom the event relates. Can be joined to user_id in either of the other tables.    |   
|**occurred_at** |The time the event occurred.|
|**action**|The name of the event that occurred. "sent_weekly_digest" means that the user was delivered a digest email showing relevant conversations from the previous day. "email_open" means that the user opened the email. "email_clickthrough" means that the user clicked a link in the email.  |


In [7]:
email_events=pd.read_csv('./dataset/yammer_emails.csv') 
email_events.head()

Unnamed: 0,user_id,occurred_at,action,user_type
0,0.0,2014-05-06 09:30:00,sent_weekly_digest,1.0
1,0.0,2014-05-13 09:30:00,sent_weekly_digest,1.0
2,0.0,2014-05-20 09:30:00,sent_weekly_digest,1.0
3,0.0,2014-05-27 09:30:00,sent_weekly_digest,1.0
4,0.0,2014-06-03 09:30:00,sent_weekly_digest,1.0


In [10]:
email_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90389 entries, 0 to 90388
Data columns (total 4 columns):
user_id        90389 non-null float64
occurred_at    90389 non-null object
action         90389 non-null object
user_type      90389 non-null float64
dtypes: float64(2), object(2)
memory usage: 2.8+ MB


*** 

### Table 4: `Rollup Periods`
The final table is a lookup table that is used to create rolling time periods. Though you could use the INTERVAL() function, creating rolling time periods is often easiest with a table like this. You won't necessarily need to use this table in queries that you write, but the column descriptions are provided here so that you can understand the query that creates the chart shown above.     

|**FEATURES**| |
|:---------|---------:
|**period_id** |This identifies the type of rollup period. The above dashboard uses period 1007, which is rolling 7-day periods.   |   
|**time_id** |This is the identifier for any given data point &mdash; it's what you would put on a chart axis. If time_id is 2014-08-01, that means that is represents the rolling 7-day period leading up to 2014-08-01. |
|**pst_start**|The start time of the period in PST. For 2014-08-01, you'll notice that this is 2014-07-25 &mdash; one week prior. Use this to join events to the table.  |
|**pst_end**|The start time of the period in PST. For 2014-08-01, the end time is 2014-08-01.  |  
|**utc_start**|The same as pst_start, but in UTC time.|  
|**utc_start**|The same as pst_end, but in UTC time. |



In [8]:
rollup_periods=pd.read_csv('./dataset/dimension_rollup_periods.csv') 
rollup_periods.head()

Unnamed: 0,period_id,time_id,pst_start,pst_end,utc_start,utc_end
0,1.0,2013-01-01 00:00:00,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-01 08:00:00,2013-01-02 08:00:00
1,1.0,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-02 08:00:00,2013-01-03 08:00:00
2,1.0,2013-01-03 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-03 08:00:00,2013-01-04 08:00:00
3,1.0,2013-01-04 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-04 08:00:00,2013-01-05 08:00:00
4,1.0,2013-01-05 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00,2013-01-05 08:00:00,2013-01-06 08:00:00


In [9]:
rollup_periods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56002 entries, 0 to 56001
Data columns (total 6 columns):
period_id    56002 non-null float64
time_id      56002 non-null object
pst_start    56002 non-null object
pst_end      56002 non-null object
utc_start    56002 non-null object
utc_end      56002 non-null object
dtypes: float64(1), object(5)
memory usage: 2.6+ MB
