# W207, Final Project
Spring, 2018

Team:  Cameron Kennedy, Gaurav Khanna, Aaron Olson

## Data Preparation / Feature Extraction Notebook
Python Notebook 1 of 2

This notebook loads and pre-processes the data.  The other notebook (2 of 2) runs our ML models.

# Introduction
This analysis seeks to predict user churn in a music sharing service.

We will write a more complete description and analysis for submission of our final project.

We worked on 2 major data tables/frames (User logs & Transactions) independently for preperation and then brought them together before analysis

In [1]:
#Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Loading the data indexing with the primary key (MSNO: String like/Object, represents the user)

In [2]:
#Load the data
members = pd.read_csv('members_filtered.csv')
transactions = pd.read_csv('transactions_filtered.csv')
user_logs = pd.read_csv('user_logs_filtered.csv')
labels = pd.read_csv('labels_filtered.csv')

#Set indices
members.set_index('msno', inplace = True)
labels.set_index('msno', inplace = True)

#user_logs.head()

Getting some info on the userful data

In [3]:

print('Transactions: \n')
transactions.info()
print('User Logs: \n')
user_logs.info()

Transactions: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 282160 entries, 0 to 282159
Data columns (total 9 columns):
msno                      282160 non-null object
payment_method_id         282160 non-null int64
payment_plan_days         282160 non-null int64
plan_list_price           282160 non-null int64
actual_amount_paid        282160 non-null int64
is_auto_renew             282160 non-null int64
transaction_date          282160 non-null int64
membership_expire_date    282160 non-null int64
is_cancel                 282160 non-null int64
dtypes: int64(8), object(1)
memory usage: 19.4+ MB
User Logs: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4883573 entries, 0 to 4883572
Data columns (total 9 columns):
msno          object
date          int64
num_25        int64
num_50        int64
num_75        int64
num_985       int64
num_100       int64
num_unq       int64
total_secs    float64
dtypes: float64(1), int64(7), object(1)
memory usage: 335.3+ MB


Helper routine to format the date for visualization. Not conducive for analysis though

In [4]:
def pd_to_date(df_col):
    df_col = pd.to_datetime(df_col, format = '%Y%m%d')
    return df_col

#Convert to date
user_logs['date'] = pd_to_date(user_logs['date'])
#user_logs.head()

# User Logs Data: Preparation and Feature Extraction

In [None]:
#Create our groupby user object 
user_logs_gb = user_logs.groupby(['msno'], sort=False)

The list of features 

* User most recent date (max date)
* User first date (min date)
* How long they've been listening:  Min vs. max date by user
* Matrix of all the following (cartesian product)
    * Total X=(seconds, 100, 985, 75, 50, 25, unique), avg per day of X, maybe median per day of X
    * Last day, last 7 days, last 30 days, last 90, 180, 365, total (note last day is relative to user)
 

In [None]:
#This cell is slow

#Append max date to every row in main table
user_logs['max_date'] = user_logs_gb['date'].transform('max')
user_logs['days_before_max_date'] = (user_logs['max_date'] - user_logs['date']).apply(lambda x: x.days)
    #The .apply(lambda...  just converts it from datetime to an integer, for easier comparisons later.

#Generate user's first date, last date, and tenure
#Also, the user_logs_features table will be the primary table to return from the transactions table
user_logs_features = (user_logs_gb
    .agg({'date':['max', 'min', lambda x: (max(x) - min(x)).days]})  #.days converts to int
    .rename(columns={'max': 'max_date', 'min': 'min_date','<lambda>':'listening_tenure'})
                      )
#Add a 3rd level, used for joining data later
user_logs_features = pd.concat([user_logs_features], axis=1, keys=['date_features'])

In [None]:
user_logs_features.head()

In [None]:
#Create Features:
    # Total X=(seconds, 100, 985, 75, 50, 25, unique), avg per day of X, maybe median per day of X
    # Last day, last 7 days, last 30 days, last 90, 180, 365, total (note last day is relative to user)
    
for num_days in [7, 14, 31, 90, 180, 365, 999]:
    #Create groupby object for items with x days
    ul_gb_xdays = (user_logs.loc[(user_logs['days_before_max_date'] < num_days)]
                   .groupby(['msno'], sort=False))

    #Generate sum and mean (and count, once) for all the user logs stats
    past_xdays_by_user = (ul_gb_xdays
        .agg({'num_unq':['sum', 'mean', 'count'],
              'total_secs':['sum', 'mean'],
              'num_25':['sum', 'mean'],
              'num_50':['sum', 'mean'],
              'num_75':['sum', 'mean'],
              'num_985':['sum', 'mean'],
              'num_100':['sum', 'mean'],
             })
                      )
    #Append level header
    past_xdays_by_user = pd.concat([past_xdays_by_user], axis=1, keys=['within_days_' + str(num_days)])

    #Join (append) to user_logs_features table
    user_logs_features = user_logs_features.join(past_xdays_by_user, how='inner')

In [None]:
#Next, let's look at changes in last 7 days vs. last 30 days, and last 30 days vs. last 180 days.

#Also, need to think about users with < x days tenure.

In [None]:
#Join members and labels files
features_all = None
features_all = members.join(labels, how='inner')
features_all = features_all.join(user_logs_features, how='inner')

#Note, the warning is okay, and actually helps us by flattening our column headers.

# Test
features_all.head()

# Transaction Data: Preparation and Feature Extraction

Grouping by the primary key (MSNO)

In [None]:
# Grouping by the member (msno)
transactions_gb = transactions.sort_values(["transaction_date"]).groupby(['msno'])

# How many groups i.e. members i.e. msno's. We're good if this is the same number as the members table
print('%d Groups/msnos' %(len(transactions_gb.groups)))

The list of features 

    * Latest transaction
        * Plan no of days for the latest transaction
        * Plan actual amount paid/day for the latest transaction
        * plan total amount paid for the latest transaction
        * Is_auto_renew for the latest transaction
        * is_cancel for the latest transaction
    * Aggregate values
        * Total number of plan days
        * Total of all the amounts paid for the plan
    * Comparing transactions
        * Plan day difference among the latest and previous transaction
        * Amount paid/day difference among the latest and previous transaction
    ....


Aggregate values

In [None]:
# Features: Total_plan_days, Total_amount_paid
transactions_features = (transactions_gb
    .agg({'payment_plan_days':'sum', 'actual_amount_paid':'sum' })
    .rename(columns={'payment_plan_days': 'Total_plan_days', 'actual_amount_paid': 'Total_amount_paid',})
          )
# Test
# transactions_features.head()

Latest transaction. We'll just pick from the bottom of the ordered (by date) rows in groups

In [None]:
# Features: latest transaction, renaming the collumns
latest_transaction_gb = transactions_gb.tail([1]).rename(columns={'payment_plan_days': 'latest_plan_days', 'actual_amount_paid': 'latest_amount_paid','is_auto_renew': 'latest_auto_renew', 
                                                                  'transaction_date': 'latest_transaction_date',
                                                                  'membership_expire_date': 'latest_expire_date', 'is_cancel': 'latest_is_cancel' })

# Index by msno
latest_transaction_gb.set_index('msno', inplace = True)

# Test
# latest_transaction_gb.head()

In [None]:
# Plan actual amount paid/day for the latest transaction
# Adding the collumn amount_paid_per_day

latest_transaction_gb['amount_paid_per_day'] = (latest_transaction_gb['latest_amount_paid']/latest_transaction_gb['latest_plan_days'])

# Test
latest_transaction_gb.head()

In [None]:
# TODO Differences among latest and previous transaction

Getting all the transaction features in a single DF

In [None]:
# Get all transaction features in a single DF
transactions_features = transactions_features.join(latest_transaction_gb, how = 'inner')

# Test
transactions_features.head()

# Bringing all the features in a single Data Frame, file

Members and Labels were joined into the User logs DF

The code below joins the Transaction features into the primary features dataframe

In [None]:
# Joining feature DF's
features_all = features_all.join(transactions_features, how='inner')

In [None]:
# Test
features_all.head()

In [None]:
#Write all features to pkl
features_all.to_pickle('features_all.pkl')

#Writing the features to a .pkl file allows us to use the 2nd ipynb file
#without having to run all the code above