# Build matrices notebooks
This notebook is used to develop the matrices that will be used for analysis. Based on learnings from the exploratory notebook "Starbucks_Capstone_notebook.ipynb". I have defined a set of parameteres that has to be derived from the combination of portfolio, profile and transcript data. 

## Interaction matrices
In princple I have decided to make two main matrices. One focusing on the users, and one focusing on the offers given. 

Documentation of the columns for each matrix is given in a separate description. XXINSERT_REF

### Offer based interactions
The offers dataframe will have one line per offer given to a user. Each line will consist of data related to the offer, taken from the portfolio data, data related to user identity, taken from the profile data, and data related to user interactions with the given offer, derived from the transcript data. This matrix will be the basis for investigating the user - offer interactions. 

### User based interactions
The profile_exp dataframe will be built with one row per user (in principle an expansion of the profile.json data. The expansion will provide features about the user, user details as provided in the original profile data, and aggregated features about the users offer and spending history. This matrix will be used for segmentation analysis of the users and their interactions. 

## Info
The functions created in this notebook will be moved to different python modules as seen fit. There are a lot of helper methods needed to build both matrices, and these can be investigated in detail below, or in the 




In [1]:
import pandas as pd
import numpy as np
import math
import json
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from collections import OrderedDict
%matplotlib inline

from utils.cleaning import clean_data

In [2]:
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

#for simplicity I will not keep the original dataframes
portfolio, profile, transcript = clean_data(portfolio, profile, transcript)

## Offer based interactions

I need to separate each offer the user receives. Each offer will be treated as a unique offering, identified by the offer_id, the user_id. Based on the instructions on the data I am making a few assumptions and create some definitions to limit the solution space. 

First, I define a "valid windom" as the window from the offer is viewed to its complettion. Completion in this case will be defined as the first event of offer completed or offer expires. Parameters related to what happens inside this valid window is denoted with "...\_in\_window". 

The rest of the total window will then be defined as not valid, or "...\_out\_window". 

Anything happening outside of the window defined by the "offer received"-time and after the duration is passed will not be considered in this matrix. The user based interaction matrix will create parameters that transcends the specific offering such as spendings outside valid windows and offerings. 

One user can receive a specific offer_id several times. If these are overlapping, we will be in trouble with difficulty to differentiate the offers. At the moment we will close our eyes and hope that this is not happening to any users. It would seem counter productive to offer two bogo offers of the same kind at the same time to the same users. Also, the user doesn't need to have information twice about the same offering, one should hope at least. 

We are also not differentiating if the user has two or more offers at the same time. The aggregated data from transactions during the offer is performed per offer user receives. Thus, transactions that occurs inside window of more than one offer at the same time will be double booked. This is in principle no problem when focusing on offers by itself, as we cannot really differentiate which offer (at least not immediately) that influence the user the most. Hence, we make an assumption that each offer is independent from another regardless of time of occurence. 



In [65]:
def get_user_offer_ids(user_transcript):
    """
    Extracts offer ids presented to the user    
    """
    offer_ids = [(i, offer_id) for i, offer_id in
                 enumerate(user_transcript.loc[user_transcript['event'] == 'offer received', 'offer_id'])]
    return offer_ids

def get_user_offer_starts(user_transcript):
    """
    Extracts start times of offers presented to the user    
    """
    offers_start = np.array(user_transcript.loc[user_transcript['event'] == 'offer received', 'time'])
    return offers_start

def get_user_offer_types(portfolio, offer_ids):
    """
    Extracts offer types of offers presented to the user    
    """
    offers_type = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'offer_type'].values.astype(str)[0] for offer_id in offer_ids])
    return offers_type
def get_user_offer_difficulties(portfolio, offer_ids):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_difficulty = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'difficulty'].values.astype(str)[0] for offer_id in offer_ids])
    return offers_difficulty

def get_user_offer_rewards(portfolio, offer_ids):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_reward = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'reward'].values.astype(str)[0] for offer_id in offer_ids])
    return offers_reward
    
def get_user_offer_durations(portfolio, offer_ids):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_duration = np.array(
        [portfolio.loc[portfolio['id'] == offer_id, 'duration'].values.astype(int)[0] * 24 for offer_id in offer_ids])
    return offers_duration
    
def get_user_offer_views(user_transcript):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_viewed = np.array(user_transcript.loc[user_transcript['event'] == 'offer viewed', ['time', 'offer_id']])
    return offers_viewed

def get_user_offer_completions(user_transcript):
    """
    Extracts difficulty of offers presented to the user    
    """
    offers_completed = np.array(
        user_transcript.loc[user_transcript['event'] == 'offer completed', ['time', 'offer_id']])
    return offers_completed

    
def build_offer_df(portfolio, profile, transcript):
    #iterate over users
    users = profile['id'].unique()
    
    offers = {}
    count = 0
    count_users_no_offer = 0
    for user in users:
        # transcripts for specific user
        user_transcript = transcript.loc[transcript['id'] == user, :]
        user_transactions = user_transcript.loc[user_transcript['event'] == 'transaction', ['time', 'amount']]        
        offer_ids_tuples = get_user_offer_ids(user_transcript)
        if len(offer_ids_tuples)<1: #if there are no offers given to user, skip the rest. 
            count_users_no_offer += 1
            continue
        offer_ids = list(list(zip(*offer_ids_tuples))[1])
        
        offers_start = get_user_offer_starts(user_transcript)
        offers_duration = get_user_offer_durations(portfolio, offer_ids)
        offers_difficulty = get_user_offer_difficulties(portfolio, offer_ids)
        offers_reward = get_user_offer_rewards(portfolio, offer_ids) 
        offers_type = get_user_offer_types(portfolio, offer_ids)
        offers_viewed = get_user_offer_views(user_transcript) 
        offers_completed = get_user_offer_completions(user_transcript)
        
        offers_end = offers_start + offers_duration
        
        #Test if results are as expected
        assert len(offer_ids) == len(offers_start) , "The number of offerings ({}) are not the same as the number of starting points ({})".format(len(offer_ids), len(offers_start))
        assert len(offer_ids) == len(offers_type) , "The number of offerings ({}) are not the same as the number of offer types ({})".format(len(offer_ids), len(offers_type))
        assert len(offer_ids) == len(offers_difficulty) , "The number of offerings ({}) are not the same as the number of offer difficulties ({})".format(len(offer_ids), len(offers_difficulty))
        assert len(offer_ids) == len(offers_reward) , "The number of offerings ({}) are not the same as the number of offer rewards ({})".format(len(offer_ids), len(offers_reward))
        assert len(offer_ids) == len(offers_duration) , "The number of offerings ({}) are not the same as the number of offer durations ({})".format(len(offer_ids), len(offers_duration))
        
        #iterate over offers and build dict to be used to fill a dataframe
        for i, offer_id in offer_ids_tuples:
            start = offers_start[i]
            duration = offers_duration[i]
            end = offers_end[i]
            kind = offers_type[i]
            reward = offers_reward[i]
            difficulty = offers_difficulty[i]
            
            # identify completion event within the offer
            completed_time = None
            completed = 0 #0 if no completion even, 1 if completion even
            for time, completion_offer_id in offers_completed:
                if completion_offer_id == offer_id and time >= start and time <= end:
                    completed_time = time
                    completed = 1
                    break
                    
            # identify view event within the offer, views after completion will be regarded as not viewed
            viewed_time = None
            viewed = 0 #0 if no completion even, 1 if completion even
            for time, viewed_offer_id in offers_viewed:
                if completed_time:
                    if time > completed_time: #do not accept if time of viewing is after time of completion
                        break
                if viewed_offer_id == offer_id and time >= start and time <= end:
                        viewed_time = time
                        viewed = 1
                        break        
            
            # calculate valid window related parameters
            time_in_window = 0
            amount_in_window = 0
            if viewed:
                # time from viewed to completion or end of offer window. 
                if completed_time:
                    time_in_window = completed_time - viewed_time
                else:
                    time_in_window = end - viewed_time
                # cumulative amount spent in valid window, if no valid window, no amount spent due to offer
                transactions_in_window = user_transactions.loc[(user_transactions['time'] >= viewed_time) &
                                                               (user_transactions['time'] <= viewed_time + time_in_window), :]
                
                amount_in_window = transactions_in_window['amount'].sum()
            
            
            
            
            offers.update({count: {'offer_id': offer_id,
                                   'user_id': user,
                                   'offer_type': kind,
                                   'difficulty': difficulty,
                                   'reward': reward,
                                   'start_time': start,
                                   'duration': duration,
                                   'end_time': end,
                                   'viewed': viewed,
                                   'view_time': viewed_time,
                                   'completed': completed,
                                   'complet_time': completed_time,
                                   'time_in_window': time_in_window,
                                   'amount_in_window': amount_in_window}})
            count+=1
    dtype = [str, str, str, float, float, float, float, float, int, float, int, float, float, float]
    offer_df = pd.DataFrame.from_dict(offers, orient='index')
    print("{} received no offer".format(count_users_no_offer))
    return offer_df

offer_df = build_offer_df(portfolio, profile, transcript)
pd.to_pickle(offer_df, 'offer_df.pkl')

6 received no offer


In [70]:
offer_df.head()

Unnamed: 0,offer_id,user_id,offer_type,difficulty,reward,start_time,duration,end_time,viewed,view_time,completed,complet_time,time_in_window,amount_in_window
0,2906b810c7d4411798c6938adc9daaa5,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,168,168,336,1,216.0,0,,120,0.0
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,68be06ca386d4c31939f3a4f0e3dd783,discount,20,5,336,240,576,1,348.0,0,,228,10.52
2,fafdcd668e3743c1bb461111dcafc2a4,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,408,240,648,1,408.0,1,552.0,144,10.17
3,2298d6c36e964ae4a3e7e9706d1fb8c2,68be06ca386d4c31939f3a4f0e3dd783,discount,7,3,504,168,672,1,504.0,1,552.0,48,7.54
4,fafdcd668e3743c1bb461111dcafc2a4,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,576,240,816,1,582.0,0,,234,9.88


In [3]:
#save dataframe for future use. Don't want to remake it every time. 
offer_df = pd.read_pickle('offer_df.pkl')


Allright! We have a matrix with slightly more information. To make all interesting columns ready for a bit of analysis we will introduce dummy variables for the offer_type. Due to the use for the categorical value in data wrangling, I will not delete it. Rather, we will later define a subset of this matrix to be used to different inference methods. 

In [14]:
offer_type_dummies = pd.get_dummies(offer_df.loc[:, 'offer_type'], prefix='type')
offer_df = offer_df.merge(offer_type_dummies, left_index=True, right_index=True)
offer_df.head()

Unnamed: 0,offer_id,user_id,offer_type,difficulty,reward,start_time,duration,end_time,viewed,view_time,completed,complet_time,time_in_window,amount_in_window,type_bogo,type_discount,type_informational
0,2906b810c7d4411798c6938adc9daaa5,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,168,168,336,1,216.0,0,,120,0.0,0,1,0
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,68be06ca386d4c31939f3a4f0e3dd783,discount,20,5,336,240,576,1,348.0,0,,228,10.52,0,1,0
2,fafdcd668e3743c1bb461111dcafc2a4,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,408,240,648,1,408.0,1,552.0,144,10.17,0,1,0
3,2298d6c36e964ae4a3e7e9706d1fb8c2,68be06ca386d4c31939f3a4f0e3dd783,discount,7,3,504,168,672,1,504.0,1,552.0,48,7.54,0,1,0
4,fafdcd668e3743c1bb461111dcafc2a4,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,576,240,816,1,582.0,0,,234,9.88,0,1,0


## User based interactions

The user based interaction matrix is based on the profile matrix. It tries to summarise the users behaviour within an offer period and outside. It is built with several potential features that can help us separate groups and offers. 

We will use the profile as a basis for expanding it wiht all these features.

In [71]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,2017-02-12,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,2017-07-15,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,2018-07-12,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,2017-05-09,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,2017-08-04,,a03223e636434f42ac4c3df47e8bac43,


In [271]:
def merged_intervals(windows):
    """
    Returns a list of list with merged intervals. Assume sort start times. 
    Expect a list of list of the form [[starttime, endtime], [starttime, endtime],...]
    Sorts by start time and returns a list of list ordered
    """
    if len(windows)==0: 
        return [[0],[0]]
    if np.all([np.isnan(s) for s,e in windows]):
        return [[0],[0]]
    windows.sort(key=lambda x: x[0])
    while np.isnan(windows[0][0]):
        windows.pop(0)
    intervals = [[windows[0][0], windows[0][1]]]
    if len(windows)==1:
        return intervals
    for start, end in windows[1:]:
        if np.isnan(start) or np.isnan(end):
            continue
        if start < intervals[-1][1]: 
            if end > intervals[-1][1]: # if start of next window is less than current interval, then change interval end
                intervals[-1][1] = end
        else:
            intervals.append([start, end])
    return intervals

def build_user_df(portfolio, profile, transcript, offers):
    users = np.array(profile['id'])
    
    user_dict = {}
    max_time = transcript.loc[:,'time'].max()

    for user in users: 
        user_transcript = transcript.loc[transcript['id'] == user, :]
        
        user_transactions = user_transcript.loc[user_transcript['event'] == 'transaction', ['time', 'amount']]    
        user_offers = offers[offers['user_id']==user]
        
        
        total_spent = user_transactions['amount'].sum()
        #It would be tempting to do: total_spent_in_window = user_offers['amount_in_window'].sum()
        # that is not possible since we have overlapping offers, that counts the spending twice. 
        # Instead we have to mask any transaction in the union of time windows
        spent_in_window = 0
        spent_in_discount_window = 0
        spent_in_bogo_window = 0
        spent_in_info_window = 0
        spent_no_window = 0
        for i, row in user_transactions.iterrows():
            #the below test is based on the fact that comparing a value with nan returns false, thus if not viewed, automatically it will be nan
            #the test checks if the transaction time is inside any of the "valid windows" of all offers given to the user. 
            transaction_in_window = np.any((user_offers['view_time'] <= row['time']) & 
                                           (user_offers['view_time'] + user_offers['time_in_window'] >= row['time']))
            if transaction_in_window: 
                spent_in_window += row['amount']
            else:
                spent_no_window += row['amount']
        
        assert np.isclose(spent_in_window + spent_no_window, total_spent, rtol=1e-5, atol=1e-3), 'summation of spendings not correct'

        # Get amount spent in specific windows. Here, double booking is allowed to happen
        spent_in_discount_window = user_offers.loc[user_offers['offer_type']=='discount','amount_in_window'].sum()
        spent_in_bogo_window = user_offers.loc[user_offers['offer_type']=='bogo','amount_in_window'].sum()
        spent_in_info_window = user_offers.loc[user_offers['offer_type']=='informational','amount_in_window'].sum()
        
        # Get time spent in any window
        windows = list(zip(user_offers['view_time'],  user_offers['view_time'] + user_offers['time_in_window']))        
        intervals = merged_intervals(windows)
#         print(windows)
#         print(intervals)
#         print(np.diff(np.array(intervals).transpose(), axis=0).sum())
        time_in_windows = np.diff(np.array(intervals).transpose(), axis=0).sum()
        time_no_windows = max_time - time_in_windows
        
        windows_discount = list(zip(user_offers.loc[user_offers['offer_type']=='discount', 'view_time'],  
                                    user_offers.loc[user_offers['offer_type']=='discount', 'view_time'] + 
                                    user_offers.loc[user_offers['offer_type']=='discount', 'time_in_window']))
        intervals_discount = merged_intervals(windows_discount)
        windows_bogo = list(zip(user_offers.loc[user_offers['offer_type']=='bogo', 'view_time'],  
                                    user_offers.loc[user_offers['offer_type']=='bogo', 'view_time'] + 
                                    user_offers.loc[user_offers['offer_type']=='bogo', 'time_in_window']))
        intervals_bogo = merged_intervals(windows_bogo)
        windows_info = list(zip(user_offers.loc[user_offers['offer_type']=='informational', 'view_time'],  
                                    user_offers.loc[user_offers['offer_type']=='informational', 'view_time'] + 
                                    user_offers.loc[user_offers['offer_type']=='informational', 'time_in_window']))
        intervals_info = merged_intervals(windows_info)
        time_in_discount = np.diff(np.array(intervals_discount).transpose(), axis=0).sum()
        if np.isnan(time_in_discount):
            time_in_discount = 0
        time_in_bogo = np.diff(np.array(intervals_bogo).transpose(), axis=0).sum()
        if np.isnan(time_in_bogo):
            time_in_bogo = 0
        time_in_info = np.diff(np.array(intervals_info).transpose(), axis=0).sum()
        if np.isnan(time_in_info):
            time_in_info = 0
        
        if user_offers.shape[0] == 0:
            print("user {} has no offers to extract data from".format(user))
            view_ratio = 0
            completion_ratio = 0
            view_and_complete_ratio = 0
        else:
            view_ratio = user_offers['viewed'].sum()/user_offers.shape[0]
            completion_ratio = user_offers['completed'].sum()/user_offers.shape[0]
            view_and_complete_ratio = user_offers.loc[(user_offers['completed']==1) & (user_offers['viewed']==1),'start_time'].count()/user_offers.shape[0]
        
        
        
        user_dict.update({user: {'spent_total': total_spent, 
                                 'spent_in_window': spent_in_window,
                                 'spent_no_window': spent_no_window,
                                 'spent_in_discount': spent_in_discount_window,
                                 'spent_in_bogo': spent_in_bogo_window,
                                 'spent_in_informational': spent_in_info_window,
                                 'time_in_window': float(time_in_windows), 
                                 'time_no_window': time_no_windows,
                                 'time_in_discount': time_in_discount,
                                 'time_in_bogo': time_in_bogo,
                                 'time_in_informational': time_in_info,
                                 'view_ratio': view_ratio,
                                 'completion_ratio': completion_ratio,
                                 'view_and_complete_ratio': view_and_complete_ratio,
                                 'num_offers_received': user_offers.shape[0]}})
        
    
    expanded = pd.DataFrame.from_dict(user_dict, orient='index').reset_index().rename(columns={'index':'user_id'})
    
    profile_expanded = pd.merge(profile.sort_values('id'), expanded.sort_values('user_id'), left_on='id', right_on='user_id').drop(columns='id')
    return profile_expanded
    
profile_expanded = build_user_df(portfolio, profile, transcript, offer_df)

pd.to_pickle(profile_expanded, 'profile_exanded_2.pkl')
    
    

user c6e579c6821c41d1a7a6a9cf936e91bb has no offers to extract data from
user da7a7c0dcfcb41a8acc7864a53cf60fb has no offers to extract data from
user eb540099db834cf59001f83a4561aef3 has no offers to extract data from
user 3a4874d8f0ef42b9a1b72294902afea9 has no offers to extract data from
user ae8111e7e8cd4b60a8d35c42c1110555 has no offers to extract data from
user 12ede229379747bd8d74ccdc20097ca3 has no offers to extract data from


In [274]:
profile_expanded

Unnamed: 0,age,became_member_on,gender,income,user_id,spent_total,spent_in_window,spent_no_window,spent_in_discount,spent_in_bogo,spent_in_informational,time_in_window,time_no_window,time_in_discount,time_in_bogo,time_in_informational,view_ratio,completion_ratio,view_and_complete_ratio,num_offers_received
0,33,2017-04-21,M,72000.0,0009655768c64bdeb2e877511632db8f,127.60,30.73,96.87,0.00,0.00,30.73,108.0,606.0,0.0,0.0,108.0,0.400000,0.600000,0.000000,5
1,118,2018-04-25,,,00116118485d4dfda04fdbaba9a87b5c,4.09,0.00,4.09,0.00,0.00,0.00,138.0,576.0,0.0,138.0,0.0,1.000000,0.000000,0.000000,2
2,40,2018-01-09,O,57000.0,0011e0d4e6b944f998e987f904e8c1e5,79.46,33.98,45.48,33.98,22.05,0.00,354.0,360.0,210.0,60.0,144.0,1.000000,0.600000,0.600000,5
3,59,2016-03-04,F,90000.0,0020c2b971eb4e9188eac86d93036a77,196.86,34.87,161.99,17.63,17.24,0.00,126.0,588.0,42.0,84.0,0.0,0.400000,0.600000,0.400000,5
4,24,2016-11-11,F,60000.0,0020ccbbb6d84e358d3414a3ff76cffd,154.05,95.37,58.68,11.65,24.85,58.87,174.0,540.0,54.0,48.0,72.0,1.000000,0.750000,0.750000,4
5,26,2017-06-21,F,73000.0,003d66b6608740288d6cc97a6903f4f0,48.34,30.92,17.42,22.47,0.00,12.59,240.0,474.0,168.0,0.0,96.0,0.800000,0.600000,0.400000,5
6,19,2016-08-09,F,65000.0,00426fe3ffde4c6b9cb9ad6d077a13ea,68.51,49.26,19.25,23.34,0.00,25.92,108.0,606.0,72.0,0.0,36.0,0.400000,0.200000,0.200000,5
7,55,2018-05-08,F,74000.0,004b041fbfe44859945daa2c7f79ee64,138.36,47.85,90.51,19.93,27.92,0.00,162.0,552.0,138.0,24.0,0.0,0.666667,0.666667,0.666667,3
8,54,2016-03-31,M,99000.0,004c5799adbf42868b9cff0396190900,347.38,101.94,245.44,43.21,58.73,0.00,114.0,600.0,48.0,66.0,0.0,0.600000,1.000000,0.600000,5
9,56,2017-12-09,M,47000.0,005500a7188546ff8a767329a2f7c76a,20.36,20.36,0.00,0.00,20.36,0.00,426.0,288.0,0.0,426.0,0.0,0.600000,0.200000,0.000000,5


Just as with the offer matrix there are columns in the profile_exanded that should be dummies when using it for machine learning. However, for manual inference, it is useful to have the categorical values. Thus, I will keep them. 

In [23]:
profile_expanded = pd.read_pickle('profile_exanded_2.pkl')
gender_dummies = pd.get_dummies(profile_expanded.loc[:,'gender'], prefix='gender', dummy_na=True)
profile_expanded = profile_expanded.merge(gender_dummies, left_index=True, right_index=True)
pd.to_pickle(profile_expanded, 'profile_expanded.pkl')
profile_expanded


Unnamed: 0,age,became_member_on,gender,income,user_id,spent_total,spent_in_window,spent_no_window,spent_in_discount,spent_in_bogo,...,time_in_bogo,time_in_informational,view_ratio,completion_ratio,view_and_complete_ratio,num_offers_received,gender_F,gender_M,gender_O,gender_nan
0,33,2017-04-21,M,72000.0,0009655768c64bdeb2e877511632db8f,127.60,30.73,96.87,0.00,0.00,...,0.0,108.0,0.400000,0.600000,0.000000,5,0,1,0,0
1,118,2018-04-25,,,00116118485d4dfda04fdbaba9a87b5c,4.09,0.00,4.09,0.00,0.00,...,138.0,0.0,1.000000,0.000000,0.000000,2,0,0,0,1
2,40,2018-01-09,O,57000.0,0011e0d4e6b944f998e987f904e8c1e5,79.46,33.98,45.48,33.98,22.05,...,60.0,144.0,1.000000,0.600000,0.600000,5,0,0,1,0
3,59,2016-03-04,F,90000.0,0020c2b971eb4e9188eac86d93036a77,196.86,34.87,161.99,17.63,17.24,...,84.0,0.0,0.400000,0.600000,0.400000,5,1,0,0,0
4,24,2016-11-11,F,60000.0,0020ccbbb6d84e358d3414a3ff76cffd,154.05,95.37,58.68,11.65,24.85,...,48.0,72.0,1.000000,0.750000,0.750000,4,1,0,0,0
5,26,2017-06-21,F,73000.0,003d66b6608740288d6cc97a6903f4f0,48.34,30.92,17.42,22.47,0.00,...,0.0,96.0,0.800000,0.600000,0.400000,5,1,0,0,0
6,19,2016-08-09,F,65000.0,00426fe3ffde4c6b9cb9ad6d077a13ea,68.51,49.26,19.25,23.34,0.00,...,0.0,36.0,0.400000,0.200000,0.200000,5,1,0,0,0
7,55,2018-05-08,F,74000.0,004b041fbfe44859945daa2c7f79ee64,138.36,47.85,90.51,19.93,27.92,...,24.0,0.0,0.666667,0.666667,0.666667,3,1,0,0,0
8,54,2016-03-31,M,99000.0,004c5799adbf42868b9cff0396190900,347.38,101.94,245.44,43.21,58.73,...,66.0,0.0,0.600000,1.000000,0.600000,5,0,1,0,0
9,56,2017-12-09,M,47000.0,005500a7188546ff8a767329a2f7c76a,20.36,20.36,0.00,0.00,20.36,...,426.0,0.0,0.600000,0.200000,0.000000,5,0,1,0,0


For now I am happy with these matrices. I will modify them and use subsets from them as I see fit in the analysis part. 