# Introduction

This notebook analyses...

## Core aspects

- cohorts: defined by month of creation of first cash advance (`created_at`)

goal:

- track monthly evolution of key metrics by cohort

key metrics:

- frequency of usage of cash advancements over time
- incident rate
- revenue generated
- new relevant metric (TBD)

## Exploratory Data Analysis (EDA)
1. conduct an exploratory data analysis to gain a comprehensive understanding of the dataset.

2. Explore key statistics, distributions, and visualizations to identify patterns and outliers.

## Data Quality Analysis
1. Assess the quality of the dataset by identifying missing values, data inconsistencies, and potential errors.

2. Implement data cleaning and preprocessing steps to ensure the reliability of your analysis. 


## Calculate and analyze the following metrics for each cohort:

1. Frequency of Service Usage: Understand how often users from each cohort utilize IronHack Payments' cash advance services over time.

2. Incident Rate: Determine the incident rate, specifically focusing on payment incidents, for each cohort. Identify if there are variations in incident rates among different cohorts.

3. Revenue Generated by the Cohort: Calculate the total revenue generated by each cohort over months to assess the financial impact of user behavior.

4. New Relevant Metric: Propose and calculate a new relevant metric that provides additional insights into user behavior or the performance of IronHack Payments' services.

relevant columns:

cashRequest:
  - `created_at`
  - `updated_at`

## Deliverables
1. Python Code: Provide well-documented Python code that conducts the cohort analysis, including data loading, preprocessing, cohort creation, metric calculation, and visualization.

## Setup requirements

- extract/define cohorts in dataset

## Table of contents

1. [Introduction](#introduction)
2. [EDA](#eda)  
    a. [Data overview](#data-overview)  
    b. [Data cleaning/quality analysis](#data-cleaning/quality-analysis)  
    c. [Further EDA](#further-eda)
3. [Target data analysis](#target-data-analysis)

# EDA

## Preamble

Loading libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pltevaluateDataFrame

Import data

In [2]:
#First we removed the spaces from the csv files so we can easuly import them here

# We modified our import process to directly cast proper datatypes for dates.
# Float/integer will still be handled in data cleaning, 
# since some of the offending columns have NaN values causing issues (hence presumably the wrong automatic casting)


# lists of columns containing dates

datetime_columns_cash_request = [
    "created_at",
    "updated_at",
    "moderated_at",
    "cash_request_received_date",
    "reimbursement_date",
    "money_back_date",
    "send_at",
    "reco_last_update",
    "reco_creation"
]
   
datetime_columns_fees = [
    "created_at",
    "updated_at",
    "paid_at",
    "from_date",
    "to_date"
]


fees = pd.read_csv("../project_dataset/extract-fees-dataanalyst.csv",
                            parse_dates = datetime_columns_fees)
cashRequest = pd.read_csv("../project_dataset/extract-cashrequest-dataanalyst.csv", 
                            parse_dates = datetime_columns_cash_request)



In [3]:
# This is how we get a small insight in the data
display(fees.head())

Unnamed: 0,id,cash_request_id,type,status,category,total_amount,reason,created_at,updated_at,paid_at,from_date,to_date,charge_moment
0,6537,14941.0,instant_payment,rejected,,5.0,Instant Payment Cash Request 14941,2020-09-07 10:47:27.423150+00:00,2020-10-13 14:25:09.396112+00:00,2020-12-17 14:50:07.47011+00,,,after
1,6961,11714.0,incident,accepted,rejected_direct_debit,5.0,rejected direct debit,2020-09-09 20:51:17.998653+00:00,2020-10-13 14:25:15.537063+00:00,2020-12-08 17:13:10.45908+00,,,after
2,16296,23371.0,instant_payment,accepted,,5.0,Instant Payment Cash Request 23371,2020-10-23 10:10:58.352972+00:00,2020-10-23 10:10:58.352994+00:00,2020-11-04 19:34:37.43291+00,,,after
3,20775,26772.0,instant_payment,accepted,,5.0,Instant Payment Cash Request 26772,2020-10-31 15:46:53.643958+00:00,2020-10-31 15:46:53.643982+00:00,2020-11-19 05:09:22.500223+00,,,after
4,11242,19350.0,instant_payment,accepted,,5.0,Instant Payment Cash Request 19350,2020-10-06 08:20:17.170432+00:00,2020-10-13 14:25:03.267983+00:00,2020-11-02 14:45:20.355598+00,,,after


In [4]:
# Overview of data in fees
fees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21061 entries, 0 to 21060
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               21061 non-null  int64              
 1   cash_request_id  21057 non-null  float64            
 2   type             21061 non-null  object             
 3   status           21061 non-null  object             
 4   category         2196 non-null   object             
 5   total_amount     21061 non-null  float64            
 6   reason           21061 non-null  object             
 7   created_at       21061 non-null  datetime64[ns, UTC]
 8   updated_at       21061 non-null  datetime64[ns, UTC]
 9   paid_at          15531 non-null  object             
 10  from_date        7766 non-null   object             
 11  to_date          7766 non-null   object             
 12  charge_moment    21061 non-null  object             
dtypes: datetime64[ns

**Observations**:

- `cash_request_id` is automatically cast as float64. `int` might be more plausible, change in cleaning
- date-related columns (`created_at`,`updated__at`,`paid_at`,`from_date`,`to_date`) will need special treatment 

- after date casting at import still trouble for `paid_at`,`from_date`,`to_date`

In [5]:
display(cashRequest.head())

Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
0,5,100.0,rejected,2019-12-10 19:05:21.596873+00:00,2019-12-11 16:47:42.407830+00:00,804.0,2019-12-11 16:47:42.405646+00,,2020-01-09 19:05:21.596363+00,NaT,,regular,,,NaT,NaT
1,70,100.0,rejected,2019-12-10 19:50:12.347780+00:00,2019-12-11 14:24:22.900054+00:00,231.0,2019-12-11 14:24:22.897988+00,,2020-01-09 19:50:12.34778+00,NaT,,regular,,,NaT,NaT
2,7,100.0,rejected,2019-12-10 19:13:35.825460+00:00,2019-12-11 09:46:59.779773+00:00,191.0,2019-12-11 09:46:59.777728+00,,2020-01-09 19:13:35.825041+00,NaT,,regular,,,NaT,NaT
3,10,99.0,rejected,2019-12-10 19:16:10.880172+00:00,2019-12-18 14:26:18.136163+00:00,761.0,2019-12-18 14:26:18.128407+00,,2020-01-09 19:16:10.879606+00,NaT,,regular,,,NaT,NaT
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00:00,2020-05-07 09:21:55.340080+00:00,7686.0,2020-05-07 09:21:55.320193+00,,2020-06-05 22:00:00+00,NaT,,regular,,,NaT,NaT


In [6]:
# Overview of data in cashRequest
cashRequest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23970 entries, 0 to 23969
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   id                          23970 non-null  int64              
 1   amount                      23970 non-null  float64            
 2   status                      23970 non-null  object             
 3   created_at                  23970 non-null  datetime64[ns, UTC]
 4   updated_at                  23970 non-null  datetime64[ns, UTC]
 5   user_id                     21867 non-null  float64            
 6   moderated_at                16035 non-null  object             
 7   deleted_account_id          2104 non-null   float64            
 8   reimbursement_date          23970 non-null  object             
 9   cash_request_received_date  16289 non-null  datetime64[ns]     
 10  money_back_date             16543 non-null  object        

**Observations**:

- `delete_account_id` and `user_id` needn't be floats (cast to int later)
- date-related columns (`created_at`,`updated__at`,`moderated_at`...) will need special treatment
- fewer unique `user_id` values than cashRequest `id`s: indicating multiple transactions for some users or actual missing values?


- after date casting at import the following fields are still `object` rather than `datetime`:
  `moderated_at`,`reimbursement_date`,`money_back_date`,`send_at`

In [7]:
# functions
def evaluateDataFrame(df):
    # Lets check how many values we actually have
    print("Total amount of records")
    print(len(df))
    print()
    # This shows us the amount of empty rows for each column 
    print("Empty rows")
    print(df.isnull().sum())
    print()
    # check the number of unique values for each column 
    print("Unique rows")
    print(df.nunique())
    print()
    #print("DataFrame info")            # we're already calling this earlier, might make sense for plain-py version (although then we could put info() at start and remove len, since that's also displayed by info())
    #fees.info()
    #print()
    

def inspect_data_types(df, name="DataFrame"):
    print(f"=== {name} ===")
    numerical = df.select_dtypes(include='number').columns.tolist()
    categorical = df.select_dtypes(include='object').columns.tolist()
    datetime = df.select_dtypes(include=['datetime','datetime64','datetime64[ns, UTC]']).columns.tolist()
      
    print(f"Numerical columns ({len(numerical)}): {numerical}")
    print(f"Categorical columns ({len(categorical)}): {categorical}")
    print(f"Date columns ({len(datetime)}): {datetime}")
    print()
    
    return numerical, categorical, datetime         # modified to also return the lists for further use




In [8]:
# calling functions
# commented for now, piecewise presentation might be more readable in notebook 

# evaluateDataFrame(cashRequest)
# evaluateDataFrame(fees)

# inspect_data_types(cashRequest)
# inspect_data_types(fees)

In [9]:
cashRequest.isna().sum()


id                                0
amount                            0
status                            0
created_at                        0
updated_at                        0
user_id                        2103
moderated_at                   7935
deleted_account_id            21866
reimbursement_date                0
cash_request_received_date     7681
money_back_date                7427
transfer_type                     0
send_at                        7329
recovery_status               20640
reco_creation                 20640
reco_last_update              20640
dtype: int64

In [10]:
cashRequest.nunique()


id                            23970
amount                           41
status                            7
created_at                    23970
updated_at                    23970
user_id                       10798
moderated_at                  16035
deleted_account_id             1141
reimbursement_date             4089
cash_request_received_date      312
money_back_date               12221
transfer_type                     2
send_at                       16641
recovery_status                   4
reco_creation                  3330
reco_last_update               3330
dtype: int64

In [29]:
fees.isna().sum()

id                     0
cash_request_id        4
type                   0
status                 0
category           18865
total_amount           0
reason                 0
created_at             0
updated_at             0
paid_at             5623
from_date          14312
to_date            14549
charge_moment          0
dtype: int64

In [30]:
fees.nunique()

id                 21061
cash_request_id    12933
type                   3
status                 4
category               2
total_amount           2
reason             15149
created_at         21026
updated_at         21061
paid_at            15438
from_date            454
to_date              486
charge_moment          2
dtype: int64

In [32]:
fees['total_amount'].unique()

array([ 5., 10.])

In [38]:
fees[['category','total_amount']]

Unnamed: 0,category,total_amount
0,,5.0
1,rejected_direct_debit,5.0
2,,5.0
3,,5.0
4,,5.0
...,...,...
21056,,5.0
21057,,5.0
21058,,5.0
21059,,5.0


Only two types of fees are levied: 5 and 10 Euros(?) - maybe convert to int as well?

In [13]:

cashr_numcols, cashr_strcols, cashr_dtcols = inspect_data_types(cashRequest, name="cashRequest")


=== cashRequest ===
Numerical columns (4): ['id', 'amount', 'user_id', 'deleted_account_id']
Categorical columns (7): ['status', 'moderated_at', 'reimbursement_date', 'money_back_date', 'transfer_type', 'send_at', 'recovery_status']
Date columns (5): ['created_at', 'updated_at', 'cash_request_received_date', 'reco_creation', 'reco_last_update']



In [14]:
fees_numcols, fees_strcols, fees_dtcols = inspect_data_types(fees, name="fees")


=== fees ===
Numerical columns (3): ['id', 'cash_request_id', 'total_amount']
Categorical columns (8): ['type', 'status', 'category', 'reason', 'paid_at', 'from_date', 'to_date', 'charge_moment']
Date columns (2): ['created_at', 'updated_at']



Several of the date fields aren't typed correctly, fix in data cleaning and rerun function

**Observations**

- 2103 empty values in `cashRequest.user_id` corresponding to the difference to `id` noted above
  - also: very close to value of `deleted_account` id (2104), so possible relation to that
- fees are associated to cashRequests via `cash_request_id`

We used these insights to adapt our data import in order to directly cast the correct datatypes for columns that were not correctly identified automatically.  

## Data cleaning/quality analysis

### Instructions after EDA
1. Parse all values to the right data types
2. remove loose items (like fees without cashRequest)
3. 

In [15]:
# clean the start and ends of all column names so we get no suprises in the data retrieval later
fees.columns = fees.columns.str.strip()
cashRequest.columns = cashRequest.columns.str.strip()

The next block is going to fix datatypes for both dataframes, i.e. fixing the missing dates and casting some columns as integers.

In [49]:

# This is antoher option of parsing datatypes
# errors="coerce" -> means that erroes will force conversion and replace any invalid or unconvertible values with NaT
datetime_columns_cash_request = [
    "created_at",
    "updated_at",
    "moderated_at",
    "cash_request_received_date",
    "reimbursement_date",
    "money_back_date",
    "send_at",
    "reco_last_update",
    "reco_creation"
]

for col in datetime_columns_cash_request:
    cashRequest[col] = pd.to_datetime(cashRequest[col], errors="coerce")
    
datetime_columns_fees = [
    "created_at",
    "updated_at",
    "paid_at",
    "from_date",
    "to_date"
]

for col in datetime_columns_fees:
    fees[col] = pd.to_datetime(fees[col], errors="coerce")
    


float_to_int_fees = [
 
    "cash_request_id"
]

# This currently doesn't work with astype(int), while astype("Int64")

#for col in float_to_int_fees:
#     fees[col] = pd.to_numeric(fees[col], errors="coerce").astype("Int64")
     
float_to_int_cash_request = [
 
    "user_id",
    "deleted_account_id"
]

#for col in float_to_int_cash_request:
#     cashRequest[col] = pd.to_numeric(cashRequest[col], errors="coerce").astype("Int64")

In [18]:
cashRequest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23970 entries, 0 to 23969
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   id                          23970 non-null  int64              
 1   amount                      23970 non-null  float64            
 2   status                      23970 non-null  object             
 3   created_at                  23970 non-null  datetime64[ns, UTC]
 4   updated_at                  23970 non-null  datetime64[ns, UTC]
 5   user_id                     21867 non-null  float64            
 6   moderated_at                15912 non-null  datetime64[ns, UTC]
 7   deleted_account_id          2104 non-null   float64            
 8   reimbursement_date          3050 non-null   datetime64[ns, UTC]
 9   cash_request_received_date  16289 non-null  datetime64[ns]     
 10  money_back_date             12040 non-null  datetime64[ns,

In [50]:
cashr_numcols, cashr_strcols, cashr_dtcols = inspect_data_types(cashRequest, name="cashRequest")


=== cashRequest ===
Numerical columns (4): ['id', 'amount', 'user_id', 'deleted_account_id']
Categorical columns (3): ['status', 'transfer_type', 'recovery_status']
Date columns (9): ['created_at', 'updated_at', 'moderated_at', 'reimbursement_date', 'cash_request_received_date', 'money_back_date', 'send_at', 'reco_creation', 'reco_last_update']



In [17]:
fees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21061 entries, 0 to 21060
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               21061 non-null  int64              
 1   cash_request_id  21057 non-null  float64            
 2   type             21061 non-null  object             
 3   status           21061 non-null  object             
 4   category         2196 non-null   object             
 5   total_amount     21061 non-null  float64            
 6   reason           21061 non-null  object             
 7   created_at       21061 non-null  datetime64[ns, UTC]
 8   updated_at       21061 non-null  datetime64[ns, UTC]
 9   paid_at          15438 non-null  datetime64[ns, UTC]
 10  from_date        6749 non-null   datetime64[ns, UTC]
 11  to_date          6512 non-null   datetime64[ns, UTC]
 12  charge_moment    21061 non-null  object             
dtypes: datetime64[ns

In [51]:
fees_numcols, fees_strcols, fees_dtcols = inspect_data_types(fees, name="fees")


=== fees ===
Numerical columns (3): ['id', 'cash_request_id', 'total_amount']
Categorical columns (5): ['type', 'status', 'category', 'reason', 'charge_moment']
Date columns (5): ['created_at', 'updated_at', 'paid_at', 'from_date', 'to_date']



### Cleaning floats that should be ints

In [52]:
fees[fees['cash_request_id'].isna()]

Unnamed: 0,id,cash_request_id,type,status,category,total_amount,reason,created_at,updated_at,paid_at,from_date,to_date,charge_moment
1911,2990,,instant_payment,cancelled,,5.0,Instant Payment Cash Request 11164,2020-08-06 22:42:34.525373+00:00,2020-11-04 16:01:17.296048+00:00,NaT,NaT,NaT,after
1960,3124,,instant_payment,cancelled,,5.0,Instant Payment Cash Request 11444,2020-08-08 06:33:06.244651+00:00,2020-11-04 16:01:08.332978+00:00,NaT,NaT,NaT,after
4605,5185,,instant_payment,cancelled,,5.0,Instant Payment Cash Request 11788,2020-08-26 09:39:37.362933+00:00,2020-11-04 16:01:36.492576+00:00,NaT,NaT,NaT,after
11870,3590,,instant_payment,cancelled,,5.0,Instant Payment Cash Request 12212,2020-08-12 14:20:06.657075+00:00,2020-11-04 16:01:53.106416+00:00,NaT,NaT,NaT,after


NA-values in `fees.cash_request_id` are for cancelled transactions - let's drop them!(?)

In [53]:
# creating copies before dropping values (optional)
fees_cp = fees.copy()
cashRequest_cp = cashRequest.copy()

In [54]:
fees_cp.dropna(subset=['cash_request_id'],inplace=True)

In [55]:
fees_cp['cash_request_id'] = fees_cp['cash_request_id'].astype(int)
fees_cp.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21057 entries, 0 to 21060
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               21057 non-null  int64              
 1   cash_request_id  21057 non-null  int64              
 2   type             21057 non-null  object             
 3   status           21057 non-null  object             
 4   category         2196 non-null   object             
 5   total_amount     21057 non-null  float64            
 6   reason           21057 non-null  object             
 7   created_at       21057 non-null  datetime64[ns, UTC]
 8   updated_at       21057 non-null  datetime64[ns, UTC]
 9   paid_at          15438 non-null  datetime64[ns, UTC]
 10  from_date        6749 non-null   datetime64[ns, UTC]
 11  to_date          6512 non-null   datetime64[ns, UTC]
 12  charge_moment    21057 non-null  object             
dtypes: datetime64[ns, UTC

### Checking NaT dates

In [None]:
# check date fields with missing data to assess significance
for col in cashr_dtcols:
    if cashRequest_cp[col].isna().sum() > 0:
        print(col, ': ', cashRequest_cp[col].isna().sum())       
        display(cashRequest_cp[cashRequest_cp[col].isna()].head(10))

moderated_at :  8058


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
8,5672,100.0,canceled,2020-06-28 12:06:33.712840+00:00,2020-06-28 12:06:33.712853+00:00,,NaT,2499.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
48,23534,25.0,money_back,2020-10-23 15:20:26.163927+00:00,2020-12-18 13:08:29.099365+00:00,21465.0,NaT,,NaT,NaT,2020-11-06 07:16:21.845479+00:00,instant,2020-10-23 15:21:26.878525+00:00,,NaT,NaT
157,257,100.0,rejected,2019-12-19 18:05:11.964030+00:00,2020-04-13 09:40:44.909068+00:00,,NaT,721.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
161,20639,50.0,money_back,2020-10-12 16:05:27.478409+00:00,2020-12-18 13:12:06.313275+00:00,,NaT,30317.0,NaT,2020-10-16,2020-10-31 18:58:22.244510+00:00,instant,2020-10-15 06:29:31.161555+00:00,,NaT,NaT
396,20108,100.0,money_back,2020-10-09 11:02:54.071547+00:00,2020-12-18 13:08:29.646333+00:00,63894.0,NaT,,NaT,2020-10-10,2020-11-06 19:27:38.913381+00:00,instant,2020-10-09 11:03:25.488796+00:00,,NaT,NaT
423,2364,100.0,money_back,2020-05-30 15:10:48.767168+00:00,2020-07-09 14:54:31.446247+00:00,11499.0,NaT,,NaT,2020-06-02,NaT,regular,NaT,completed,2020-06-17 22:24:38.582685+00:00,2020-07-14 14:59:25.211216+00:00
487,20112,100.0,money_back,2020-10-09 11:12:35.190378+00:00,2020-12-18 13:08:29.865140+00:00,10116.0,NaT,,NaT,2020-10-10,2020-11-06 20:35:03.777445+00:00,instant,2020-10-09 11:12:41.849906+00:00,,NaT,NaT
543,2153,100.0,money_back,2020-05-25 10:27:15.844613+00:00,2020-07-09 15:08:38.475526+00:00,3628.0,NaT,,NaT,2020-05-27,NaT,regular,NaT,completed,2020-06-20 22:16:10.319218+00:00,2020-06-26 14:21:38.253322+00:00
582,4231,100.0,rejected,2020-06-19 15:38:41.644967+00:00,2020-06-25 11:03:21.481161+00:00,18703.0,NaT,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
616,25766,50.0,money_back,2020-10-28 18:16:18.975577+00:00,2020-12-18 13:08:30.176523+00:00,97141.0,NaT,,NaT,2020-10-29,2020-11-07 19:54:45.149107+00:00,instant,2020-10-28 18:16:40.074320+00:00,,NaT,NaT


reimbursement_date :  20920


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00:00,2020-05-07 09:21:55.340080+00:00,7686.0,2020-05-07 09:21:55.320193+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
5,2145,100.0,money_back,2020-05-23 20:58:55.129432+00:00,2020-07-06 03:36:03.023911+00:00,9489.0,2020-05-24 12:40:33.054910+00:00,,NaT,2020-05-26,2020-07-06 03:36:03.023521+00:00,regular,NaT,completed,2020-06-12 22:27:04.837525+00:00,2020-07-06 03:36:03.030904+00:00
6,3512,100.0,rejected,2020-06-16 17:07:38.452652+00:00,2020-06-17 10:21:21.364746+00:00,14631.0,2020-06-17 10:21:21.360742+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
8,5672,100.0,canceled,2020-06-28 12:06:33.712840+00:00,2020-06-28 12:06:33.712853+00:00,,NaT,2499.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
13,2122,100.0,money_back,2020-05-22 12:47:42.741369+00:00,2020-06-13 06:36:34.188372+00:00,8218.0,2020-05-23 14:29:15.376466+00:00,,NaT,2020-05-27,2020-06-13 06:36:34.188042+00:00,regular,NaT,,NaT,NaT
17,756,70.0,rejected,2020-02-28 10:13:18.421583+00:00,2020-02-28 14:12:15.076660+00:00,821.0,2020-02-28 14:12:15.073471+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
18,1073,100.0,rejected,2020-04-08 02:21:48.854449+00:00,2020-04-08 09:41:16.026093+00:00,2867.0,2020-04-08 09:41:16.022533+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
30,1540,100.0,rejected,2020-05-04 04:23:57.148763+00:00,2020-05-04 09:23:14.319562+00:00,3377.0,2020-05-04 09:23:14.315961+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
31,1135,100.0,rejected,2020-04-10 22:10:31.412685+00:00,2020-04-13 08:45:55.525828+00:00,2274.0,2020-04-13 08:45:55.521142+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
33,1395,70.0,rejected,2020-04-27 14:33:19.288174+00:00,2020-04-27 16:16:02.981982+00:00,7379.0,2020-04-27 16:16:02.978418+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT


cash_request_received_date :  7681


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
0,5,100.0,rejected,2019-12-10 19:05:21.596873+00:00,2019-12-11 16:47:42.407830+00:00,804.0,2019-12-11 16:47:42.405646+00:00,,2020-01-09 19:05:21.596363+00:00,NaT,NaT,regular,NaT,,NaT,NaT
1,70,100.0,rejected,2019-12-10 19:50:12.347780+00:00,2019-12-11 14:24:22.900054+00:00,231.0,2019-12-11 14:24:22.897988+00:00,,2020-01-09 19:50:12.347780+00:00,NaT,NaT,regular,NaT,,NaT,NaT
2,7,100.0,rejected,2019-12-10 19:13:35.825460+00:00,2019-12-11 09:46:59.779773+00:00,191.0,2019-12-11 09:46:59.777728+00:00,,2020-01-09 19:13:35.825041+00:00,NaT,NaT,regular,NaT,,NaT,NaT
3,10,99.0,rejected,2019-12-10 19:16:10.880172+00:00,2019-12-18 14:26:18.136163+00:00,761.0,2019-12-18 14:26:18.128407+00:00,,2020-01-09 19:16:10.879606+00:00,NaT,NaT,regular,NaT,,NaT,NaT
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00:00,2020-05-07 09:21:55.340080+00:00,7686.0,2020-05-07 09:21:55.320193+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
6,3512,100.0,rejected,2020-06-16 17:07:38.452652+00:00,2020-06-17 10:21:21.364746+00:00,14631.0,2020-06-17 10:21:21.360742+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
7,654,100.0,rejected,2020-02-10 01:11:53.808270+00:00,2020-02-10 11:53:32.104131+00:00,,2020-02-10 09:11:21.350695+00:00,309.0,2020-03-11 01:11:53.807930+00:00,NaT,NaT,regular,NaT,,NaT,NaT
8,5672,100.0,canceled,2020-06-28 12:06:33.712840+00:00,2020-06-28 12:06:33.712853+00:00,,NaT,2499.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
9,71,90.0,rejected,2019-12-10 19:51:23.911206+00:00,2019-12-12 15:06:11.192888+00:00,897.0,2019-12-12 15:06:11.190299+00:00,,2019-12-17 19:51:23.910748+00:00,NaT,NaT,regular,NaT,,NaT,NaT
10,648,1.0,rejected,2020-02-08 19:20:44.627662+00:00,2020-02-09 13:59:26.784459+00:00,2908.0,2020-02-09 13:59:26.779954+00:00,,2020-03-09 19:20:44.627255+00:00,NaT,NaT,regular,NaT,,NaT,NaT


money_back_date :  11930


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
0,5,100.0,rejected,2019-12-10 19:05:21.596873+00:00,2019-12-11 16:47:42.407830+00:00,804.0,2019-12-11 16:47:42.405646+00:00,,2020-01-09 19:05:21.596363+00:00,NaT,NaT,regular,NaT,,NaT,NaT
1,70,100.0,rejected,2019-12-10 19:50:12.347780+00:00,2019-12-11 14:24:22.900054+00:00,231.0,2019-12-11 14:24:22.897988+00:00,,2020-01-09 19:50:12.347780+00:00,NaT,NaT,regular,NaT,,NaT,NaT
2,7,100.0,rejected,2019-12-10 19:13:35.825460+00:00,2019-12-11 09:46:59.779773+00:00,191.0,2019-12-11 09:46:59.777728+00:00,,2020-01-09 19:13:35.825041+00:00,NaT,NaT,regular,NaT,,NaT,NaT
3,10,99.0,rejected,2019-12-10 19:16:10.880172+00:00,2019-12-18 14:26:18.136163+00:00,761.0,2019-12-18 14:26:18.128407+00:00,,2020-01-09 19:16:10.879606+00:00,NaT,NaT,regular,NaT,,NaT,NaT
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00:00,2020-05-07 09:21:55.340080+00:00,7686.0,2020-05-07 09:21:55.320193+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
6,3512,100.0,rejected,2020-06-16 17:07:38.452652+00:00,2020-06-17 10:21:21.364746+00:00,14631.0,2020-06-17 10:21:21.360742+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
7,654,100.0,rejected,2020-02-10 01:11:53.808270+00:00,2020-02-10 11:53:32.104131+00:00,,2020-02-10 09:11:21.350695+00:00,309.0,2020-03-11 01:11:53.807930+00:00,NaT,NaT,regular,NaT,,NaT,NaT
8,5672,100.0,canceled,2020-06-28 12:06:33.712840+00:00,2020-06-28 12:06:33.712853+00:00,,NaT,2499.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
9,71,90.0,rejected,2019-12-10 19:51:23.911206+00:00,2019-12-12 15:06:11.192888+00:00,897.0,2019-12-12 15:06:11.190299+00:00,,2019-12-17 19:51:23.910748+00:00,NaT,NaT,regular,NaT,,NaT,NaT
10,648,1.0,rejected,2020-02-08 19:20:44.627662+00:00,2020-02-09 13:59:26.784459+00:00,2908.0,2020-02-09 13:59:26.779954+00:00,,2020-03-09 19:20:44.627255+00:00,NaT,NaT,regular,NaT,,NaT,NaT


send_at :  7504


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
0,5,100.0,rejected,2019-12-10 19:05:21.596873+00:00,2019-12-11 16:47:42.407830+00:00,804.0,2019-12-11 16:47:42.405646+00:00,,2020-01-09 19:05:21.596363+00:00,NaT,NaT,regular,NaT,,NaT,NaT
1,70,100.0,rejected,2019-12-10 19:50:12.347780+00:00,2019-12-11 14:24:22.900054+00:00,231.0,2019-12-11 14:24:22.897988+00:00,,2020-01-09 19:50:12.347780+00:00,NaT,NaT,regular,NaT,,NaT,NaT
2,7,100.0,rejected,2019-12-10 19:13:35.825460+00:00,2019-12-11 09:46:59.779773+00:00,191.0,2019-12-11 09:46:59.777728+00:00,,2020-01-09 19:13:35.825041+00:00,NaT,NaT,regular,NaT,,NaT,NaT
3,10,99.0,rejected,2019-12-10 19:16:10.880172+00:00,2019-12-18 14:26:18.136163+00:00,761.0,2019-12-18 14:26:18.128407+00:00,,2020-01-09 19:16:10.879606+00:00,NaT,NaT,regular,NaT,,NaT,NaT
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00:00,2020-05-07 09:21:55.340080+00:00,7686.0,2020-05-07 09:21:55.320193+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
5,2145,100.0,money_back,2020-05-23 20:58:55.129432+00:00,2020-07-06 03:36:03.023911+00:00,9489.0,2020-05-24 12:40:33.054910+00:00,,NaT,2020-05-26,2020-07-06 03:36:03.023521+00:00,regular,NaT,completed,2020-06-12 22:27:04.837525+00:00,2020-07-06 03:36:03.030904+00:00
6,3512,100.0,rejected,2020-06-16 17:07:38.452652+00:00,2020-06-17 10:21:21.364746+00:00,14631.0,2020-06-17 10:21:21.360742+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
7,654,100.0,rejected,2020-02-10 01:11:53.808270+00:00,2020-02-10 11:53:32.104131+00:00,,2020-02-10 09:11:21.350695+00:00,309.0,2020-03-11 01:11:53.807930+00:00,NaT,NaT,regular,NaT,,NaT,NaT
8,5672,100.0,canceled,2020-06-28 12:06:33.712840+00:00,2020-06-28 12:06:33.712853+00:00,,NaT,2499.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
9,71,90.0,rejected,2019-12-10 19:51:23.911206+00:00,2019-12-12 15:06:11.192888+00:00,897.0,2019-12-12 15:06:11.190299+00:00,,2019-12-17 19:51:23.910748+00:00,NaT,NaT,regular,NaT,,NaT,NaT


reco_creation :  20640


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
0,5,100.0,rejected,2019-12-10 19:05:21.596873+00:00,2019-12-11 16:47:42.407830+00:00,804.0,2019-12-11 16:47:42.405646+00:00,,2020-01-09 19:05:21.596363+00:00,NaT,NaT,regular,NaT,,NaT,NaT
1,70,100.0,rejected,2019-12-10 19:50:12.347780+00:00,2019-12-11 14:24:22.900054+00:00,231.0,2019-12-11 14:24:22.897988+00:00,,2020-01-09 19:50:12.347780+00:00,NaT,NaT,regular,NaT,,NaT,NaT
2,7,100.0,rejected,2019-12-10 19:13:35.825460+00:00,2019-12-11 09:46:59.779773+00:00,191.0,2019-12-11 09:46:59.777728+00:00,,2020-01-09 19:13:35.825041+00:00,NaT,NaT,regular,NaT,,NaT,NaT
3,10,99.0,rejected,2019-12-10 19:16:10.880172+00:00,2019-12-18 14:26:18.136163+00:00,761.0,2019-12-18 14:26:18.128407+00:00,,2020-01-09 19:16:10.879606+00:00,NaT,NaT,regular,NaT,,NaT,NaT
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00:00,2020-05-07 09:21:55.340080+00:00,7686.0,2020-05-07 09:21:55.320193+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
6,3512,100.0,rejected,2020-06-16 17:07:38.452652+00:00,2020-06-17 10:21:21.364746+00:00,14631.0,2020-06-17 10:21:21.360742+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
7,654,100.0,rejected,2020-02-10 01:11:53.808270+00:00,2020-02-10 11:53:32.104131+00:00,,2020-02-10 09:11:21.350695+00:00,309.0,2020-03-11 01:11:53.807930+00:00,NaT,NaT,regular,NaT,,NaT,NaT
8,5672,100.0,canceled,2020-06-28 12:06:33.712840+00:00,2020-06-28 12:06:33.712853+00:00,,NaT,2499.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
9,71,90.0,rejected,2019-12-10 19:51:23.911206+00:00,2019-12-12 15:06:11.192888+00:00,897.0,2019-12-12 15:06:11.190299+00:00,,2019-12-17 19:51:23.910748+00:00,NaT,NaT,regular,NaT,,NaT,NaT
10,648,1.0,rejected,2020-02-08 19:20:44.627662+00:00,2020-02-09 13:59:26.784459+00:00,2908.0,2020-02-09 13:59:26.779954+00:00,,2020-03-09 19:20:44.627255+00:00,NaT,NaT,regular,NaT,,NaT,NaT


reco_last_update :  20640


Unnamed: 0,id,amount,status,created_at,updated_at,user_id,moderated_at,deleted_account_id,reimbursement_date,cash_request_received_date,money_back_date,transfer_type,send_at,recovery_status,reco_creation,reco_last_update
0,5,100.0,rejected,2019-12-10 19:05:21.596873+00:00,2019-12-11 16:47:42.407830+00:00,804.0,2019-12-11 16:47:42.405646+00:00,,2020-01-09 19:05:21.596363+00:00,NaT,NaT,regular,NaT,,NaT,NaT
1,70,100.0,rejected,2019-12-10 19:50:12.347780+00:00,2019-12-11 14:24:22.900054+00:00,231.0,2019-12-11 14:24:22.897988+00:00,,2020-01-09 19:50:12.347780+00:00,NaT,NaT,regular,NaT,,NaT,NaT
2,7,100.0,rejected,2019-12-10 19:13:35.825460+00:00,2019-12-11 09:46:59.779773+00:00,191.0,2019-12-11 09:46:59.777728+00:00,,2020-01-09 19:13:35.825041+00:00,NaT,NaT,regular,NaT,,NaT,NaT
3,10,99.0,rejected,2019-12-10 19:16:10.880172+00:00,2019-12-18 14:26:18.136163+00:00,761.0,2019-12-18 14:26:18.128407+00:00,,2020-01-09 19:16:10.879606+00:00,NaT,NaT,regular,NaT,,NaT,NaT
4,1594,100.0,rejected,2020-05-06 09:59:38.877376+00:00,2020-05-07 09:21:55.340080+00:00,7686.0,2020-05-07 09:21:55.320193+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
6,3512,100.0,rejected,2020-06-16 17:07:38.452652+00:00,2020-06-17 10:21:21.364746+00:00,14631.0,2020-06-17 10:21:21.360742+00:00,,NaT,NaT,NaT,regular,NaT,,NaT,NaT
7,654,100.0,rejected,2020-02-10 01:11:53.808270+00:00,2020-02-10 11:53:32.104131+00:00,,2020-02-10 09:11:21.350695+00:00,309.0,2020-03-11 01:11:53.807930+00:00,NaT,NaT,regular,NaT,,NaT,NaT
8,5672,100.0,canceled,2020-06-28 12:06:33.712840+00:00,2020-06-28 12:06:33.712853+00:00,,NaT,2499.0,NaT,NaT,NaT,regular,NaT,,NaT,NaT
9,71,90.0,rejected,2019-12-10 19:51:23.911206+00:00,2019-12-12 15:06:11.192888+00:00,897.0,2019-12-12 15:06:11.190299+00:00,,2019-12-17 19:51:23.910748+00:00,NaT,NaT,regular,NaT,,NaT,NaT
10,648,1.0,rejected,2020-02-08 19:20:44.627662+00:00,2020-02-09 13:59:26.784459+00:00,2908.0,2020-02-09 13:59:26.779954+00:00,,2020-03-09 19:20:44.627255+00:00,NaT,NaT,regular,NaT,,NaT,NaT


## Further EDA

# Target data analysis

## Merging datasets

*Note sure if we might want to merge the datasets much earlier?*

Left-join the fees dataframe to the cashRequest dataframe on `id`/`cash_request_id` to create full dataset:
(We want to retain all cash requests even in case they have no associated fees.)

In [39]:
df = cashRequest_cp.merge(fees_cp.set_index('cash_request_id'),how='left',on='id')

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23970 entries, 0 to 23969
Data columns (total 27 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   id                          23970 non-null  int64              
 1   amount                      23970 non-null  float64            
 2   status_x                    23970 non-null  object             
 3   created_at_x                23970 non-null  datetime64[ns, UTC]
 4   updated_at_x                23970 non-null  datetime64[ns, UTC]
 5   user_id                     21867 non-null  float64            
 6   moderated_at                15912 non-null  datetime64[ns, UTC]
 7   deleted_account_id          2104 non-null   float64            
 8   reimbursement_date          3050 non-null   datetime64[ns, UTC]
 9   cash_request_received_date  16289 non-null  datetime64[ns]     
 10  money_back_date             12040 non-null  datetime64[ns,

In [41]:
print(df.head())

     id  amount  status_x                     created_at_x  \
0     5   100.0  rejected 2019-12-10 19:05:21.596873+00:00   
1    70   100.0  rejected 2019-12-10 19:50:12.347780+00:00   
2     7   100.0  rejected 2019-12-10 19:13:35.825460+00:00   
3    10    99.0  rejected 2019-12-10 19:16:10.880172+00:00   
4  1594   100.0  rejected 2020-05-06 09:59:38.877376+00:00   

                      updated_at_x  user_id                     moderated_at  \
0 2019-12-11 16:47:42.407830+00:00    804.0 2019-12-11 16:47:42.405646+00:00   
1 2019-12-11 14:24:22.900054+00:00    231.0 2019-12-11 14:24:22.897988+00:00   
2 2019-12-11 09:46:59.779773+00:00    191.0 2019-12-11 09:46:59.777728+00:00   
3 2019-12-18 14:26:18.136163+00:00    761.0 2019-12-18 14:26:18.128407+00:00   
4 2020-05-07 09:21:55.340080+00:00   7686.0 2020-05-07 09:21:55.320193+00:00   

   deleted_account_id               reimbursement_date  \
0                 NaN 2020-01-09 19:05:21.596363+00:00   
1                 NaN 2020-01-

## Overview of cohorts

In [None]:
# cohorts are defined by year/month combinations

# number of individuals per cohort
df.groupby([df.created_at_x.dt.year,df.created_at_x.dt.month])['user_id'].count()

created_at_x  created_at_x
2019          11                 1
              12               230
2020          1                176
              2                157
              3                207
              4                418
              5                727
              6               2251
              7               3159
              8               3090
              9               3802
              10              7512
              11               137
Name: user_id, dtype: int64