# Introduction

This notebook analyses...

## Core aspects

- cohorts: defined by month of creation of first cash advance (`created_at`)

goal:

- track monthly evolution of key metrics by cohort

key metrics:

- frequency of usage of cash advancements over time
- incident rate
- revenue generated
- new relevant metric (TBD)

## Exploratory Data Analysis (EDA)
1. conduct an exploratory data analysis to gain a comprehensive understanding of the dataset.

2. Explore key statistics, distributions, and visualizations to identify patterns and outliers.

## Data Quality Analysis
1. Assess the quality of the dataset by identifying missing values, data inconsistencies, and potential errors.

2. Implement data cleaning and preprocessing steps to ensure the reliability of your analysis. 


## Calculate and analyze the following metrics for each cohort:

1. Frequency of Service Usage: Understand how often users from each cohort utilize IronHack Payments' cash advance services over time.

2. Incident Rate: Determine the incident rate, specifically focusing on payment incidents, for each cohort. Identify if there are variations in incident rates among different cohorts.

3. Revenue Generated by the Cohort: Calculate the total revenue generated by each cohort over months to assess the financial impact of user behavior.

4. New Relevant Metric: Propose and calculate a new relevant metric that provides additional insights into user behavior or the performance of IronHack Payments' services.

relevant columns:

cashRequest:
  - `created_at`
  - `updated_at`

## Deliverables
1. Python Code: Provide well-documented Python code that conducts the cohort analysis, including data loading, preprocessing, cohort creation, metric calculation, and visualization.

## Setup requirements

- extract/define cohorts in dataset

## Table of contents

1. [Introduction](#introduction)
2. [EDA](#eda)  
    a. [Data overview](#data-overview)  
    b. [Data cleaning/quality analysis](#data-cleaning/quality-analysis)  
    c. [Further EDA](#further-eda)
3. [Target data analysis](#target-data-analysis)

# EDA

## Preamble

Loading libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Import data

In [None]:
#First we removed the spaces from the csv files so we can easuly import them here

# We modified our import process to directly cast proper datatypes for dates.
# Float/integer will still be handled in data cleaning, 
# since some of the offending columns have NaN values causing issues (hence presumably the wrong automatic casting)


# lists of columns containing dates

datetime_columns_cash_request = [
    "created_at",
    "updated_at",
    "moderated_at",
    "cash_request_received_date",
    "reimbursement_date",
    "money_back_date",
    "send_at",
    "reco_last_update",
    "reco_creation"
]
   
datetime_columns_fees = [
    "created_at",
    "updated_at",
    "paid_at",
    "from_date",
    "to_date"
]


fees = pd.read_csv("../project_dataset/extract-fees-dataanalyst.csv",
                            parse_dates = datetime_columns_fees)
cashRequest = pd.read_csv("../project_dataset/extract-cashrequest-dataanalyst.csv", 
                            parse_dates = datetime_columns_cash_request)



In [None]:
# This is how we get a small insight in the data
display(fees.head())

In [None]:
# Overview of data in fees
fees.info()

**Observations**:

- `cash_request_id` is automatically cast as float64. `int` might be more plausible, change in cleaning
- date-related columns (`created_at`,`updated__at`,`paid_at`,`from_date`,`to_date`) will need special treatment 

- after date casting at import still trouble for `paid_at`,`from_date`,`to_date`

In [None]:
display(cashRequest.head())

In [None]:
# Overview of data in cashRequest
cashRequest.info()

**Observations**:

- `delete_account_id` and `user_id` needn't be floats (cast to int later)
- date-related columns (`created_at`,`updated__at`,`moderated_at`...) will need special treatment
- fewer unique `user_id` values than cashRequest `id`s: indicating multiple transactions for some users or actual missing values?


- after date casting at import the following fields are still `object` rather than `datetime`:
  `moderated_at`,`reimbursement_date`,`money_back_date`,`send_at`

In [None]:
# functions
def evaluateDataFrame(df):
    # Lets check how many values we actually have
    print("Total amount of records")
    print(len(df))
    print()
    # This shows us the amount of empty rows for each column 
    print("Empty rows")
    print(df.isnull().sum())
    print()
    # check the number of unique values for each column 
    print("Unique rows")
    print(df.nunique())
    print()
    #print("DataFrame info")            # we're already calling this earlier, might make sense for plain-py version (although then we could put info() at start and remove len, since that's also displayed by info())
    #fees.info()
    #print()
    

def inspect_data_types(df, name="DataFrame"):
    print(f"=== {name} ===")
    numerical = df.select_dtypes(include='number').columns.tolist()
    categorical = df.select_dtypes(include='object').columns.tolist()
    datetime = df.select_dtypes(include=['datetime','datetime64','datetime64[ns, UTC]']).columns.tolist()
      
    print(f"Numerical columns ({len(numerical)}): {numerical}")
    print(f"Categorical columns ({len(categorical)}): {categorical}")
    print(f"Date columns ({len(datetime)}): {datetime}")
    print()
    
    return numerical, categorical, datetime         # modified to also return the lists for further use




In [None]:
# calling functions
# commented for now, piecewise presentation might be more readable in notebook 

# evaluateDataFrame(cashRequest)
# evaluateDataFrame(fees)

# inspect_data_types(cashRequest)
# inspect_data_types(fees)

In [None]:
cashRequest.isna().sum()


In [None]:
cashRequest.nunique()


In [None]:
fees.isna().sum()

In [None]:
fees.nunique()

In [None]:
fees['total_amount'].unique()

In [None]:
fees[['category','total_amount']]

Only two types of fees are levied: 5 and 10 Euros(?) - maybe convert to int as well?

In [None]:

cashr_numcols, cashr_strcols, cashr_dtcols = inspect_data_types(cashRequest, name="cashRequest")


In [None]:
fees_numcols, fees_strcols, fees_dtcols = inspect_data_types(fees, name="fees")


Several of the date fields aren't typed correctly, fix in data cleaning and rerun function

**Observations**

- 2103 empty values in `cashRequest.user_id` corresponding to the difference to `id` noted above
  - also: very close to value of `deleted_account` id (2104), so possible relation to that
- fees are associated to cashRequests via `cash_request_id`

We used these insights to adapt our data import in order to directly cast the correct datatypes for columns that were not correctly identified automatically.  

## Data cleaning/quality analysis

### Instructions after EDA
1. Parse all values to the right data types
2. remove loose items (like fees without cashRequest)
3. 

In [None]:
# clean the start and ends of all column names so we get no suprises in the data retrieval later
fees.columns = fees.columns.str.strip()
cashRequest.columns = cashRequest.columns.str.strip()

The next block is going to fix datatypes for both dataframes, i.e. fixing the missing dates and casting some columns as integers.

In [None]:

# This is antoher option of parsing datatypes
# errors="coerce" -> means that erroes will force conversion and replace any invalid or unconvertible values with NaT
datetime_columns_cash_request = [
    "created_at",
    "updated_at",
    "moderated_at",
    "cash_request_received_date",
    "reimbursement_date",
    "money_back_date",
    "send_at",
    "reco_last_update",
    "reco_creation"
]

for col in datetime_columns_cash_request:
    cashRequest[col] = pd.to_datetime(cashRequest[col], errors="coerce")
    
datetime_columns_fees = [
    "created_at",
    "updated_at",
    "paid_at",
    "from_date",
    "to_date"
]

for col in datetime_columns_fees:
    fees[col] = pd.to_datetime(fees[col], errors="coerce")
    


float_to_int_fees = [
    "cash_request_id",
    "id"
]

# This currently doesn't work with astype(int), while astype("Int64")
for col in float_to_int_fees:
    fees[col] = pd.to_numeric(fees[col], errors="coerce").astype("Int64")
     
float_to_int_cash_request = [
    "user_id",
    "deleted_account_id",
    "id"
]

for col in float_to_int_cash_request:
    cashRequest[col] = pd.to_numeric(cashRequest[col], errors="coerce").astype("Int64")
    
fees.info()

In [None]:
cashRequest.info()

In [None]:
cashr_numcols, cashr_strcols, cashr_dtcols = inspect_data_types(cashRequest, name="cashRequest")


In [None]:
fees.info()

In [None]:
fees_numcols, fees_strcols, fees_dtcols = inspect_data_types(fees, name="fees")


### Cleaning floats that should be ints

In [None]:
fees[fees['cash_request_id'].isna()]

NA-values in `fees.cash_request_id` are for cancelled transactions - let's drop them!(?)

In [None]:
# creating copies before dropping values (optional)
fees_cp = fees.copy()
cashRequest_cp = cashRequest.copy()

In [None]:
# Drop all rows that have are not connected to a cash request anymore
fees_cp.dropna(subset=['cash_request_id'],inplace=True)

In [None]:
# fees_cp['cash_request_id'] = fees_cp['cash_request_id'].astype(int)
fees_cp.info()

### Checking NaT dates

In [None]:
# check date fields with missing data to assess significance
for col in cashr_dtcols:
    if cashRequest_cp[col].isna().sum() > 0:
        print(col, ': ', cashRequest_cp[col].isna().sum())       
        # display(cashRequest_cp[cashRequest_cp[col].isna()].head(10))

## Merging datasets

*Note sure if we might want to merge the datasets much earlier?*

Left-join the fees dataframe to the cashRequest dataframe on `id`/`cash_request_id` to create full dataset:
(We want to retain all cash requests even in case they have no associated fees.)

In [None]:
df = cashRequest_cp.merge(fees_cp, how='left', left_on='id', right_on='cash_request_id')
print(len(cashRequest_cp))
print(len(fees_cp))
print(len(df))



## Further EDA

In [None]:
# Group on fees > cash_request_id and count the amount of items -> this will return us the amount of fees per cash_request_id

fee_counts = fees_cp.groupby('cash_request_id').size()

# This returns us all the cash_requests that have multiple fees 
multiple_fees = fee_counts[fee_counts > 1]

# Show them
# print(multiple_fees)
print(len(df))
print(df.info)

In [None]:
print(df.head())

# Target data analysis

## Overview of cohorts

In [None]:
# create a new column with the created_at converted to the month annotation like 2024-03-18, 2023-11-02 etc -> these can be used to be labels for the plot
cashRequest_cp['cohort_month'] = cashRequest_cp['created_at'].dt.to_period('M')

# This groupes the data by the newly generated column and takes the count of cashrequest that happened in eacht period
time_plot_data = cashRequest_cp.groupby('cohort_month')['user_id'].count()
# print(time_plot_data)
time_plot_data.plot(kind='line')
