### 1. EDA - Dataset

#### 1.1 Univariate Analysis

* Evolution of registrations
* Evolution of logins
* Evolution of consumption (in and out)
* Demographic features (global numbers: age, sex, province, etc)

#### 1.2 Multivariate Analysis

* Correlation Matrix (Pearson) (relationships between variables before Association Rules)

### 2. EDA - Association Rules

* Total products
* Most frequent items/ itemsets
* Most frequent rules
* Business rules


Importing packages and libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import boto3
from datetime import datetime, timedelta
import awswrangler as wr
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 500) 
sns.set(style="whitegrid")
paleta_uala = ['#3E6BFD','#698cff','#9eb5ff','#d5dfff','#3E6BFD']
import pingouin as pg 
import scipy

ModuleNotFoundError: No module named 'mlxtend'

The parquet where the data is stored is loaded:

In [None]:
df = wr.s3.read_parquet(f's3://test-uala-arg-datalake-aiml-recommendations/data/stage/all_data_v2/')
df.rename(columns={"nu_sin_categoria_aprob":"nu_compras_otras_categorias_aprob"}, inplace=True)

In [None]:
df.shape

## Data Cleaning

Fetuares with NULL values are checked. 

In [None]:
df.isnull().sum()

In [None]:
df[['cobros','cuotificaciones','promociones','transferencia_c2c','transferencia_cvu']] = df[['cobros','cuotificaciones','promociones','transferencia_c2c','transferencia_cvu']].fillna(0)

In [None]:
df = df.dropna()

The data type for the variable "dt" is modified and the name is changed.

In [None]:
df['dt'] = pd.to_datetime(df['dt'])

In [None]:
df = df.rename({'dt':'periodo'}, axis=1)

The variable Age is created:

In [None]:
# Last timeframe available
max_periodo = df.periodo.max()
max_periodo = pd.to_datetime(max_periodo)

In [None]:
df.fecha_alta = pd.to_datetime(df.fecha_alta)
df.fecha_alta = np.where(df.fecha_alta > max_periodo , max_periodo, df.fecha_alta.dt.date)

In [None]:
df['antiguedad']= max_periodo - df.fecha_alta

In [None]:
df['antiguedad'] = df['antiguedad'].dt.days

# 1. EDA - Dataset

## 1.1 Univariate Analysis

### User Base Growth

In [None]:
temp = (df
 .groupby("periodo")
 .account_id
 .agg('count')
)

In [None]:
ax = temp.plot(figsize=(12,8), color='#698cff',marker='D', markeredgecolor='darkblue')
ax.set_xlabel("Periodo")
ax.set_ylabel("Cantidad de Usuarios en Millones")
_ = ax.set_title("Evolución de Clientes que usaron la App", fontsize=16)

### Evolution of Registrations

In [None]:
temp = df.copy()
temp['periodo_alta'] = pd.to_datetime(df['fecha_alta'], format='%Y/%m').dt.strftime('%Y-%m')

temp = (temp
 .groupby("periodo_alta")
 .account_id
 .agg('count')
)


In [None]:
ax = temp.plot(figsize=(12,8), color='#698cff',marker='D', markeredgecolor='darkblue')
ax.set_xlabel("Periodo")
ax.set_ylabel("Cantidad de Usuarios")
_ = ax.set_title("Evolución de Clientes Nuevos", fontsize=16)

In the plots analyzed above, it is observed that although the number of user registrations decreased during the pandemic, the trend in the use of the app continues to grow.

## Evolution of Average Transaction Amounts

### Cashout

In [None]:
temp = (df
 .groupby("periodo", as_index = False)['vl_compras_aprob',
                        'vl_withdraw_atm_aprob',
                        'vl_investments_deposit_aprob',
                        'vl_telerecargas_carga_aprob',
                        'vl_user_to_user_aprob', 
                        'vl_cash_out_cvu_aprob'
                       ]
 .agg('mean')
)

In [None]:
temp_pivot = temp.melt(var_name = "cashout",
                     value_name = "Media_usd",
                     id_vars = ["periodo"],
                     value_vars = ['vl_compras_aprob',
                        'vl_withdraw_atm_aprob',
                        'vl_investments_deposit_aprob',
                        'vl_telerecargas_carga_aprob',
                        'vl_user_to_user_aprob', 
                        'vl_cash_out_cvu_aprob']) # out a un cbu o cvu

In [None]:
g = sns.relplot(
    data = temp_pivot,
    x="periodo", y="Media_usd", col="cashout", hue="cashout",
    kind="line", palette= ['#3E6BFD','#698cff','#9eb5ff','#d5dfff','#3E6BFD','#698cff'], linewidth=4, zorder=5,
    col_wrap=2, height=3, aspect=3.2, legend=False, facet_kws=dict(sharey=False)
)


It is observed in the plot of cashout amounts (average) in USD that there is a positive trend with a peak in December, although it presents different levels depending on the product used.

### Cashin

In [None]:
temp = (df
 .groupby("periodo", as_index = False)['vl_cashin_efectivo_sum',
                                       'vl_cashin_transferencia_sum']
 .agg('mean')
)

In [None]:
temp_pivot = temp.melt(var_name = "cashin",
                     value_name = "Media_usd",
                     id_vars = ["periodo"],
                     value_vars = ['vl_cashin_efectivo_sum',
                                  'vl_cashin_transferencia_sum'])

In [None]:
g = sns.relplot(
    data = temp_pivot,
    x="periodo", y="Media_usd", col="cashin", hue="cashin",
    kind="line", palette=['#3E6BFD','#698cff'], linewidth=4, zorder=5,
    col_wrap=2, height=4, aspect=2.5, legend=False, facet_kws=dict(sharey=False)
)

It is observed in the plot of cashin amounts (averages) in USD that there is a positive trend with a peak in December. Being in the case of transactions in cash, the most abrupt rises and falls with respect to transfers.

## Evolution of Logins and Recorded Incidents

In [None]:
temp = (df
 .groupby("periodo", as_index = False)['nu_incidente',
                                       'nu_inicio',
                                       'nu_bloqueo']
 .agg('mean')
 .fillna(0)
)

temp_pivot = temp.melt(var_name = "evento",
                       value_name = "cantidad_media",
                       id_vars = ["periodo"],
                       value_vars = ['nu_incidente',
                                     'nu_inicio',
                                    'nu_bloqueo'])

In [None]:
g = sns.relplot(
    data = temp_pivot,
    x="periodo", y="cantidad_media", col="evento", hue="evento",
    kind="line", palette= ['#3E6BFD','#698cff','#9eb5ff'], linewidth=4, zorder=5,
    col_wrap=2, height=3, aspect=3.0, legend=False, facet_kws=dict(sharey=False)
)

## Demographic Characteristics of Clients

**Seniority as a customer**

In [None]:
plt.figure(figsize=(20,5))

sns.distplot(df.antiguedad, kde=True, hist=True, bins=15, color='#3E6BFD')
plt.xlabel('Cantidad de días de antiguedad')
plt.ylabel('Frecuencia')
plt.title('Distribución de Antigüedad de los Clientes', fontsize=16)
plt.show()

De la distribución de la antigüedad podemos confirmar que la cantidad de nuevos usuarios se fue reduciendo tocando minimo en el mes de Mayo del 2020 coincidiendo con el inicio de la pandemia en Argentina.

**Age of customers**

In [None]:
df['fecha_nacimiento'] = pd.to_datetime(df['fecha_nacimiento'], format='%Y/%m/%d')
df['edad']=[math.trunc((datetime.now()-i).days/365)  for i in df['fecha_nacimiento']]
plt.figure(figsize=(20,5))

sns.distplot(df.edad, kde=True, hist=True, bins=15, color='#698cff')
plt.xlabel('edad')
plt.ylabel('Frecuencia')
plt.title('Distribución de la Edad de los Clientes', fontsize=16)

plt.show()

The age distribution graph presents a skewed to the right, concentrating the user base in the age range from 18 to 30 years.

**Areas where users reside**

In [None]:
df['zona']=df.provincia.str.lower()
df.loc[-df.provincia.isin(['santa fe','cordoba','buenos aires','capital federal',
                                                            'gran buenos aires','mendoza','tucuman','salta',
                                                           'misiones']),\
                         'zona']='Others'
plt.figure(figsize=(20,5))
sns.countplot(x=df.zona, palette=paleta_uala)
plt.title('Distribución de las Zonas donde residen los Clientes', fontsize=16)

plt.show()

The plot presents the most populated areas in Argentina versus the rest. Within it, it can be observed that the distribution of users of Ualá is concentrated in the province of Buenos Aires and the category 'Others' presents a high density since it summarizes the number of users of the rest of the country's provinces.

## Phone Company and User Operating System

Phone Company

In [None]:
temp = df.copy()

condlist=[
    temp.fl_carrier_claro == 1,
    temp.fl_carrier_movistar == 1,
    temp.fl_carrier_personal == 1,
    temp.fl_carrier_tuenti ==1
    
]

choicelist=['Claro','Movistar','Personal','Tuenti']
temp['cd_empresa_tel']=np.select(condlist,choicelist)
temp['cd_empresa_tel']=(np.where(temp['cd_empresa_tel']=='0','otros',temp['cd_empresa_tel']))

In [None]:
b = temp.groupby(['periodo','cd_empresa_tel']).account_id.count().reset_index(name='cantidad_clientes')
b = b[b.cd_empresa_tel != 'otros']
b['periodo']= b['periodo'].dt.date

In [None]:
b_pivot = pd.pivot_table(
    b, 
    values="cantidad_clientes",
    index="periodo",
    columns="cd_empresa_tel", 
    aggfunc='sum'
)
b_pivot.head(1)

In [None]:
paleta_uala = ['#3E6BFD','#698cff','#9eb5ff','#d5dfff','#3E6BFD']

#import matplotlib.dates as mdates
ax = b_pivot.plot(kind="bar", color= paleta_uala, width=0.9)
fig = ax.get_figure()
fig.set_size_inches(12, 8)

plt.title('Distribución de Empresas Telefónicas que utilizan los Clientes', fontsize=16)
ax.set_xlabel("periodo")
ax.set_ylabel("Cantidad de Clientes")

Of the users who present which telephone company they use, there is a concentration distributed in three companies: Personal, Claro and Movistar.

Operating System

In [None]:
temp["Android_OS"] = np.where(temp.fl_os_android==0,'Android','IOS')
temp_pie = temp.groupby(["Android_OS"])["account_id"].count().reset_index(name="Cantidad_Clientes")

ax = temp_pie.plot.pie(y="Cantidad_Clientes", labels=temp_pie.Android_OS, autopct="%1.2f%%", figsize=(10, 8), colors=['#698cff','#9eb5ff'])
_ = ax.set_title("Proporción de Clientes con cada Sistema Operativo", fontsize=16)

The plot shows that most of the users have Android operating system.

## 1.2 Multivariate Analysis

### Correlation Matrix

Se realizó una matriz de correlación de Pearson para evaluar con mas detalle el grado de correlación lineal de las variables y visualizar mas interesantes para nuestro análisis.

In [None]:
# We select the variables that we do not want to have within the correlation matrix
a = df.columns.isin(['account_id','year_month','os_name','external_id','periodo','fecha_alta','provincia','sexo','fecha_nacimiento','zona'])
a_comple = df.columns[~a]

# We create a new df without the variables we choose
df_ = df.loc[:,a_comple]

For this analysis, the strongest linear correlations were ruled out, that is, greater than 90% and less than 0.5%, to simplify the bivariate analysis considering the large number of variables that the dataset presents.

In [None]:
# Correlation matrix (absolute values)
temp = df_.copy()
corr_matrix = temp.corr().abs()

# We select upper triangle
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# We remove variables with correlations greater than the 90% percentile and less than the 0.5%.
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
to_drop2 = [column for column in upper.columns if any(upper[column] < 0.005)]

# We delete those lists
temp.drop(to_drop+to_drop2, axis=1, inplace=True)

temp.shape

In [None]:
# Correlation Matrix
matriz = temp.corr()
#matriz.style.background_gradient(cmap='coolwarm')

Heatmaps

In [None]:
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(matriz, dtype=np.bool))
heatmap = sns.heatmap(matriz, mask=mask, vmin=-1, vmax=1, annot=True, cmap=paleta_uala)
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);

From the previous heatmap, the following correlations stand out:
* inversiones -> 	transferencia_cvu
* transferencia_c2c -> nu_compras_otras_categorias_aprob	
* nu_cashin_inversiones_qty -> nu_consumption_pos_aprob
* nu_tcc_r_aprob -> nu_incidente
* nu_inicio -> transferencia_cvu, nu_cashin_inversiones_qty, nu_tcc_r_aprob, nu_consumption_pos_aprob, nu_compras_otras_categorias_aprob, nu_incidente

## Bivariate Relationships

In [None]:
g=sns.pairplot(
    temp,
    x_vars=["inversiones", "transferencia_cvu", "transferencia_c2c", "nu_compras_otras_categorias_aprob", 
"nu_cashin_inversiones_qty", "nu_consumption_pos_aprob", "nu_tcc_r_aprob", "nu_incidente", "nu_inicio"],
    y_vars=["inversiones", "transferencia_cvu", "transferencia_c2c", "nu_compras_otras_categorias_aprob", 
"nu_cashin_inversiones_qty", "nu_consumption_pos_aprob", "nu_tcc_r_aprob", "nu_incidente", "nu_inicio"], 
    #height = 6,
    #aspect=0.5
)
g.fig.set_figheight(20)
g.fig.set_figwidth(25)
plt.show()

The linear relationship of the following variables is graphically confirmed:

* transferencia_cvu con nu_compras_otras_categ
* nu_inicio con cahin_inv
* nu_inicio con tcc_r_aprob
* transferencia_cvu con transferencia_c2c

# Rules Apriori

In order to understand cross-selling within Ualá's products, an algorithm of Association Rules (Apriori) is run to measure the frequency of use of the product basket offered by the company.

Features that you do not want to have within the array are eliminated. Futhermore, to give priority to the most recent transactionality of the users, the last month available in the dataset is analyzed:

In [None]:
df_=df[df['periodo']=='2021-03'].copy()
a = df_.columns.isin(['year_month','os_name','external_id','periodo','fecha_alta','provincia','sexo','fecha_nacimiento','zona','edad'])
a_comple = df_.columns[~a]

df_ = df_.loc[:,a_comple]

In [None]:
data_pivot = pd.melt(df_, 
                     id_vars='account_id', 
                     var_name='producto', 
                     value_name='value')

In [None]:
data_pivot = data_pivot[data_pivot.value > 0]

In [None]:
data_pivot.head()

In [None]:
data_matriz = (data_pivot
               .groupby(['account_id', 'producto'])['value']
               .sum().unstack().reset_index().fillna(0)
               .set_index('account_id'))

In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

data_matriz = data_matriz.applymap(encode_units)

The transactions related to the products of interest are selected below to recommend to users:

In [None]:
data_matriz_test = data_matriz[[
 'nu_cashin_prestamos_qty',
 'nu_cashin_adquirencia_qty',
 'nu_investments_deposit_aprob',
 'nu_telerecargas_carga_aprob',
 'nu_compras_otras_categorias_aprob',
 'nu_compras_aprob',
 'nu_entretenimiento_aprob',
 'nu_servicios_débitos_automaticos_aprob',
 'nu_supermercados_alimentos_aprob']]

In [None]:
data_matriz_test= data_matriz

In [None]:
itemsets = apriori(data_matriz_test, min_support=0.01, use_colnames=True)

In [None]:
rules = association_rules(itemsets, metric="lift", min_threshold=1)

In [None]:
itemsets

In [None]:
rules

It is observed that the investment product is the most present in the cross-selling of the product basket, both in the role of antecedent and consequent.
This can act as a basis for future evaluation of the models to be applied and the population of users to be analyzed.