Vamos a intentar encontrar que transacciones de la base de datos son fraudulentas. Las transacciones fraudulentas suelen ser uno de los principales problemas que las plataformas de pago o las fintech persiguen ya que tienen que cumplir con normas de compliance y tienen que ser transparentes entre otros motivos. Normalmente las transacciones fraudulentas suelen representar menos del 0,3% del total de las transacciones. 
En este caso hemos decidido que las siguientes características sean las más relevantes para detectar el fraude transaccional:
    1. Sólo aquellas transacciones con status "transaction_declined" o "rejected"
    2. Aquellas transacciones que tengan montos anormales, siguiendo con clientes que han hecho cash request de poca cantidad y de repente mucha cantidad o bien usuarios que solamente tengan una transacción y que sea un valor atípico de lo elevado que és.
    3. Usuarios inactivos que de repente hagan un cash request de mucha cantidad.
    4. Que la diferencia entre el created_at y el updated_at o el created_at y el paid_at sea muy pequeña.
    5. Transacciones a altas horas de la madrugada sin motivo aparente (1 AM a 4 AM)
    6. Diferencias significativas entre created_at, moderated_at, y cash_request_recieved_date.
    7. Muchas transacciones rechazadas antes de una transacción aprobada (secuencia sospechosa).

In [None]:
%run codes/analisis_calidad_datos.ipynb

UsageError: Line magic function `%run:` not found.


In [2]:
# Filtrar los user_id con status "transaction_declined" o "rejected"
fraudulent_users = merged_df[merged_df['status'].isin(['transaction_declined', 'rejected'])]['user_id'].unique()

fraudulent_users


array([44498., 10562., 14048., 47557., 88127., 62880., 14832., 27319.,
         526.,  6333., 45517., 13565., 56860., 68367.,  3248., 34991.,
       69253., 76803., 83479., 96899.])

In [3]:
# Filtrar los user_id con el amount máximo de 200
high_amount_users = merged_df[merged_df['amount'] == 200]['user_id'].unique()

high_amount_users


array([4881., 9282., 9222.])

In [4]:
# Convertir las columnas de fechas a tipo datetime
merged_df['created_at'] = pd.to_datetime(merged_df['created_at'], errors='coerce')
merged_df['updated_at'] = pd.to_datetime(merged_df['updated_at'], errors='coerce')
merged_df['send_at'] = pd.to_datetime(merged_df['send_at'], errors='coerce')

# Calcular el diferencial de días entre created_at y updated_at
merged_df['days_created_to_updated'] = (merged_df['updated_at'] - merged_df['created_at']).dt.days

# Calcular el diferencial de días entre created_at y send_at
merged_df['days_created_to_send'] = (merged_df['send_at'] - merged_df['created_at']).dt.days

# Mostrar las nuevas columnas
merged_df[['created_at', 'updated_at', 'send_at', 'days_created_to_updated', 'days_created_to_send']]


Unnamed: 0,created_at,updated_at,send_at,days_created_to_updated,days_created_to_send
0,2020-10-23 15:20:26.163927+00:00,2020-12-18 13:08:29.099365+00:00,2020-10-23 15:21:26.878525+00:00,55,0.0
1,2020-05-27 02:26:27.615190+00:00,2020-06-09 11:25:51.726360+00:00,NaT,13,
2,2020-07-01 09:30:03.145410+00:00,2020-08-11 22:27:58.240406+00:00,NaT,41,
3,2020-07-01 09:30:03.145410+00:00,2020-08-11 22:27:58.240406+00:00,NaT,41,
4,2020-07-01 09:30:03.145410+00:00,2020-08-11 22:27:58.240406+00:00,NaT,41,
...,...,...,...,...,...
21052,2020-10-20 07:58:04.006937+00:00,2021-02-05 12:19:30.656816+00:00,2020-10-20 07:58:14.171553+00:00,108,0.0
21053,2020-10-10 05:40:55.700422+00:00,2021-02-05 13:14:19.707627+00:00,2020-10-10 05:41:23.368363+00:00,118,0.0
21054,2020-10-10 05:40:55.700422+00:00,2021-02-05 13:14:19.707627+00:00,2020-10-10 05:41:23.368363+00:00,118,0.0
21055,2020-10-08 14:16:52.155661+00:00,2021-01-05 15:45:52.645536+00:00,2020-10-08 14:17:04.526139+00:00,89,0.0


In [13]:
# Agrupar por 'user_id' y calcular el mínimo de 'days_created_to_updated' por cada usuario
grouped_users = merged_df.groupby('user_id')['days_created_to_updated'].min()

# Ordenar los resultados y obtener los 20 con el menor valor
top_20_users_grouped = grouped_users.nsmallest(19)

top_20_users_grouped



user_id
526.0      0
3248.0     0
6333.0     0
10562.0    0
13565.0    0
14048.0    0
14832.0    0
34991.0    0
44498.0    0
45517.0    0
47557.0    0
56860.0    0
62880.0    0
68367.0    0
69253.0    0
76803.0    0
83479.0    0
88127.0    0
96899.0    0
Name: days_created_to_updated, dtype: int64

In [28]:
# Convertir la columna 'created_at' a tipo datetime si no lo está
merged_df['created_at'] = pd.to_datetime(merged_df['created_at'], errors='coerce')

# Extraer la hora de la columna 'created_at'
merged_df['hour_created'] = merged_df['created_at'].dt.hour

# Filtrar los user_id con transacciones realizadas entre la 1AM y las 4AM
filtered_users = merged_df[(merged_df['hour_created'] >= 1) & (merged_df['hour_created'] < 4)]['user_id'].unique()

filtered_users


array([  2109.,   6536.,  26912.,  11905.,  33465.,     nan,  20230.,
        62302.,  24547.,  28404.,  92043.,  87807.,  35354.,  24144.,
        13735.,  79903.,   9917.,  26764.,   5264.,  30956.,  65693.,
        27319.,  38898.,  19667.,  77635.,  21354.,   6164.,  56475.,
        53742.,  98130.,   4982.,  12070.,  48917.,   8093.,  56093.,
        22510.,   1987.,   1776.,  18708.,  17697.,  31241.,  34635.,
        32822.,  86082.,  52482.,  65328.,   3219.,  47267.,   7302.,
         8746.,  93714.,   4656.,   1658.,  17707.,   3287.,  14750.,
        10028.,  30112.,  10522.,  77311.,  87976.,  80521.,  82814.,
        18369.,  12015.,   1136.,  32407.,  12353.,   6938.,  25158.,
        12233.,  10432.,  17725.,   8196.,    430.,  87479.,  34496.,
         5291.,   9852.,  47851.,    590.,  14617.,   5189.,  23971.,
         9973.,    192.,   3631.,  34928.,  34872.,  17925.,  20836.,
        19579.,  30377.,   3121.,  33406.,  19753.,  20926.,  29652.,
         8496.,  121

In [25]:
# Filtrar los user_id con status "transaction_declined" o "direct_debit_rejected"
filtered_status_users = merged_df[
    merged_df['status'].isin(['transaction_declined', 'direct_debit_rejected'])
]['user_id'].nunique()

filtered_status_users

724

In [26]:
# Obtener los user_id con status "transaction_declined" o "direct_debit_rejected"
filtered_user_ids = merged_df[
    merged_df['status'].isin(['transaction_declined', 'direct_debit_rejected'])
]['user_id'].unique()

filtered_user_ids


array([ 15415.,  16861.,  28368.,  14484.,  58270.,   8379.,  95574.,
         9748.,   6976.,  32634.,  41200.,  74867.,   2575.,   5296.,
        17851.,  88936.,  13261.,   5206.,  12513.,   6561.,   2715.,
        44498.,   8866.,  33064.,   5299.,  33866.,  13778.,  10562.,
        16730.,  17580.,  15057.,  25246.,  88757.,  17343.,  13863.,
         4552.,  16115.,  96836.,  12485.,  15570.,   7780.,  22274.,
        69977.,  13637.,  27978.,  32758.,  11967.,   4825.,  23754.,
        60572.,  42918.,  34591.,  31556.,  57440.,  98604.,  42862.,
        28868.,  14048.,   4305.,  32873.,  72204.,  17751.,   1909.,
        19060.,   5331.,  34460.,     nan,  25080.,  99347.,  20859.,
        60704.,  10123.,   5496.,  33541.,   5186.,  52193.,  79660.,
        12699.,  20820.,   4644.,   6569.,  27321.,  24547.,  30009.,
        30131.,  30364.,  26802.,  36981.,  43191.,  14638.,  97197.,
        36366.,  57193.,  28404.,  62776.,  59840.,  40226.,  12664.,
        32220.,  179

In [29]:
# Convertir cada lista de user_id en un conjunto
fraudulent_users_set = set(fraudulent_users)
high_amount_users_set = set(high_amount_users)
top_20_users_grouped_set = set(top_20_users_grouped.index)  # Si es un DataFrame con índice user_id
filtered_users_set = set(filtered_users)
filtered_status_users_set = set([filtered_status_users])  # Solo un valor si es un conteo
filtered_user_ids_set = set(filtered_user_ids)

# Unir todos los conjuntos
all_sets = [
    fraudulent_users_set,
    high_amount_users_set,
    top_20_users_grouped_set,
    filtered_users_set,
    filtered_user_ids_set,
]

# Encontrar intersecciones y usuarios únicos
from collections import Counter

# Contar cuántos conjuntos contienen cada user_id
user_id_counter = Counter(user_id for user_set in all_sets for user_id in user_set)

# Filtrar los user_id que aparecen en más de un conjunto
common_users = {user_id: count for user_id, count in user_id_counter.items() if count > 1}

common_users


{76803.0: 3,
 96899.0: 3,
 69253.0: 3,
 526.0: 3,
 68367.0: 3,
 83479.0: 3,
 56860.0: 3,
 62880.0: 3,
 34991.0: 3,
 3248.0: 3,
 27319.0: 3,
 6333.0: 3,
 88127.0: 3,
 10562.0: 3,
 47557.0: 3,
 45517.0: 3,
 44498.0: 3,
 14048.0: 3,
 14832.0: 3,
 13565.0: 3,
 31241.0: 2,
 10766.0: 2,
 93714.0: 2,
 79903.0: 2,
 11832.0: 2,
 86082.0: 2,
 24144.0: 2,
 47267.0: 2,
 7379.0: 2,
 18664.0: 2,
 28404.0: 2,
 87807.0: 2,
 52482.0: 2,
 18708.0: 2,
 48917.0: 2,
 12070.0: 2,
 77635.0: 2,
 34635.0: 2,
 24547.0: 2,
 53742.0: 2}