# Análisis de la base de datos 
## Universidad de los Andes - Smurfit Westrock
### Poyecto Intermedio de Consultoría Empresarial (PICE) 202520
Daniel Benavides

This code performs an exploratory and preparatory analysis of Smurfit Westrock’s payment data. It begins by importing and cleaning raw datasets from Excel or CSV files, addressing missing values, duplicates, and inconsistencies. The data is then transformed through normalization of numerical variables and encoding of categorical ones such as suppliers, cost centers, and expense types. Exploratory Data Analysis (EDA) is conducted to visualize payment distributions, identify outliers and temporal trends, and examine correlations among key variables. Additionally, feature engineering is applied to create new indicators that capture behavioral patterns and transaction frequency, ensuring the dataset is ready for anomaly detection models. This analysis provides preliminary insights and recommendations to guide the development of Machine Learning models and improve overall data quality.

In [40]:
# Data extraction libraries
import numpy as np
import pandas as pd

# Data visualizaton libraries 
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "browser"
import altair as alt

from matplotlib import font_manager
plt.rcParams['font.family'] = 'Arial'

Data downloaded as Excel files

In [None]:
bd_xl_1 = pd.read_excel("PICE BD 2025-Parte 1.xlsx")
bd_xl_2 = pd.read_excel("PICE BD 2025-Parte 2.xlsx")
bd_xl_3 = pd.read_excel("PICE BD 2025-Parte 3.xlsx")

Data downloaded as CSV files (ideal)

In [41]:
bd_cs_1 = pd.read_csv("PICE BD 2025-Parte 1.csv", low_memory=False)
bd_cs_2 = pd.read_csv("PICE BD 2025-Parte 2.csv", low_memory=False)
bd_cs_3 = pd.read_csv("PICE BD 2025-Parte 3.csv", low_memory=False)

In [42]:
# Joint CSV file
df = pd.concat([bd_cs_1, bd_cs_2, bd_cs_3], ignore_index=True)

# Download CSV file
df.to_csv("PICE BD 2025 - Joint.csv", index=False)

In [43]:
df = pd.read_csv("PICE BD 2025 - Joint.csv")

df.info()
df


Columns (2,3,4,5,6,9,10,13,14,15,17,20,22,23,26,27,28,29,30,31,32,33,35,36) have mixed types. Specify dtype option on import or set low_memory=False.



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340112 entries, 0 to 1340111
Data columns (total 37 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Número Documento Referencia  446678 non-null  float64
 1   Material                     268788 non-null  float64
 2   Número de Cuenta             446678 non-null  object 
 3   Denominación                 446678 non-null  object 
 4   Centro de Coste              188181 non-null  object 
 5   En moneda de la sociedad     446678 non-null  object 
 6   Cantidad                     281959 non-null  object 
 7   Acreedor                     68287 non-null   float64
 8   Número Documento             446678 non-null  float64
 9   Usuario                      446678 non-null  object 
 10  Descripción                  205598 non-null  object 
 11  Período                      446678 non-null  float64
 12  Documento Compras            89306 non-null   float64
 1

Unnamed: 0,Número Documento Referencia,Material,Número de Cuenta,Denominación,Centro de Coste,En moneda de la sociedad,Cantidad,Acreedor,Número Documento,Usuario,...,División,Elemento PEP,Fecha Entrada,Fecha Valor,Hora,Ledger,Orden,Pedido Cliente,Se ha anulado el Documento,Sector
0,4.000295e+09,5000133.0,71050596,Mecanica blanqueada,,17360785728,10046751.00,,36958801.0,ULLOAFE,...,3.0,,03.06.2025,31.05.2025,10:24:31,8A,,,,GE
1,4.000295e+09,5000133.0,71050596,Mecanica blanqueada,,17304230016,10014022.00,,36959182.0,ULLOAFE,...,3.0,,03.06.2025,31.05.2025,10:52:52,8A,,,,GE
2,4.000295e+09,5000132.0,71050593,Kraft pino ( ksw ),,14253494931,7887933.00,,36958801.0,ULLOAFE,...,3.0,,03.06.2025,31.05.2025,10:24:31,8A,,,,GE
3,4.000295e+09,5000132.0,71050593,Kraft pino ( ksw ),,14194668046,7855378.00,,36959182.0,ULLOAFE,...,3.0,,03.06.2025,31.05.2025,10:52:52,8A,,,,GE
4,4.000295e+09,5000132.0,71050513,Kraft pino ( ksw ),MC4006,14180625849,7847607.00,,36959235.0,ULLOAFE,...,6.0,,03.06.2025,31.05.2025,11:02:51,8A,,,,GE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1340107,,,,,,,,,,,...,,,,,,,,,,
1340108,,,,,,,,,,,...,,,,,,,,,,
1340109,,,,,,,,,,,...,,,,,,,,,,
1340110,,,,,,,,,,,...,,,,,,,,,,


### Data Cleaning and Transformation

In [48]:
df.drop(columns=["Número Documento Referencia",
                 "Acreedor",
                 "Número Documento",
                 "Documento Compras",
                 "Clase de Documento",
                 "Clase de Actividad",
                 "Deudor",
                 "Elemento PEP",
                 "Orden"], inplace=True)

df.to_csv("PICE BD 2025 - Joint_CnT.csv", index=False)


In [50]:
df = pd.read_csv("PICE BD 2025 - Joint_CnT.csv")
df


Columns (1,2,3,4,5,6,7,9,10,11,13,16,18,19,20,21,22,23,24,26,27) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,Material,Número de Cuenta,Denominación,Centro de Coste,En moneda de la sociedad,Cantidad,Usuario,Descripción,Período,Clase de Movimiento V,...,Centro de Beneficio,Clase de Factura,División,Fecha Entrada,Fecha Valor,Hora,Ledger,Pedido Cliente,Se ha anulado el Documento,Sector
0,5000133.0,71050596,Mecanica blanqueada,,17360785728,10046751.00,ULLOAFE,TRASLADO PULPAPEL-MOLINOS,5.0,502,...,PC01,,3.0,03.06.2025,31.05.2025,10:24:31,8A,,,GE
1,5000133.0,71050596,Mecanica blanqueada,,17304230016,10014022.00,ULLOAFE,TRASLADO PULPAPEL-MOLINOS,5.0,502,...,PC01,,3.0,03.06.2025,31.05.2025,10:52:52,8A,,,GE
2,5000132.0,71050593,Kraft pino ( ksw ),,14253494931,7887933.00,ULLOAFE,TRASLADO PULPAPEL-MOLINOS,5.0,502,...,PC01,,3.0,03.06.2025,31.05.2025,10:24:31,8A,,,GE
3,5000132.0,71050593,Kraft pino ( ksw ),,14194668046,7855378.00,ULLOAFE,TRASLADO PULPAPEL-MOLINOS,5.0,502,...,PC01,,3.0,03.06.2025,31.05.2025,10:52:52,8A,,,GE
4,5000132.0,71050513,Kraft pino ( ksw ),MC4006,14180625849,7847607.00,ULLOAFE,PULPA KRAFT PINO,5.0,201,...,MC04,,6.0,03.06.2025,31.05.2025,11:02:51,8A,,,GE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1340107,,,,,,,,,,,...,,,,,,,,,,
1340108,,,,,,,,,,,...,,,,,,,,,,
1340109,,,,,,,,,,,...,,,,,,,,,,
1340110,,,,,,,,,,,...,,,,,,,,,,
