<a href="https://colab.research.google.com/github/datascience-uniandes/data-quality-tutorial/blob/master/data-quality-tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Quality and Cleanliness

MINE-4101: Applied Data Science  
Univerisdad de los Andes  
  
**Dataset:** Homicides Colombia ([datos.gov.co](datos.gov.co))

**References:**
- String similarity metrics: https://www.baeldung.com/cs/string-similarity-edit-distance
- Levenshtein lib's docs: https://rapidfuzz.github.io/Levenshtein/installation.html
  
Last update: September, 2024

In [4]:
!pip install levenshtein



In [5]:
import re
from datetime import datetime

import numpy as np
import pandas as pd

from Levenshtein import distance, ratio

In [6]:
pd.set_option("display.max_columns", None) #Mostrar todas las columnas

### Load the data

In [56]:
dataset_url="https://github.com/amrios80/data-quality-tutorial/raw/master/data/homicides.csv"

homicides_df = pd.read_csv(dataset_url)

In [57]:
homicides_df.shape

(12400, 22)

In [58]:
homicides_df.dtypes

Unnamed: 0,0
FECHA,object
DEPARTAMENTO,object
MUNICIPIO,object
DIA,object
HORA,object
BARRIO,object
ZONA,object
CLASE DE SITIO,object
ARMA O MEDIO,object
MOVIL VICTIMA,object


In [59]:
homicides_df.sample(5)

Unnamed: 0,FECHA,DEPARTAMENTO,MUNICIPIO,DIA,HORA,BARRIO,ZONA,CLASE DE SITIO,ARMA O MEDIO,MOVIL VICTIMA,MOVIL AGRESOR,EDAD,GENERO,ESTADO CIVIL,CLASE EMPLEADO,PROFESION,ESCOLARIDAD,PAIS NACE,CODIGO DANE,AÑO DE NACIMIENTO,CÉDULA,CORREO
7957,08/25/2021 12:00:00 AM,ATLÁNTICO,BARRANQUILLA (CT),Martes,22:20,LAS ESMERALDA,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,A PIE,A PIE,37.0,MASCULINO,UNION LIBRE,RELIGIOSO,NO REPORTADO,SECUNDARIA,COLOMBIA,8001000.0,1978.0,52-981,correo977@unidatos.edu.co
8091,08/29/2021 12:00:00 AM,CAUCA,ARGELIA,Sábado,23:30,CENTRO,URBANA,VIAS PUBLICAS,ARMA BLANCA,A PIE,A PIE,37.0,MASCULINO,SOLTERO,AGRICULTOR,NO REPORTADO,ANALFABETA,COLOMBIA,19050000.0,1978.0,93-359,nelfhe6314@unidatos.edu.co
9169,10/01/2021 12:00:00 AM,VALLE,CALI (CT),Jueves,17:15,CIUDAD CAPRI E17,URBANA,CENTRO COMERCIAL,ARMA DE FUEGO,A PIE,PASAJERO MOTOCICLETA,40.0,MASCULINO,UNION LIBRE,EMPLEADO PARTICULAR,NO REPORTADO,SECUNDARIA,COLOMBIA,76001000.0,1975.0,44-048,sdacko6976@gmail.com
1134,02/01/2021 12:00:00 AM,CUNDINAMARCA,BOGOTÁ D.C. (CT),Domingo,22:02,LOS LAURELES ENO REPORTADO7,URBANA,VIAS PUBLICAS,ARMA BLANCA,A PIE,A PIE,20.0,MASCULINO,SOLTERO,DESEMPLEADO,NO REPORTADO,SECUNDARIA,COLOMBIA,11001000.0,1995.0,79-665,correo8789@colombia.gov.co
10447,11/10/2021 12:00:00 AM,VALLE,CALI (CT),Martes,10:40,EL INGENIO E17,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,CONDUCTOR MOTOCICLETA,VEHICULO,28.0,FEMENINO,SOLTERO,EMPLEADO PARTICULAR,NO REPORTADO,SUPERIOR,COLOMBIA,76001000.0,1987.0,42-658,correo9564@colombia.gov.co


### Work with datetimes

In [60]:
# Creating a lambda expression for datetime parsing
dateparse = lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S %p")

In [61]:
# Applying the validation to all values in the column
homicides_df["FECHA"].apply(dateparse)

# IT IS EXPECTED TO HAVE AN ERROR BECAUSE SOME VALUES DOESN'T FIT THE FORMAT

ValueError: time data '13/12/2021 12:00:00 AM' does not match format '%m/%d/%Y %H:%M:%S %p'

In [62]:
# Creating a function for validating which value is causing the previous error
def error_in_format(x):
    try:
        datetime.strptime(x, "%m/%d/%Y %H:%M:%S %p")
        return False
    except:
        return True

In [63]:
# Using the function for validation
homicides_df.loc[homicides_df["FECHA"].apply(error_in_format)]

Unnamed: 0,FECHA,DEPARTAMENTO,MUNICIPIO,DIA,HORA,BARRIO,ZONA,CLASE DE SITIO,ARMA O MEDIO,MOVIL VICTIMA,MOVIL AGRESOR,EDAD,GENERO,ESTADO CIVIL,CLASE EMPLEADO,PROFESION,ESCOLARIDAD,PAIS NACE,CODIGO DANE,AÑO DE NACIMIENTO,CÉDULA,CORREO
486,13/12/2021 12:00:00 AM,VALLE,CALI (CT),kunes,23:00,MOJICA E15,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,A PIE,A PIE,26.0,MASCULINO,SOLTERO,INDEPENDIENTE,NO REPORTADO,SECUNDARIA,COLOMBIA,76001000.0,89.0,80-330,lujhdf9132@gmail.com
695,30/01/2021 12:00:00 AM,BOLÍVAR,CARTAGENA (CT),Lunes,5:30,REP. DEL LIBANO,URBANA,VIAS PUBLICAS,ARMA BLANCA,NO REPORTADO,A PIE,25.0,MASCULINO,SOLTERO,EMPLEADO PARTICULAR,NO REPORTADO,SECUNDARIA,COLOMBIA,13001000.0,1990.0,12-915,ghumtg4094@unidatos.edu.co
1250,18/05/2021 12:00:00 AM,HUILA,TESALIA,Jueves,19:30,VEREDA PACARNI,RURAL,CASAS DE HABITACION,ARMA DE FUEGO,A PIE,A PIE,34.0,FEMENINO,SOLTERO,AGRICULTOR,NO REPORTADO,PRIMARIA,COLOMBIA,41797000.0,1981.0,99-095,sdaggf6639@gmail.com
12168,12/25/2021 12:00:00 MM,VALLE,PALMIRA,Viernes,15:30,LA EMILIA,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,A PIE,CONDUCTOR MOTOCICLETA,17.0,MASCULINO,SOLTERO,EMPLEADO PARTICULAR,NO REPORTADO,SECUNDARIA,COLOMBIA,76520000.0,1998.0,16-362,uschca1775@gmail.com
12399,TOTAL,,,,,,,,,,,,,,,,,,,,,


In [64]:
# Deleting a row by its index
homicides_df.drop([486, 695, 1250, 12168, 12399], inplace=True)

In [65]:
# Trying to parse the datetime string again
homicides_df["FECHA"] = homicides_df["FECHA"].apply(dateparse)

In [66]:
homicides_df.dtypes

Unnamed: 0,0
FECHA,datetime64[ns]
DEPARTAMENTO,object
MUNICIPIO,object
DIA,object
HORA,object
BARRIO,object
ZONA,object
CLASE DE SITIO,object
ARMA O MEDIO,object
MOVIL VICTIMA,object


In [67]:
# Counting homicides by hour
homicides_df["FECHA"].dt.hour.value_counts()

# All datetime hour parts are the same

Unnamed: 0_level_0,count
FECHA,Unnamed: 1_level_1
12,12395


<span style="color:red">The homicide hour is available in a different column!</span>

In [68]:
# Merging both columns
homicides_df["FECHA"] = homicides_df["FECHA"].astype(str).apply(lambda x: x[:11]) + homicides_df["HORA"]

In [69]:
# Deleting redundant column
homicides_df.drop(columns=["HORA"], inplace=True)

In [70]:
# Making a new expression for datetime parsing
dateparse = lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M")  #lambda permite definir funciones inline

In [71]:
# Applying the expression
homicides_df["FECHA"] = homicides_df["FECHA"].apply(dateparse)

In [72]:
homicides_df.dtypes

Unnamed: 0,0
FECHA,datetime64[ns]
DEPARTAMENTO,object
MUNICIPIO,object
DIA,object
BARRIO,object
ZONA,object
CLASE DE SITIO,object
ARMA O MEDIO,object
MOVIL VICTIMA,object
MOVIL AGRESOR,object


In [73]:
homicides_df.sample(5)

Unnamed: 0,FECHA,DEPARTAMENTO,MUNICIPIO,DIA,BARRIO,ZONA,CLASE DE SITIO,ARMA O MEDIO,MOVIL VICTIMA,MOVIL AGRESOR,EDAD,GENERO,ESTADO CIVIL,CLASE EMPLEADO,PROFESION,ESCOLARIDAD,PAIS NACE,CODIGO DANE,AÑO DE NACIMIENTO,CÉDULA,CORREO
6401,2021-07-08 23:55:00,ANTIOQUIA,PUERTO NARE,Miércoles,LA UNION,RURAL,CARRETERA,ARMA DE FUEGO,A PIE,A PIE,27.0,MASCULINO,UNION LIBRE,EMPLEADO PARTICULAR,NO REPORTADO,SECUNDARIA,COLOMBIA,5585000.0,1988.0,45-243,rmkmtf3913@gmail.com
10303,2021-11-07 11:20:00,ANTIOQUIA,ZARAGOZA,Sávado,SAN GREGORIO,URBANA,CALLEJÓN,ARMA DE FUEGO,A PIE,A PIE,55.0,MASCULINO,SOLTERO,INDEPENDIENTE,NO REPORTADO,PRIMARIA,COLOMBIA,5895000.0,1960.0,11-030,poqtgh5827@unidatos.edu.co
9237,2021-10-03 14:10:00,VALLE,CALI (CT),Sábado,CIUDAD CORDOBA E15,URBANA,FRENTE A RESIDENCIAS - VIA PUBLICA,ARMA DE FUEGO,A PIE,A PIE,27.0,MASCULINO,SOLTERO,INDEPENDIENTE,NO REPORTADO,SECUNDARIA,COLOMBIA,76001000.0,1988.0,61-471,msrpjl5143@unidatos.edu.co
7955,2021-08-25 20:30:00,ATLÁNTICO,BARRANQUILLA (CT),Martes,LA CHINITA,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,A PIE,A PIE,23.0,MASCULINO,SOLTERO,EMPLEADO PARTICULAR,NO REPORTADO,SECUNDARIA,COLOMBIA,8001000.0,1992.0,70-055,gjcrqq6020@unidatos.edu.co
11040,2021-11-28 18:45:00,BOLÍVAR,CARTAGENA (CT),Sábado,Cto PASACABALLOS,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,A PIE,PASAJERO MOTOCICLETA,27.0,MASCULINO,SOLTERO,DESEMPLEADO,NO REPORTADO,SECUNDARIA,COLOMBIA,13001000.0,1988.0,13-918,skggmr5619@unidatos.edu.co


In [74]:
homicides_df["FECHA"].min(), homicides_df["FECHA"].max()

(Timestamp('2021-01-01 00:00:00'), Timestamp('2021-12-31 23:58:00'))

### Fix categorical column

In [75]:
# Creating a dictionary representing the valid departments for Colombia
departments_list = ['ANTIOQUIA', 'ATLÁNTICO', 'BOLÍVAR', 'BOYACÁ', 'CALDAS', 'CAQUETÁ',
       'CASANARE', 'CAUCA', 'CESAR', 'CHOCÓ', 'CÓRDOBA', 'META',
       'CUNDINAMARCA', 'HUILA', 'MAGDALENA', 'NARIÑO', 'PUTUMAYO',
       'RISARALDA', 'SANTANDER', 'SUCRE', 'TOLIMA', 'VALLE',
       'NORTE DE SANTANDER', 'GUAJIRA', 'QUINDÍO', 'SAN ANDRÉS Y PROVIDENCIA', 'ARAUCA',
       'GUAINÍA', 'VICHADA', 'VAUPÉS', 'GUAVIARE', 'AMAZONAS']

In [78]:
# Finding values not matching with the dictionary
homicides_df.loc[~homicides_df["DEPARTAMENTO"].isin(departments_list), "DEPARTAMENTO"].unique() #Buscar cuales son los valores unicos de las filas del df que no estan en department_list

array([], dtype=object)

<span style="color:red">TODO: Replace the values identified as error to a valid value from the dictionary.</span>

<span style="color:red">Hint: You can use the replace() pandas function.</span>

In [77]:
homicides_df["DEPARTAMENTO"] = homicides_df["DEPARTAMENTO"].apply(lambda x: x.replace('SAN ANDRÉS', 'SAN ANDRÉS Y PROVIDENCIA'))
homicides_df["DEPARTAMENTO"] = homicides_df["DEPARTAMENTO"].apply(lambda x: x.replace('SAN ANDRÉS Y PROVIDENCIA Y PROVIDENCIA', 'SAN ANDRÉS Y PROVIDENCIA'))
homicides_df["DEPARTAMENTO"] = homicides_df["DEPARTAMENTO"].apply(lambda x: x.replace('N. DE SANTANDER', 'NORTE DE SANTANDER'))


### Analyze potential duplicates

In [79]:
# Detecting duplicates by "CÉDULA" column
duplicates_by_cedula = homicides_df.loc[homicides_df["CÉDULA"].duplicated(keep=False)]

In [80]:
duplicates_by_cedula.shape

(1627, 21)

In [81]:
duplicates_by_cedula.sort_values("CÉDULA", ascending=True).head(6)

Unnamed: 0,FECHA,DEPARTAMENTO,MUNICIPIO,DIA,BARRIO,ZONA,CLASE DE SITIO,ARMA O MEDIO,MOVIL VICTIMA,MOVIL AGRESOR,EDAD,GENERO,ESTADO CIVIL,CLASE EMPLEADO,PROFESION,ESCOLARIDAD,PAIS NACE,CODIGO DANE,AÑO DE NACIMIENTO,CÉDULA,CORREO
4041,2021-05-03 23:30:00,CAUCA,EL TAMBO,Domingo,LA VICTORIA,RURAL,"BARES, CANTINAS Y SIMILARES",ARMA DE FUEGO,A PIE,A PIE,45.0,MASCULINO,UNION LIBRE,AGRICULTOR,NO REPORTADO,PRIMARIA,COLOMBIA,19256000.0,1970.0,10-048,aitufn1227@gmail.com
3810,2021-04-26 03:00:00,ANTIOQUIA,SAN JERÓNIMO,Domingo,LA PLAYA,URBANA,"HOTELES, RESIDENCIAS, Y SIMILARES.",CUERDA/SOGA/CADENA,A PIE,A PIE,25.0,MASCULINO,SOLTERO,INDEPENDIENTE,NO REPORTADO,SECUNDARIA,COLOMBIA,5656000.0,1990.0,10-048,ibbcpu2509@unidatos.edu.co
693,2021-01-19 05:30:00,ANTIOQUIA,SALGAR,Lunes,LA HABANA,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,A PIE,A PIE,58.0,MASCULINO,SOLTERO,AGRICULTOR,NO REPORTADO,PRIMARIA,COLOMBIA,5642000.0,57.0,10-079,unhoqj1172@unidatos.edu.co
10796,2021-11-20 03:00:00,VALLE,CALI (CT),Viernes,POTRERO GRANDE E21,URBANA,VIAS PUBLICAS,ARMA DE FUEGO,A PIE,A PIE,21.0,MASCULINO,SOLTERO,EMPLEADO PARTICULAR,NO REPORTADO,SECUNDARIA,COLOMBIA,76001000.0,1994.0,10-079,iknnoj8430@unidatos.edu.co
2386,2021-03-14 02:00:00,META,FUENTE DE ORO,Sábado,VEREDA PUERTO NUEVO,RURAL,"BARES, CANTINAS Y SIMILARES",ARMA BLANCA,A PIE,A PIE,42.0,MASCULINO,SOLTERO,INDEPENDIENTE,NO REPORTADO,PRIMARIA,COLOMBIA,50287000.0,1973.0,10-255,afompq7113@unidatos.edu.co
138,2021-01-02 06:20:00,CAUCA,SOTARA,Viernes,CENTRO,RURAL,VIAS PUBLICAS,ARMA BLANCA,A PIE,A PIE,22.0,MASCULINO,SOLTERO,EMPLEADO EJERCITO,NO REPORTADO,SECUNDARIA,COLOMBIA,19760000.0,1993.0,10-255,cobgqs8819@unidatos.edu.co


<span style="color:red">TODO: Delete records with "CÉDULA" duplicated.</span>

<span style="color:red">Hint: You can use drop_duplicates() pandas function.</span>

### Fix formats

In [None]:
# Using regular expressions for validating if "CÉDULA" values match the pattern XX-XXX
cedula_malformed = homicides_df.loc[homicides_df["CÉDULA"].apply(lambda x: (re.match("\d{2}-\d{3}", x) is None))]

In [None]:
cedula_malformed.shape

In [None]:
cedula_malformed.head(6)

<span style="color:red">TODO: Fix the malformed "CÉDULA" values.</span>

<span style="color:red">Assume that the third digit in the first part corresponds to the first one in the second part.</span>

<span style="color:red">TODO:Check the "CORREO" column (PATTERN: 4 digits before the @, only .edu.co and .com domains are allowed)</span>

### Automate imputation of categorical values

In [None]:
homicides_df["DIA"].unique()

In [None]:
# Calculating the distance between two words using the Levenshtein distance
print(distance("sábado", "sabaod"))
print(distance("sábado", "viernes"))

In [None]:
print(ratio("sábado", "sabaod"))
print(ratio("sábado", "viernes"))

<span style="color:red">TODO: Create a function to automatically fix the digitation errors for the "DIA" column.</span>

In [None]:
# It's expected to hava a list with only 7 values
homicides_df["DIA"].unique()

### Recalculate based on a different column

In [None]:
homicides_df[["AÑO DE NACIMIENTO", "EDAD"]].sample(10)

<span style="color:red">TODO: Fix "EDAD" column using "AÑO DE NACIMIENTO".</span>

### Conclusion

<span style="color:red">Make a summary of the different data quality issues found on the dataset, the data quality dimension that is related to and the implemented strategy for solving or mitigating each issue.</span>