# 📊 **Data Wrangling**

> Data wrangling or **data munging** is the proces of cleasing, transforming, structuring raw data, to ensure that it is formatted correctly for analysis and informated decision-making.

## ♻ **Data Wrangling Procees**

### **Step 1️⃣: Discover**

Initialy you focus on exploring and understanding the data that you've gathered. This step can include the identification of the data sources like data bases, csv files, CRM, ERP, etc, another proces is assessing the data quality , and gaining insights into the structure and format of data.

### **Step 2️⃣: Structurre**

In the structuring step, you are going to convert the data in a format thta facilitates an efficiente analysis, the processes in this step depends on the type of analysisis or model you're going to do. But in general data structuring involves **reshaping**, **handling missing values** and **dealing with data types**.

### **Step 3️⃣: Clean**

Data cleansing is a crucial step to address inconsistencies, errors, and outliers within the dataset. This involves removing or correcting inaccurate data, handling duplicates, and addressing any anomalies that could impact the reliability of analyses.

### **Step 4️⃣: Enrich**

Enrinching your data means adding supplementary information to give it greater context and detail. This might involve combining datasets, deriving useful features, or integrating data from external sources. The purpose is to expand the original dataset so it becomes richer and more useful for analysis. Whenever new data is included, it should be properly organized and cleaned.

### **Step 5️⃣: Validate**

Validation is the process of confirming the accuracy and dependability of your processed data. It involves identifying inconsistencies, ensuring data integrity, and verifying compliance with established standards. By validating, you can be more confident in the dataset’s correctness and its suitability for meaningful analysis.

### **Step 6️⃣: Publish**

Once your dataset has been curated and validated, it is ready for analysis or distribution to business users. At this stage, you document the data lineage and the steps followed throughout the wrangling process, provide metadata, and prepare the dataset for storage or integration with data science and analytics platforms. Publishing the data promotes collaboration and enables others to leverage it for their analyses or decision-making.

### **Practice**

#### **Import Libraries**

In [14]:
import numpy as np
import pandas as pd

#### **Load Dataset**

In [3]:
df = pd.read_csv("./data/mdi_personasdesaparecidas_pm_2024_enero.csv",
                 encoding="latin1",
                 sep=";")


#### **Exploring | Discovering Data**

* Explore the information of dataset
* Data types
* Number of observation and features
* Data quality

**Number of rows and columns**

In [4]:
# Number of rows and columns
print(f"No. Rows: {df.shape[0]}")
print(f"No. Columns: {df.shape[1]}")

No. Rows: 551
No. Columns: 10


**Data types of each column**

In [6]:
dtypes_df = pd.DataFrame({
    "Columns": df.columns,
    "DataType": df.dtypes.values
})

In [7]:
dtypes_df

Unnamed: 0,Columns,DataType
0,Provincia,object
1,Latitud,object
2,Longitud,object
3,Edad Aprox.,float64
4,Sexo,object
5,Motivo Desaparción,object
6,Motivo Desaparción Obs.,object
7,Fecha Desaparición,object
8,Situación Actual,object
9,Fecha Localización,object


**Review n first and last observations**

In [8]:
# n first rows
df.head(10)

Unnamed: 0,Provincia,Latitud,Longitud,Edad Aprox.,Sexo,Motivo Desaparción,Motivo Desaparción Obs.,Fecha Desaparición,Situación Actual,Fecha Localización
0,AZUAY,-29024543,-790028905,3.0,MUJER,SIN_DATO,,10/1/2024,DESAPARECIDO,SIN_DATO
1,EL ORO,-32610165,-799708185,12.0,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,1/1/2024,ENCONTRADO,3/1/2024
2,GUAYAS,-233695892,-7940604541,14.0,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,14/1/2024,ENCONTRADO,16/1/2024
3,GUAYAS,-21327363,-795954587,13.0,MUJER,VIOLENCIA,VIOLENCIA FÍSICA,29/1/2024,ENCONTRADO,30/1/2024
4,GUAYAS,-2242258498,-7963950637,14.0,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,26/1/2024,ENCONTRADO,3/2/2024
5,IMBABURA,2227898,-782623429,16.0,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,1/1/2024,ENCONTRADO,6/1/2024
6,MANABI,-1064535401,-8046702147,13.0,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,3/1/2024,ENCONTRADO,8/1/2024
7,SUCUMBIOS,72216083,-7688166879,13.0,HOMBRE,SIN_DATO,SIN_DATO,29/1/2024,DESAPARECIDO,SIN_DATO
8,SUCUMBIOS,72216083,-7688166879,13.0,MUJER,SIN_DATO,SIN_DATO,29/1/2024,DESAPARECIDO,SIN_DATO
9,PICHINCHA,-2188216,-785135489,16.0,HOMBRE,CAUSAS PERSONALES,CIRCUNSTANCIAL,9/1/2024,ENCONTRADO,22/1/2024


In [9]:
# n last rows
df.tail(10)

Unnamed: 0,Provincia,Latitud,Longitud,Edad Aprox.,Sexo,Motivo Desaparción,Motivo Desaparción Obs.,Fecha Desaparición,Situación Actual,Fecha Localización
541,GUAYAS,-215559646,-7992212013,11.0,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,29/1/2024,ENCONTRADO,3/2/2024
542,GUAYAS,-20755495,-799256079,13.0,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,22/1/2024,ENCONTRADO,3/2/2024
543,GUAYAS,-21258717,-800157633,15.0,HOMBRE,SIN_DATO,SIN_DATO,24/1/2024,DESAPARECIDO,SIN_DATO
544,GUAYAS,-22402791,-799220686,25.0,MUJER,CAUSAS SOCIALES,DROGADICCIÓN,5/1/2024,ENCONTRADO,6/2/2024
545,GUAYAS,-2221876,-798898816,16.0,HOMBRE,CAUSAS PERSONALES,CIRCUNSTANCIAL,6/1/2024,ENCONTRADO,5/2/2024
546,GUAYAS,-2168974,-798397207,29.0,HOMBRE,SIN_DATO,SIN_DATO,24/1/2024,DESAPARECIDO,SIN_DATO
547,GUAYAS,-2111457399,-7993233353,18.0,HOMBRE,SIN_DATO,SIN_DATO,23/1/2024,DESAPARECIDO,SIN_DATO
548,GUAYAS,-2111457399,-7993233353,17.0,MUJER,SIN_DATO,SIN_DATO,23/1/2024,DESAPARECIDO,SIN_DATO
549,GUAYAS,-2111457399,-7993233353,38.0,MUJER,SIN_DATO,SIN_DATO,23/1/2024,DESAPARECIDO,SIN_DATO
550,GUAYAS,-2206083126,-799312367,23.0,HOMBRE,FALLECIDO,ASESINATO,29/1/2024,FALLECIDO,31/1/2024


##### **Discover proccess conclusions**

* The dataset has 551 rows and 10 columns or features
* The column Age has de incorrect data type, since currently is float but it should be integer
* Some columns have the ***SIN_DATO*** value, wich can indicate a missing value, specially in "Fecha Localización" column. This problem we can handle by some ways:
    * Change SIN_DATO by a null value 
    * Delete that rows
    * Delete only the rows of Fecha Localización columns with that value
* The columns **Fecha Desaparición** and **Fecha Localización** have the incorrect data type, they should be date not object

#### **Structuring Dataset**

* Handling missing values
* Correct data types

**Handling missing values**

Remembering that we founded some values equal to SIN_DATO in some columns, specially in Fecha Localción column, I think, we can replace that value by null value in that column. In the others we can leave it like that.

***Why do I replace SIN_DATO value by null in Fecha Localización column?***

Because I cannot invent dates, this is imposible. It is a dataset about missing persons, so It is imposible invent or replace missing dates by mean, or mode, this can change the results of analysis

In [15]:
# Replace SIN_DATO value in Fecha Localizacion by null value
df["Fecha Localización"] = df["Fecha Localización"].replace("SIN_DATO", np.nan)

In [16]:
# Verify missing values
nan_df = pd.DataFrame({
    "Column": df.columns,
    "MissingValues": df.isna().sum().values,
    "Percentage": (((df.isna().sum() * 100) / df.shape[0]).round(2)).values,
    "DataType": df.dtypes.values
})

In [17]:
nan_df

Unnamed: 0,Column,MissingValues,Percentage,DataType
0,Provincia,0,0.0,object
1,Latitud,0,0.0,object
2,Longitud,0,0.0,object
3,Edad Aprox.,1,0.18,float64
4,Sexo,1,0.18,object
5,Motivo Desaparción,2,0.36,object
6,Motivo Desaparción Obs.,2,0.36,object
7,Fecha Desaparición,1,0.18,object
8,Situación Actual,1,0.18,object
9,Fecha Localización,119,21.6,object


Here I don't have many missing values, with an exception in Fecha Localización column, but in this case like it is a date column, we can drop all missing values with not problems

In [18]:
# Drop missing values
df.dropna(inplace=True)

**Change Data Types**

* Age: From float to int
* Fecha Desaparición: From object to date
* Fecha Localización: From object to date

In [30]:
# age column
df["Edad Aprox."] = df["Edad Aprox."].astype("int")
df["Fecha Desaparición"] = pd.to_datetime(df["Fecha Desaparición"], format="%d/%m/%Y")
df["Fecha Localización"] = pd.to_datetime(df["Fecha Localización"], format="%d/%m/%Y")

**Change column names**

In [34]:
df = df.rename(columns={
    "Edad Aprox.": "Edad_Aprox",
    "Motivo Desaparción": "Motivo_Desaparicion",
    "Motivo Desaparción Obs.": "Motivo_Desaparicion_Obs",
    "Fecha Desaparición": "Fecha_Desaparicion",
    "Fecha Localización": "Fecha_Localizacion",
    "Situación Actual": "Situacion_Actual"
})
df.columns

Index(['Provincia', 'Latitud', 'Longitud', 'Edad_Aprox', 'Sexo',
       'Motivo_Desaparicion', 'Motivo_Desaparicion_Obs', 'Fecha_Desaparicion',
       'Situacion_Actual', 'Fecha_Localizacion'],
      dtype='object')

### **Cleasing Data**

* Drop duplicated values

In [35]:
# Verify duplicated data
df[df.duplicated]

Unnamed: 0,Provincia,Latitud,Longitud,Edad_Aprox,Sexo,Motivo_Desaparicion,Motivo_Desaparicion_Obs,Fecha_Desaparicion,Situacion_Actual,Fecha_Localizacion
22,AZUAY,-29184557,-790389464,15,MUJER,CAUSAS PERSONALES,CIRCUNSTANCIAL,2024-01-02,ENCONTRADO,2024-01-04
209,IMBABURA,3552973,-781250848,12,HOMBRE,EXTRAVIADO - AUSENCIA TEMPORAL,ACCIDENTAL POR DESCUIDO,2024-01-18,ENCONTRADO,2024-01-19


In [36]:
# drop duplicated columns
df.drop_duplicates(inplace=True)

### **Enrich Data**

Enrich data is include more data from otrhers data soucers, or adding new columns, in this case we can create a new column to indicate in which days there are more missing people

In [39]:
df["Anio_Desaparicion"] = df["Fecha_Desaparicion"].dt.year
df["Dia_Desaparicion"] =  df["Fecha_Desaparicion"].dt.day_name("es").str.upper()

### **Validate Data**

In [40]:
# Confirm not missing values
df.isna().sum()

Provincia                  0
Latitud                    0
Longitud                   0
Edad_Aprox                 0
Sexo                       0
Motivo_Desaparicion        0
Motivo_Desaparicion_Obs    0
Fecha_Desaparicion         0
Situacion_Actual           0
Fecha_Localizacion         0
Anio_Desaparicion          0
Dia_Desaparicion           0
dtype: int64

In [41]:
# confirmno duplicated columns
df[df.duplicated]

Unnamed: 0,Provincia,Latitud,Longitud,Edad_Aprox,Sexo,Motivo_Desaparicion,Motivo_Desaparicion_Obs,Fecha_Desaparicion,Situacion_Actual,Fecha_Localizacion,Anio_Desaparicion,Dia_Desaparicion


In [42]:
# Confirm correct data types
df.dtypes

Provincia                          object
Latitud                            object
Longitud                           object
Edad_Aprox                          int64
Sexo                               object
Motivo_Desaparicion                object
Motivo_Desaparicion_Obs            object
Fecha_Desaparicion         datetime64[ns]
Situacion_Actual                   object
Fecha_Localizacion         datetime64[ns]
Anio_Desaparicion                   int32
Dia_Desaparicion                   object
dtype: object

### **Publish Data**

In [43]:
df.to_csv("./data/personas_desaparecidas.csv", index=False)

---