# ANALISIS DE RIESGOS: Crear dataframe

En este notebook se crea un dataframe que englobe todos los existentes para los diferentes lustros.
Se crea un fichero .csv padre del cual parta los diferentes análisis a realizar

### PREPARACION DE DATASET

In [1]:
# Leer ficheros de datos
import codecs
import pandas as pd
import pyarrow.parquet as pq
# Extraer información
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
# Auxiliares
import numpy as np
import logging
from typing import Union
import pyarrow as pa

##### FUENTE
*https://av-info.faa.gov/dd_sublevel.asp?Folder=%5CAID*

Todos los datos obtenidos, se realiza mediante el acceso a la pagina web de la fuente. De este modo, en el momento de obtener los datos de estudio, se asegura que estos estén actualizados.

In [2]:
# URL de los datos a analizar: Leyenda
url_legend = 'https://av-info.faa.gov/data/AID/Afilelayout.txt'

# URL de los datos a analizar: Lustros con datos de estudio de 1975 a actualidad
# Obtener ficheros que se desea obtener los datos
lust = [year for year in range(1975, 2025, 5)]
AID_files = [f'a{year}_{str(year+4)[-2:]}.txt' for year in lust]
# El fichero de datos actual, sabemos que es de 2020 a 2025 actualemnte
AID_files[-1:] = 'a2020_25.txt',

with urlopen("https://av-info.faa.gov/dd_sublevel.asp?Folder=%5CAID") as r:
    bs = BeautifulSoup(r.read(), "html.parser")

url_AID = []
for link in bs.find_all("a"):
    # Buscar el texto del fichero en los ficheros creados previamente
    # Se desea aquellos que estan delimitados por \t, por ello buscar /tab/
    if link.next_element in AID_files and '/tab/' in link.get("href"):
        url_AID.append(link.get("href"))

##### Leyenda de datos

Extraer del fichero de leyendas, la referencia a los nombres de columnas para poder validar posteriormente los datos de los datasets.<br>
Se accede al contenido de los datos desde internet, de este modo, existe una referencia actualizada de los datos al obtener estos.

Dar formato al .txt descargado para poder realizar su lectura posterior

In [3]:
# Leemos el contenido de la leyenda
with urlopen(url_legend) as content:
    soup = BeautifulSoup(content, "html.parser")
    soup_lines = str(soup).split('\r\n')

# Transformamos la respuesta en un diccionario con el nombre de la columna y la descripción
legend_df = {"Column_name": [], "Description": []}
# Recorrer las lineas con datos de lineas. Se salta las dos primeras y las tres ultimas
# filas por no tener datos relevantes de la leyenda
for line in soup_lines[2:-3]:
    # Se extrae los 5 primeros caracteres para conocer el nombre de la columna
    legend_df["Column_name"].append(line[0:5].strip())
    # Se extrae la descripcion, esta comienza en la posicion 53
    legend_df["Description"].append(line[53:].strip())
    
# Convertir el diccionario en pandas.Dataframe
legend_df = pd.DataFrame.from_dict(data=legend_df)
legend_df

Unnamed: 0,Column_name,Description
0,c5,Unique control number used to relate to AID_MA...
1,c1,Type of Event
2,c2,FAR part number
3,c3,Form on which the latest data was received.
4,c4,Agency conducting investigation.
...,...,...
174,c163,2nd Additional cause factor text
175,c183,Supporting Factor Text ...
176,c191,Supporting cause factor B text. O...
177,c229,Date of Birth of PIC


##### Obtener dataframes

Leer y validar todos los ficheros con datos para realizar PHA.<br>

Para obtener los datos, ya que se trata de ficheros muy pesado, se realiza la descarga de los mismos y se trabaja el análisis para crear el dataset desde los ficheros

In [4]:
data_path = 'Datasets/PHA/'

In [5]:
# Descarga de ficheros previamente encontrados con BeautifulSoup
for url, file_name in zip(url_AID, AID_files):
    # Se envía la petición HTTP Get para la obtención de los datos
    data = requests.get(url)

    # Guardamos el archivo de manera local
    with open(data_path+file_name, 'wb')as file:
        file.write(data.content)
        print(f'[+] FILE_CREATED\t{file_name}')

[+] FILE_CREATED	a1975_79.txt
[+] FILE_CREATED	a1980_84.txt
[+] FILE_CREATED	a1985_89.txt
[+] FILE_CREATED	a1990_94.txt
[+] FILE_CREATED	a1995_99.txt
[+] FILE_CREATED	a2000_04.txt
[+] FILE_CREATED	a2005_09.txt
[+] FILE_CREATED	a2010_14.txt
[+] FILE_CREATED	a2015_19.txt
[+] FILE_CREATED	a2020_25.txt


Con los ficheros descargados, se crea un dataframe que albergue todos los datos. Para ello, se validará que las columnas existentes en los sucesivos ficheros, existen en la leyenda previamente extraida.

In [6]:
def validate_df(df) -> Union[bool, list]:
    """
    Validar las columnas existentes en el dataframe creado
    """
    if len(df.columns) != 180:
        err = f'[-] COLUMNS_NUMBER\t{len(df.columns)}'
        return [False, err]
        
    for column in df.columns:
        # Saltar la columna "end_of_record"
        if column == 'end_of_record':
            continue
        # Acceder al registro de la leyenda y comprobar que hay resultados
        if legend_df.loc[legend_df['Column_name'] == column].empty:
            err = f'[-] COLUMN_UNMATCH\t{column}'
            return [False, err]
    
    return True

In [7]:
# Ruta de acceso a los ficheros descargados
datasets = [data_path + file for file in AID_files]

# Crear un dataframe que posteriormente ira albergando el resto de dataframes que se crean
try:
    # Crear un dataframe que posteriormente ira albergando el resto de dataframes que se crean
    dataset = datasets[0]
    doc = codecs.open(dataset, 'rU')
    df = pd.read_csv(doc, sep='\t', on_bad_lines='skip', low_memory=False)
    
    # Validar dataset
    res = validate_df(df)
    
    if not isinstance(res, bool):
        raise AssertionError(res[1])
        
    print('[+] SUCCESS\t', dataset)

    for dataset in datasets[1:]:
        doc = codecs.open(dataset, 'rU')
        aux_df = pd.read_csv(doc, sep='\t', on_bad_lines='skip', low_memory=False)

        # Validar dataset
        res = validate_df(aux_df) 
        
        if not isinstance(res, bool):
            raise AssertionError(res[1])

        # Añadir nuevos datos al dataframe
        # Se añade los nuevos registros ordenados por columnas
        df = pd.concat([df, aux_df],axis=0)
        del aux_df
        print('[+] SUCCESS\t', dataset)
        
except AssertionError as e:
    print(f'[-] BAD_FILE\t{dataset}\ERROR\t{e.__str__()}')
    del df
    
    doc = codecs.open(dataset, 'rU')
    error_df = pd.read_csv(doc, sep='\t', on_bad_lines='skip', low_memory=False)
    display(error_df)

[+] SUCCESS	 Datasets/PHA/a1975_79.txt
[+] SUCCESS	 Datasets/PHA/a1980_84.txt
[+] SUCCESS	 Datasets/PHA/a1985_89.txt
[+] SUCCESS	 Datasets/PHA/a1990_94.txt
[+] SUCCESS	 Datasets/PHA/a1995_99.txt
[+] SUCCESS	 Datasets/PHA/a2000_04.txt
[+] SUCCESS	 Datasets/PHA/a2005_09.txt
[+] SUCCESS	 Datasets/PHA/a2010_14.txt
[+] SUCCESS	 Datasets/PHA/a2015_19.txt
[+] SUCCESS	 Datasets/PHA/a2020_25.txt


###### Nota:
Durante la validación, el fichero "a1990_94.txt" tiene un error en la columna "32":
[-] BAD_FILE	Datasets/PHA/a1990_94.txt	COLUMN_ERROR	32

Tras realizar las comprobaciones pertinentes, se observa que se debe a un mal tipado en la fuente de los datos y esta columna debería ser "c32". Por lo tanto, se modifica el fichero y se vuelve a realizar las comprobaciones

##### Cribado de datos

Previo a los posteriores análisis, se realiza un cribado de datos que permita reducir el tamaño del dataset resultante. Esto se debe a que existe diferentes columnas las cuales es conocido que no van a aportar ningún valor a los diferentes estudios.

In [8]:
# Mostrar datos obtenidos
df_clean = df.copy()

# Volver a crear el indice ya que al crear la suma de dataframes, este no se genera correctamente
df_clean.reset_index(drop=True, inplace=True)
df_clean

Unnamed: 0,c5,c1,c2,c3,c4,c6,c7,c8,c9,c10,...,c154,c156,c158,c161,c163,c183,c191,c229,c230,end_of_record
0,19750101000049A,A,,.4,,1975,1,1,19750101,,...,,,,,,,,,,
1,19750101000129A,A,,.4,,1975,1,1,19750101,,...,,,Wheeled-Tricycle,,,,,,,
2,19750101000139A,A,,.4,,1975,1,1,19750101,,...,,,Wheeled-Tricycle,,,,,,,
3,19750101000219A,A,,.4,,1975,1,1,19750101,,...,,,Wheeled-Tricycle,,,,,,,
4,19750101000229A,A,,.4,,1975,1,1,19750101,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214882,20220206001469I,I,091,,,2022,02,06,20220206,0930,...,,,Wheeled-Tricycle,,,,,19470720.0,,
214883,20220206001479I,I,091,,,2022,02,06,20220206,1710,...,,,Wheeled-Tricycle,,,,,19700731.0,,
214884,20220131001459I,I,O,,,2022,01,31,20220131,0832,...,,,Wheeled-Tricycle,,,,,19810601.0,19980409.0,
214885,20220107001449I,I,135,,,2022,01,07,20220107,0407,...,,,Wheeled-Tricycle,,,,,19520606.0,,


- Eliminar columnas vacias

Existe datos NaN así como caracteres vacíos o espacios. Por lo tanto, primero se necesita convertir todos los datos indeseados a NaN.

In [9]:
logging.basicConfig(filename='nan-convert.log', level=logging.DEBUG)

def clean_data(value):
    # Si se trata de un valor en blanco sustituir a NaN
    try:
        if isinstance(value, str) and (value.isspace() or value == ''):
            return np.nan
        
        return value

    except Exception as e:
        logging.info(f'[-] ERROR\t{e.__str__()}\tcol:{column}\trow:{i}\tvalue:{value}')
        return value

In [10]:
# Iterar sobre todas las columnas y comprobar sus valores
columns = len(df_clean.columns)

for column in df_clean.columns:
    # Utilizar compresion de listas y aplicar el filtrado de valores
    values = [clean_data(value) for value in df_clean[column]]
    df_clean[column] = values
    columns -= 1
    print(f'[+] REMAINING_COLUMNS\t{columns}\t', end='\r')
    
df_clean

[+] REMAINING_COLUMNS	0			

Unnamed: 0,c5,c1,c2,c3,c4,c6,c7,c8,c9,c10,...,c154,c156,c158,c161,c163,c183,c191,c229,c230,end_of_record
0,19750101000049A,A,,.4,,1975,1,1,19750101,,...,,,,,,,,,,
1,19750101000129A,A,,.4,,1975,1,1,19750101,,...,,,Wheeled-Tricycle,,,,,,,
2,19750101000139A,A,,.4,,1975,1,1,19750101,,...,,,Wheeled-Tricycle,,,,,,,
3,19750101000219A,A,,.4,,1975,1,1,19750101,,...,,,Wheeled-Tricycle,,,,,,,
4,19750101000229A,A,,.4,,1975,1,1,19750101,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214882,20220206001469I,I,091,,,2022,02,06,20220206,0930,...,,,Wheeled-Tricycle,,,,,19470720.0,,
214883,20220206001479I,I,091,,,2022,02,06,20220206,1710,...,,,Wheeled-Tricycle,,,,,19700731.0,,
214884,20220131001459I,I,O,,,2022,01,31,20220131,0832,...,,,Wheeled-Tricycle,,,,,19810601.0,19980409.0,
214885,20220107001449I,I,135,,,2022,01,07,20220107,0407,...,,,Wheeled-Tricycle,,,,,19520606.0,,


In [11]:
# Eliminar las columnas que tenga todos los datos nulos
empty_cols = [col for col in df_clean.columns if df_clean[col].isna().all()]

df_clean.drop(empty_cols, axis=1, inplace=True)
display(df_clean)
print(f'[+] DROPPED_COLUMNS\t{empty_cols}')

Unnamed: 0,c5,c1,c2,c3,c4,c6,c7,c8,c9,c10,...,c150,c154,c156,c158,c161,c163,c183,c191,c229,c230
0,19750101000049A,A,,.4,,1975,1,1,19750101,,...,POWERED,,,,,,,,,
1,19750101000129A,A,,.4,,1975,1,1,19750101,,...,POWERED,,,Wheeled-Tricycle,,,,,,
2,19750101000139A,A,,.4,,1975,1,1,19750101,,...,POWERED,,,Wheeled-Tricycle,,,,,,
3,19750101000219A,A,,.4,,1975,1,1,19750101,,...,POWERED,,,Wheeled-Tricycle,,,,,,
4,19750101000229A,A,,.4,,1975,1,1,19750101,,...,POWERED,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214882,20220206001469I,I,091,,,2022,02,06,20220206,0930,...,,,,Wheeled-Tricycle,,,,,19470720.0,
214883,20220206001479I,I,091,,,2022,02,06,20220206,1710,...,,,,Wheeled-Tricycle,,,,,19700731.0,
214884,20220131001459I,I,O,,,2022,01,31,20220131,0832,...,,,,Wheeled-Tricycle,,,,,19810601.0,19980409.0
214885,20220107001449I,I,135,,,2022,01,07,20220107,0407,...,,,,Wheeled-Tricycle,,,,,19520606.0,


[+] DROPPED_COLUMNS	['end_of_record']


- Eliminado logico de columnas

Analizar las columnas existentes y eliminar aquellas que no tenga sentido mantener en el resultado final del dataset.

In [12]:
unused_columns = [5, 3, 2, 4, 9, 75, 140, 139, 203, 204, 214, 790, 26, 37, 15, 16, 17,18, 19, 143, 205, 206, 207, 208, 210, 43, 129, 124, 125, 77]
unused_columns = [f'c{idx}' for idx in unused_columns]

for col in unused_columns:
    description = legend_df[legend_df["Column_name"] == col]["Description"].values[0]
    print(f'[+] UNUSED_COL\t{col}\t{description}')

[+] UNUSED_COL	c5	Unique control number used to relate to AID_MAIN table.
[+] UNUSED_COL	c3	Form on which the latest data was received.
[+] UNUSED_COL	c2	FAR part number
[+] UNUSED_COL	c4	Agency conducting investigation.
[+] UNUSED_COL	c9	Date the accident/incident happened.
[+] UNUSED_COL	c75	First, second, or third airplane involved or Not a midair.
[+] UNUSED_COL	c140	Extent of investigation
[+] UNUSED_COL	c139	Code for related reports pertaining to the accident/incident.
[+] UNUSED_COL	c203	This field is generated by the system so no edits are performed.
[+] UNUSED_COL	c204	Sequential number assigned to cases dealt with in the current year.
[+] UNUSED_COL	c214	Group of the air operator involved in the investigation.
[+] UNUSED_COL	c790	Consolidated Statement Rebuttal
[+] UNUSED_COL	c26	Technical certificate data sheet information of the aircraft
[+] UNUSED_COL	c37	Technical certificate data sheet information of the engine
[+] UNUSED_COL	c15	Name of the airport
[+] UNUSED_COL	c16	Ai

In [13]:
# Eliminar las columnas seleccionadas
df_clean.drop(unused_columns, axis=1, inplace=True)
df_clean

Unnamed: 0,c1,c6,c7,c8,c10,c132,c134,c136,c138,c141,...,c150,c154,c156,c158,c161,c163,c183,c191,c229,c230
0,A,1975,1,1,,,,,,,...,POWERED,,,,,,,,,
1,A,1975,1,1,,,,,,,...,POWERED,,,Wheeled-Tricycle,,,,,,
2,A,1975,1,1,,,,,,,...,POWERED,,,Wheeled-Tricycle,,,,,,
3,A,1975,1,1,,,,,,,...,POWERED,,,Wheeled-Tricycle,,,,,,
4,A,1975,1,1,,,,,,,...,POWERED,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214882,I,2022,02,06,0930,,,,,,...,,,,Wheeled-Tricycle,,,,,19470720.0,
214883,I,2022,02,06,1710,,,,,,...,,,,Wheeled-Tricycle,,,,,19700731.0,
214884,I,2022,01,31,0832,,,,,,...,,,,Wheeled-Tricycle,,,,,19810601.0,19980409.0
214885,I,2022,01,07,0407,,,,,,...,,,,Wheeled-Tricycle,,,,,19520606.0,


##### Guardado de datos

Una vez validado el dataframe, debido al gran volumen de datos que existe, se guarda el fichero en formato .parquet

In [14]:
pha_file = 'PHA_Data.parquet'

- Problemas encontrados<br>
A la hora de realizar la conversión a .parquet, se obtiene el error *ArrowInvalid: ("Could not convert '2010' with type str: tried to convert to int64", 'Conversion failed for column c6 with type object')*<br>
Se comprueba los tipos de datos de la columna y se mapea al formato int que se espera.

In [15]:
column = 'c6'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,
       1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
       1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
       2008, 2009, '2010', '2011', nan, '2012', '2013', '2014', '2015',
       '2016', '2017', '2018', '2019', '2020', '2021', '2022'],
      dtype=object)

In [16]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].fillna(0)
df_clean[column] = df_clean[column].astype(int)
df_clean[column].unique()

array([1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,
       1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
       1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
       2008, 2009, 2010, 2011,    0, 2012, 2013, 2014, 2015, 2016, 2017,
       2018, 2019, 2020, 2021, 2022])

- Problemas encontrados<br>
*ArrowInvalid: ("Could not convert '01' with type str: tried to convert to int64", 'Conversion failed for column c7 with type object')*

In [17]:
column = 'c7'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, '01', '02', '03', '04',
       '05', '06', '07', '08', '09', '10', '12', '11', nan], dtype=object)

In [18]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('float').astype("Int64")
df_clean[column].unique()

<IntegerArray>
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, <NA>]
Length: 13, dtype: Int64

- Problemas encontrados<br>
*ArrowInvalid: ("Could not convert '01' with type str: tried to convert to int64", 'Conversion failed for column c8 with type object')*

In [19]:
column = 'c8'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, '01', '02', '03',
       '05', '06', '04', '08', '11', '12', '09', '10', '07', '18', '14',
       '15', '17', '19', '21', '23', '16', '22', '13', '20', '26', '28',
       '29', '30', '27', '24', '25', '31', nan], dtype=object)

In [20]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('float').astype("Int64")
df_clean[column].unique()

<IntegerArray>
[   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,   11,   12,   13,
   14,   15,   16,   17,   18,   19,   20,   21,   22,   23,   24,   25,   26,
   27,   28,   29,   30,   31, <NA>]
Length: 32, dtype: Int64

- Problemas encontrados<br>
*ArrowInvalid: ("Could not convert '1 ' with type str: tried to convert to int64", 'Conversion failed for column c151 with type object')*

In [21]:
column = 'c151'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([1.0, 2.0, 4.0, nan, 0.0, 3.0, '1 ', '2 ', '0 ', '4 ', '3 '],
      dtype=object)

In [22]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('float').astype("Int64")
df_clean[column].unique()

<IntegerArray>
[1, 2, 4, <NA>, 0, 3]
Length: 6, dtype: Int64

- Problemas encontrados<br>
*ArrowInvalid: ("Could not convert '9028 ' with type str: tried to convert to double", 'Conversion failed for column c31 with type object')*

In [23]:
column = 'c31'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([nan, 660.0, 28.0, ..., '9887 ', '10009', '10905'], dtype=object)

In [24]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('str')
df_clean[column].unique()

array(['nan', '660.0', '28.0', ..., '9887 ', '10009', '10905'],
      dtype=object)

- Problemas encontrados<br>
*ArrowInvalid: ("Expected bytes, got a 'float' object", 'Conversion failed for column c106 with type object')*

In [25]:
column = 'c106'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array(['B', 'A', 'C', nan, 19620714.0, 19760912.0, 19540228.0, 19700202.0,
       19680726.0, 19691204.0, 19641127.0, 19440916.0, 19560306.0,
       19550927.0, 19980218.0, 19961002.0, 19670626.0, 19930126.0,
       19571026.0, 19691006.0], dtype=object)

In [26]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('str')
df_clean[column].unique()

array(['B', 'A', 'C', 'nan', '19620714.0', '19760912.0', '19540228.0',
       '19700202.0', '19680726.0', '19691204.0', '19641127.0',
       '19440916.0', '19560306.0', '19550927.0', '19980218.0',
       '19961002.0', '19670626.0', '19930126.0', '19571026.0',
       '19691006.0'], dtype=object)

- Problemas encontrados<br>
*ArrowInvalid: ("Expected bytes, got a 'float' object", 'Conversion failed for column c108 with type object')*

In [27]:
column = 'c108'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array(['S', 'D', 'U', 'M', 'G', 'N', nan, 'L', 'T', 'H', 'F', 'I', 'O',
       'W', 'J', 'V', 'A', 'P', 'B', 'C', 'R', 'K', 'E', 'Q', 19890602.0],
      dtype=object)

In [28]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('str')
df_clean[column].unique()

array(['S', 'D', 'U', 'M', 'G', 'N', 'nan', 'L', 'T', 'H', 'F', 'I', 'O',
       'W', 'J', 'V', 'A', 'P', 'B', 'C', 'R', 'K', 'E', 'Q',
       '19890602.0'], dtype=object)

- Problemas encontrados<br>
*ArrowInvalid: ("Expected bytes, got a 'float' object", 'Conversion failed for column c113 with type object')*

In [29]:
column = 'c113'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([nan, '600 ', '200 ', '400 ', 'CLER', '800 ', '7000', '3500',
       '4500', '1000', '2500', '2000', '500 ', '5000', '3000', '2300',
       'GRIK', '1800', '6000', '2900', '3100', '900 ', '4000', '2200',
       '9000', '8000', '$)))', '1200', '1300', '3300', '1500', 'GRIL',
       '9999', '1721', '5500', '100 ', '700 ', '6400', '1900', '4900',
       '2400', '3400', '1700', '7500', '3200', '2100', '2700', 'GEIK',
       '4600', 'G10 ', '1600', '6500', 'SCAT', '2800', '2600', '1400',
       '300 ', '4400', '3800', '1100', '6   ', '9500', '2020', '8500',
       '3700', '4300', 'GR1K', '7   ', '0700', 'BRKN', 'ZERO', '0   ',
       'OCS ', '850 ', '150 ', '3600', '2   ', '1340', '0600', 'OVCT',
       '3900', '?   ', '2280', '38  ', '1   ', '5   ', '7200', '.   ',
       '10,0', '4100', '25  ', '4700', '250 ', '5600', '85  ', '50  ',
       '125 ', '8   ', '75  ', '4002', '1032', 'F   ', 'GRI ', 'OCST',
       '40  ', '4200', '650 ', '0500', '10  ', '4800', 'GIK ', '1350',
       'G

In [30]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('str')
df_clean[column].unique()

array(['nan', '600 ', '200 ', '400 ', 'CLER', '800 ', '7000', '3500',
       '4500', '1000', '2500', '2000', '500 ', '5000', '3000', '2300',
       'GRIK', '1800', '6000', '2900', '3100', '900 ', '4000', '2200',
       '9000', '8000', '$)))', '1200', '1300', '3300', '1500', 'GRIL',
       '9999', '1721', '5500', '100 ', '700 ', '6400', '1900', '4900',
       '2400', '3400', '1700', '7500', '3200', '2100', '2700', 'GEIK',
       '4600', 'G10 ', '1600', '6500', 'SCAT', '2800', '2600', '1400',
       '300 ', '4400', '3800', '1100', '6   ', '9500', '2020', '8500',
       '3700', '4300', 'GR1K', '7   ', '0700', 'BRKN', 'ZERO', '0   ',
       'OCS ', '850 ', '150 ', '3600', '2   ', '1340', '0600', 'OVCT',
       '3900', '?   ', '2280', '38  ', '1   ', '5   ', '7200', '.   ',
       '10,0', '4100', '25  ', '4700', '250 ', '5600', '85  ', '50  ',
       '125 ', '8   ', '75  ', '4002', '1032', 'F   ', 'GRI ', 'OCST',
       '40  ', '4200', '650 ', '0500', '10  ', '4800', 'GIK ', '1350',
       

- Problemas encontrados<br>
*ArrowInvalid: ("Expected bytes, got a 'float' object", 'Conversion failed for column c240 with type object')*

In [31]:
column = 'c240'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([nan, '21', '05', '11', '0 ', '13', '30', '15', '09', '04', '36',
       '17', '26', '08', '27', '31', '12', '23', '33', '01', '32', '18',
       '25', '16', '07', '02', '35', '29', '34', '24', '03', '14', '22',
       '06', '28', '20', '19', '10', '00', '60', '39', '50', 'G3', 'G1',
       '55', '40', '6 ', '3 ', 'G4', '8 ', '47', '4 ', '7 ', '5 ', '2 ',
       'C ', '1 ', '45', '38', '90', '9 ', 'G2', 'G ', 'OO', '37', 'N ',
       '70', '.9', 'O ', 'G6', '80', 'G5', '89', '66', '@@', 'G0', '75',
       '87', 24.0, 28.0, 5.0, 30.0, 33.0, 36.0, 11.0, 2.0, 22.0, 18.0,
       20.0, 23.0, 15.0, 14.0, 29.0, 10.0, 35.0, 21.0, 13.0, 19.0, 12.0,
       1.0, 27.0, 16.0, 9.0, 4.0, 34.0, 31.0, 7.0, 25.0, 17.0, 8.0, 32.0,
       3.0, 6.0, 26.0, 0.0, '1B', '3N'], dtype=object)

In [32]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('str')
df_clean[column].unique()

array(['nan', '21', '05', '11', '0 ', '13', '30', '15', '09', '04', '36',
       '17', '26', '08', '27', '31', '12', '23', '33', '01', '32', '18',
       '25', '16', '07', '02', '35', '29', '34', '24', '03', '14', '22',
       '06', '28', '20', '19', '10', '00', '60', '39', '50', 'G3', 'G1',
       '55', '40', '6 ', '3 ', 'G4', '8 ', '47', '4 ', '7 ', '5 ', '2 ',
       'C ', '1 ', '45', '38', '90', '9 ', 'G2', 'G ', 'OO', '37', 'N ',
       '70', '.9', 'O ', 'G6', '80', 'G5', '89', '66', '@@', 'G0', '75',
       '87', '24.0', '28.0', '5.0', '30.0', '33.0', '36.0', '11.0', '2.0',
       '22.0', '18.0', '20.0', '23.0', '15.0', '14.0', '29.0', '10.0',
       '35.0', '21.0', '13.0', '19.0', '12.0', '1.0', '27.0', '16.0',
       '9.0', '4.0', '34.0', '31.0', '7.0', '25.0', '17.0', '8.0', '32.0',
       '3.0', '6.0', '26.0', '0.0', '1B', '3N'], dtype=object)

- Problemas encontrados<br>
*ArrowInvalid: ("Expected bytes, got a 'float' object", 'Conversion failed for column c241 with type object')*

In [33]:
column = 'c241'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([nan, '15', '05', '07', '03', '10', '08', '06', '04', '18', '12',
       '09', '11', '24', '14', '22', '13', '25', '20', '17', '16', '19',
       '29', '01', '02', '23', '8 ', '26', '21', '52', '40', '28', '44',
       'G1', 'G0', '2 ', 'G3', '90', 'G2', '30', '0 ', '00', ' G', '5 ',
       '6 ', '27', '34', '9 ', '31', '7 ', '36', '41', 'T ', ' 2', 'G4',
       '3 ', 'S ', 'B ', '35', '4 ', '55', 'G6', '60', 'X ', '33', '50',
       '1 ', 'W ', '43', '32', '80', '39', '99', '38', '51', ') ', '53',
       '0G', '45', ' 3', '46', '70', '58', '54', '56', 'G5', '49', '87',
       '66', '57', '37', '47', '48', '69', 9.0, 8.0, 5.0, 15.0, 6.0, 10.0,
       0.0, 12.0, 7.0, 14.0, 3.0, 40.0, 2.0, 11.0, 25.0, 13.0, 16.0, 26.0,
       21.0, 24.0, 4.0, 65.0, 23.0, 18.0, 20.0, 17.0, 35.0, 22.0, 28.0,
       1.0, 30.0, 33.0, 19.0, 37.0, 32.0, 27.0, 57.0, 45.0, 31.0, 38.0,
       87.0, 84.0, 46.0, 29.0, 58.0, 50.0, 36.0, 51.0, 75.0, 80.0, 92.0,
       74.0, 70.0, 60.0, 34.0, 55.0, 42.0], dtype=

In [34]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('str')
df_clean[column].unique()

array(['nan', '15', '05', '07', '03', '10', '08', '06', '04', '18', '12',
       '09', '11', '24', '14', '22', '13', '25', '20', '17', '16', '19',
       '29', '01', '02', '23', '8 ', '26', '21', '52', '40', '28', '44',
       'G1', 'G0', '2 ', 'G3', '90', 'G2', '30', '0 ', '00', ' G', '5 ',
       '6 ', '27', '34', '9 ', '31', '7 ', '36', '41', 'T ', ' 2', 'G4',
       '3 ', 'S ', 'B ', '35', '4 ', '55', 'G6', '60', 'X ', '33', '50',
       '1 ', 'W ', '43', '32', '80', '39', '99', '38', '51', ') ', '53',
       '0G', '45', ' 3', '46', '70', '58', '54', '56', 'G5', '49', '87',
       '66', '57', '37', '47', '48', '69', '9.0', '8.0', '5.0', '15.0',
       '6.0', '10.0', '0.0', '12.0', '7.0', '14.0', '3.0', '40.0', '2.0',
       '11.0', '25.0', '13.0', '16.0', '26.0', '21.0', '24.0', '4.0',
       '65.0', '23.0', '18.0', '20.0', '17.0', '35.0', '22.0', '28.0',
       '1.0', '30.0', '33.0', '19.0', '37.0', '32.0', '27.0', '57.0',
       '45.0', '31.0', '38.0', '87.0', '84.0', '46.0', '29

- Problemas encontrados<br>
*ArrowInvalid: ("Expected bytes, got a 'float' object", 'Conversion failed for column c243 with type object')*

In [35]:
column = 'c243'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([nan, '32', '16', '15', '12', '29', '40', '22', '27', '35', '24',
       '20', '25', '18', '10', '30', '31', '14', '17', '19', '28', '38',
       '36', '26', '2 ', '34', '23', '33', '52', '43', '21', '45', '53',
       '58', '05', '08', '06', '0 ', '1 ', '4 ', 'C ', '42', '48', '37',
       'B ', '39', '60', '55', '07', '7 ', '13', 'S ', '50', '44', 'G ',
       '03', '8 ', 'L ', '02', '04', '85', '65', '09', '11', '80', '5 ',
       '70', '46', '00', '90', '47', '75', '51', '41', '01', '77', 18.0,
       25.0, 30.0, 15.0, 12.0, 20.0, 40.0, 22.0, 23.0, 35.0, 24.0, 10.0,
       32.0, 28.0, 38.0, 19.0, 21.0, 41.0, 33.0, 27.0, 29.0, 55.0, 68.0,
       45.0, 42.0, 36.0, 58.0, 37.0, 13.0, 7.0, 2.0, 17.0, 60.0, 44.0,
       39.0, 8.0, 34.0, 16.0, 6.0, 26.0, 50.0, 11.0, 14.0, 46.0, 5.0, 4.0,
       31.0, 9.0, 52.0, 48.0, 47.0, 70.0, 43.0, 3.0, 80.0, 54.0, 0.0,
       65.0, 1.0, 62.0, 57.0, 49.0, 51.0, 99.0, 56.0], dtype=object)

In [36]:
# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('str')
df_clean[column].unique()

array(['nan', '32', '16', '15', '12', '29', '40', '22', '27', '35', '24',
       '20', '25', '18', '10', '30', '31', '14', '17', '19', '28', '38',
       '36', '26', '2 ', '34', '23', '33', '52', '43', '21', '45', '53',
       '58', '05', '08', '06', '0 ', '1 ', '4 ', 'C ', '42', '48', '37',
       'B ', '39', '60', '55', '07', '7 ', '13', 'S ', '50', '44', 'G ',
       '03', '8 ', 'L ', '02', '04', '85', '65', '09', '11', '80', '5 ',
       '70', '46', '00', '90', '47', '75', '51', '41', '01', '77', '18.0',
       '25.0', '30.0', '15.0', '12.0', '20.0', '40.0', '22.0', '23.0',
       '35.0', '24.0', '10.0', '32.0', '28.0', '38.0', '19.0', '21.0',
       '41.0', '33.0', '27.0', '29.0', '55.0', '68.0', '45.0', '42.0',
       '36.0', '58.0', '37.0', '13.0', '7.0', '2.0', '17.0', '60.0',
       '44.0', '39.0', '8.0', '34.0', '16.0', '6.0', '26.0', '50.0',
       '11.0', '14.0', '46.0', '5.0', '4.0', '31.0', '9.0', '52.0',
       '48.0', '47.0', '70.0', '43.0', '3.0', '80.0', '54.0', '0.0'

- Problemas encontrados<br>
*ArrowInvalid: ("Could not convert '0 ' with type str: tried to convert to int64", 'Conversion failed for column c63 with type object')*

In [37]:
column = 'c63'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([0, 4, 1, 2, 3, 10, 5, 6, 8, 11, 7, 25, 16, 15, 9, '0 ', '1 ', '3 ',
       '5 ', '2 ', '6 ', '4 ', '8 ', '7 ', '30', '10', nan, '70', '* ',
       '11', '13', '23', '9 ', '64', '12', '27', '16', '18', '57', '35',
       '65', '22', '20', '14', '25', '40', '37', '26', '99', '83', '17',
       '38', '56', '58', '81', '19', 18.0, 47.0, 45.0, 13.0], dtype=object)

In [38]:
# Convertir los valores nulos '* ' en NaN
df_clean[column].replace('* ', np.nan, inplace=True)

# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('float').astype("Int64")
df_clean[column].unique()

<IntegerArray>
[   0,    4,    1,    2,    3,   10,    5,    6,    8,   11,    7,   25,   16,
   15,    9,   30, <NA>,   70,   13,   23,   64,   12,   27,   18,   57,   35,
   65,   22,   20,   14,   40,   37,   26,   99,   83,   17,   38,   56,   58,
   81,   19,   47,   45]
Length: 43, dtype: Int64

- Problemas encontrados<br>
*ArrowInvalid: ("Could not convert '3 ' with type str: tried to convert to double", 'Conversion failed for column c64 with type object')*

In [39]:
column = 'c64'
# Visualizar los tipos de datos que tenemos en la columna de los años
df_clean[column].unique()

array([1.0, 0.0, 3.0, 2.0, 11.0, 4.0, 6.0, 5.0, 8.0, 7.0, 9.0, 20.0, 18.0,
       55.0, 16.0, 12.0, 35.0, nan, 21.0, 32.0, 17.0, 15.0, 13.0, 40.0,
       33.0, 10.0, 14.0, '3 ', '1 ', '0 ', '4 ', '2 ', '10', '42', '7 ',
       '6 ', '20', '30', '5 ', '26', '31', '15', '11', '8 ', '12', '22',
       '13', '9 ', '19', '16', '38', '28', '24', '14', '71', '23', '* ',
       '36', '21', '84', '39', '25', '33', '18', '37', '34', '29', '47',
       '35', '77', '17', '50', '68', '57', '45', 49.0, 30.0, 39.0, 28.0,
       42.0, 53.0, 85.0, 22.0, 24.0], dtype=object)

In [40]:
# Convertir los valores nulos '* ' en NaN
df_clean[column].replace('* ', np.nan, inplace=True)

# Convertir los valores NaN a 0 para podeer realizar la conversion
df_clean[column] = df_clean[column].astype('float').astype("Int64")
df_clean[column].unique()

<IntegerArray>
[   1,    0,    3,    2,   11,    4,    6,    5,    8,    7,    9,   20,   18,
   55,   16,   12,   35, <NA>,   21,   32,   17,   15,   13,   40,   33,   10,
   14,   42,   30,   26,   31,   22,   19,   38,   28,   24,   71,   23,   36,
   84,   39,   25,   37,   34,   29,   47,   77,   50,   68,   57,   45,   49,
   53,   85]
Length: 54, dtype: Int64

- Conversión del dataframe a .parquet

In [41]:
try:
    table = pa.Table.from_pandas(df_clean, preserve_index=False)
    pq.write_table(table, data_path + pha_file)
except Exception as e:
    print(e.__str__())