# Data Processing

The provided code snippet is used to process a dataset containing airport information using Python's pandas library. Initially, the script imports necessary modules, `pandas` and `numpy`, for data manipulation and generation, respectively. The dataset, located at `'../Data/osm-world-airports@babel.csv'`, is read into a pandas DataFrame, employing specific parsing parameters (`delimiter=';'`, `on_bad_lines='skip'`, `engine='python'`) to handle potential errors.

Following this, a random 'Size' column is introduced to denote the size of each airport, utilizing numpy's random integer generation. Rows with missing values in critical columns ('Country', 'Country code', 'IATA code', 'ICAO code') are subsequently dropped to ensure data integrity. Superfluous columns such as 'other_tags', 'Description', 'Wikipedia id', 'Wikidata id', 'Edit in OSM', and 'Source' are removed from the DataFrame to streamline further analysis.

To mitigate potential parsing complications in CSV files, certain columns (`'Geo Point'`, `'Name'`, `'Phone'`, `'Name (en)'`, `'Name (fr)'`, `'Operator'`, `'Website'`) are processed to replace commas with semicolons, enhancing data consistency and facilitating seamless parsing.

The cleaned DataFrame (`df`) represents structured airport data, where:
- **IATA code** refers to the unique three-letter code assigned by the International Air Transport Association to identify airports.
- **ICAO code** denotes the four-letter code designated by the International Civil Aviation Organization for individual airports and airspace globally.
- **Country** signifies the country in which each airport is located.
- **Country code** represents the standard two-letter code assigned to countries according to ISO 3166-1 alpha-2.
- **Name** denotes the official name of the airport.
- **Size** indicates a randomly generated numerical value representing the size of each airport.

The final processed DataFrame (`df`) can be displayed for examination or utilized for subsequent analytical tasks as per specific project requirements.


In [1]:
import pandas as pd
import numpy as np

# Define the file path of the dataset
file_path = '../Data/osm-world-airports@babel.csv'

# Read the CSV file into a pandas DataFrame
# Skipping lines that cause parsing errors, using Python engine
df = pd.read_csv(file_path, delimiter=';', on_bad_lines='skip', engine='python')

# Generate a random 'Size' column for each airport in the DataFrame
df['Size'] = np.random.randint(100, 500, size=len(df))

# Drop rows with missing values in critical columns ('Country', 'Country code', 'IATA code', 'ICAO code')
df = df.dropna(subset=['Country', 'Country code', 'IATA code', 'ICAO code'])

# Remove unnecessary columns from the DataFrame
df = df.drop(labels=["other_tags", "Description", "Wikipedia id", "Wikidata id", "Edit in OSM", "Source"], axis=1, inplace=False)

# Replace commas (',') in specific columns with semicolons (';') to avoid CSV parsing issues
df['Geo Point'] = df['Geo Point'].str.replace(',', ';')
df['Name'] = df['Name'].str.replace(',', ';')
df['Phone'] = df['Phone'].str.replace(',', ';')
df['Name (en)'] = df['Name (en)'].str.replace(',', ';')
df['Name (fr)'] = df['Name (fr)'].str.replace(',', ';')
df['Operator'] = df['Operator'].str.replace(',', ';')
df['Website'] = df['Website'].str.replace(',', ';')

# Display or further process the cleaned DataFrame
df

# Save the cleaned DataFrame to Disk
df.to_csv("../Data/Airports.csv")