<a id='Q0'></a>
<center> <h1> Aviation Herald Project: Preprocessing</h1> </center>
<p style="margin-bottom:1cm;"></p>
<center><h4>Laurent Bobay, 2024</h4></center>
<p style="margin-bottom:1cm;"></p>

<div style="background:#EEEDF5;border-top:0.1cm solid #EF475B;border-bottom:0.1cm solid #EF475B;">
    <div style="margin-left: 0.5cm;margin-top: 0.5cm;margin-bottom: 0.5cm;color:#303030">
        <p><strong>Goal:</strong> Create dataset of all publicly available articles and comments from www.avherald.com</p>
        <strong> Outline:</strong>
        <a id='P0' name="P0"></a>
        <ol>
            <li> <a style="color:#303030" href='#SU'>Set up</a></li>
            <li> <a style="color:#303030" href='#P1'>Data Exploration and Cleaning</a></li>
            <li> <a style="color:#303030" href='#P2'>Modeling</a></li>
            <li> <a style="color:#303030" href='#P3'>Model Evaluation</a></li>
            <li> <a style="color:#303030" href='#CL'>Conclusion</a></li>
        </ol>
        <strong>Topics Trained:</strong> Notebook Layout, Data Cleaning, Modelling and Model Evaluation
    </div>
</div>

<nav style="text-align:right"><strong>
        <a style="color:#00BAE5" href="https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/07_MLEngineering/index.html" title="momentum"> Module 7, Machine Learning Engineering </a>|
        <a style="color:#00BAE5" href="https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/07_MLEngineering/day1/index.html" title="momentum">Day 1, Data Science Project Development </a>|
        <a style="color:#00BAE5" href="https://drive.google.com/file/d/1SOCQu9Gv3jNNXxvJSszBC3fYNsM0df2F/view?usp=sharing" title="momentum"> Live Coding 1, Simple Prediction Notebook</a>
</strong></nav>

<a id='I' name="I"></a>
## [Introduction](#P0)

www.avherald.com is the standard when it comes to listing aviation incidents and occurrences and lists around 30000 occurrences. The goal of this notebook is scraping the site and store the articles as a dataset. 

<a id='SU' name="SU"></a>
## [Set up](#P0)

### Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from datetime import datetime
from IPython.display import display, Markdown
from tqdm import tqdm

import torch
from sentence_transformers import SentenceTransformer
from transformers import pipeline


from preprocessing_helpers import *


  from tqdm.autonotebook import tqdm, trange


### File Paths

In [4]:

dataset_path = "../data/interim/full_dataset.csv" # updated dataset (last updated)


<a id='P1'></a>
## [Data Preparation](#P0)

In [5]:
df = pd.read_csv(dataset_path)
df

Unnamed: 0,title,href,text,time_author,headline,comment_authors,comments,occurrence,url
0,"RAM AT72 at Seville on Jun 19th 2024, runway i...",/h?article=51aa8170&opt=0,A RAM Royal Air Maroc Avions de Transport Regi...,"By Simon Hradecky, created Wednesday, Jul 3rd...",Incident: RAM AT72 at Seville on Jun 19th 2024...,"['By (anonymous) on Thursday, Jul 4th 2024 18...",['With the increase of traffic for the summer ...,incident,https://avherald.com/h?article=51aa8170&opt=0
1,Corendon Europe B738 at Brussels on Jul 2nd 20...,/h?article=51aa31d7&opt=0,"A Corendon Airlines Europe Boeing 737-800, reg...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: Corendon Europe B738 at Brussels on ...,"['By Bert hethak on Wednesday, Jul 3rd 2024 1...","[""The second to last sentence of the first par...",incident,https://avherald.com/h?article=51aa31d7&opt=0
2,"San Marino A333 at Surakarta on Jul 2nd 2024, ...",/h?article=51aa2c28&opt=0,A San Marino Aviation Airbus A330-300 on behal...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: San Marino A333 at Surakarta on Jul ...,"['By Hans R. on Wednesday, Jul 3rd 2024 20:34...",['They own the A 300-600F since October 2019 (...,incident,https://avherald.com/h?article=51aa2c28&opt=0
3,"ANZ B789 over Pacific on Jul 1st 2024, engine ...",/h?article=51aa1b40&opt=0,"An ANZ Air New Zealand Boeing 787-9, registrat...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ B789 over Pacific on Jul 1st 202...,"['By Yoke bloke on Thursday, Jul 4th 2024 14:...","[""There's a simple solution, and the crew didn...",incident,https://avherald.com/h?article=51aa1b40&opt=0
4,"ANZ DH8C at Invercargill on Jun 29th 2024, uns...",/h?article=51aa17ea&opt=0,An ANZ Air New Zealand de Havilland Dash 8-300...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ DH8C at Invercargill on Jun 29th...,"['By (anonymous) on Wednesday, Jul 3rd 2024 0...",['Lucky. '],incident,https://avherald.com/h?article=51aa17ea&opt=0
...,...,...,...,...,...,...,...,...,...
28957,"Fedex MD-10 gear collapse at Memphis, TN, Dec ...",/h?article=3d769eac&opt=0,NTSB has released the final report of the acci...,"By Simon Hradecky, created Monday, Sep 5th 20...",Final Report: Fedex MD-10 gear collapse at Mem...,[],[],final report,https://avherald.com/h?article=3d769eac&opt=0
28958,"Mandala Airlines B737-200 at Medan, Indonesia ...",/h?article=3d767dad&opt=0,"A B737-200 of Indonesian airline ""Mandala Airl...","By Simon Hradecky, created Monday, Sep 5th 20...","Crash: Mandala Airlines B737-200 at Medan, Ind...",[],[],crash,https://avherald.com/h?article=3d767dad&opt=0
28959,"Tans Peru B737-200 at Pucallpa, Amazonas on A...",/h?article=3d6daad4&opt=0,"An airplane of Peruvian Airline ""Tans"", airpla...","By Simon Hradecky, created Tuesday, Aug 23rd 2...","Crash: Tans Peru B737-200 at Pucallpa, Amazon...",[],[],crash,https://avherald.com/h?article=3d6daad4&opt=0
28960,Flash Air B737 at Sharm el-Sheik on Jan 03 200...,/h?article=3d68ad47&opt=0,"Hi all, just found this link to the official a...","By Urs Wildermuth, created Tuesday, Aug 16th 2...",Final Report: Flash Air B737 at Sharm el-Sheik...,[],[],final report,https://avherald.com/h?article=3d68ad47&opt=0


### Feature Engineering

In [6]:
# Apply the function to each value in 'Input' column
df["author"], df["created"], df["updated"] = zip(*df["time_author"].apply(get_author_and_time))
df.head()

Unnamed: 0,title,href,text,time_author,headline,comment_authors,comments,occurrence,url,author,created,updated
0,"RAM AT72 at Seville on Jun 19th 2024, runway i...",/h?article=51aa8170&opt=0,A RAM Royal Air Maroc Avions de Transport Regi...,"By Simon Hradecky, created Wednesday, Jul 3rd...",Incident: RAM AT72 at Seville on Jun 19th 2024...,"['By (anonymous) on Thursday, Jul 4th 2024 18...",['With the increase of traffic for the summer ...,incident,https://avherald.com/h?article=51aa8170&opt=0,Simon Hradecky,2024-07-03 07:01:00,2024-07-03 07:01:00
1,Corendon Europe B738 at Brussels on Jul 2nd 20...,/h?article=51aa31d7&opt=0,"A Corendon Airlines Europe Boeing 737-800, reg...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: Corendon Europe B738 at Brussels on ...,"['By Bert hethak on Wednesday, Jul 3rd 2024 1...","[""The second to last sentence of the first par...",incident,https://avherald.com/h?article=51aa31d7&opt=0,Simon Hradecky,2024-07-02 19:50:00,2024-07-02 19:50:00
2,"San Marino A333 at Surakarta on Jul 2nd 2024, ...",/h?article=51aa2c28&opt=0,A San Marino Aviation Airbus A330-300 on behal...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: San Marino A333 at Surakarta on Jul ...,"['By Hans R. on Wednesday, Jul 3rd 2024 20:34...",['They own the A 300-600F since October 2019 (...,incident,https://avherald.com/h?article=51aa2c28&opt=0,Simon Hradecky,2024-07-02 19:00:00,2024-07-02 19:00:00
3,"ANZ B789 over Pacific on Jul 1st 2024, engine ...",/h?article=51aa1b40&opt=0,"An ANZ Air New Zealand Boeing 787-9, registrat...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ B789 over Pacific on Jul 1st 202...,"['By Yoke bloke on Thursday, Jul 4th 2024 14:...","[""There's a simple solution, and the crew didn...",incident,https://avherald.com/h?article=51aa1b40&opt=0,Simon Hradecky,2024-07-02 16:34:00,2024-07-02 16:34:00
4,"ANZ DH8C at Invercargill on Jun 29th 2024, uns...",/h?article=51aa17ea&opt=0,An ANZ Air New Zealand de Havilland Dash 8-300...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ DH8C at Invercargill on Jun 29th...,"['By (anonymous) on Wednesday, Jul 3rd 2024 0...",['Lucky. '],incident,https://avherald.com/h?article=51aa17ea&opt=0,Simon Hradecky,2024-07-02 16:08:00,2024-07-02 16:08:00


#### Convert all text to string

In [7]:

# Convert all items in 'text' column to string
df['text'] = df['text'].apply(lambda x: str(x) if pd.notna(x) else " ")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28962 entries, 0 to 28961
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   title            28962 non-null  object        
 1   href             28962 non-null  object        
 2   text             28962 non-null  object        
 3   time_author      28962 non-null  object        
 4   headline         28962 non-null  object        
 5   comment_authors  28962 non-null  object        
 6   comments         28962 non-null  object        
 7   occurrence       28681 non-null  object        
 8   url              28962 non-null  object        
 9   author           28962 non-null  object        
 10  created          28962 non-null  datetime64[ns]
 11  updated          28962 non-null  datetime64[ns]
dtypes: datetime64[ns](2), object(10)
memory usage: 2.7+ MB


#### Preprocess Texts

In [8]:
# Ensure necessary NLTK resources are downloaded
nltk.download('punkt');
stop_words = nltk.corpus.stopwords.words('english')

# Initialize geonamescache
gc = geonamescache.GeonamesCache()

# Get a dictionary of cities and countries
cities = gc.get_cities()
countries = gc.get_countries()

# Extract city and country names
city_names = [remove_accented_chars(city['name']).lower() for city in cities.values()]
country_names = [remove_accented_chars(country['name']).lower() for country in countries.values()]

[nltk_data] Downloading package punkt to /Users/laurent/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
tqdm.pandas()  # This line enables tqdm support for Pandas apply function

# Apply the preprocess_text function to each row in df["text"] with tqdm progress bar
df["normalized_text"], df["cities"], df["countries"] = zip(*df["text"].progress_apply(lambda x: preprocess_text(x, stop_words, city_names, country_names)))

100%|██████████| 28962/28962 [10:38<00:00, 45.39it/s] 


In [10]:
df

Unnamed: 0,title,href,text,time_author,headline,comment_authors,comments,occurrence,url,author,created,updated,normalized_text,cities,countries
0,"RAM AT72 at Seville on Jun 19th 2024, runway i...",/h?article=51aa8170&opt=0,A RAM Royal Air Maroc Avions de Transport Regi...,"By Simon Hradecky, created Wednesday, Jul 3rd...",Incident: RAM AT72 at Seville on Jun 19th 2024...,"['By (anonymous) on Thursday, Jul 4th 2024 18...",['With the increase of traffic for the summer ...,incident,https://avherald.com/h?article=51aa8170&opt=0,Simon Hradecky,2024-07-03 07:01:00,2024-07-03 07:01:00,ram royal air maroc avions transport regional ...,"{begun, casablanca, cork}","{spain, morocco, ireland}"
1,Corendon Europe B738 at Brussels on Jul 2nd 20...,/h?article=51aa31d7&opt=0,"A Corendon Airlines Europe Boeing 737-800, reg...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: Corendon Europe B738 at Brussels on ...,"['By Bert hethak on Wednesday, Jul 3rd 2024 1...","[""The second to last sentence of the first par...",incident,https://avherald.com/h?article=51aa31d7&opt=0,Simon Hradecky,2024-07-02 19:50:00,2024-07-02 19:50:00,corendon airlines europe boeing registration t...,"{gazipasa, brussels}","{belgium, turkey}"
2,"San Marino A333 at Surakarta on Jul 2nd 2024, ...",/h?article=51aa2c28&opt=0,A San Marino Aviation Airbus A330-300 on behal...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: San Marino A333 at Surakarta on Jul ...,"['By Hans R. on Wednesday, Jul 3rd 2024 20:34...",['They own the A 300-600F since October 2019 (...,incident,https://avherald.com/h?article=51aa2c28&opt=0,Simon Hradecky,2024-07-02 19:00:00,2024-07-02 19:00:00,aviation airbus behalf garuda registration mmm...,"{san, marino, surakarta}",{indonesia}
3,"ANZ B789 over Pacific on Jul 1st 2024, engine ...",/h?article=51aa1b40&opt=0,"An ANZ Air New Zealand Boeing 787-9, registrat...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ B789 over Pacific on Jul 1st 202...,"['By Yoke bloke on Thursday, Jul 4th 2024 14:...","[""There's a simple solution, and the crew didn...",incident,https://avherald.com/h?article=51aa1b40&opt=0,Simon Hradecky,2024-07-02 16:34:00,2024-07-02 16:34:00,anz air new zealand boeing registration nzd pe...,"{shanghai, honiara, auckland}",{china}
4,"ANZ DH8C at Invercargill on Jun 29th 2024, uns...",/h?article=51aa17ea&opt=0,An ANZ Air New Zealand de Havilland Dash 8-300...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ DH8C at Invercargill on Jun 29th...,"['By (anonymous) on Wednesday, Jul 3rd 2024 0...",['Lucky. '],incident,https://avherald.com/h?article=51aa17ea&opt=0,Simon Hradecky,2024-07-02 16:08:00,2024-07-02 16:08:00,anz air new zealand havilland dash registratio...,"{christchurch, invercargill}",{}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28957,"Fedex MD-10 gear collapse at Memphis, TN, Dec ...",/h?article=3d769eac&opt=0,NTSB has released the final report of the acci...,"By Simon Hradecky, created Monday, Sep 5th 20...",Final Report: Fedex MD-10 gear collapse at Mem...,[],[],final report,https://avherald.com/h?article=3d769eac&opt=0,Simon Hradecky,2005-09-05 10:02:00,2005-09-05 10:02:00,ntsb released final report accident fedex tenn...,"{officer, memphis}",{}
28958,"Mandala Airlines B737-200 at Medan, Indonesia ...",/h?article=3d767dad&opt=0,"A B737-200 of Indonesian airline ""Mandala Airl...","By Simon Hradecky, created Monday, Sep 5th 20...","Crash: Mandala Airlines B737-200 at Medan, Ind...",[],[],crash,https://avherald.com/h?article=3d767dad&opt=0,Simon Hradecky,2005-09-05 06:11:00,2005-09-05 06:11:00,indonesian airline mandala airlines flight peo...,"{jakarta, medan}",{indonesia}
28959,"Tans Peru B737-200 at Pucallpa, Amazonas on A...",/h?article=3d6daad4&opt=0,"An airplane of Peruvian Airline ""Tans"", airpla...","By Simon Hradecky, created Tuesday, Aug 23rd 2...","Crash: Tans Peru B737-200 at Pucallpa, Amazon...",[],[],crash,https://avherald.com/h?article=3d6daad4&opt=0,Simon Hradecky,2005-08-23 22:15:00,2005-08-23 22:15:00,airplane peruvian airline tans airplane type u...,"{lima, pucallpa}",{}
28960,Flash Air B737 at Sharm el-Sheik on Jan 03 200...,/h?article=3d68ad47&opt=0,"Hi all, just found this link to the official a...","By Urs Wildermuth, created Tuesday, Aug 16th 2...",Final Report: Flash Air B737 at Sharm el-Sheik...,[],[],final report,https://avherald.com/h?article=3d68ad47&opt=0,Urs Wildermuth,2005-08-16 21:46:00,2005-08-16 21:46:00,found link official accident report flight fla...,{best},{}


In [17]:
# Remove all linebrakes, tabs, etc. in the texts
def remove_linebreaks(text):
    return re.sub(r'[\n\r\t\s]+', ' ', text, flags=re.UNICODE)

df["text"] = df["text"].apply(remove_linebreaks)

In [18]:
df

Unnamed: 0,title,href,text,time_author,headline,comment_authors,comments,occurrence,url,author,created,updated,normalized_text,cities,countries
0,"RAM AT72 at Seville on Jun 19th 2024, runway i...",/h?article=51aa8170&opt=0,A RAM Royal Air Maroc Avions de Transport Regi...,"By Simon Hradecky, created Wednesday, Jul 3rd...",Incident: RAM AT72 at Seville on Jun 19th 2024...,"['By (anonymous) on Thursday, Jul 4th 2024 18...",['With the increase of traffic for the summer ...,incident,https://avherald.com/h?article=51aa8170&opt=0,Simon Hradecky,2024-07-03 07:01:00,2024-07-03 07:01:00,ram royal air maroc avions transport regional ...,"{begun, casablanca, cork}","{spain, morocco, ireland}"
1,Corendon Europe B738 at Brussels on Jul 2nd 20...,/h?article=51aa31d7&opt=0,"A Corendon Airlines Europe Boeing 737-800, reg...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: Corendon Europe B738 at Brussels on ...,"['By Bert hethak on Wednesday, Jul 3rd 2024 1...","[""The second to last sentence of the first par...",incident,https://avherald.com/h?article=51aa31d7&opt=0,Simon Hradecky,2024-07-02 19:50:00,2024-07-02 19:50:00,corendon airlines europe boeing registration t...,"{gazipasa, brussels}","{belgium, turkey}"
2,"San Marino A333 at Surakarta on Jul 2nd 2024, ...",/h?article=51aa2c28&opt=0,A San Marino Aviation Airbus A330-300 on behal...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: San Marino A333 at Surakarta on Jul ...,"['By Hans R. on Wednesday, Jul 3rd 2024 20:34...",['They own the A 300-600F since October 2019 (...,incident,https://avherald.com/h?article=51aa2c28&opt=0,Simon Hradecky,2024-07-02 19:00:00,2024-07-02 19:00:00,aviation airbus behalf garuda registration mmm...,"{san, marino, surakarta}",{indonesia}
3,"ANZ B789 over Pacific on Jul 1st 2024, engine ...",/h?article=51aa1b40&opt=0,"An ANZ Air New Zealand Boeing 787-9, registrat...","By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ B789 over Pacific on Jul 1st 202...,"['By Yoke bloke on Thursday, Jul 4th 2024 14:...","[""There's a simple solution, and the crew didn...",incident,https://avherald.com/h?article=51aa1b40&opt=0,Simon Hradecky,2024-07-02 16:34:00,2024-07-02 16:34:00,anz air new zealand boeing registration nzd pe...,"{shanghai, honiara, auckland}",{china}
4,"ANZ DH8C at Invercargill on Jun 29th 2024, uns...",/h?article=51aa17ea&opt=0,An ANZ Air New Zealand de Havilland Dash 8-300...,"By Simon Hradecky, created Tuesday, Jul 2nd 2...",Incident: ANZ DH8C at Invercargill on Jun 29th...,"['By (anonymous) on Wednesday, Jul 3rd 2024 0...",['Lucky. '],incident,https://avherald.com/h?article=51aa17ea&opt=0,Simon Hradecky,2024-07-02 16:08:00,2024-07-02 16:08:00,anz air new zealand havilland dash registratio...,"{christchurch, invercargill}",{}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28957,"Fedex MD-10 gear collapse at Memphis, TN, Dec ...",/h?article=3d769eac&opt=0,NTSB has released the final report of the acci...,"By Simon Hradecky, created Monday, Sep 5th 20...",Final Report: Fedex MD-10 gear collapse at Mem...,[],[],final report,https://avherald.com/h?article=3d769eac&opt=0,Simon Hradecky,2005-09-05 10:02:00,2005-09-05 10:02:00,ntsb released final report accident fedex tenn...,"{officer, memphis}",{}
28958,"Mandala Airlines B737-200 at Medan, Indonesia ...",/h?article=3d767dad&opt=0,"A B737-200 of Indonesian airline ""Mandala Airl...","By Simon Hradecky, created Monday, Sep 5th 20...","Crash: Mandala Airlines B737-200 at Medan, Ind...",[],[],crash,https://avherald.com/h?article=3d767dad&opt=0,Simon Hradecky,2005-09-05 06:11:00,2005-09-05 06:11:00,indonesian airline mandala airlines flight peo...,"{jakarta, medan}",{indonesia}
28959,"Tans Peru B737-200 at Pucallpa, Amazonas on A...",/h?article=3d6daad4&opt=0,"An airplane of Peruvian Airline ""Tans"", airpla...","By Simon Hradecky, created Tuesday, Aug 23rd 2...","Crash: Tans Peru B737-200 at Pucallpa, Amazon...",[],[],crash,https://avherald.com/h?article=3d6daad4&opt=0,Simon Hradecky,2005-08-23 22:15:00,2005-08-23 22:15:00,airplane peruvian airline tans airplane type u...,"{lima, pucallpa}",{}
28960,Flash Air B737 at Sharm el-Sheik on Jan 03 200...,/h?article=3d68ad47&opt=0,"Hi all, just found this link to the official a...","By Urs Wildermuth, created Tuesday, Aug 16th 2...",Final Report: Flash Air B737 at Sharm el-Sheik...,[],[],final report,https://avherald.com/h?article=3d68ad47&opt=0,Urs Wildermuth,2005-08-16 21:46:00,2005-08-16 21:46:00,found link official accident report flight fla...,{best},{}


In [19]:
# Write full dataset to csv
# Write the DataFrame to a CSV file with additional options
df.to_csv('preprocessed_dataset.csv', sep=',', index=False, header=True, na_rep='NULL', encoding='utf-8')




In [20]:
# Write the DataFrame to a CSV file with additional options
df.to_csv('../data/interim/preprocessed_dataset.csv', sep=',', index=False, header=True, na_rep='NULL', encoding='utf-8')

#### 

<div style="border-top:0.1cm solid #EF475B"></div>
    <strong><a href='#Q0'><div style="text-align: right"> <h3>End of this Notebook.</h3></div></a></strong>