<a id='Q0'></a>
<center> <h1> Aviation Herald Project: FastText Model</h1> </center>
<p style="margin-bottom:1cm;"></p>
<center><h4>Laurent Bobay, 2024</h4></center>
<p style="margin-bottom:1cm;"></p>

<div style="background:#EEEDF5;border-top:0.1cm solid #EF475B;border-bottom:0.1cm solid #EF475B;">
    <div style="margin-left: 0.5cm;margin-top: 0.5cm;margin-bottom: 0.5cm;color:#303030">
        <p><strong>Goal:</strong> Create dataset of all publicly available articles and comments from www.avherald.com</p>
        <strong> Outline:</strong>
        <a id='P0' name="P0"></a>
        <ol>
            <li> <a style="color:#303030" href='#SU'>Set up</a></li>
            <li> <a style="color:#303030" href='#P1'>Data Exploration and Cleaning</a></li>
            <li> <a style="color:#303030" href='#P2'>Modeling</a></li>
            <li> <a style="color:#303030" href='#P3'>Model Evaluation</a></li>
            <li> <a style="color:#303030" href='#CL'>Conclusion</a></li>
        </ol>
        <strong>Topics Trained:</strong> Notebook Layout, Data Cleaning, Modelling and Model Evaluation
    </div>
</div>

<nav style="text-align:right"><strong>
        <a style="color:#00BAE5" href="https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/07_MLEngineering/index.html" title="momentum"> Module 7, Machine Learning Engineering </a>|
        <a style="color:#00BAE5" href="https://monolith.propulsion-home.ch/backend/api/momentum/materials/ds-materials/07_MLEngineering/day1/index.html" title="momentum">Day 1, Data Science Project Development </a>|
        <a style="color:#00BAE5" href="https://drive.google.com/file/d/1SOCQu9Gv3jNNXxvJSszBC3fYNsM0df2F/view?usp=sharing" title="momentum"> Live Coding 1, Simple Prediction Notebook</a>
</strong></nav>

In [1]:
import fasttext
import pandas as pd
import numpy as np
import csv
from IPython.display import display, Markdown
import re

from sklearn.metrics.pairwise import cosine_similarity



### Create the corpus

##### Convert the csv text column to a txt file:

In [10]:
# Path to your CSV file
csv_file = "../data/interim/preprocessed_dataset.csv" # title,href,text,time_author

# Path to the output text file
txt_file = "../data/interim/text_corpus.txt"

# Function to read CSV and write to text file
def csv_to_txt(csv_file, txt_file):
    with open(csv_file, 'r', encoding='utf-8') as csv_in, open(txt_file, 'w', encoding='utf-8') as txt_out:
        csv_reader = csv.reader(csv_in)
        next(csv_reader)  # Skip header if exists
        for row in csv_reader:
            if len(row) > 0:  # Ensure row is not empty
                text = row[2]  # Assuming "text" is the first column
                text = re.sub(r'[\n\r\t\s]+', ' ', text, flags=re.UNICODE) # Remove all linebrakes, tabs, etc.
                txt_out.write(text + '\n')

# Call the function
csv_to_txt(csv_file, txt_file)

print(f"Texts extracted from '{csv_file}' and saved to '{txt_file}'.")


Texts extracted from '../data/interim/preprocessed_dataset.csv' and saved to '../data/interim/text_corpus.txt'.


### Train the FastText model

In [11]:
# Train a FastText model on the corpus
model = fasttext.train_unsupervised(txt_file, model='skipgram', verbose=2)

# Save the model
model_path = "../models/fasttext_skipgram.bin"
model.save_model(model_path)



Read 6M words
Number of words:  39628
Number of labels: 0
Progress: 100.0% words/sec/thread:  118665 lr:  0.000000 avg.loss:  1.678877 ETA:   0h 0m 0s


### Generate Embeddings for each Text in the Corpus

In [12]:
# Load the trained model
model = fasttext.load_model(model_path)

# Read the corpus and generate embeddings
corpus_file = txt_file
embeddings = []
texts = []

with open(corpus_file, 'r') as f:
    for line in f:
        text = line.strip()
        texts.append(text)
        embeddings.append(model.get_sentence_vector(text))

# Convert to numpy array for easier manipulation
embeddings = np.array(embeddings)

print(len(embeddings))


29016


### Find similar Texts for a given new Text

In [18]:
# Function to find the most similar texts
def find_similar_texts(new_text, model, texts, embeddings, top_n=30):
    new_embedding = model.get_sentence_vector(new_text)
    similarities = cosine_similarity([new_embedding], embeddings)[0]
    similar_indices = similarities.argsort()[-top_n:][::-1]
    
    return [(texts[i], similarities[i]) for i in similar_indices], similar_indices



In [20]:
# Provide the new text
new_text = "At about 1210 hrs, the flight crew received take-off clearance on runway 27. At the time, crosswind from the right side of 12 kt prevailed. The FDR recording and the statement of the co-pilot showed that during the initial about 23 s long phase of the take-off run up to a speed of 126 kt CAS, the co-pilot counteracted the crosswind with left rudder inputs. The rudder was continually returned to neutral position. At 1210:51 hrs, a rudder input to the right occurred. The airplane changed the heading by about 10° to the right to 281° and a short time later was about 230 m beyond the intersection of runway 27 with runway 30, 3.7 m away from the right edge of runway 27. Shortly before the closest distance to the runway edge, the PIC (40, ATPL, 9,870 hours total, 3,052 hours on type) checked the engine parameters, then the co-pilot (28, CPL, 808 hours total, 718 hours on type) said “Sorry, das Flugzeug zieht nach rechts (the airplane is pulling to the right)”, according to the statement of both pilots. Subsequently, the PIC aborted the take-off at a speed of 137 kt CAS, 2 kt below the decision speed V1. According to her own statement, she pulled the thrust lever to idle and then to full reverse. The PIC also executed a left rudder input towards the runway centre line and pulled the side stick back. According to her statement, she can no longer remember to operate the side stick. Immediately afterwards, the nose landing gear lifted off. After the PIC realised the lift-off of the nose, she pushed the side stick forward and the nose landing gear touched down again. During ground contact, the airplane was in a yaw movement towards the left in the direction of the runway centre. After the rejected take-off, the ground spoilers were deployed automatically and the auto brake system activated in level MAX. During the braking action, the airplane moved back towards the runway centre line and came to a stop on the runway ahead of taxiway C. After the aircraft was stopped, the co-pilot informed the tower of the rejected take-off and requested the fire brigade. Following a short stop at this position and the check of the brake temperatures, among other things, the flight crew decided to leave the runway and taxied back to the assigned parking position 12. The PIC informed the mechanics of the maintenance organisation subcontracted by the operator, that the take-off had been aborted because the airplane had pulled to the right, according to their statements. The operator was initially informed of the occurrence by the PIC via the Operations Control Center (OCC)."

# Find similar texts in the corpus
similar_texts, similar_indices = find_similar_texts(new_text, model, texts, embeddings)


In [22]:
df = pd.read_csv(csv_file)

for idx in similar_indices:
    title = df.iloc[idx].title
    url = df.iloc[idx].url
    cluster = df.iloc[idx]
    print(title, url)


Etihad A320 at Kozhikode on Jun 20th 2019, temporary runway excursion https://avherald.com/h?article=4dc63eae&opt=0
Merpati B735 at Makassar on Dec 26th 2011, temporary runway excursion on landing https://avherald.com/h?article=4a302f92&opt=0
Arabia A320 at Sharjah on Sep 18th 2018, intersection line up departed in wrong direction https://avherald.com/h?article=4bded52d&opt=0
Singapore B744 at Singapore on Dec 2nd 2011, runway excursion https://avherald.com/h?article=44706cf6/0000&opt=0
Jordan B734 at Tombouctou on May 5th 2017, overran runway on landing https://avherald.com/h?article=4b8b4503&opt=0
Singapore B772 at Singapore on Feb 28th 2009, temporarily veered off runway on landing https://avherald.com/h?article=42e2b42f/0000&opt=0
Garuda AT72 at Ende on Oct 19th 2015, unstabilized approach leads to bounces, runway excursion and balked landing https://avherald.com/h?article=4e9eb5db&opt=0
British Midland A319 at Amsterdam on Apr 18th 2007, rapid yaw during takeoff, liftoff below V1 

<div style="border-top:0.1cm solid #EF475B"></div>
    <strong><a href='#Q0'><div style="text-align: right"> <h3>End of this Notebook.</h3></div></a></strong>