
# Predicting Aircraft Registration Types using a Random Forest Classifier
In this data science portfolio project, you will see the various tasks I have completed utilizing web scraping, data preprocessing, feature engineering, visualization, unsupervised learning, and supervised learning techniques.

To begin, I built a web scraper using the PyPDF2 and pandas libraries to scrape a pdf of registered aircraft from the FAA. I then used the scraped data to get a list of N-nbr, which I used to scrape the FAA registry for all relevant information.

Next, I performed various preprocessing steps on the scraped data, including handling missing values and encoding categorical variables. I also did feature engineering by adding new features such as the age of the aircraft and time since the certificate was issued.

I then used the visualization library Plotly Dash to build a dashboard that allows for the interactive exploration of the data. The dashboard includes different types of visualizations such as bar charts, line charts, scatter plots, pie charts and stacked bar charts and allows users to filter the data by manufacturer name and engine type.

I also used unsupervised learning techniques, specifically clustering, to group the aircraft into different clusters based on their expiration date, aircraft information and registration type.

The scrape_pdf function is a powerful tool for extracting data from PDF documents. It utilizes the PyPDF2 library to open and read the contents of a PDF file, then uses a combination of string manipulation techniques such as regular expressions and splitting to extract the desired data. The extracted data is then stored in a list, which is subsequently converted into a pandas DataFrame for further processing. One key feature of this function is its ability to handle PDF files with varying formatting, as it accounts for different ways in which data may be presented in the document.

This function can be used to extract a wide variety of information from PDFs. In this specific case, the function is being used to scrape data on registered aircraft from the FAA. This data includes information such as the registrant's name, aircraft N-number, mailing address, manufacturer, and more. This information is crucial for understanding the fleet of aircrafts in the country, and can be used for various analysis such as predicting aircrafts that may need maintenance soon, or identifying the most popular manufacturers in the industry.

For server and time related reasons only registrant companies with more than 5 aircrafts will be considered.

In [1]:
from import_pdf import scrape_pdf
import pandas as pd
# Import the pdf file for scraping
# pdf_df = pd.read_csv('scraped_plane_data.csv')
pdf_df = scrape_pdf('arasmp81.pdf') # Find the full code in the import_pdf.py file

# Use the value_counts() method to count the number of occurrences of each value in the "Registrant" column
registrant_counts = pdf_df["Registrant"].value_counts()

# Use a boolean mask to filter the values where the count is greater than 40
mask = registrant_counts > 5

# Use the mask to select the values from the "Registrant" column
selected_values = registrant_counts[mask]
pdf_df = pdf_df[pdf_df["Registrant"].isin(selected_values.index)]
pdf_df = pdf_df[['Registrant', 'N-Nbr']]
pdf_df.dropna(inplace=True)
pdf_df.reset_index(drop=True, inplace=True)
pdf_df.head()

Unnamed: 0,Registrant,N-Nbr
0,182 FLIGHT CORP,115EL
1,182 FLIGHT CORP,169KD
2,182 FLIGHT CORP,22044
3,182 FLIGHT CORP,242TC
4,182 FLIGHT CORP,381CC


The scrape_nnumber function uses the Selenium webdriver to navigate to the FAA registry website and input the N-numbers from the previously scraped PDF. The function then extracts data from the corresponding aircraft registration page and stores it in a DataFrame. This function demonstrates the ability to use web scraping techniques to gather data from dynamic websites and the ability to work with data in a programmatic way using the Pandas library. Additionally, by using a loop to iterate through the unique N-numbers in the original dataframe, this function also showcases the ability to scale the scraping process to handle a large number of inputs efficiently. The final DataFrame can then be used for further data analysis and visualization.

In [2]:
from scrape_nnumber_faa import scrape_nnumber
df = scrape_nnumber(pdf_df) # See full code in the scrape_nnumber_faa.py file
df.head()

Unnamed: 0.1,Unnamed: 0,Serial Number,Manufacturer Name,Model,Type Aircraft,Pending Number Change,Date Change Authorized,MFR Year,Type Registration,Name,...,Country,Status,Certificate Issue Date,Expiration Date,Type Engine,Dealer,Mode S Code (base 8 / Oct),Mode S Code (Base 16 / Hex),Fractional Owner,N-Nbr
0,0,U20606814,CESSNA,TU206G,Fixed Wing Single-Engine,,,1984,Corporation,182 FLIGHT CORP,...,UNITED STATES,Valid,01/28/2015,01/31/2024,Reciprocating,No,50037465,A03F35,NO,115EL
1,1,18267456,CESSNA,182Q,Fixed Wing Single-Engine,,,1979,Corporation,182 FLIGHT CORP,...,UNITED STATES,Valid,02/15/2008,07/31/2025,Reciprocating,No,50212002,A11402,NO,169KD
2,2,18281671,CESSNA,182T,Fixed Wing Single-Engine,,,2005,Corporation,182 FLIGHT CORP,...,UNITED STATES,Valid,10/04/2005,12/31/2024,Reciprocating,No,50362002,A1E402,NO,22044
3,3,T20608530,CESSNA,T206H,Fixed Wing Single-Engine,,,2005,Corporation,182 FLIGHT CORP,...,UNITED STATES,Valid,09/05/2008,04/30/2024,Reciprocating,No,50434431,A23919,NO,242TC
4,4,9936CC,PIPER/CUB CRAFTERS,PA-18-150,Fixed Wing Single-Engine,,,2001,Corporation,182 FLIGHT CORP,...,UNITED STATES,Valid,06/01/2012,06/30/2024,Reciprocating,No,51057466,A45F36,NO,381CC


The Dash dashboard is a powerful tool for visualizing and exploring data. In this project, the dashboard is built using the popular open-source library Plotly Dash, which allows for easy creation of interactive and responsive web-based visualizations.

The dashboard is designed to provide an in-depth analysis of aircraft data and information, including information on aircraft manufacturers, models, engine types, and registration types. The user can filter the data by selecting a specific manufacturer from a dropdown menu, which updates the visualizations on the dashboard in real-time.

The dashboard includes several key features, including:
   - A pie chart that shows the distribution of engine types for the selected manufacturer
   - A bar chart that shows the number of aircraft models for the selected manufacturer
   - A line chart that shows the number of aircraft by registration type
   - A pie chart that shows the distribution of registration types for the selected manufacturer
   - A scatter plot that shows the relationship between the age of the aircraft and the time since the certificate of registration was issued
   - A stacked bar chart that shows the number of aircraft by status and registration type
   - A map that shows the distribution of aircraft by country, using a choropleth map

The dashboard also includes several clustering techniques, such as k-means, that allow the user to group similar aircraft based on the age of the aircraft, the type of aircraft, and the type of registration, in order to identify patterns and trends in the data.

In addition, The dashboard is designed to be user-friendly and easy to navigate, making it a great tool for exploring and analyzing aircraft data. Overall, this project showcases a wide range of data science skills, including data scraping, data cleaning, data visualization, feature engineering, and machine learning. The dashboard is a powerful tool for extracting insights from complex data sets and is an excellent example of the power of data science for extracting insights from data.


In [3]:
from dash import Dash
from dash.dependencies import Input, Output
import dash_bootstrap_components as dbc
from dash import html
from dash import dcc
import plotly.graph_objs as go
import pandas as pd
import numpy as np

from machine_learning import kmeans_dates, kmeans_aircraft_info, kmeans_type

# set a copy of the scraped dataframe so editing doesn't effect later analysis
faa_df = df.copy()

faa_df['Expiration Date'] = pd.to_datetime(faa_df['Expiration Date'])
faa_df['Type Aircraft'] = faa_df['Type Aircraft'].astype(str)
# Create cluster column with KMeans
faa_df = kmeans_dates(faa_df)
faa_df = kmeans_aircraft_info(faa_df)
faa_df = kmeans_type(faa_df)


app = Dash(__name__, external_stylesheets=[dbc.themes.DARKLY])


app.layout = html.Div([
    dbc.Row([
        dbc.Col(html.H1("My Dashboard"), md=12)
    ]),
    dbc.Row([
        dbc.Col(dbc.Select(
            id='manufacturer-dropdown',
            options=[{'label': i, 'value': i} for i in faa_df['Manufacturer Name'].unique()],
            value=faa_df['Manufacturer Name'].unique()[0]
        ), md=12)
    ]),
    dbc.Row([
        dbc.Col(dcc.Graph(id='engine-pie'), md=4),
        dbc.Col(dcc.Graph(id='bar-chart'), md=4),
        dbc.Col(dcc.Graph(id='line-chart'), md=4)
    ]),
    dbc.Row([
        dbc.Col(dcc.Graph(id='pie-chart'), md=4),
        dbc.Col(dcc.Graph(id='scatter-plot'), md=4),
        dbc.Col(dcc.Graph(id='stacked-bar-chart'), md=4)
    ]),
    dbc.Row([
        dbc.Col(dbc.Select(
                id='cluster-date-dropdown',
                options=sorted([{'label': i, 'value': i} for i in faa_df['dates_cluster'].unique()], key=lambda x: x['value']),
                value=0,
                ), md=4),

        dbc.Col(dbc.Select(
                id='cluster-aircraft-dropdown',
                options=sorted([{'label': i, 'value': i} for i in faa_df['aircraft_cluster'].unique()], key=lambda x: x['value']),
                value=0,
                ), md=4),
        dbc.Col(dbc.Select(
                id='cluster-type-dropdown',
                options=sorted([{'label': i, 'value': i} for i in faa_df['type_cluster'].unique()], key=lambda x: x['value']),
                value=0,
                ), md=4)
    ]),
    dbc.Row([
        dbc.Col(dcc.Graph(id='dates-scatter-chart'), md=4),
        dbc.Col(dcc.Graph(id='aircraft-scatter-chart'), md=4),
        dbc.Col(dcc.Graph(id='type-scatter-chart'), md=4)
    ]),
    dbc.Row([
        dbc.Col(dcc.Graph(id='choropleth-country-map'), md=20)
])

], className="mt-4")


### BAR CHART ###
@app.callback(
    Output('bar-chart', 'figure'),
    [Input('manufacturer-dropdown', 'value')]
)
def update_bar_chart(manufacturer):
    filtered_faa_df = faa_df[faa_df['Manufacturer Name'] == manufacturer]
    trace = go.Bar(x=filtered_faa_df['Model'].unique(),
                   y=filtered_faa_df.groupby('Model').size())
    return {'data': [trace],
            'layout': go.Layout(title='Number of Aircrafts by Model',
                                xaxis={'title': 'Model'},
                                yaxis={'title': 'Number of Aircrafts'})
            }

### PIE CHART ###
@app.callback(
    Output('engine-pie', 'figure'),
    [Input('manufacturer-dropdown', 'value')]
)
def update_engine_pie(manufacturer):
    filtered_faa_df = faa_df[faa_df['Manufacturer Name'] == manufacturer]
    values = filtered_faa_df.groupby('Type Engine').size()
    labels = values.index
    trace = go.Pie(labels=labels, values=values)
    return {'data': [trace],
            'layout': go.Layout(title='Engine Type Distribution')}


### LINE CHART ###
@app.callback(
    Output('line-chart', 'figure'),
    [Input('manufacturer-dropdown', 'value')]
)
def update_line_chart(manufacturer):
    filtered_faa_df = faa_df[faa_df['Manufacturer Name'] == manufacturer]
    filtered_faa_df = filtered_faa_df.replace([np.inf, -np.inf, 'None', '0000', 0000], np.nan)
    filtered_faa_df = filtered_faa_df.dropna(subset=['MFR Year'])
    filtered_faa_df['MFR Year'] = filtered_faa_df['MFR Year'].astype(int)
    filtered_faa_df = filtered_faa_df[filtered_faa_df['MFR Year'].apply(lambda x: len(str(x)) == 4)]

    x = filtered_faa_df.groupby('MFR Year').size().index
    y = filtered_faa_df.groupby('MFR Year').size().values
    y_min = min(y)
    y_max = max(y)
    trace = go.Scatter(x=x, y=y, mode='lines+markers',
                       marker=dict(color='rgb(255,0,0)', size=10))
    return {'data': [trace],
            'layout': go.Layout(title='Number of Aircrafts by Year Manufactured',
                                xaxis=dict(title='Year'),
                                yaxis=dict(title='Number of Aircrafts',
                                range=[y_min, y_max]))}

### PIE CHART ###
@app.callback(
    Output('pie-chart', 'figure'),
    [Input('manufacturer-dropdown', 'value')]
)
def update_pie_chart(manufacturer):
    filtered_faa_df = faa_df[faa_df['Manufacturer Name'] == manufacturer]

    values = filtered_faa_df.groupby('Type Registration').size()
    labels = [value for value in filtered_faa_df['Type Registration'].unique() if value in values.index]
    trace = go.Pie(labels=labels, values=values, textinfo='value',
                   textfont={'color': 'white'}, hoverinfo='label+value+percent',
                   sort=False, name='Registratipn Type Distribution')
    return {'data': [trace],
            'layout': go.Layout(title='Registratipn Type Distribution', legend=dict(title='Type R'))}

# Add a new column 'Closest Expiration' with the closest expiration date
faa_df['Closest Expiration'] = faa_df.groupby(['Manufacturer Name','Name'])['Expiration Date'].transform(min)

# Create a scatter plot with the closest expiration date for each manufacturer
@app.callback(
    Output('scatter-plot', 'figure'),
    [Input('manufacturer-dropdown', 'value')]
)
def update_scatter_plot(manufacturer):
    filtered_faa_df = faa_df[faa_df['Manufacturer Name'] == manufacturer]
    trace = go.Scatter(x=filtered_faa_df['Manufacturer Name'],
                       y=filtered_faa_df['Closest Expiration'],
                       mode='markers',
                       text=filtered_faa_df['Name'])
    return {'data': [trace],
            'layout': go.Layout(title='Closest Expiration Dates by Manufacturer',
                                xaxis={'title': 'Manufacturer'},
                                yaxis={'title': 'Closest Expiration Date'})}


@app.callback(
    Output('stacked-bar-chart', 'figure'),
    [Input('manufacturer-dropdown', 'value')]
)
def update_stacked_bar(manufacturer):
    colors = {'Individual': 'rgb(255,0,0)', 'Partnership': 'rgb(255,255,0)', 'Corporation': 'rgb(0,0,255)', 'Co-Owned': 'rgb(0,255,0)', 'Government': 'rgb(255,0,255)', 'LLC': 'rgb(255,255,255)', 'Non Citizen Corporation': 'rgb(0,255,255)', 'Non Citizen Co-Owned': 'rgb(255,0,0)'}
    # Filter the dataframe by the selected manufacturer
    faa_df_filtered = faa_df[faa_df['Manufacturer Name'] == manufacturer]

    # Group the dataframe by Type Registration and Type Aircraft columns
    faa_df_grouped = faa_df_filtered.groupby(['Type Registration', 'Type Aircraft']).size().reset_index(name='counts')

    # Create a trace for each Type Registration value
    traces = []
    for type_r, faa_df_type_r in faa_df_grouped.groupby('Type Registration'):
        traces.append(go.Bar(
            x=faa_df_type_r['Type Aircraft'],
            y=faa_df_type_r['counts'],
            name=type_r,
            marker=dict(color=colors[type_r]),
            ))

    return {
        'data': traces,
        'layout': go.Layout(
            barmode='stack',
            xaxis=dict(title='Aircraft Type'),
            yaxis=dict(title='Count'),
            title=f'Type R vs Aircraft Type for {manufacturer}',
            margin={'t': 50, 'b': 120}
        )
    }


@app.callback(
    Output('dates-scatter-chart', 'figure'),
    [Input('cluster-date-dropdown', 'value')]
)
def update_graph(cluster):
    filtered_faa_df = faa_df[faa_df['dates_cluster'] == int(cluster)]
    traces = []
    for i in range(3):
        cluster_faa_df = filtered_faa_df[filtered_faa_df['dates_cluster'] == i]
        trace = go.Scatter(
            x=cluster_faa_df['Certificate Issue Date'],
            y=cluster_faa_df['Expiration Date'],
            mode='markers',
            marker={
                'size': 12,
                'color': 'rgb(51,204,153)',
                'symbol': 'circle',
                'line': {'width': 2}
            },
            name='dates_cluster {}'.format(i)
        )
        traces.append(trace)

    return {
        'data': traces,
        'layout': go.Layout(
            title='Aircraft Certification and Expiration Dates by Cluster',
            xaxis={'title': 'Certification Date'},
            yaxis={'title': 'Expiration Date'}
        )
    }



@app.callback(
    Output('aircraft-scatter-chart', 'figure'),
    [Input('cluster-aircraft-dropdown', 'value')]
)
def update_graph(cluster):
    filtered_faa_df = faa_df[faa_df['aircraft_cluster'] == int(cluster)]

    x = [i[:12] for i in filtered_faa_df['Model']]
    y = [i[:15] for i in filtered_faa_df['Manufacturer Name']]
    traces = []
    trace = go.Scatter(
        x=x,
        y=y,
        mode='markers',
        marker={
            'size': 12,
            'color': 'rgb(51,204,153)',
            'symbol': 'circle',
            'line': {'width': 2}
        },
        name='Cluster {}'.format(cluster)
    )
    traces.append(trace)

    return {
        'data': traces,
        'layout': go.Layout(
            title='Manufacturer, Model, and Type of Engine Cluster',
            margin={'l': 150, 'r': 10, 't': 50, 'b': 120},
            xaxis={'title': 'Model'},
            yaxis={'title': 'Manufacturer Name'}
        )
    }

@app.callback(
    Output('type-scatter-chart', 'figure'),
    [Input('cluster-type-dropdown', 'value')]
)
def update_graph(cluster):
    filtered_faa_df = faa_df[faa_df['type_cluster'] == int(cluster)]
    traces = []
    x = [str(i)[:12] for i in filtered_faa_df['Type Registration']]
    y = [str(i)[:15] for i in filtered_faa_df['Type Aircraft']]
    for i in range(3):
        cluster_faa_df = filtered_faa_df[filtered_faa_df['type_cluster'] == i]
        trace = go.Scatter(
            x=x,
            y=y,
            mode='markers',
            marker={
                'size': 12,
                'color': 'rgb(51,204,153)',
                'symbol': 'circle',
                'line': {'width': 2}
            },
            name='type_cluster {}'.format(i)
        )
        traces.append(trace)

    return {
        'data': traces,
        'layout': go.Layout(
            title='Aircraft and Registration Type Cluster',
            xaxis={'title': 'Type R'},
            yaxis={'title': 'Aircraft Type'},
            margin={'l': 150, 'r': 10, 't': 50, 'b': 120}
        )
    }


@app.callback(
    Output('choropleth-country-map', 'figure'),
    [Input('manufacturer-dropdown', 'value')]
)
def update_choropleth(manufacturer):
    filtered_faa_df = faa_df[faa_df['Manufacturer Name'] == manufacturer].groupby('Country').size()
    trace = go.Choropleth(locations=filtered_faa_df.index,
                         locationmode='country names',
                         z=filtered_faa_df.values)
    return {'data': [trace],
            'layout': go.Layout(title='Aircraft per Country',
                                geo={'scope': 'world'})}

if __name__ == '__main__':
    app.run_server(debug=True, use_reloader=False)


Dash is running on http://127.0.0.1:8050/

 * Serving Flask app '__main__'
 * Debug mode: on


This code is using a variety of data science techniques to build a Random Forest classifier to predict the type of registration for an aircraft based on various features such as the manufacturer name, model, type of aircraft, year of manufacture, type of engine, status, and certificate issue date.

First, the code imports a number of necessary libraries for data manipulation and analysis, including pandas for data manipulation, LabelEncoder from sklearn for encoding categorical variables, RandomForestClassifier from sklearn for building the classifier, and a variety of metrics from sklearn for evaluating the model's performance.

The code then reads in a CSV file containing data on aircraft and performs some data cleaning and manipulation tasks, such as replacing missing values, calculating new columns for the age of the aircraft and time since certificate issue, and encoding categorical variables.

Next, the code splits the data into training and test sets, using the train_test_split function from sklearn. The RandomizedSearchCV function is then used to perform a randomized search over a defined range of hyperparameters for the Random Forest classifier. The model is then fit to the training data and used to make predictions on the test set.

Finally, the code calculates a number of evaluation metrics, including accuracy, precision, recall, and F1-score, to assess the performance of the model. These metrics provide insight into how well the model is able to accurately predict the type of registration for an aircraft based on the input features. Overall, this code demonstrates a variety of data science skills including data cleaning, feature engineering, model selection and tuning, and model evaluation.

In [4]:
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
import pandas as pd
from datetime import datetime
# Prepare the data
df = pd.read_csv('faainquiry.csv')
df = df[['Manufacturer Name', 'Model', 'Type Aircraft', 'MFR Year', 'Type Engine', 'Status', 'Type Registration', 'Certificate Issue Date']]
df.replace([np.inf, -np.inf, 'None', '0000', 0000], np.nan, inplace=True)
df = df.dropna()
df['Age_of_Aircraft'] = df['MFR Year'].apply(lambda x: datetime.now().year - int(x))
df['Certificate_Issue_Year'] = pd.to_datetime(df['Certificate Issue Date']).dt.year
df['Time_Since_Certificate_Issue'] = datetime.now().year - df['Certificate_Issue_Year']
# Encode categorical variables
le = LabelEncoder()
df['Manufacturer Name'] = le.fit_transform(df['Manufacturer Name'])
df['Model'] = le.fit_transform(df['Model'])
df['Type Aircraft'] = le.fit_transform(df['Type Aircraft'])
df['Type Engine'] = le.fit_transform(df['Type Engine'])
df['Status'] = le.fit_transform(df['Status'])
df['Type Registration'] = le.fit_transform(df['Type Registration'])

# Split the data
X = df.drop(columns=['Type Registration', 'Certificate_Issue_Year', 'Certificate Issue Date'])
y = df['Type Registration']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Define the parameter distribution for n_estimators
param_dist = {'n_estimators': np.arange(10, 200),
              # 'max_depth': [None, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
              # 'min_samples_split': np.linspace(0.1, 1.0, 10, endpoint=True),
              # 'min_samples_leaf': np.linspace(0.1, 0.5, 5, endpoint=True),
              # 'max_features': ['auto', 'sqrt', 'log2', None],
              # 'class_weight': [None, 'balanced', 'balanced_subsample']
              }

# Create a random forest classifier
clf = RandomForestClassifier()

# Create the randomized search object
random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=10, cv=5)

# Fit the randomized search object to the data
random_search.fit(X_train, y_train)

# Make predictions on the test set
y_pred = random_search.predict(X_test)

# Evaluate the model
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='weighted')
rec = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {acc:.2f}")
print(f"Precision: {prec:.2f}")
print(f"Recall: {rec:.2f}")
print(f"F1-Score: {f1:.2f}")



Accuracy: 0.70
Precision: 0.69
Recall: 0.70
F1-Score: 0.69


In conclusion, this data science project demonstrates a wide range of skills and techniques that are commonly used in the field. The project begins by scraping data from a PDF document using the PyPDF2 library and regular expressions. The scraped data is then used to scrape additional information from the FAA registry using the Selenium webdriver. This data is then cleaned, preprocessed, and transformed to prepare it for analysis and modeling.

The project then moves on to data visualization using the Plotly Dash library. The dashboard allows the user to explore various aspects of the data such as aircraft manufacturer, engine type, and certification expiration dates. Additionally, the dashboard includes interactive scatter plots and maps which allow for further exploration of the data.

Finally, the project demonstrates the use of machine learning techniques for classification. A Random Forest classifier is trained on the data and its performance is evaluated using various metrics such as accuracy, precision, recall, and F1-score. The results show that the model has an accuracy of 0.70, precision of 0.69, recall of 0.70, and an F1-score of 0.69. While these scores are not perfect, they are still considered to be good results and the model can be used as a starting point for further improvements. Overall, this project showcases the power of data science in extracting valuable insights from data, and the ability to analyze and visualize the data in a meaningful way.