# Flight Delay Modeling: The Puffins

## Abstract
Flight delays create problems in scheduling for airlines and airports, leading to passenger inconvenience, and huge economic losses. Predicting these delays ahead of time can alleviate some of the issues caused by these delays. Inspired by puffins, birds that are known to be [very punctual](https://www.rspb.org.uk/about-the-rspb/about-us/media-centre/press-releases/rspb-ni-rathlin-punctual-puffins/) in their migratory patterns, our team plans to predict flight departure delays using airport and weather data. Our primary customer for this project are passengers, who we will be able to proactively notify about flight delays 2 hours before the scheduled depature of the flight. We will be predicting whether a flight can be categorized into one of the following four categories. We chose these bins, as this will allow customers to plan their schedules to the airport accordingly.
 - Will depart on-time (less than 15 minutes after the scheduled departure) 
 - Will depart between 15-44 minutes late
 - Will depart between 45-89 minutes late
 - Will depart between greater than 90 minutes late

We plan on building models using Logistic Regression and XGBoost to start, though we will be evaluating and assessing other models as we progress through EDA and preliminary pipeline construction. We will report the success of our models using F1-Score, Precision and Recall.

\
<img src="https://drive.google.com/uc?id=1VNYZJf7ypNNjC24USv9lIZgItLNC3h_z" alt="Google Drive Image" width=50%/>

## Team Members

Name: Rathin Bector \
Email: rathin.bector@berkeley.edu \
<img src="https://drive.google.com/uc?id=13yvUjEUS6aybbRax_wOIffeIIaD4jMNt" alt="Google Drive Image" width=20%/>

Name: Victor Ramirez \
Email: victor.ramirez@berkeley.edu \
<img src="https://drive.google.com/uc?id=1NbNsQiYvLQNZdXi0HPEkd3biCRV_eXZY" alt="Google Drive Image" width=20%/>

Name: Francisco Meyo \
Email: francisco@berkeley.edu \
<img src="https://drive.google.com/uc?id=1F7qT1sVXMjM_lh1Qr6MnvFF1ZuwTAtaT" alt="Google Drive Image" width=20%/>

Name: Landon Morin \
Email: morinlandon@berkeley.edu \
<img src="https://drive.google.com/uc?id=1KLrkBsnwrWNhakBRSRc_Jr2UpEmHLfFN" alt="Google Drive Image" width=20%/>

## Phase Leader Plan

| Phase | Description                                                                                                      | Project Manager |
|-------|------------------------------------------------------------------------------------------------------------------|-----------------|
| I     | Project Plan, describe datasets, joins, tasks, and metrics                                                       | Rathin Bector   |
| II    | EDA, baseline pipeline on all data, Scalability, Efficiency, Distributed/parallel Training, and Scoring Pipeline | Victor Ramirez  |
| III   | Feature engineering and hyperparameter tuning, In-class Presentation                                             | Francisco Meyo  |
| IV    | Select the optimal algorithm, fine-tune and submit a final report (research style)                               | Landon Morin    |

## Credit Assignment Plan
| Task Name                                                                                         | Phase   | Assignee                                                    | Due Date   | Status |
|---------------------------------------------------------------------------------------------------|---------|-------------------------------------------------------------|------------|--------|
| Create Phase 1 Notebook                                                                           | Phase 1 | Rathin Bector                                               | 10/24/2022 | DONE   |
| Make Phase Leader Plan                                                                            | Phase 1 | Rathin Bector                                               | 10/26/2022 | DONE   |
| Add Pictures and Emails to Notebook                                                               | Phase 1 | Rathin Bector                                               | 10/26/2022 | DONE   |
| Project Plan Abstract                                                                             | Phase 1 | Rathin Bector                                               | 10/27/2022 | DONE   |
| Credit Assignment Plan                                                                            | Phase 1 | Landon Morin, Rathin Bector, Francisco Meyo, Victor Ramirez | 10/28/2022 | DONE   |
| Description of Data                                                                               | Phase 1 | Victor Ramirez, Francisco Meyo, Landon Morin                | 10/29/2022 | DONE   |
| Basic EDA writeup                                                                                 | Phase 1 | Landon Morin, Francisco Meyo, Victor Ramirez                | 10/29/2022 | DONE   |
| Data Joins Plan                                                                                   | Phase 1 | Rathin Bector, Landon Morin                                 | 10/29/2022 | DONE   |
| ML Algorithms and Metrics                                                                         | Phase 1 | Landon Morin, Rathin Bector                                 | 10/30/2022 | DONE   |
| Machine Learning Pipelines                                                                        | Phase 1 | Victor Ramirez                                              | 10/30/2022 | DONE   |
| Conclusions and Next Steps                                                                        | Phase 1 | Francisco Meyo                                              | 10/30/2022 | DONE   |
| Submit Notebook and PDF                                                                           | Phase 1 | Rathin Bector                                               | 10/30/2022 | DONE   |
| Create Post of Discussion Page                                                                    | Phase 1 | Rathin Bector                                               | 10/30/2022 | DONE   |
| EDA on weather table                                                                              | Phase 2 | Francisco Meyo                                              | 11/01/2022 | DOING  |
| EDA on airline table                                                                              | Phase 2 | Landon Morin                                                | 11/01/2022 | DOING  |
| EDA on stations table                                                                             | Phase 2 | Victor Ramirez                                              | 11/01/2022 | DOING  |
| Conduct Data Joins                                                                                | Phase 2 | Rathin Bector                                               | 11/01/2022 | DOING  |
| Train, Validation, Test Split                                                                     | Phase 2 | Landon Morin                                                | 11/02/2022 | TO DO  |
| Feature Cleanup and Transformations                                                               | Phase 2 | Rathin Bector, Landon Morin, Francisco Meyo, Victor Ramirez | 11/05/2022 | TO DO  |
| PCA for Dimensionality Reduction                                                                  | Phase 2 | Rathin Bector                                               | 11/07/2022 | TO DO  |
| Logistic Regression Baseline Model                                                                | Phase 2 | Francisco Meyo                                              | 11/08/2022 | TO DO  |
| Cross-Validation Scoring Pipeline                                                                 | Phase 2 | Landon Morin                                                | 11/10/2022 | TO DO  |
| 2021 Scoring Pipeline                                                                             | Phase 2 | Victor Ramirez                                              | 11/10/2022 | TO DO  |
| Run Additional Experiments                                                                        | Phase 2 | Rathin Bector                                               | 11/13/2022 | TO DO  |
| Update Phase Project Notebook                                                                     | Phase 2 | Victor Ramirez                                              | 11/13/2022 | TO DO  |
| Create 2min Video Update                                                                          | Phase 2 | Victor Ramirez                                              | 11/13/2022 | TO DO  |
| Submit Notebook and PDF                                                                           | Phase 2 | Victor Ramirez                                              | 11/13/2022 | TO DO  |
| Create Post of Discussion Page                                                                    | Phase 2 | Victor Ramirez                                              | 11/13/2022 | TO DO  |
| Research SMOTE                                                                                    | Phase 4 | Landon Morin                                                | 11/15/2022 | DONE   |
| Feature Engineering                                                                               | Phase 3 | Francisco Meyo, Landon Morin, Victor Ramirez                | 11/16/2022 | TO DO  |
| Create a baseline model                                                                           | Phase 3 | Rathin Bector                                               | 11/17/2022 | TO DO  |
| Conduct test on using new features and report metrics                                             | Phase 3 | Rathin Bector                                               | 11/17/2022 | TO DO  |
| Update leaderboard and write a gap analysis of your best pipeline against the Project Leaderboard | Phase 3 | Landon Morin                                                | 11/18/2022 | TO DO  |
| Fine-tune baseline pipeline using a grid search                                                   | Phase 3 | Victor Ramirez                                              | 11/18/2022 | TO DO  |
| Video update                                                                                      | Phase 3 | Francisco Meyo, Landon Morin, Rathin Bector, Victor Ramirez | 11/20/2022 | TO DO  |
| Slides for presentation                                                                           | Phase 3 | Francisco Meyo, Landon Morin, Rathin Bector, Victor Ramirez | 11/20/2022 | TO DO  |
| Consider other models and build pipelines for these models                                        | Phase 4 | Victor Ramirez, Francisco Meyo, Rathin Bector, Landon Morin | 11/22/2022 | TO DO  |
| Hyperparameter tuning for all models using cross-validation                                       | Phase 4 | Rathin Bector                                               | 11/26/2022 | TO DO  |
| Final feature engineering and refinement                                                          | Phase 4 | Landon Morin                                                | 11/26/2022 | TO DO  |
| Consider and formalize written analysis of exciting, novel directions that we pursued             | Phase 4 | Victor Ramirez                                              | 11/26/2022 | TO DO  |
| Clean up code                                                                                     | Phase 4 | Francisco Meyo                                              | 12/04/2022 | TO DO  |
| Gap analysis of best pipeline against project leaderboard                                         | Phase 4 | Landon Morin                                                | 12/04/2022 | TO DO  |
| Final Writeup                                                                                     | Phase 4 | Landon Morin, Victor Ramirez, Francisco Meyo, Rathin Bector | 12/04/2022 | TO DO  |
| Submission and discussion board post                                                              | Phase 4 | Landon Morin                                                | 12/04/2022 | TO DO  |

## Project Plan

We are using ClickUp for project planning for this project. The Clickup folder for this project can be found [here](https://sharing.clickup.com/42080451/g/h/184663-160/d7cc34e69aa3512). Below, you can find the Gantt charts we have for each phase of this project.

### Phase 1
<img src="https://drive.google.com/uc?id=1igtHkkr-dC7tVNMZ3uk_Q2kvqATs4KDB" alt="Google Drive Image" width=90%/>

### Phase 2
<img src="https://drive.google.com/uc?id=1iMnTRgQCdYP30K9udHHF7KhbZ6DyDMrg" alt="Google Drive Image" width=90%/>

### Phase 3
<img src="https://drive.google.com/uc?id=1lcFOfhD9nK0X_jTPk6HvpnJH-PHYkNhP" alt="Google Drive Image" width=90%/>

### Phase 4
<img src="https://drive.google.com/uc?id=1_0ZIc9Jt-9N01KCk27O3q1cq0srAAwtL" alt="Google Drive Image" width=90%/>

## Dataset Summary


**Airlines:**    

The Airline dataset contains on-time performance data from the TranStats data collection available from the U.S. Department of Transportation (DOT). The airline flight dataset provides many features about the flight including flight, date, carrier, delay, cancel and diversion information. The data consist of numerical (integers and doubles) and categorical (strings) value types. The flight dataset does contain high null / missing values. Many of the dataset columns are related to flight diverts and delays. As only a low number of flights are delayed or diverted the dataset contains many null values.  

We will use the following features to build our model:  
1.	FL_DATE: Flight date (yyymmdd)  
2.	DEP_DELAY_GROUP: Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.  
3.	OP_CARRIER: Flight Carrier  
4.	ORIGIN: Flight departure origin  
5.	TAIL_NUM: Flight aircraft tail number  
6.	DISTANCE: Total flight distance  

**Weather:**   

The weather dataset contains information from 2015 to 2021. The dataset contains summaries from major airport weather stations that include a daily account of temperature extremes, degree days, precipitation amounts and wind.  

We will use the following features to build our model:  
1.	FL_DATE: Flight date (yyymmdd)  
2.	DEP_DELAY_GROUP: Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.  
3.	OP_CARRIER: Flight Carrier  
4.	ORIGIN: Flight departure origin  
5.	TAIL_NUM: Flight aircraft tail number  
6.	DISTANCE: Total flight distance  


**Station:** 

The stations dataset contains information from the US naval stations in the US and territories. The data consist of numerical (doubles) and categorical (string) value types. The dataset is complete with no null / missing values.   
We will use the following features to build our model:  
1.	station_id: A character string that is a unique identifier for the weather station.  
2.	neighbor_call: The ICAO identifier for the station.  
3.	neighbor_lat: Latitude (degrees) rounded to three decimal places of the neighbor location.  
4.	neighbor_lon: Longitude (degrees) rounded to three decimal places of the neighbor location.  
5.	distance_to_neighbor: Total distance to the next station. 

**Summary description of dataset:**  

* Airlines Dataset
  + Size: 2,806,942 Rows (3 month subset)  / 74,177,433 Rows (full dataset)  
  + Source: https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ    
* Weather Dataset  
  + Size: 30,528,602 Rows (3 month subset) /  898,983,399 Rows (full dataset) 
  + Source: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00679    
* Stations Dataset
  + Size: 5,004,169 Rows (full dataset)      
  + Source: W261 Course provided  



## TODO:

Correlation analysis

## Exploratory Data Analysis

## Statons EDA

In reviewing the station data, we discovered the numerical latitude, longitude columns for all the weather stations. We will be using an additional dataset to join the stations dataset with the airlines and weather dataset. 

Using the latitude and longitude columns we generated a table with all the required coordinates. Leveraging the data bricks Map (Marker) visualization tool, we plotted all the station coordinates on a satellite layer.

**Weather Stations Map Plot**

<img src="https://drive.google.com/uc?id=14Nl8mtCTJJOpNMGXZfMdpm7iCnEMdMlU" alt="Google Drive Image" width=90%/>

**Stations State Distribution**

<img src="https://drive.google.com/uc?id=1TBOvm0CmpLpwPpVZOH3uQEp4KAjVel8k" alt="Google Drive Image" width=90%/>


## Airlines EDA
The airlines dataset is characterized by high null values in about half of the columns. For the columns that remain after dropping columns with high nulls, we are left with columns with 2% null values or fewer. In the following table, we observe the percent of null values in the remaining columns. Note that the remainder of our kept features contain no null values.

Remaining Columns With Null Values | Remaining Columns With No Nulls
-|-
<img src="https://drive.google.com/uc?id=1KoNxWoZB5Ptl6WtwLog-3F8jB3sQFkXg" alt="Google Drive Image" width=90%/> | <img src="https://drive.google.com/uc?id=1nD0OoIV4bJU9ijWHao7xo386EY1oHqwZ" alt="Google Drive Image" width=90%/>

Of these columns, we map the descriptive statistics of each numerical feature below. From this brief analysis we can begin to see skew in features such as flight delay, year (geopolitical and health factors in 2020), flight arrival, flight distance, flight airtime, and diverted flights. Further analysis in phase 2 will provide us with the necessary information to take actions to address imbalances in both numerical and categorical data. Furthermore, differing numerical scales will require normalization of the training and test sets such that gradient descent and loss optimization is possible.

<img src="https://drive.google.com/uc?id=1YWXDgy3Rntkgn-IxXexL9r7KnROASFOx" alt="Google Drive Image" width=30%/>


As previously mentioned, our labels will be taken from DEP_DELAY_GROUP, which has a heavy right skew. This indicates that we will need to use SMOTE to adjust for class imbalances.

<img src="https://drive.google.com/uc?id=17xi9M-oImTFDFMdSuIud9wzlhNFzqMUs" alt="Google Drive Image" width=90%/>

From preliminary analyses, we find that there are potential patterns between our selected features and labels that will provide predictive power despite class imbalances.. 

Relationship of Sampled Airlines With Grouped Departure Delays | Relationship Of Total Distance With Total Flight Delay
-|-
<img src="https://drive.google.com/uc?id=1cGyGUFGyo7CbeDNQJYUmXVd6pPNYMfGd" alt="Google Drive Image" width=90%/> | <img src="https://drive.google.com/uc?id=1EHG5MpWRbQOcSA5ihu4sugSaaS0hGXVj" alt="Google Drive Image" width=90%/>

## Weather EDA

Selected features of the weather dataset contain several string fields that will require further analysis for categorization. On the other hand, numeric fields are included in several scales that will make normalization necessary:  


Summary Statistics P.1 | Summary Statistic P.2
-|-
<img src="https://drive.google.com/uc?id=1BbKOPtB9d_zLrPx2k7mx3u2S0o_AUHsl" alt="Google Drive Image" width=90%/> | <img src="https://drive.google.com/uc?id=1yhMlL92fRLnYr_Ue9TVD4K5cZke_a5Hv" alt="Google Drive Image" width=90%/>

## Feature Transformations 

1. Do you need any dimensionality reduction? (e.g., LASSO regularization, forward/backward selection, PCA, etc..)
2. Specify the feature transformations for the pipeline and justify these features given the target (ie, hashing trick, tf-idf, stopword removal, lemmatization, tokenization, etc..)
3. Other feature engineering efforts, i.e. interaction terms, Breiman's method, etc…)

## Data Joins

In order to have features derived from the weather dataset to predict flight, we must join the weather data to the airlines data. This join can be performed in the following steps:
1. The [Master Coordinate table](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FLL), available on the Beaureau of Transportation Statistics website, contains the latitude an longitude for each airport in the world. We can find the closest weather station in the stations table to each airport using Haversine Distance formula Analysis. A guide on how to do this can be found [here](https://medium.com/analytics-vidhya/finding-nearest-pair-of-latitude-and-longitude-match-using-python-ce50d62af546) 
2. Left join the airlines table with the table derived from step 1, on `ORIGIN_AIRPORT_SEQ_ID` and `AIRPORT_SEQ_ID` respectively. This will give us the closest weather station to the origin ariport for each flight. 
3. Left join the the table from step 2 with the weather table, on the `station_id` and `STATION` as well as the `FL_DATE` and `DATE` columns respectively. This gives us the weather at the origin airport the day of each flight.

## TODO:

1. Join tables (full join of all the data) and generate the dataset that will be used for training and evaluation
2. Joins take 2-3 hours with 10 nodes;

    a. Join stations data with flights data
    
    b. Join weather data with flights + Stations data
    
    c. Store on cold blob storage on overwrite mode [Use your Azure free credit  (of $100) for storage only]   
    
3. EDA on joined dataset that will be used for training and evaluation

## Data Split
Train / Test


## Machine Learning Algorithms and Metrics

**TODO:**

1. Review the following material regarding developing machine learning pipelines in Spark:
   
   a. https://pages.databricks.com/rs/094-YMS-629/images/02-Delta%20Lake%20Workshop%20-%20Including%20ML.htmlLinks to an external site.
   
   b. https://spark.apache.org/docs/latest/ml-tuning.htmlLinks to an external site. 
   
2. Create machine learning baseline pipelines and do experiments on ALL the data (the entire dataset, not just  three/six/12 months)

   a. Use 2021 data as a blind test set that is never consulted during training.
   
   b. Report  evaluation metrics in terms of cross-fold validation over the training set (2015-2020)
   
   c. Report  evaluation metrics in terms of the 2021 dataset
   
   d. Create a baseline model using logistic/linear regression, ensemble models
   
   e. Discuss experimental results on cross-fold validation, and on the held-out test dataset (2021).
   
   f. Hint: cross-validation in Time Series is very different from regular cross-validation. Please review the following to get more background on CV for time-based data 
   
   + Cross-validation in Time Series data (very different from regular cross-validation)
      * The method that can be used for cross-validating the time-series model is cross-validation on a rolling basis. Start with a small subset of data for training purposes, forecast for the later data points, and then check the accuracy of the forecasted data points. The same forecasted data points are included as part of the next training dataset and subsequent data points are forecasted.
      
      * For more information see:
        + https://hub.packtpub.com/cross-validation-strategies-for-time-series-forecasting-tutorial/
        
        + https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4#:~:text=Cross%20Validation%20


We will be operationalizing the problem of predicting flight delays for consumers using multiclass classification. To accomplish this, we will create a hybrid label from DEP_DELAY_GROUP, which consists of bins of 15 minute delay intervals. Since DEP_DELAY_GROUP is skewed toward lower departure delays, we will consider anything less than 15 minutes delayed as on time, then will bin delays into 30 minute classification intervals until 90 minutes, with all times after 90 minutes merged into one bucket +90. 

Our metric for success will be the F1 score, since consumers will both care that we are making accurate delay predictions (precision), but also that we are minimizing false negatives(recall). The F1 score is a harmonic mean of both precision and recall, and therefore considers both metrics in its calculation. 

$$Equation one: F1 = 2 * \frac{Precision * Recall}{Precision + Recall}$$

To start, we will use a baseline algorithm that predicts that a flight will not be delayed. Since most flights are not delayed, we predict that this will be reasonably accurate, but will have a zero recall, precision, and F1 score. Against this baseline, we will be using three machine learning algorithms. 

Model one will be logistic regression with a momentum optimizer to predict the probability of a flight being classified as k-minutes delayed, where k is a bucket that is comprised of 15 minute intervals up to 90 minutes and infinite minutes after 90 minutes. We can represent this problem as the following:

$$Model One - Logistic Regression: \hat{y} = argmax f_k(x)$$ Where k represents the following:

k | Bucket
- | -
1 | delay <= 15 min
2 | 15 min > delay >= 45 min
3 | 45 min > delay >= 75 min
4 | 75 min > delay >= 90 min
5 | 90 min > delay



We will use a multiclass Binary Cross Entropy loss function, which can be represented by the following function:

$$Model One - BCE Loss: -\sum_{c=1}^My_{o,c}\log(p_{o,c})$$ Where M represents the number of classes, y represents the binary classification of class c, and p represents the probability of the classification of class c. 

Model 2 will be XGBoost, also with a Binary Cross Entropy loss function. XGBoost is known to scale well, and to provide more accurate results than a simple random forest. XGBoost improves upon each iteration by considering and weighting underperforming predictions from the previous iteration. This algorithm is better performing than random forests when operating on imbalanced data. This feature will be necessary for a large and imbalanced dataset that consists of mostly on-time flights. The XGBoost loss function can be represented by the following function:
$$Model Two - BCE Loss: -\sum_{c=1}^My_{o,c}\log(p_{o,c})$$


## Machine Learning Pipelines Overview

We will following the industry standard machine learning pipeline that is an end-to-end process. A machine learning pipeline is a well-defined set of steps to develop, train, test, and optimize a machine learning algorithm. 

Each stage of the machine learning pipeline makes up a specific step in the entire pipeline. The workflow is broken up into modular stages of work. The stages are independent and can be optimized. 

The pipeline begins with the ingestion and flow of raw data into the pipeline. Once the data is cleaned, sanitized the dataflow continues to the next stage. 

The following stage will handle the feature engineering portion of the pipeline. This stage will handle the process of selecting, manipulating, and transforming the data into features that can be used in machine learning model. 

The next stage is the model development, training, testing, and tuning. This stage will use the data split between train and test sets. Once the model is trained and tested the model validation happens based on standard testing metrics and results. 

In the final stage the machine learning model is used on new data.

<img src="https://drive.google.com/uc?id=1tho8U5UBa20AMBZidTTtJ6hiMkmoRENC" alt="Google Drive Image" width=90%/>


# Machine Learning Models

<img src="https://drive.google.com/uc?id=1xcM2gvYZa17nslSxD0i2Qh038DlfoGLj" alt="Google Drive Image" width=90%/>


   
  
  

In [0]:
# ML Pipeline

# importing packages
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


In [0]:
# import data

In [0]:
# train /test data split

In [0]:
# create pipelines for Logistic regression, XGBoost
# pipeline will include
# 1. Data import / ingest
# 2. Data preprocessing 
# 3. Data joining
# 4. Train model - loop both pipelines
# 5. Test model
# 6. Deploy model 
# 


## Model One: Logistic Regression Pipeline Details 

**TODO:**

1. Data Ingestion  
   i.  Gather data from various sources     
   ii. Combine and join all data sources     
2. Data Preperation  
   i.  Feature engineering     
   ii. Split train / test data  
4. Model Development  
   i.   Train model with training data     
   ii.  Model tunning /optimazation     
   iii. Paramenter / feature selection     
   iV.  Model selection validation     
5. Model Deployment  
   i.   New data predictions     
6. Monitor Model  
   i.   Model feedback  

## Model Two: XBoost Pipeline Details 

**TODO:**

1. Data Ingestion  
   i.  Gather data from various sources     
   ii. Combine and join all data sources     
2. Data Preperation  
   i.  Feature engineering     
   ii. Split train / test data  
4. Model Development  
   i.   Train model with training data     
   ii.  Model tunning /optimazation     
   iii. Paramenter / feature selection     
   iV.  Model selection validation     
5. Model Deployment  
   i.   New data predictions     
6. Monitor Model  
   i.   Model feedback  
   

## Conclusion and Next Steps

The work conducted during Phase I allowed us to: 

1. Obtain clean datasets (eg performing simple transformations or disregarding features with a high content of missing values)  
2. Selec relevant features of each dataset after performing basic EDA  
3. Join datasets to enhance the features available for our model  
4. Define preliminary models and performance metrics  


During Phase II we will be conducting the following next steps:  
1. Perform a detailed EDA that will allow us to define any normalization needed as well as parameter fine tunning   
2. Determine and calculate our baseline model (most likely Logistic regression)  
3. Perform experiments   
4. Adjust our plan as needed