## Background / Motivation

The US aviation industry transports millions of Americans every year, providing near-essential travel all across the country. Over the past few years, we have seen how precarious this industry can be with the COVID-19 pandemic, which saw almost all flights suspended for an extended period. Then, this past December and January, Southwest Airlines was plagued by issues in their decades-old computer systems which caused a massive meltdown of the entire airline’s scheduling procedures, causing thousands of flight cancellations [1].

These catastrophic problems have led many to question the structural longevity of the aviation industry. To this end, we wanted to look at probably the most widespread side-effect of industry problems: flight delays. Flights can be delayed due to a variety of factors: weather, airline problems (like Southwest this winter), airport-specific problems like runway/tarmac congestion, and more. In this project, we aim to use a dataset of over 4 million domestic flights from 2015 to infer which factors of a flight are associated with flight delays.

## Problem statement 

Our project aims to analyze flight delay data and identify the predictors that have the greatest impact on flight delays. We will explore the relationship between various predictors, such as weather conditions, flight distance, airline carrier, and time of day, and how they affect flight delays. The goal is to provide insights to airlines and airports on how to reduce flight delays and improve the travel experience for passengers. We will approach this problem as an inference task, focusing on understanding the relationships between the predictors and the response variable. Specifically, we will use regression analysis to model the relationship between flight delay time and the various predictors. Ultimately, the insights gained from this project will have the potential to reduce the negative impact of flight delays on individuals, airlines, and the economy as a whole.

## Data sources

**Flight Data: [2015 Flight Delays and Cancellations](https://www.kaggle.com/datasets/usdot/flight-delays)**

Our flight data comes from the US Department of Transportation’s Bureau of Transportation Statistics, published on Kaggle. It contains information for US domestic flights in 2015. This was the most comprehensive dataset available with regard to the amount of information available for each flight, so we elected to use it even though it is somewhat older data, at least in the sense that it is from prior to recent societal changes (COVID) which have changed the aviation industry. The downside of using data from 2015 is that it prevents us from being able to conduct a true prediction study on this data; flight data from only 2015 cannot be used to predict flight delays in 2023. However, it stands to reason that factors associated with delays in 2015 are mostly still associated with delays in 2023, so we can use the data to perform an inference study of these variables.

This dataset contains three separate files:
- `airlines.csv`: Contains the IATA code for each airline in the dataset (i.e., Southwest Airlines = WN).
- `airports.csv`: Contains the IATA code and location for each airport, including latitude and longitude (i.e., Washington Dulles Airport is code IAD in Chantilly, VA)
- `flights.csv`: Contains information for each flight in the dataset, including the date, airline, flight number, origin and destination airports, scheduled and actual departure and arrival times, and the response variable, departure delay


**Temperature Data: [Daily Temperature of Major Cities](https://www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major-cities)**

As discussed in the Data Cleaning section below, we decided that weather was another important factor in flight delays, so we wanted to incorporate it. The best dataset we could find which had weather data for most airport cities on most days in 2015 was this temperature dataset which comes from the University of Dayton and was published by SRK on Kaggle.

This dataset contains daily average temperatures for major world cities from 1995 to 2020. Included in the list are around 200 US cities, which encompass the locations of most airports in the flights dataset.


## Stakeholders

**Passengers:** US airline passengers are the primary stakeholders because they are the ones most impacted by flight delays and cancellations. Late arrivals and missed connections disrupt passengers’ lives, and our model will help passengers avoid booking flights that are likely to be majorly delayed.

**Airlines:** The airlines also have interest in our project because they schedule flights and stand to lose money if they are delayed/canceled. By using our model, they can identify certain flight dates/times/locations/types to avoid scheduling in order to keep their flights on time. They can also potentially use our model to evaluate their performance relative to other airlines.

**Government agencies (DOT, FAA):** Government agencies like the FAA would be stakeholders because they monitor all US flights and can play a role in helping to mitigate delays. They can regulate certain flight attributes that tend to lead to delays, such as plane type or departure time, and they can reprimand airlines or airports whose flights are most frequently late.

## Data quality check / cleaning / preparation 

**In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels.**

**If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.**

<font color='red'>Aymen: table of distribution of values</font>

### Data Cleaning and Wrangling

To prepare our data for modeling, the following steps were followed:
1. Read in `airports` and `flights` datasets. The `flights` dataset will be the one from which we get our predictors and response variable.


2. Merge `airports` with `flights` datasets. The purpose of this is to obtain the location information for both the origin and destination of the flight. Therefore, `flights` was merged twice with `airports`, once on the origin airport, and once on the destination, to get latitude and longitude for each.


3. Remove unnecessary columns from the flights dataset. The specific reason for each removal was explained in the code, but mainly falls into two categories:

    - Column contains information irrelevant to predicting flight delays. These are columns like flight number and airplane tail number, which are arbitrary and have no bearing on the length of departure delay for the flight.
    
    - Column contains information that is obviously correlated with another column (multicollinearity) or in some cases is a direct mathematical derivation from other columns.
    
    For example, the `arrival_delay` time of a flight is obviously correlated with the `departure_delay`, since a flight that leaves late will usually arrive late. Additionally, arrival delay cannot be used as a predictor of departure delay because it occurs after departure. Another example is the `departure_time`, which is merely the `scheduled_departure` + `departure_delay`. Since the latter two variables are included in the data, `departure_time` should be excluded because it is useless information.
   
   
4. Convert time columns to minutes since midnight. Some time columns, such as `scheduled_departure`, were listed in a time format, i.e., a 7:59 am departure would be listed as `759` but an 8:00 am departure would be listed as `800`. To create a continuous variable, these times were modified to be the number of minutes since midnight. For example, 7:59 am would be `479` and 8:00 am would be `480`.


5. Add a `day_of_year` column that is a continuous measure of date, rather than the existing `month`/`year` columns which discrete. For example, February 1st would be day `32`.


6. Upon conducting some early EDA and base modeling, our team realized that an important likely predictor of flight delays was missing entirely from our dataset: weather. To remedy this problem, we brought in the previously mentioned temperature dataset, which contains daily average temperatures for most cities in our flights dataset. An initial search of each dataset found that around 70% of all flights were matched with cities in the weather dataset. However, there were some airports whose listed location in `flights` did not match the city listed in `temperature`. For example, Washington Dulles airport is listed as Chantilly, VA in `flights`, but this small locale is not in `temperature`, so `Washington DC` was renamed to `Chantilly` in `temperature` to allow a match between datasets. After replacing names, over 85% of all flights were matched to a temperature, and the two datasets were merged. The merge process was similar to the previous one in that `flights` was merged twice with `temperatures`, both for origin and destination airports.


7. There were a few null values present in some columns of data. Because of the large size of the dataset, rows containing null values were dropped.

### Data Preparation

After cleaning, the resulting dataset was written out to `data/flights_clean.csv` for use in modeling and EDA.

An additional dummy variable dataset was written to `data/flights_clean_numerical.csv`. However, this file was around 3.5 GB and over 150 columns, so it was too large to use for the variable selection and shrinkage methods below.

After attempting to use the dummy variable dataset, we went back and wrote out an additional cleaned dataset, `data/flights_clean_numerical_significant.csv`. This dataset contains all numerical columns as well as dummy variables for categorical levels which were significant in the base model.

## Exploratory data analysis

**Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s).**

**List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model.**

**Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.**

<font color='red'>Aymen</font>

## Approach

**What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?**

We used a linear model because we were predicting a continuous predictor. We tried to optimize performance metrics such as AIC, BIC, and especially RMSE. We wanted to optimize RMSE over other performance metrics as its greatest advantage is that it takes into account large errors which is important for our model regression. This is critical for managing travel schedules effectively, as airlines and passengers require accurate and precise predictions. RMSE measures the average difference between the predicted and actual flight delay times and ensures that errors are in the same units as the target variable, allowing for the expected deviation in flight delay times to be quantified. In contrast, other metrics like MAE or R-squared do not provide a clear understanding of the error magnitude and may not be suitable for flight delay prediction. MAE treats all errors equally, and R-squared measures the proportion of variation in the data explained by the model. Therefore, RMSE is essential for making informed decisions regarding flight schedules and alternative plans.

**Is there anything unorthodox / new in your approach?**

<font color='red'>Aymen add</font>

Since we had a very large dataset (~5 million rows), we were not able to run certain variable selection methods without reducing the sample size or the number of predictors due to computational limitations. One of our workarounds was to include all numerical predictors with the significant categorical predictors from the base into our variable selection. Although this may not have necessarily resulted in the most optimal set of predictors, it allowed us to reduce the number of predictors while still retaining important information.

**What problems did you anticipate? What problems did you encounter? Did the very first model you tried work?**

- Just from looking at the columns of the raw dataset, we knew that there would be a lot of cleaning and data wrangling that would need to be done first. Many of the columns were also clearly collinear (e.g. arrival_time very strongly correlated with scheduled_arrival). After cleaning, we also did anticipate possibly not having enough relevant predictors to make strong inferences about flight delays, so we also added daily temperature data as we reasoned that extreme temperatures could cause more flight delays which hopefully increase the accuracy of our analysis.


- The biggest problem we encountered was the size of our dataset. We did not want to sacrifice too many of our raw observations as to mitigate making inaccurate inferences, so we worked around this only using numerical data for our first round of variable selection and shrinkage methods. However, we wanted to be able utilize at least some of the categorical predictors, so through some EDA, we decided to use significant categorical levels in an updated dataset with all the numerical predictors, as mentioned earlier.


- The baseline model that we ran first did not produce great results, so we knew there was a lot of room for improvement. We were only able to run it a few times since running it on the whole dataset would take ~15 minutes.This baseline model yielded an R-squared of 0.050 and an RMSE of 37.4 minutes. For comparison, the standard deviation of the response variable was 37.8 minutes, and the mean was 10.1 minutes.

**Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?**

No, our problem did not have already have solutions posted. We built our model from scratch.

## Developing the model

### Variable Selection
#### Best Subset Selection:
One of our main methods of improving the base model was through various variable selection methods. We first began by using best subset selection which we found very quickly to be not computationally feasible. So we first only ran a small sample of the data:
![](images/flights_clean.png)

We also only ran our initial variable selection on numerical variables and took out all categorical variables:
![](images/train_drop.png)

Once the best subset function finished (took almost 30 minutes to run only on a sample of 10000), the results for which model to use differed based on the metric:
![](images/best_subset_metrics.png)

As shown above, based on adjusted R-squared and AIC, the optimal number of predictors is 12. We decided to optimize AIC over BIC, since AIC places on a smaller penalty on models with more variables, which made sense in this case because a we had a large predictor set and our baseline model of all predictors only yielded an **R-squared of 0.02**. With this best subset model of 12 predictors, the **R-squared was 0.024**.
![](images/best_subset_model.png)

Since we were only running this on a such small sample (only represented **0.2%** of the entire dataset), we couldn’t really use these models for reliable inference, so we then moved on to using forward selection, which was the perfect variable selection method to use due to its relative computational efficiency.

#### Forward Selection: 
Like with best subset selection, the first time we ran forward selection on the dataset, we only computed it on a set of only the numerical predictors, as using categorical predictors was not computationally possible. I was not even able to properly read the full “dummies” version of the dataset, let alone run any functions on it. Running forward selection only took **4.5 minutes**, displaying its computational advantage over other methods of variable selection. We found that using **all 16** of the numerical predictors would yield the best model, which was consistent across optimizing AIC, BIC, and adjusted R-squared as shown below: 
![](images/forward_metrics.png)

Here is the model equation:
![](images/forward_equation.png)

Even though our model R-squared was the same as the baseline model’s, the new model with 16 predictors had a small decrease in **RMSE by 0.1 to 37.3**

#### Backward Selection:
Next, we ran backward selection on the numerical dataset. Running the function took almost **7 minutes**, and yielded the following results: 
![](images/backward_metrics.png)
As shown above, backward selection yielded very similar results to forward selection, essentially telling us to use all the numerical predictors. This makes sense because backward selection starts with a full model and forward selection starts with an empty model, and since both functions were using the same predictors it checks out that they would have similar results since the best model was using all predictors
![](images/best_fwd_reg_model.png)

Since the model was essentially the same as the first forward selection model, the **RMSE was also 37.3**.

#### Forward Selection: Updated Dataset
Due to the aforementioned issues regrading using the numerical dataset, for the second round of forward selection we used an updated dataset using significant categorical levels from the baseline model. 
![](images/updated_forward_metrics.png)

As shown above, using 23 predictors would yield the best model which optimized AIC. In this new model with 23 predictors, the **RMSE** stayed the same at **37.45**.

**Conclusions:**

With this set of variable selection, we were not able to improve the baseline model by a substantial amount. Our biggest obstacle was working with such a large dataset and still properly utilizing the dataset to make reasonable inferences. Ideally, we would have liked run best subset selection on the full dummies dataset, however that would be impossible for any normal computer to run (dummies dataset was over 150 columns).
Even with our workaround of including significant categorical levels from the baseline model, we weren’t able to improve the model by much, and we could only run the updated dataset fully on forward selection, as the updated dataset was too big to run through backward selection, let alone best subset selection. 


### Shrinkage Methods

After completing variable selection, we used shrinkage methods to attempt to improve the performance of our model by shrinking coefficients.

**Note: See appendix for all associated graphs.

#### Ridge Regression

We first attempted ridge regression on only the numerical data in `flights`. We did this by dropping all categorical variables from the dataset. The following steps were taken to perform ridge regression:

1. Split the data into train and test using `train_test_split` to evaluate performance later.
2. Standardize the predictors `X` into `Xstd` using `StandardScaler()`.
3. Set the `alphas` space to consider when searching for the optimal lambda tuning parameter for the shrinkage penalty.
4. Fit a regression for each possible lambda value and produce a coefficient vs. lambda graph:
5. Use cross validation to find the optimal performing lambda value.\
    **lambda = 0.051**
6. Optimal lambda graphed with the coefficients:
7. Standardize the test data (`Xtest_std`) for testing performance.
8. Use the optimal model to predict the `departure_delay` for test data.
9. Calculate the RMSE for the test data predictions to evaluate (and standard deviation of test data to compare):\
    **RMSE = 37.422**\
    **STD = 37.787**

10. Calculate R-squared scores for train and test data:\
    **R^2 train = -48.058**\
    **R^2 test = -48.122**
    Notably, this RMSE is virtually the same as the RMSE obtained without ridge regression. Therefore, we sought further improvements.

After completing the first round of variable selection and shrinkage methods, we wanted to incorporate categorical variables into our modeling, and due to the aforementioned issues with our dummy variable dataset, we used the `flights` dataset with all numerical columns and significant categorical levels in order to do this. The same procedure was followed, with the following results:

**Optimal lambda = 0.057**

**RMSE = 37.111**\
**STD = 37.546**

**R^2 train = -41.227**\
**R^2 test = -40.564**

This round of ridge regression produced a slight improvement in model performance (RMSE) but not a large one. We decided to try lasso regression to compare its performance.

#### Lasso Regression

Unlike ridge regression, lasso will completely remove insignificant predictors (i.e., their coefficients will go to 0). Similar to ridge, we first performed lasso on only numerical data to compare performance. This process was much the same as the process for ridge:

1. Take a sample of the data (n=10000) on which to perform lasso regression. This was necessary because the lasso algorithm would not successfully run on all the data.
2. Split the data into train and test using `train_test_split` to evaluate performance later.
3. Standardize the predictors `X` into `Xstd` using `StandardScaler()`.
4. Set the `alphas` space to consider when searching for the optimal lambda tuning parameter for the shrinkage penalty.
5. Fit a regression for each possible lambda value and produce a coefficient vs. lambda graph:
6. Use cross validation to find the optimal performing lambda value.\
    **lambda = 0.581**
7. Optimal lambda graphed with the coefficients:
8. Standardize the test data (`Xtest_std`) for testing performance.
9. Use the optimal model to predict the `departure_delay` for test data.
10. Calculate the RMSE for the test data predictions to evaluate (and standard deviation of test data to compare):
    **RMSE = 41.092**\
    **STD = 41.482**
    R-squared was not calculated because it was unused in evaluating the previous ridge models.

This first lasso attempt was poor, with a much worse RMSE than either ridge regression, although the sample standard deviation was also higher. We again tried lasso with the same modified dataset of numerical + select categorical data, under the same process. However, this time, we also added the following variable transformations according to our EDA to try to further improve the model:

Transformations:

- `log(distance)`
- `log(scheduled_time)`
- `log(taxi_in)`
- `log(taxi_out)`

Binned variables (added as dummy variables for each bin):

- `binned origin_latitude`
- `binned origin_longitude`
- `binned destination_latitude`
- `binned destination_longitude`
- `binned day_of_year`

Following are the results of this model:

**Optimal lambda = 0.068**

**RMSE = 34.526**\
**STD = 35.094**

By RMSE, this was our best-performing model.

**Put the final model equation.**

`departure_delay =
-0.280062 * day -0.410927 * day_of_week -1.408435 * day_of_year -0.476174 * destination_temperature
+0.838358 * origin_latitude +0.592646 * origin_longitude -0.479043 * origin_temperature
+1.295893 * scheduled_arrival +3.007562 * scheduled_departure +1.016395 * taxi_in +3.509715 * taxi_out
-0.832837 * airline_AS +0.872424 * airline_NK +1.509344 * airline_UA +0.426376 * destination_airport_BTV
-0.081166 * destination_airport_DTW -0.240464 * destination_airport_FNT +0.660775 * origin_airport_CMH
-0.588643 * origin_airport_IAD -0.384632 * origin_airport_LNK -0.294619 * origin_airport_RIC
-0.443162 * state_destination_MI +0.000075 * state_destination_VT +0.353779 * state_origin_NE
+0.361776 * log_distance -0.622060 * log_taxi_in -2.651841 * log_taxi_out
+0.316994 * destination_latitude_(32.192, 43.066] -0.806106 * destination_latitude_(43.066, 53.94]
-0.121632 * destination_longitude_(-158.01, -136.019] -0.267983 * destination_longitude_(-114.116, -92.212]
+0.239962 * destination_longitude_(-92.212, -70.309] +0.605814 * origin_latitude_(21.275, 32.192]
-0.192634 * origin_latitude_(43.066, 53.94] -0.395085 * origin_latitude_(53.94, 64.814]
+0.055742 * origin_longitude_(-158.01, -136.019] +0.338428 * origin_longitude_(-136.019, -114.116]
-0.807627 * day_of_year_(0.636, 46.5] -0.805485 * day_of_year_(46.5, 92.0]
+1.538830 * day_of_year_(137.5, 183.0] +1.396959 * day_of_year_(183.0, 228.5]
-0.013491 * day_of_year_(228.5, 274.0] -0.678390 * day_of_year_(274.0, 319.5]`

**Did you succeed in achieving your goal, or did you fail? Why?**

Ultimately, we cannot say that we achieved our goal. While we improved our model through the methods discussed above, the RMSE remained in the neighborhood of 35 to 40 minutes. Since the mean departure delay is 10 minutes, this RMSE means the model cannot reliably detect the lateness of a flight, and it inhibits our ability to infer insights from the data. We have still done so in the conclusions section to follow, but bear in mind that these insights are at the mercy of our poorly-performing model.

## Limitations of the model with regard to inference / prediction

Our initial goal as a team, as previously discussed, was to develop a predictive model that could predict delay times for flights in 2023. However, as previously discussed, this became impossible when we settled on our 2015 dataset. Instead, our project is inference-based; we used our model equation and data analysis to identify factors that contribute to flight delays. In doing this, we reasoned that these factors remain fairly consistent over the years, however, this assumption is a limitation of our model. In fact, it may not be a fair assumption that the factors leading to delays are consistent.

During COVID-19, flights were delayed or canceled for unique reasons, such as safety precautions, lack of workers, and lack of demand. This past winter, flights were again canceled for novel reasons. Southwest flights were canceled due to an out-of-the-blue computer system problem. Neither of these events would have been hinted at in the inferences we drew from our model. Therefore, our ability to use 2015 data to actually infer delay patterns in 2023 is severely limited; 2015 should be used more as a starting point upon which to launch further research into delays in 2023.

## Conclusions and Recommendations to stakeholder(s)

**What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.**

**How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.**

**If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable?**

**Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?**

<font color='red'>Aymen</font>

## GitHub and individual contribution {-}

https://github.com/bencaterine/ding-ding-ding

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Ben Caterine</td>
    <td>Data cleaning, ridge regression, lasso regression</td>
    <td>Cleaned and wrangled data to prepare for modeling. Shrinkage methods (ridge and lasso) to improve model performance</td>
    <td>28</td>
  </tr>
  <tr>
    <td>Haneef Usmani</td>
    <td>Best subset selection, Forward selection, backward selection</td>
    <td>Variable selection to improve model performance</td>
    <td>15</td>
  </tr>
    <tr>
    <td>Aymen Lamsahel</td>
    <td>[Aymen contributions here]</td>
    <td>[Aymen details here]</td>
    <td>8</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

GitHub is not very useful for Jupyter notebook files. The differences between your version and the current version are frequently not shown due to the changes being too large, and git registers changes to files that you didn’t actually change but merely opened/ran a cell. Also, GitHub is not very easy to learn on-the-go for members who hadn’t used it before. We were lucky to have some members with GitHub experience; otherwise this would’ve been very difficult.

Github felt very overwhelming at first but has become more manageable. However, I feel that using other collaboration tools would have been less time-consuming especially for the scale of this project. If we had more than five people in a team, then I feel the advantages of using Github would be more clear; but because a lot of our work was independent, using Github felt like an extra task in the process. Because of this, I would often wait until I had made all changes to a file before pushing the changes, especially in the beginning since I was unfamiliar with pushing/pulling.

## References {-}

[1] Karen Brooks Harper. Southwest Airlines’ holiday meltdown brings on federal investigation, Dec. 27, 2022. The Texas Tribune.

## Appendix {-}

### Ridge Regression
![](images/ridge_coeffs.png)
![](images/ridge_error.png)
![](images/ridge_both.png)

### Updated Ridge Regression
![](images/ridge_updated_coeffs.png)
![](images/ridge_updated_error.png)
![](images/ridge_updated_both.png)

### Lasso Regression
![](images/lasso_coeffs.png)
![](images/lasso_error.png)
![](images/lasso_both.png)

### Updated Lasso Regression
![](images/lasso_updated_coeffs.png)
![](images/lasso_updated_error.png)
![](images/lasso_updated_both.png)