# Analyzing Swiss Large Cap Companies
In this notebook we investigate data from the largest 100 publicly traded companies in Switzerland. The data is from: https://www.tradingview.com/markets/stocks-switzerland/market-movers-large-cap/. It has been extracted on September 20, 2024 and preprocessed in the separate notebook (named `large_caps_CH_PREP.ipynb`; which we will look at in the last block). The cleaned version of the dataset is available as comma-separated values in the file `large_caps_CH_20_val-09-20.csv`. We will discuss techniques to prepare data for analysis in the last block; for now we just use the cleaned version of the data set.

## Preparations
Before we can start, we need to import a number of python libraries that we will be using in this notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# Avoid unnecessary warnings:
pd.options.mode.chained_assignment = None 

## Data Loading
Using the function `read_csv` fron the `pandas` library, we can load the content of a file with comma-separated values into Python. It will be stored in the format of a **DataFrame**, which offers some nice functionality. For now, we limit ourselves to shares with a positive earnings per share (EPS) as well as a positive dividend yield:

In [None]:
largeCaps = pd.read_csv( ... )
largeCaps = largeCaps.loc[(largeCaps['DivYield_Prct']>0) & (largeCaps['EPS_CHF']>0)]

In a jupyter notebook, data frames can be displayed nicely by just typing in their names. 

In [None]:
# ...

Note that if you are working directly on a console, you have to use a `print` statement, such as `print(largeCap_df)`, but the output is not as nice:

In [None]:
print(largeCaps)

## An Overview over the Data
To get an overview of the data, all `pandas` DataFrames have two generic methods: 

* `info()` displays information about the size of the data frame (number of rows and columns), as well as the names, types and available non-empty values for each attribute (or column)
* `describe()` gives an overview (in the form of a number of summary statistics) of the numeric column. Columns that are not numeric are dropped for this.

In [None]:
# ...

In [None]:
# ...

Additionally, the function `value_counts()` can be called for each column of a DataFrame. It counts the number of times each value occurs, and lists them in decreasing number of occurence. This makes sinse in particular for non-numeric attributes, e.g., the `Sector` attribute of the large cap companies:

In [None]:
# ...

## Visualisations
The above analysis gives a summary over the numeric and categorical values. Next, we will show some visualisations, aiming at a better understanding of the data and potential dependencies between various attributes of the shares.

### Visualising a Single Dimension
A standard way to plot one simple dimension is the so-called *boxplot*, which can easily be plotted using the `boxplot` function from `matplotlib.pyplot`:

In [None]:
plt.boxplot(largeCaps['SharePrice_CHF'])
plt.show()

We already see that one share has a way higher price than all the others. As the plot is automatically scaled such that all shares fit, we see very little of the other shares. Therefore, we limit ourselves to shares with a price below 20'000 CHF, and plot them again.

What we do here is so-called *logical indexing*, i.e., we look at all the rows for which the indicated condition (in our example: `largeCaps['SharePrice_CHF']` is below `20_000`), and we store these into a new variable, called `largeCaps_b20k`. Note that in python, we can use the underscore `_` to make numbers more readable; it has no impact in Python.

In [None]:
largeCaps_b20k = largeCaps.loc[largeCaps['SharePrice_CHF']<20_000, ]
plt.boxplot(largeCaps_b20k['SharePrice_CHF'])
plt.show()

Again, we have one share which has a much higher price than all the others. We limit ourselves further and look only at companies with share price below 5000 CHF.

In [None]:
# largeCaps_b5k = ...

Now the boxplot gives a better overview:

In [None]:
plt.boxplot(largeCaps_b5k['SharePrice_CHF'])
plt.show()

A further common plot type is the so-called histogram. For this, the values (usually on the x-axis) are placed into bins, and then the number of items (or rows - companies, in our case), per bin is counted:

In [None]:
plt.hist(data=largeCaps_b5k, x="SharePrice_CHF")
plt.grid()
plt.show()

Looking under the hood, we see that the function has created a specific data structure, which contains the bin limits (as second element) and the number of items per bin (as first element):

In [None]:
plt.hist(data=largeCaps_b5k, x="SharePrice_CHF")

We can also increase the number of bins (or explicitly specify the bin limits) - see the documentation for details:

In [None]:
plt.hist(data=largeCaps_b5k, x="SharePrice_CHF", bins=30)
plt.grid()
plt.show()

A further possibility (mainly helpful if there are many points, or if you want to compare different groups) are so-called *density plots*. In `seaborn`, we can get them using the function `kdeplot` (where `kde` *stands for kernel density estimation*, which you can roughtly think of a smoothed version of a histogram). A limitation of this type of plot is that it might get too smooth.

`seaborn` builds up on top of `matplotlib` and allows us to provide a data frame and then indicate which columns should be used for which axis (and for other plot attributes):

In [None]:
sns.kdeplot(data=largeCaps_b5k, x="SharePrice_CHF")
plt.grid()
plt.show()

This image already shows a potential issue with the smoothed density plots: A negative share price does not make sense, but we see that the smoothing leads to a positive density for negative share prices.

### Visualising Two Dimensions
To add some more details, we can visualize more than one attribute in the same plot. For example, we can plot a box plot of the share price *per industry sector*. To do so, we will use `boxplot` from the library `seaborn`:

In [None]:
sns.boxplot(data=largeCaps_b5k, x="SharePrice_CHF", y="Sector")
plt.grid()
plt.show()

Another very common plot type to visualize two dimensions are the so-called *scatter plots*, which we will use often also in this class. A coordinate system is built up from the two attributes, and each data point (in our case: each share) is plotted at the respective position:

In [None]:
plt.scatter(x=largeCaps_b5k['EPS_CHF'], y=largeCaps_b5k['SharePrice_CHF']);
plt.xlabel('Earnings per Share [CHF]')
plt.ylabel('Share Price [CHF]')
plt.grid()
plt.show()

### Visualising more than Two Dimensions
Visualising more than two dimensions is tricky. An extension of the scatterplot into 3D is of course possible, but often hard to actually read.

An alternative is the use of color (or the plot symbol) to encode additional information. For example, we can enrich the above scatterplot of *earnings per share* vs *share price* with the sector encoded as color. Again, we use the `seaborn` library, which offers a simple interface to do so:

In [None]:
sns.scatterplot(data=largeCaps_b5k, x='EPS_CHF', y='SharePrice_CHF', hue='Sector')
plt.xlabel('Earnings per Share [CHF]')
plt.ylabel('Share Price [CHF]')
plt.legend()
plt.grid()
plt.show()

This looks colorful, but is hard to read due to the many sectors. To illustrate this plot with a more helpful example, we limit ourselves to the three sectors with the most companies (see above). We again store these companies in a new dataset:

In [None]:
largeCaps_b5k_largeSectors = largeCaps_b5k.loc[largeCaps_b5k['Sector'].isin([ 'Finance', 'Producer manufacturing', 'Health technology' ])]
largeCaps_b5k_largeSectors

Now we do the same plot again:

In [None]:
sns.scatterplot(data=largeCaps_b5k_largeSectors, x='EPS_CHF', y='SharePrice_CHF', hue='Sector')
plt.xlabel('Share Price [CHF]')
plt.ylabel('Earnings per Share [CHF]')
plt.legend()
plt.grid()
plt.show()

Another way to show several dimensions is to create pairwise histograms for every combination of two dimensions. This is done by the `pairwise` function:

In [None]:
sns.pairplot(largeCaps_b5k_largeSectors[['MarketCap_BCHF', 'SharePrice_CHF', 'Volume_Shares', 'EPS_CHF', 'DivYield_Prct', 'Sector']], hue='Sector')

With too many dimensions, however, also this plot becomes very hard to read - and takes time to render. You can try it out below if you don't mind to wait a few moments:

In [None]:
sns.pairplot(largeCaps_b5k)

## Visualising Correlation
Korrelation is a prime statistical measure for the linear dependency between two variables. For any two variables, it lies in the range -1 to 1.

In [None]:
plt.figure(figsize=(10, 8))
heatmap = sns.heatmap(largeCaps_b5k.select_dtypes(include=np.number).corr(), vmin=-1, vmax=1, annot=True, cmap="coolwarm_r")
heatmap.set_title('Correlation Heatmap');

If we are interested in the share price `SharePrice_CHF`, the correlation matrix shows us that the earnings per share `EPS_CHF`, the Dividend Yield `DivYield_Prct`, and the Volume of the Shares traded on a given day `Volume_Shares` show the largest absolute correlation. Therefore, let's investigate this in more detail. To do so, we will make some scatter plots:

In [None]:
plt.scatter(x=largeCaps_b5k['EPS_CHF'], y=largeCaps_b5k['SharePrice_CHF']);
plt.xlabel('Earnings per Share [CHF]')
plt.ylabel('Share Price [CHF]')
plt.grid()
plt.show()

**EXERCISE**: Continue similarly to find out about other dependencies. If you need inspiration, look at the correlation matrix and identify some attributes that might have an influence of the share price.

## Formalizing and Evaluating a Linear Dependency Hypothesis
The scatterplot above (and the correlation heatmap) might give rise to the hypothesis that the higher the earnings per share, the higher the share price is. We now formalize the assumption as a linear model, using the `scikit-learn` class `LinearRegression`.

First, we define which columns of our dataframe we want to use as `X` and `y` values:

In [None]:
X = largeCaps_b5k[['EPS_CHF']]
y = largeCaps_b5k[['SharePrice_CHF']]

Next, we get a linear model (from `sklearn.linear_model`) and adapt it to the selected data. This processs is called **fitting** or **training**:

In [None]:
linreg_sp_vs_eps = LinearRegression()
linreg_sp_vs_eps.fit(X, y)

After calling `fit(...)`, the model is now adapted to our data. We can access the parameters (coefficient and intercept) that have been learned:

In [None]:
print('Coefficient: ')
print(linreg_sp_vs_eps.coef_)
print('Intercept:')
print(linreg_sp_vs_eps.intercept_)

The `seaborn` library offers the function `lmplot` to plot the data together with the linear model, including the confidence region:

In [None]:
sns.lmplot(data=largeCaps_b5k,
           x='EPS_CHF', y='SharePrice_CHF', height=4.8, aspect=4/3)
plt.xlabel('Earnings per Share [CHF]')
plt.ylabel('Share Price [CHF]')
plt.title('Earnings per Share vs. Share Price for Swiss Large Caps\nwith Linear Trend')
plt.grid()

`scikit-learn` also implements a series of quality metrics as ready-made functions

In [None]:
y_pred = linreg_sp_vs_eps.predict(X)
print('r2-Score: ' + str(r2_score(y, y_pred)))
print('MSE: ' + str(mean_squared_error(y, y_pred)))
print('RMSE: ' + str(root_mean_squared_error(y, y_pred)))

As we will be applying and evaluating several models, we pack this into a function. Besides the printing, the function will also create a new data frame containing the same quality metrics as columns:

In [None]:
def apply_eval_model(model, data, y_true, model_name=''):
    y_pred = model.predict(data)
    print('r2-Score: ' + str(r2_score(y_true, y_pred)))
    print('MSE: ' + str(mean_squared_error(y_true, y_pred)))
    print('RMSE: ' + str(root_mean_squared_error(y_true, y_pred)))
    if len(model_name)>0:
        df = pd.DataFrame({'model_name': model_name,
                           'r2_score': r2_score(y_true, y_pred),
                           'MSE': mean_squared_error(y_true, y_pred),
                           'RMS': root_mean_squared_error(y_true, y_pred)},
                          index=[model_name])
        return df

Let us try this function:

In [None]:
model_perf_sp_vs_eps = apply_eval_model(linreg_sp_vs_eps, X, y, 'EPS')

In [None]:
model_perf_sp_vs_eps

## Evaluation on New Data
In order to find out how the shares were developing, we have collected the same data one week later, i.e., on September 27. It is stored in the same format. As above, we limit ourselves to shares with a positive earnings per share (EPS) as well as a positive dividend yield. We will mark all variables related to the later data set by ´_val´ to indicate this is based on the later data we use to validate our models.

In [None]:
# largeCaps_val = ...
# largeCaps_val_b5k = ...

Now, using the `apply_eval_model` function we have defined above, we can easily evaluate our model on the new data:

In [None]:
model_perf_val_sp_vs_eps = apply_eval_model(linreg_sp_vs_eps, largeCaps_val_b5k[['EPS_CHF']],
                                          largeCaps_val_b5k[['SharePrice_CHF']], 'EPS')

**EXERCISE**: Define and evaluate a few other models. Use the above example as guidance.

We recommend you store the data frames containing the model performance in a somewhat descriptive variable name, as we did for `model_perf_val_sp_vs_eps`. 

### Comparing the Performance of Several Models
We combine the individual performance evaluations into a joint dataframe which we will use afterwards to visualize the performance of the different models on both training and test data. 

**EXERCISE**: You have to adapt and/or add the names of the individual results, depending on which models you have trained and under which variables you have stored the results.

In [None]:
# model_perf_all = pd.concat([model_perf_sp_vs_eps, ..., ...])
# model_perf_all['date'] = '2024-09-20'

# model_perf_val_all = pd.concat([model_perf_val_sp_vs_eps, ..., ...])
# model_perf_val_all['date'] = '2024-09-27'

Now we can get the results of all models considered so far:

In [None]:
model_perf_both_sets = pd.concat([model_perf_all, model_perf_val_all])
model_perf_both_sets

We will do a plot to illustrate the R2-Score and the root mean squared error of the considered models:

In [None]:
sns.barplot(data=model_perf_both_sets, x='model_name', y='r2_score', hue = 'date')
plt.title('Comparison of Simple Linear Regression Models for Share Price\nPerformance on 2 different dates')
plt.xlabel('Predictor / Independent Variable')
plt.ylabel('$R^2$-Score')
plt.show()

In [None]:
sns.barplot(data=model_perf_both_sets, x='model_name', y='RMS', hue = 'date')
plt.title('Comparison of Simple Linear Regression Models for Share Price\nPerformance on 2 different dates')
plt.xlabel('Predictor / Independent Variable')
plt.ylabel('Root Mean Squared Error (RMS)')
plt.show()

## Using Several Predictors: Multiple Linear Regression
We might well assume that the share price of a company depends not only on one attribute, but on several ones. We can easily expand the linear regression model to a multiple linear model which uses several attributes to predict the share price. All we need to do is to select several columns - the `LinearRegression` model will then determine the right number of parameters for us:

In [None]:
linreg_sp_vs_3 = LinearRegression()
linreg_sp_vs_3.fit(largeCaps_val_b5k[['EPS_CHF', 'Volume_Shares', 'DivYield_Prct']], largeCaps_val_b5k['SharePrice_CHF'])

model_perf_sp_vs_3 = apply_eval_model(linreg_sp_vs_3, 
                                      largeCaps_b5k[['EPS_CHF', 'Volume_Shares', 'DivYield_Prct']],
                                      largeCaps_b5k['SharePrice_CHF'], 
                                      'EPS, Vol, DY')

model_perf_val_sp_vs_3 = apply_eval_model(linreg_sp_vs_3, 
                                          largeCaps_val_b5k[['EPS_CHF', 'Volume_Shares', 'DivYield_Prct']],
                                          largeCaps_val_b5k['SharePrice_CHF'], 
                                          'EPS, Vol, DY')

Below we add the new model to the model performance dataframe and render the same plots again:

In [None]:
model_perf_sp_vs_3['date'] = '2024-09-20'
model_perf_val_sp_vs_3['date'] = '2024-09-27'

model_perf_both_sets = pd.concat([model_perf_both_sets, model_perf_sp_vs_3, model_perf_val_sp_vs_3])

In [None]:
sns.barplot(data=model_perf_both_sets, x='model_name', y='r2_score', hue = 'date')
plt.title('Comparison of Simple Linear Regression Models for Share Price\nPerformance on 2 different dates')
plt.xlabel('Predictor / Independent Variable')
plt.ylabel('$R^2$-Score')
plt.show()

In [None]:
sns.barplot(data=model_perf_both_sets, x='model_name', y='RMS', hue = 'date')
plt.title('Comparison of Simple Linear Regression Models for Share Price\nPerformance on 2 different dates')
plt.xlabel('Predictor / Independent Variable')
plt.ylabel('Root Mean Squared Error (RMS)')
plt.show()

## Statistical Model Selection
To conclude, we want to evaluate how well the different models are suited to explain the data. We therefore train the models again, but this time using the library `statsmodels`, which we have already seen in the notebook on polynomical regression. We handle the considered models in the same order as above.

In [None]:
sm_lin_sp_vs_eps = smf.ols('SharePrice_CHF ~ EPS_CHF', data=largeCaps_b5k)
sm_lin_sp_vs_eps = sm_lin_sp_vs_eps.fit()
sm_lin_sp_vs_eps.summary()

**Comment**: The Omnibus statistic shows that the residuals are very unlikely to follow a normal distribution. It is therefore not a surprise that the model works worse on a new data set. Also, the log-likelihood is very low (compared to other models below).

In [None]:
sm_lin_sp_vs_dy = smf.ols('SharePrice_CHF ~ DivYield_Prct', data=largeCaps_b5k)
sm_lin_sp_vs_dy = sm_lin_sp_vs_dy.fit()
sm_lin_sp_vs_dy.summary()

In [None]:
sm_lin_sp_vs_vs = smf.ols('SharePrice_CHF ~ Volume_Shares', data=largeCaps_b5k)
sm_lin_sp_vs_vs = sm_lin_sp_vs_vs.fit()
sm_lin_sp_vs_vs.summary()

As a comparison, we also investigate the multiple linear model to predict the share price (in linear scale) based on the earnings per share, the volume, and the dividend yield:

In [None]:
sm_lin_sp_vs_t3 = smf.ols('SharePrice_CHF ~ EPS_CHF + Volume_Shares + DivYield_Prct', data=largeCaps_b5k)
sm_lin_sp_vs_t3_fit = sm_lin_sp_vs_t3.fit()
sm_lin_sp_vs_t3_fit.summary()

**Comment:** This model seems to be a very poor fit to the data (see the omnibus statistic); however it reaches a surprisingly high R-squared value. Also on the linear scale, the volumne of the traded shares does not seem to add significantly to the share price. Leaving out the `Volume_Shares`, we indeed get very similar results for all metrics, parameter estimators and residual statistics:

In [None]:
sm_lin_sp_vs_t2 = smf.ols('SharePrice_CHF ~ EPS_CHF + DivYield_Prct', data=largeCaps_b5k)
sm_lin_sp_vs_t2_fit = sm_lin_sp_vs_t2.fit()
sm_lin_sp_vs_t2_fit.summary()

## Log-Transformation and Share-Price Prediction
In this section we will do the log-transformation of the features (EPS, Volume, and Dividend Yield) and the target variable, and then define a linear regression to predict the log-transformed share price. As a first step, we calculate the logaritms:

In [None]:
# Apply Log-transformation to training data
largeCaps_b5k['log_SharePrice_CHF'] = np.log(largeCaps_b5k['SharePrice_CHF'])
largeCaps_b5k['log_EPS_CHF'] = np.log(largeCaps_b5k['EPS_CHF'])
largeCaps_b5k['log_Volume_Shares'] = np.log(largeCaps_b5k['Volume_Shares'])
largeCaps_b5k['log_DivYield_Prct'] = np.log(largeCaps_b5k['DivYield_Prct'])

**EXERCISE**: Do the log-transformation on the three predictors and the target variable (share price) also for the validation data. Then define and train regression models on the log-transformed features. Evaluate their performance on the later data.

In [None]:
# Apply Log-transformation to validation data
# largeCaps_val_b5k['log_SharePrice_CHF'] ...
# ...

In [None]:
## A linear Model to Predict the log Share Price from log EPS
# Model definition:
# linreg_lsp_vs_leps = ...

# Model Fitting:
# linreg_lsp_vs_leps.fit( ...)

# Evaluation on Training Data:
# model_perf_lsp_vs_leps = ...

# Evaluation on Test Data:
# model_perf_val_lsp_vs_leps = ... 

In [None]:
# define and evaluate further models 

Next, we merge the evaluation results for the different models. We do that separately for the training and the test data.

**NOTE**: If you choose a different naming convention, you might have to adapt the code here:

In [None]:
model_perf_all_log = pd.concat([model_perf_lsp_vs_leps, model_perf_lsp_vs_ldy, model_perf_lsp_vs_lvs, model_perf_lsp_vs_l3])
model_perf_all_log['date'] = '2024-09-20'

model_perf_val_all_log = pd.concat([model_perf_val_lsp_vs_leps, model_perf_val_lsp_vs_ldy, model_perf_val_lsp_vs_lvs, model_perf_val_lsp_vs_l3])
model_perf_val_all_log['date'] = '2024-09-27'

model_perf_both_sets_linlog = pd.concat([model_perf_both_sets, model_perf_all_log, model_perf_val_all_log])

Now we can plot the performance bar charts including the models with the logarithm transform:

In [None]:
sns.barplot(data=model_perf_both_sets_linlog, x='model_name', y='r2_score', hue = 'date')
plt.title('Comparison of Simple Linear Regression Models for Share Price\nPerformance on 2 different dates')
plt.xlabel('Predictor / Independent Variable')
plt.xticks(rotation=45, ha='right')
plt.ylabel('$R^2$-Score')
plt.legend(loc=(1.05, 0.5))
plt.grid()
plt.show()

**EXERCISE**: How to interpret these results?

## Shrinkage Model for Share Price Prediction
In this exercise, you will apply the lasso shrinkage method to identify the attributes most relevant to predict the share price. If you run this notebook in the given order, the logarithm of the 3 most correlated features are included in addition to the linear values.

Before we can start with the actual shrinkage, we have to scale the features:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, lasso_path
from sklearn.model_selection import GridSearchCV

**Note**: Depending on the hyperparameters and the data, the lasso optimization might not converge. In this case, you would get a convergence warning (something like `ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.297e+10, tolerance: 5.518e+07`, with numbers potentially different). This means that the optimization algorithm did not find a good solution. As mentioned in the warning message, a higher regularization can help here. As we are searching for the best hyperparameter, we can assume that the candidate values for which these messages occurr are not yielding a good results and will therefore be discarded. To avoid cluttering of the output, we switch off the display of convergence warnings with the cell below:

In [None]:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

As the scaling is only possible for numerical attributes, we first drop the three non-numerical attributes and store the results as `largeCaps_b5k_num`. Then we can define, fit and apply the scaler. We do this for both the first and the second data set.

In [None]:
largeCaps_b5k_num = largeCaps_b5k.drop(['Symbol', 'Name', 'Sector'], axis=1)
largeCaps_b5k_num = largeCaps_b5k_num.dropna()
predictors = largeCaps_b5k_num.drop(['log_SharePrice_CHF', 'SharePrice_CHF'], axis=1)
target = largeCaps_b5k_num['SharePrice_CHF']

Next, we scale the predictors:

In [None]:
# initialize and adapt scaler
share_scaler = StandardScaler()
share_scaler = share_scaler.fit(predictors)

# apply scaling to predictors
predictors_std = share_scaler.transform(predictors)
predictors_std = pd.DataFrame(predictors_std, columns = predictors.columns)

**EXERCISE**: Use shrinkage to identify the best predictors for the share price.

Let us now evaluate the model on the training data:

In [None]:
# define lasso model based on best value for alpha
lasso_model_mse = Lasso(alpha=grid_search_mse.best_params_['alpha'])

# train model
lasso_model_mse.fit(predictors_std, target)

# Evaluation on Training Data:
model_perf_lasso = apply_eval_model(lasso_model_mse, predictors_std, target, 'lasso')

Again, we want to evaluate this new model on the later data set to see how well this model works on a new data set.

**EXERCISE:** Apply the new model to the later data set `largeCaps_val_b5k`. In order to get a valid result, you have to apply the exact same pre-processing steps as we did to the training data.

In [None]:
# largeCaps_val_b5k_num = ...

# predictors_val = ...
# target_val = ...

# ... 

In [None]:
# Evaluation on Validation Data:
# model_perf_val_lasso = ...

Again we merge the results for a graphical representation:

In [None]:
# Merging the result data frames:
model_perf_lasso['date'] = '2024-09-20'
model_perf_val_lasso['date'] = '2024-09-27'

model_perf_all = pd.concat([model_perf_both_sets_linlog, model_perf_lasso, model_perf_val_lasso])

In [None]:
sns.barplot(data=model_perf_all, x='model_name', y='r2_score', hue = 'date')
plt.title('Comparison of Simple Linear Regression Models for Share Price\nPerformance on 2 different dates')
plt.xlabel('Predictor / Independent Variable')
plt.xticks(rotation=45, ha='right')
plt.ylabel('$R^2$-Score')
plt.legend(loc=(1.05, 0.5))
plt.grid()
plt.show()

**EXERCISE:** Among the models listed above, which one do you choose to automatically predict the share price? What steps would be necessary to give a dependable estimate of the model performance on new data (you don't have to program it)? Explain why this step is necessary.