# Step 1: Set up your environment.

This is the tutorial defined [here](https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn#step-1). To get started we set up our environment:

1. Create `venv` (or do so in vs code):

```bash
python3 -m venv .venv
```

2. Activate `venv` (path is relative to this file):

```bash
source ../../.venv/bin/activate 
```

3. Check python & pip are there and using venv ones:

```bash
which python 
which pip 
```

4. Install packages:

```bash
pip install scikit-learn
pip install numpy 
pip install pandas
```

5. Freeze packages and write `requirements.txt`:

```bash
pip freeze > requirements.txt
```

In [12]:
#@title Step 2: Import libraries and modules.

# import numpy, which provides support for more efficient numerical computation:
import numpy as np

# Pandas, a convenient library that supports dataframes
import pandas as pd

# model_selection - contains many utilities that will help us choose between models
from sklearn.model_selection import train_test_split

# preprocessing module. This contains utilities for scaling, transforming, and wrangling data.
from sklearn import preprocessing

# import the families of models we’ll need - random forest family
# For the scope of this tutorial, we’ll only focus on training a random forest and tuning its parameters. 
# We’ll have another detailed tutorial for how to choose between model families.
from sklearn.ensemble import RandomForestRegressor

# importing the tools to help us perform cross-validation.
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

#some metrics we can use to evaluate our model performance later.
from sklearn.metrics import mean_squared_error, r2_score

# way to persist our model for future use - Joblib is an alternative to Python’s pickle package, 
# and we’ll use it because it’s more efficient for storing large numpy arrays.
import joblib


In [13]:
#@title Step 3: Load red wine data.

# convenient tool we’ll use today is the read_csv() function. Using this function, we can load any CSV file, even from a remote URL
#dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
dataset_url='wine-quality.csv' # using this as actual URL gave self signed SSL error
data = pd.read_csv(dataset_url, sep=';') # data is using ; to separate data (not comma default)

# Now let’s take a look at the first 5 rows of data:
print( data.head() )

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [14]:
#@title Step 4: Split data into training and test sets.

# First, let’s separate our target (y) features from our input (X) features:
y = data.quality
X = data.drop('quality', axis=1)

# This allows us to take advantage of Scikit-Learn’s useful train_test_split function:
# As you can see, we’ll set aside 20% of the data as a test set for evaluating our model. We also set an arbitrary “random state” (a.k.a. seed) so that we can reproduce our results.
# Finally, it’s good practice to stratify your sample by the target variable. This will ensure your training set looks similar to your test set, making your evaluation metrics more reliable.
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y) 

## Step 5: Declare data preprocessing steps.

Remember, in Step 3, we made the mental note to standardize our features because they were on different scales.

### WTF is standardization?

Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations.

Standardization is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.

First, here’s some code that we won’t use…

Scikit-Learn makes data preprocessing a breeze. For example, it’s pretty easy to simply scale a dataset:

```python
X_train_scaled = preprocessing.scale(X_train)
print( X_train_scaled )
# array([[ 0.51358886,  2.19680282, -0.164433  , ...,  1.08415147,
#         -0.69866131, -0.58608178],
#        [-1.73698885, -0.31792985, -0.82867679, ...,  1.46964764,
#          1.2491516 ,  2.97009781],
#        [-0.35201795,  0.46443143, -0.47100705, ..., -0.13658641,
# ...
```

You can confirm that the scaled dataset is indeed centered at zero, with unit variance:

```Python
print( X_train_scaled.mean(axis=0) )
# [ 0. -0. -0. -0.  0. -0. -0. -0. -0. -0. -0.]

print( X_train_scaled.std(axis=0) )
# [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
```

Great, but why did we say that we won’t use this code?

The reason is that we won’t be able to perform the exact same transformation on the test set.

Sure, we can still scale the test set separately, but we won’t be using the same means and standard deviations as we used to transform the training set.

In other words, that means it wouldn’t be a fair representation of how the model pipeline, include the preprocessing steps, would perform on brand new data.

So instead of directly invoking the scale function, we’ll be using a feature in Scikit-Learn called the Transformer API. The Transformer API allows you to “fit” a preprocessing step using the training data the same way you’d fit a model…

…and then use the same transformation on future data sets!

Here’s what that process looks like:

1, Fit the transformer on the training set (saving the means and standard deviations)
2. Apply the transformer to the training set (scaling the training data)
3. Apply the transformer to the test set (using the same means and standard deviations)

This makes your final estimate of model performance more realistic, and it allows to insert your preprocessing steps into a cross-validation pipeline (more on this in Step 7).

In [18]:
# save means and standard deviations for each feature in the training set in scaler object
scaler = preprocessing.StandardScaler().fit(X_train)

X_train_scaled = scaler.transform(X_train)
print(X_train_scaled.mean(axis=0))
print(X_train_scaled.std(axis=0))

[ 1.16664562e-16 -3.05550043e-17 -8.47206937e-17 -2.22218213e-17
  2.77772766e-18 -6.38877362e-17 -4.16659149e-18 -1.20753377e-13
 -8.70817622e-16 -4.08325966e-16 -1.16664562e-15]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [20]:
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))

[ 0.02776704  0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
 -0.02414174 -0.00293273 -0.00467444 -0.10894663  0.01043391]
[1.02160495 1.00135689 0.97456598 0.91099054 0.86716698 0.94193125
 1.03673213 1.03145119 0.95734849 0.83829505 1.0286218 ]


In [21]:
# In practice, when we set up the cross-validation pipeline, we won’t even need to manually fit the 
# Transformer API. Instead, we’ll simply declare the class object, like so:

pipeline = make_pipeline(preprocessing.StandardScaler(),
                         RandomForestRegressor(n_estimators=100,
                                               random_state=123))

# This is exactly what it looks like: a modeling pipeline that first transforms the data using StandardScaler() 
# and then fits a model using a random forest regressor. Again, the random_state= parameter can be any number you choose. 
# It’s simply setting the seed so that you get consistent results each time you run the code.

## Step 6: Declare hyperparameters to tune.

Now it’s time to consider the hyperparameters that we’ll want to tune for our model.

### WTF are hyperparameters?

There are two types of parameters we need to worry about: model parameters and hyperparameters. 

Models parameters can be learned directly from the data (i.e. regression coefficients), while hyperparameters cannot.

Hyperparameters express “higher-level” structural information about the model, and they are typically set before training the model.

**Example: random forest hyperparameters.**

As an example, let’s take our random forest for regression:

Within each decision tree, the computer can empirically decide where to create branches based on either mean-squared-error (MSE) 
or mean-absolute-error (MAE). Therefore, the actual branch locations are **model parameters**.

However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide
 how many trees to include in the forest. These are examples of **hyperparameters** that the user must set.

We can list the tunable hyperparameters like so:

In [22]:
print( pipeline.get_params() )

{'memory': None, 'steps': [('standardscaler', StandardScaler()), ('randomforestregressor', RandomForestRegressor(random_state=123))], 'verbose': False, 'standardscaler': StandardScaler(), 'randomforestregressor': RandomForestRegressor(random_state=123), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'randomforestregressor__bootstrap': True, 'randomforestregressor__ccp_alpha': 0.0, 'randomforestregressor__criterion': 'squared_error', 'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 1.0, 'randomforestregressor__max_leaf_nodes': None, 'randomforestregressor__max_samples': None, 'randomforestregressor__min_impurity_decrease': 0.0, 'randomforestregressor__min_samples_leaf': 1, 'randomforestregressor__min_samples_split': 2, 'randomforestregressor__min_weight_fraction_leaf': 0.0, 'randomforestregressor__n_estimators': 100, 'randomforestregressor__n_jobs': None, 'randomforestregressor__oob_score': False, 'rando

In [23]:
# You can also find a list of all the parameters on the RandomForestRegressor documentation page. 
# Just note that when it’s tuned through a pipeline, you’ll need to prepend  
# randomforestregressor__ before the parameter name, like in the code above.

# Now, let’s declare the hyperparameters we want to tune through cross-validation.

hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

# As you can see, the format should be a Python dictionary (data structure for key-value pairs) where keys 
# are the hyperparameter names and values are lists of settings to try. The options for parameter values 
# can be found on the documentation page.

## Step 7: Tune model using a cross-validation pipeline.

Now we’re almost ready to dive into fitting our models. But first, we need to spend some time talking about cross-validation.

This is one of the most important skills in all of machine learning because it helps you maximize model performance while reducing the chance of overfitting.

### What is cross-validation (CV)?

Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.

Practically, that “method” is simply a set of hyperparameters in this context.

These are the steps for CV:

1. Split your data into k equal parts, or “folds” (typically k=10).
2. Train your model on k-1 folds (e.g. the first 9 folds).
3. Evaluate it on the remaining “hold-out” fold (e.g. the 10th fold).
4. Perform steps (2) and (3) k times, each time holding out a different fold.
5. Aggregate the performance across all k folds. This is your performance metric.

![K-Fold Cross Validation Diagram](https://elitedatascience.com/wp-content/uploads/2016/12/K-fold_cross_validation_EN.jpg)

K-Fold Cross-Validation diagram (Wikipedia)

### Why is cross-validation important in machine learning?

Let’s say you want to train a random forest regressor. One of the hyperparameters you must tune is the maximum depth allowed for each decision tree in your forest.

How can you decide?

That’s where cross-validation comes in. Using only your training set, you can use CV to evaluate different hyperparameters and estimate their effectiveness.

This allows you to keep your test set “untainted” and save it for a true hold-out evaluation when you’re finally ready to select a model.

For example, you can use CV to tune a random forest model, a linear regression model, and a k-nearest neighbors model, using only the training set. 
Then, you still have the untainted test set to make your final selection between the model families!

### So what is a cross-validation “pipeline?”

The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop. This prevents accidentally tainting your training folds with influential data from your test fold.

Here’s how the CV pipeline looks after including preprocessing steps:

1. Split your data into k equal parts, or “folds” (typically k=10).
2. Preprocess k-1 training folds.
3. Train your model on the same k-1 folds.
4. Preprocess the hold-out fold using the same transformations from step (2).
5. Evaluate your model on the same hold-out fold.
6. Perform steps (2) – (5) k times, each time holding out a different fold.
7. Aggregate the performance across all k folds. This is your performance metric.

Fortunately, Scikit-Learn makes it stupidly simple to set this up:

In [24]:
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 
# Fit and tune model
clf.fit(X_train, y_train)

40 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/edoatley/source/share-predict/.venv/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/edoatley/source/share-predict/.venv/lib/python3.11/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edoatley/source/share-predict/.venv/lib/python3.11/site-packages/sklearn/pipeline.py", line 427, in fit
    self._final_estimator.fit(Xt, y, **fit_para

In [25]:
#@title Step 9: Evaluate model pipeline on test data.
y_pred = clf.predict(X_test)

print( r2_score(y_test, y_pred) )

print( mean_squared_error(y_test, y_pred) )


0.4712595193413647
0.34118218749999996


Great, so now the question is… is this performance good enough?

Well, the rule of thumb is that your very first model probably won’t be the best possible model. However, we recommend a combination of three strategies to decide if you’re satisfied with your model performance.

Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem?

Look in academic literature to get a sense of the current performance benchmarks for specific types of data.

Try to find low-hanging fruit in terms of ways to improve your model.

There are various ways to improve a model. We’ll have more guides that go into detail about how to improve model performance, but here are a few quick things to try:

Try other regression model families (e.g. regularized regression, boosted trees, etc.).

Collect more data if it’s cheap to do so.

Engineer smarter features after spending more time on exploratory analysis.

Speak to a domain expert to get more context (this is a good excuse to go wine tasting!).

As a final note, when you try other families of models, we recommend using the same training and test set as you used to fit the random forest model. That’s the best way to get a true apples-to-apples comparison between your models.

In [26]:
#@title Step 10: Save model for future use.
joblib.dump(clf, 'rf_regressor.pkl')

# When you want to load the model again, simply use this function:
# clf2 = joblib.load('rf_regressor.pkl')
# clf2.predict(X_test)

['rf_regressor.pkl']