# Step 1: Set up your environment.

This is the tutorial defined [here](https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn#step-1). To get started we set up our environment:

1. Create `venv` (or do so in vs code):

```bash
python3 -m venv .venv
```

2. Activate `venv` (path is relative to this file):

```bash
source ../../.venv/bin/activate 
```

3. Check python & pip are there and using venv ones:

```bash
which python 
which pip 
```

4. Install packages:

```bash
pip install scikit-learn
pip install numpy 
pip install pandas
```

5. Freeze packages and write `requirements.txt`:

```bash
pip freeze > requirements.txt
```

In [5]:
#@title Step 2: Import libraries and modules.

# import numpy, which provides support for more efficient numerical computation:
import numpy as np

# Pandas, a convenient library that supports dataframes
import pandas as pd

# model_selection - contains many utilities that will help us choose between models
from sklearn.model_selection import train_test_split

# preprocessing module. This contains utilities for scaling, transforming, and wrangling data.
from sklearn import preprocessing

# import the families of models we’ll need - random forest family
# For the scope of this tutorial, we’ll only focus on training a random forest and tuning its parameters. 
# We’ll have another detailed tutorial for how to choose between model families.
from sklearn.ensemble import RandomForestRegressor

# importing the tools to help us perform cross-validation.
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

#some metrics we can use to evaluate our model performance later.
from sklearn.metrics import mean_squared_error, r2_score

# way to persist our model for future use - Joblib is an alternative to Python’s pickle package, 
# and we’ll use it because it’s more efficient for storing large numpy arrays.
import joblib


In [10]:
#@title Step 3: Load red wine data.

# convenient tool we’ll use today is the read_csv() function. Using this function, we can load any CSV file, even from a remote URL
#dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
dataset_url='wine-quality.csv' # using this as actual URL gave self signed SSL error
data = pd.read_csv(dataset_url, sep=';') # data is using ; to separate data (not comma default)

# Now let’s take a look at the first 5 rows of data:
print( data.head() )

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [None]:
#@title Step 4: Split data into training and test sets.

# First, let’s separate our target (y) features from our input (X) features:
y = data.quality
X = data.drop('quality', axis=1)

# This allows us to take advantage of Scikit-Learn’s useful train_test_split function:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)