<a href="https://colab.research.google.com/github/aandrin25/professional-in-my-opinion/blob/main/sklearn_ml_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 2: Import Librarys and Modules

In [21]:
import numpy as np
np.set_printoptions(suppress=True)
import pandas as pd

In [22]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import joblib

# Step 3: Load Red Wine Data

In [23]:
# Uploads the data set from a remote URL
dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url)
# Separates the messy data set at the semicolons
data = pd.read_csv(dataset_url, sep=';')
print( data.head() )

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [24]:
print(data.shape)

(1599, 12)


In [25]:
print(data.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

# Step 4: Split data into training and test sets

In [26]:
# Separates our target (y) features from our input (X) features
y = data.quality
X = data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)
# Always good practice to "Stratify" the target variable
# Set aside 20% of the data set to test later

# Step 5: Declare data preprocessing steps
  **First! Transfrom the training set**

In [28]:
# Transformer API allows you to “fit” a preprocessing step using the training data the same way you’d fit a model
# 1. Fit the transformer on the training set (saving the means and standard deviations)
# 2. Apply the transformer to the training set (scaling the training data)
# 3. Apply the transformer to the test set (using the same means and standard deviations)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled)

[[ 0.51358886  2.19680282 -0.164433   ...  1.08415147 -0.69866131
  -0.58608178]
 [-1.73698885 -0.31792985 -0.82867679 ...  1.46964764  1.2491516
   2.97009781]
 [-0.35201795  0.46443143 -0.47100705 ... -0.13658641 -0.35492962
  -0.20843439]
 ...
 [-0.98679628  1.10708533 -0.93086814 ...  0.24890976 -0.98510439
   0.35803669]
 [-0.69826067  0.46443143 -1.28853787 ...  1.08415147 -0.35492962
  -0.68049363]
 [ 3.1104093  -0.62528606  2.08377675 ... -1.61432173  0.79084268
  -0.39725809]]


In [29]:
print(X_train_scaled.mean(axis = 0))
print(X_train_scaled.std(axis = 0))

[ 0. -0. -0. -0.  0. -0. -0. -0. -0. -0. -0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


**Second! Transform the Test Set!**

In [31]:
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))

[ 0.02776704  0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
 -0.02414174 -0.00293273 -0.00467444 -0.10894663  0.01043391]
[1.02160495 1.00135689 0.97456598 0.91099054 0.86716698 0.94193125
 1.03673213 1.03145119 0.95734849 0.83829505 1.0286218 ]


In [32]:
# When we set up the cross-validation pipeline, we won’t even need to manually fit the Transformer API.
pipeline = make_pipeline(preprocessing.StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=123))

# Step 6: Declare hyperparameters to tune

In [34]:
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]}


# Step 7: Tune model using a cross-validation pipeline

**The steps for Cross validation**

1. Split your data into k equal parts, or “folds” (typically k=10).
2. Train your model on k-1 folds (e.g. the first 9 folds).
3. Evaluate it on the remaining “hold-out” fold (e.g. the 10th fold).
4. Perform steps (2) and (3) k times, each time holding out a different fold.
5. Aggregate the performance across all k folds. This is your performance metric.






**CV pipeline after including preprocessing steps:**

1. Split your data into k equal parts, or “folds” (typically k=10).
2. Preprocess k-1 training folds.
3. Train your model on the same k-1 folds.
4. Preprocess the hold-out fold using the same transformations from step (2).
5. Evaluate your model on the same hold-out fold.
6. Perform steps (2) – (5) k times, each time holding out a different fold.
7. Aggregate the performance across all k folds. This is your performance metric.


In [36]:
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 # Fit and tune model
clf.fit(X_train, y_train)
print( clf.best_params_ )

{'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'sqrt'}


# Step 8: Refit on the entire training set

In [37]:
print( clf.refit )

True


# Step 9: Evaluate model pipeline on test data

In [38]:
y_pred = clf.predict(X_test)
print( r2_score(y_test, y_pred) )
print( mean_squared_error(y_test, y_pred) )


0.4712595193413647
0.34118218749999996


# Step 10: Save model for future use

In [39]:
joblib.dump(clf, 'rf_regressor.pkl')

['rf_regressor.pkl']