# A Practical Exploration of the Wine Quality Dataset

#### By Jordan Cairns, Chris Gao, Yingzi Jin and Chun Li
#### In fulfillment of DSCI 522 Milestone 1

## Executive Summary

Our analysis aimed to develop a predictive model to distinguish between red and white wines based on various physicochemical properties. This study employed logistic regression, a model renowned for its balance between predictive power and interpretability.

The regression result suggested that residual sugar and total sulfur dioxide had high positive coefficients, indicating a strong association with white wine, whereas density showed the most substantial negative impact, followed by alcohol and volatile acidity, suggesting these are key indicators of red wine.


The logistic regression model not only achieved high accuracy but also provided valuable insights into the features most indicative of wine type. This model can assist vintners in quality control and classification tasks. Moreover, the interpretability of the model offers a foundation for further research into wine composition and its impact on sensory attributes. Future studies might explore more complex models or delve deeper into feature engineering to enhance predictive accuracy and understanding.




# Introduction

In the intricate world of oenology, the distinction between red and white wines extends beyond color, embedding itself in the nuanced spectrum of their physicochemical properties. This project embarks on a data-driven journey to unravel these complexities by leveraging statistical models to classify wines as red or white based on their inherent characteristics. Utilizing a rich dataset that encapsulates key attributes like acidity, sugar content, sulfur dioxide levels, alcohol concentration, and more, we aim to build a predictive model that not only accurately classifies the wines but also sheds light on the influential factors that underpin this classification. Through this analysis, we intend to blend the art of winemaking with the precision of data science, offering insights that could prove valuable to vintners, sommeliers, and wine enthusiasts alike in understanding the subtle distinctions between these two celebrated categories of wine.

# Data

The dataset utilized in our project is sourced from the UCI Machine Learning Repository, specifically focusing on red and white variants of Portuguese "Vinho Verde" wine​​​​. This dataset is distinguished by its emphasis on physicochemical tests to model wine quality, capturing a range of variables that reflect the sensory and chemical composition of the wine samples. Notably, it encompasses various input features like acidity, sugar content, and alcohol levels, while the output variable relates to the sensory-driven quality rating of the wines. A unique aspect of this dataset is its exclusion of data on grape types, wine brands, or prices due to privacy and logistic constraints. This attribute frames our analysis within a context of physicochemical and sensory data, offering an opportunity to delve into wine quality assessment based on measurable attributes, free from commercial biases. The dataset's structure lends itself to both classification and regression tasks, providing a fertile ground for exploring machine learning applications in the domain of wine quality evaluation.

## Data Overview:

In [1]:
import numpy as np
import pandas as pd
import requests
import zipfile
import altair as alt
import os
from sklearn import set_config
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

In [2]:
# The following blocks of code was inspired by DSCI 522 Sample Milestone 1
# Download data as zip and extract
url = "https://archive.ics.uci.edu/static/public/186/wine+quality.zip"

request = requests.get(url)
with open("../data/raw/wine+quality.zip", 'wb') as f:
    f.write(request.content)

with zipfile.ZipFile("../data/raw/wine+quality.zip", 'r') as zip_ref:
    zip_ref.extractall("../data/raw")

In [3]:
# Reading the dataframe with delimiter set to semicolon 
df_white = pd.read_csv("../data/raw/winequality-white.csv", sep = ";")
df_red = pd.read_csv("../data/raw/winequality-red.csv", sep = ";")

In [4]:
# Creating the combined df with the prediction target column added
df_white['color'] = 'white'
df_red['color'] = 'red'
df_wine = pd.concat([df_white, df_red], ignore_index=True)

df_wine.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6,white
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6,white


In [5]:
# Check dataframe column types and missing values
df_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  color                 6497 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


In [6]:
# Display white wine summary statistics
summary_white = df_white.describe()
print(summary_white)


       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    4898.000000       4898.000000  4898.000000     4898.000000   
mean        6.854788          0.278241     0.334192        6.391415   
std         0.843868          0.100795     0.121020        5.072058   
min         3.800000          0.080000     0.000000        0.600000   
25%         6.300000          0.210000     0.270000        1.700000   
50%         6.800000          0.260000     0.320000        5.200000   
75%         7.300000          0.320000     0.390000        9.900000   
max        14.200000          1.100000     1.660000       65.800000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  4898.000000          4898.000000           4898.000000  4898.000000   
mean      0.045772            35.308085            138.360657     0.994027   
std       0.021848            17.007137             42.498065     0.002991   
min       0.009000             2.000000         

In [7]:
# Display red wine summary statistics
summary_red = df_red.describe()
print(summary_red)

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

In [8]:
# Check the balancedness of the dataframe
df_wine.groupby('color')['color'].count()

color
red      1599
white    4898
Name: color, dtype: int64

In [9]:
# Splitting the dataframe into train and test portions with a random seed set to ensure reproducibility
np.random.seed(522)
set_config(transform_output="pandas")

# Creating the split
wine_train, wine_test = train_test_split(df_wine, train_size=0.70, stratify=df_wine["color"])

In [10]:
# Saving the processed data into folder 'processed'

# Directory path
directory = "../data/processed"

# Check if the directory exists, and create it if it doesn't
if not os.path.exists(directory):
    os.makedirs(directory)

# Saving the data into a new folder
wine_train.to_csv("../data/processed/wine_train.csv")
wine_test.to_csv("../data/processed/wine_test.csv")

## Exploratory Data Analysis

The first step of EDA is to generate some histograms to visualize the effects of all numerical variables to the type of wines. By comparing these distributions side by side, we can pinpoint which features exhibit significant variations across the two categories, thereby informing feature selection for predictive modeling. Such visual tools are invaluable as they facilitate an intuitive understanding of complex data relationships, highlight potential factors that could influence the wine's classification, and guide subsequent analytical steps in the data science workflow.




In [11]:
from alt_distribution_helper import layered_distri_plot
distribution_detail = layered_distri_plot(wine_train)
distribution_detail

ModuleNotFoundError: No module named 'alt_distribution_helper'

In [12]:
def layered_distri_plot(wine_train):
    """
    Generate a layered distribution plot using Altair for visualizing wine characteristics by color.

    Parameters:
    - wine_train (pandas.DataFrame): DataFrame containing wine data with columns representing characteristics 
                                     and 'color' column specifying wine type (e.g., 'red' or 'white').

    Returns:
    - Altair Chart: A layered distribution plot visualizing wine characteristics by color.
    """

    # Extract numeric columns (wine characteristics) excluding the 'color' column
    numeric_cols = wine_train.columns.tolist()[:-2]

    # Create a tick chart representing the distribution of wine characteristics by color
    tick_chart = alt.Chart(wine_train).mark_tick().encode(
        alt.X(alt.repeat(), type="quantitative").scale(zero=False),
        alt.Y("color", title=""),
        alt.Color(
            "color",
            legend=alt.Legend(title="Type of Wine"),
            scale=alt.Scale(domain=['red', 'white'], 
                            range=['red', 'peachpuff'])
        )
    )
    
    # Create a boxplot chart for each wine characteristic by color
    box_chart = alt.Chart(wine_train).mark_boxplot(
        extent="min-max", color="grey", opacity=0.6
    ).encode(
        alt.X(alt.repeat(), type="quantitative"),
        alt.Y("color", title="")
    )
    
    # Create a point chart representing mean values of each characteristic by color
    point_chart = alt.Chart(wine_train).mark_point(
        filled=True, color="black", size=10
    ).encode(
        alt.X(alt.repeat(), aggregate='mean', type="quantitative"),
        alt.Y("color", title="")
    )
    
    # Layer the charts (tick, boxplot, point) for each wine characteristic
    layer_chart = (tick_chart + box_chart + point_chart).properties(
        height=50, width=200
    ).repeat(
        repeat=numeric_cols, 
        columns=2, title='Comparative Distribution of Wine Characteristics by Color'
    )
    
    return layer_chart

distribution_detail = layered_distri_plot(wine_train)
distribution_detail

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000).

Try enabling the VegaFusion data transformer which raises this limit by pre-evaluating data
transformations in Python.
    >> import altair as alt
    >> alt.data_transformers.enable("vegafusion")

Or, see https://altair-viz.github.io/user_guide/large_datasets.html for additional information
on how to plot large datasets.

alt.RepeatChart(...)

In [29]:
# Generating histograms to visualize the effects of all numerical variables to the type of wines

numeric_hist_plots = alt.Chart(wine_train).mark_bar(opacity=0.6).encode(
    alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=30)),
    y='count()',
        color=alt.Color('color', 
                    legend=alt.Legend(title="Type of Wine"), 
                    scale=alt.Scale(domain=['red', 'white'], 
                                    range=['red', 'peachpuff'])
                   )
).properties(height=150, width=250
            ).repeat(repeat=numeric_cols, columns=2, title='Comparative Distribution of Wine Characteristics by Color')

numeric_hist_plots

Visually, some features do show significant differences between red and white wines and may be particularly relevant in distinguishing between the two. In particular, the following five features stand out in the histograms and could be considered significant for predicting the color of the wine.

1. Fixed & Volatile Acidity: There's a noticeable difference in the distributions, with red wines generally exhibiting higher fixed and volatile acidity.

2. Residual Sugar: White wines display a much higher residual sugar content, which could be a strong differentiator

3. Total Sulfur Dioxide: The levels are significantly higher in white wines, suggesting this feature could be key in classification.

4. Free Sulfur Dioxide: Similar to total sulfur dioxide, this feature is also markedly higher in white wines.

5. pH value: The majority of red wines seem to have a higher overall pH values.


In [33]:
# wine_train[numeric_cols].corr('spearman').style.background_gradient(cmap ="RdBu")

The distribution of the plot also demonstrates the majority of the explainatory variables are not strongly corrlated. However, we do observe the correlations between variable pairs `free sulfur dioxide` and `total sulfur dioxide`, as well as `density` and `alcohol` are relatively high (absolute value exceeding 0.7). This might introduce difficulties to the model to estimate the relationship between each independent variable and the dependent variable independently

## Models and Results

In [14]:
# Build a transformer to further process the data
numeric_features = wine_train.columns.tolist()[:-2]
categorical_features = wine_train.columns.tolist()[-2:-1]
columns_to_passthrough = ['color']

wine_preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OrdinalEncoder(), categorical_features),
    ('passthrough', columns_to_passthrough)
)

wine_preprocessor.fit(wine_train)
scaled_wine_train = wine_preprocessor.transform(wine_train)
scaled_wine_test = wine_preprocessor.transform(wine_test)

scaled_wine_train.to_csv("../data/processed/scaled_wine_train.csv")
scaled_wine_test.to_csv("../data/processed/scaled_wine_test.csv")

In [15]:
scaled_wine_train.head()

Unnamed: 0,standardscaler__fixed acidity,standardscaler__volatile acidity,standardscaler__citric acid,standardscaler__residual sugar,standardscaler__chlorides,standardscaler__free sulfur dioxide,standardscaler__total sulfur dioxide,standardscaler__density,standardscaler__pH,standardscaler__sulphates,standardscaler__alcohol,ordinalencoder__quality,passthrough__color
3835,-0.782248,-0.180874,-0.611183,-0.459337,-0.116545,0.191273,-0.047292,-0.139874,-0.354777,-0.343825,-1.744772,2.0,white
5605,0.145189,0.919581,-0.887211,-0.523001,0.616447,-0.810536,-1.395805,0.641474,0.954901,-0.141958,0.016849,2.0,red
2422,-0.782248,0.613899,-0.956217,0.474405,1.173521,-0.142663,1.673837,1.016927,2.139847,1.271111,-1.073678,2.0,white
3636,-0.550389,-0.486556,0.492927,-0.862543,-1.054776,-1.03316,-0.881241,-1.28991,0.206513,1.472978,0.687943,4.0,white
226,-0.627675,-1.036784,0.009879,-0.650329,-0.233824,0.580865,1.496401,-0.305614,1.765653,-0.209247,-0.654244,3.0,white


In [16]:
# Rename 'passthrough__color' back to 'color'
scaled_wine_train.rename(columns={'passthrough__color': 'color'}, inplace=True)
scaled_wine_test.rename(columns={'passthrough__color': 'color'}, inplace=True)

# Preparing data for machine learning model
X_train = scaled_wine_train.drop(columns=['color'])
y_train = scaled_wine_train['color']

X_test = scaled_wine_test.drop(columns=['color'])
y_test = scaled_wine_test['color']

# Show first few lines of the data to make sure it is okay
X_train.head()

Unnamed: 0,standardscaler__fixed acidity,standardscaler__volatile acidity,standardscaler__citric acid,standardscaler__residual sugar,standardscaler__chlorides,standardscaler__free sulfur dioxide,standardscaler__total sulfur dioxide,standardscaler__density,standardscaler__pH,standardscaler__sulphates,standardscaler__alcohol,ordinalencoder__quality
3835,-0.782248,-0.180874,-0.611183,-0.459337,-0.116545,0.191273,-0.047292,-0.139874,-0.354777,-0.343825,-1.744772,2.0
5605,0.145189,0.919581,-0.887211,-0.523001,0.616447,-0.810536,-1.395805,0.641474,0.954901,-0.141958,0.016849,2.0
2422,-0.782248,0.613899,-0.956217,0.474405,1.173521,-0.142663,1.673837,1.016927,2.139847,1.271111,-1.073678,2.0
3636,-0.550389,-0.486556,0.492927,-0.862543,-1.054776,-1.03316,-0.881241,-1.28991,0.206513,1.472978,0.687943,4.0
226,-0.627675,-1.036784,0.009879,-0.650329,-0.233824,0.580865,1.496401,-0.305614,1.765653,-0.209247,-0.654244,3.0


In [17]:
# Creating the DummyClassifier to get the baseline score
dummy_scores = pd.DataFrame(cross_validate(
    DummyClassifier(strategy="most_frequent"),
    X_train,
    y_train,
    return_train_score=True,
    scoring=["accuracy"]
))

dummy_scores

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy
0,0.003,0.002,0.753846,0.753909
1,0.001999,0.001001,0.753846,0.753909
2,0.001999,0.001001,0.754572,0.753728
3,0.002,0.000999,0.753609,0.753968
4,0.001,0.002,0.753609,0.753968


In [18]:
from helper_func_model_selection import model_selection

In [19]:
models = model_selection("dummy", "dtree", "knn", "svm", "nb", "lr")
models

{'Dummy Classifier': DummyClassifier(random_state=123),
 'Decision Tree': DecisionTreeClassifier(random_state=123),
 'KNN': KNeighborsClassifier(),
 'RBF SVM': SVC(random_state=123),
 'Naive Bayes': BernoulliNB(),
 'Logistic Regression': LogisticRegression(max_iter=1000)}

In [20]:
# The following block of code was inspired by DSCI 571 Lab 4

results_list = []

for name, model in models.items():
    
    # Create a pipeline with a CountVectorizer and the current model
    pipeline = make_pipeline(model)
    
    # Perform cross-validation
    cv_results = cross_validate(pipeline, X_train, y_train, cv=5,
    return_train_score=True,
    scoring='accuracy',
    n_jobs=-1)
    
    # Append results for the current model to the results_list
    results_list.append({
        "model": name,
        "fit_time": np.mean(cv_results['fit_time']),
        "score_time": np.mean(cv_results['score_time']),
        "test_score": np.mean(cv_results['test_score']),
        "train_score": np.mean(cv_results['train_score']),
    })

# Create a DataFrame from the results_list
results_df = pd.DataFrame(results_list)

# Set the model name as the index
results_df.set_index('model', inplace=True)

# Show the resulting DataFrame
results_df

Unnamed: 0_level_0,fit_time,score_time,test_score,train_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dummy Classifier,0.0036,0.003,0.753896,0.753896
Decision Tree,0.039512,0.003202,0.983261,0.999856
KNN,0.012396,0.120998,0.994228,0.995334
RBF SVM,0.073188,0.031201,0.996921,0.997691
Naive Bayes,0.066279,0.007,0.974216,0.974601
Logistic Regression,0.029599,0.003404,0.995766,0.9962


Upon concluding our exploratory data analysis and delving into model evaluation, the results delineate an intriguing landscape of model performance. Notably, while the Decision Tree, KNN, and RBF SVM models exhibit high accuracy, with the SVM model achieving the highest test scores, the choice of model cannot rest on accuracy alone. Logistic Regression, while marginally surpassed by SVM in test score metrics, stands out for its interpretability. This model provides not only a robust predictive performance but also the capacity to glean meaningful insights from the significance and impact of each feature, as reflected by its coefficients. In light of this, we opt for Logistic Regression, valuing the interpretative clarity it offers, which is instrumental for a nuanced understanding of the variables influencing wine classification. This strategic choice harmonizes predictive strength with explanatory depth, guiding us towards actionable intelligence over mere predictive prowess




In [21]:
# Fitting the Logistic Regression model and score it on the test portion
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
model.score(X_test, y_test)

0.9884615384615385

In [22]:
# Producing the table to present the marginal contribution of each feature.
reg_data = {
    'Feature Name': model.feature_names_in_,
    'Coefficient': model.coef_[0]
}

result_df = pd.DataFrame(reg_data)

result_df['Feature Name'] = result_df['Feature Name'].str.replace('standardscaler__', '')
result_df['Feature Name'] = result_df['Feature Name'].str.replace('ordinalencoder__', '')

result_df

Unnamed: 0,Feature Name,Coefficient
0,fixed acidity,0.152154
1,volatile acidity,-1.378425
2,citric acid,0.251229
3,residual sugar,3.214798
4,chlorides,-0.71337
5,free sulfur dioxide,-0.380639
6,total sulfur dioxide,2.647135
7,density,-5.004208
8,pH,0.012768
9,sulphates,-0.427502


In [23]:
# To visualize the above result

result_df['Wine Prediction'] = result_df['Coefficient'].apply(lambda x: 'Predicting White Wine' if x > 0 else 'Predicting Red Wine')

chart = alt.Chart(result_df).mark_bar().encode(
    x='Coefficient',
    y='Feature Name',
    color=alt.Color('Wine Prediction', 
                    legend=alt.Legend(title="Wine Prediction"), 
                    scale=alt.Scale(domain=['Predicting Red Wine', 'Predicting White Wine'], 
                                    range=['red', 'peachpuff'])
                   )
)

chart

The coefficients obtained from the logistic regression model provide a quantifiable measure of the impact each feature has on the likelihood of a wine being classified as red or white. Features with positive coefficients, such as residual sugar and total sulfur dioxide, increase the probability of a wine being classified as white, as indicated by the model.classes_ array. Conversely, features with negative coefficients, such as alcohol, volatile acidity, chlorides, and notably density with the largest negative coefficient, are indicative of a wine being classified as red. The magnitude of these coefficients reveals the relative importance of each feature, with density and alcohol having the most substantial influence in the negative direction and residual sugar significantly increases the odds in favor of white wine. The feature 'quality' also plays a role, albeit a smaller one, in swaying the classification towards red wine. Overall, the model's coefficients provide a nuanced understanding of how each physicochemical characteristic tilts the balance in the complex interplay of factors that determine wine color in our dataset.

## Conclusion

The logistic regression analysis reveals expected relationships between wine characteristics and their classification as red or white. Residual sugar's positive coefficient aligns with the higher levels typically found in white wines, indicating a greater likelihood of a wine being classified as white as the sugar content increases. Similarly, the positive coefficient for sulfur dioxide corresponds with the higher concentrations in white wines. The negative coefficients for alcohol and density suggest a higher probability of wine being classified as red with increasing values, which is consistent with red wines generally having higher alcohol content. These insights highlight the intricate balance of physicochemical properties influencing wine color, reaffirming the importance of considering the context and interactions of features within the dataset when interpreting model outcomes.



Nevertheless, it's important to remember that the signs and magnitudes of coefficients in logistic regression are influenced by the scale of the features and the correlations between them. These factors can affect the interpretability of the coefficients in complex ways, especially if there is multicollinearity in the data. Therefore, while the results are plausible and show some expected trends, any surprising findings would warrant a deeper investigation into the data and the model's behavi 



# Reference

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Wine Quality Dataset. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/dataset/186/wine+quality

Timbers, T. (2023). Breast Cancer Predictor Python Repository. GitHub. Retrieved from https://github.com/ttimbers/breast_cancer_predictor_py/tree/v0.0.2

Mor, N. S. (2022).Wine Quality and Type Prediction from Physicochemical Properties Using Neural Networks for Machine Learning: A Free Software for Winemakers and Customer. https://osf.io/ph4cu/download.

UBC Master of Data Science. (2023). DSCI 571: Supervised Learning I. UBC GitHub. Retrieved from https://github.ubc.ca/MDS-2023-24/DSCI_571_sup-learn-1_students