# Find Fine Wine
- How can you dine without some fine wine?

## Goal
* Discover drivers of wine quality
* Use drivers to develop clusters or groupings
* Use drivers and clusters to develop a machine learning model to predict wine quality

## Imports

In [1]:
# local imports
import wrangle as w
import explore as e

### Acquire

- Data acquired from data.world
- Data initially acquired on 23 May 2023
- It contained 6497 rows and 13 columns before cleaning
- Each row represents a red or white variant of the Portuguese "Vinho Verde" wine
- Each column represents a physicochemical property of the wine
- Quality (target) is the median of at least 3 evaluations made by wine experts from 0 (very bad) to 10 (very excellent)

### Prepare

- Identify and handle Nulls
    - No Nulls
- Renamed columns for readability
- Verified data types
- Handle outliers
    - residual sugar < 33
    - free sulfur dioxide < 280
    - chlorides < 0.21
- Create custom features
- Encode categorical variables
- Split data into Train, Validate, Test (approx. 60/20/20)
- Scaled using Standard Scaler for clustering and modeling

In [None]:
# acquire, clean, and prepare the data
df = w.wrangle_wine()

# split into train, validate, and test
train, validate, test = w.split_data(df)

#### A brief look at the data

In [None]:
# taking a peek
train.head()

#### Quality (Target) Distribution

In [None]:
# target distribution
e.dist(train)

## Explore
- Is there a correlation between ??? and quality?

### Is there a correlation between ??? and quality?
* $H_0$: There is **NO** correlation between ??? and quality
* $H_a$: There is a correlation between ??? and quality
* Continuous (???) vs Continuous (quality) = $Spearman's R$
    - $r$ = 
    - $p$ = 
* `???` is correlated with `quality`

In [None]:
# explore ??? and quality


**With an alpha of 0.05 (confidence 95%) the p-value is ??? than alpha. Therefore, evidence suggests that ??? and quality are correlated with each other. I believe that using `???` in the modeling could have a positive impact.**

## Exploration Summary 
- ???

### Features for Modeling
- Features that will be used:
    - quality (target)
- Some features that may be useful:
    - ???

## Modeling
- RMSE will be the evaluation metric
- Target is quality
- Using mean of the quality as baseline
    - ??? will be the baseline
- Features scaled using Standard Scaler
- Models will be developed using a few different types, various features, and various hyperparameter configurations
    - Linear Regression
    - Polynomial Features through Linear Regression
    - Lasso Lars
    - Tweedie Regressor (GLM)
- Models will be evaluated on Train and Validate
- Best performing model will only be evaluated on Test

In [None]:
# split into X and y


# standard scaler


In [None]:
# get baseline


### Best 4 Model Configurations

#### Linear Regression

In [None]:
# linear regression results


#### Polynomial Features

In [None]:
# polynomial features results


#### Lasso Lars

In [None]:
# lasso lars results


#### Tweedie Regressor

In [None]:
# tweedie results


### Best to Test
- ???

In [None]:
# best model eval with test data


#### How does it compare?

In [None]:
# plot predictions vs actual on test


### Modeling Wrap Up
- ???

## Conclusion

### Takeaways and Key Findings
- ???

### Recommendations and Next Steps
- ???