# Technical Breakdown of Predictive Models

### Cluster Analysis of Flavors

In [None]:
# insert Cate's Cluster visual

### Linear Model Using the Continuous Variables from the Data

In [16]:
from sklearn import linear_model
from sklearn.metrics import r2_score
reg = linear_model.Ridge()
model = reg.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'R-Squared for test data: {model.score(X_test, y_test)}')
print(f'R-Squared for predictions: {r2_score(y_test, y_pred)}')

R-Squared for test data: 0.13880089747170443
R-Squared for predictions: 0.13880089747170443


* Predicted scores are only weakly correlated to test data

### Classifier Models

### First Attempt - A Ridge Classifier Model

In [15]:
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import r2_score
clf = RidgeClassifier().fit(X_train, y_train)
print(f'R-Squared for test data: {clf.score(X_test, y_test)}')
y_pred = clf.predict(X_test)
print(f'R-Squared for predictions: {r2_score(y_test, y_pred)}')

R-Squared for test data: 0.620746319753509
R-Squared for predictions: 0.12766027469626617


### Second Attempt - A Decision Tree Model

In [16]:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
dt = clf.fit(X_train, y_train)
print(f'R-Squared for test data: {dt.score(X_test, y_test)}')
y_pred = dt.predict(X_test)
print(f'R-Squared for predictions: {r2_score(y_test, y_pred)}')

R-Squared for test data: 0.620746319753509
R-Squared for predictions: 0.12766027469626617


### Third Attempt - A Random Forest Model

In [17]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(X_train, y_train)
print(f'R-Squared for test data: {rf.score(X_test, y_test)}')
y_pred = rf.predict(X_test)
print(f'R-Squared for predictions: {r2_score(y_test, y_pred)}')

R-Squared for test data: 0.6457377610407394
R-Squared for predictions: 0.13993908775622077


In [18]:
# List of IV features by importance
sorted(zip(rf.feature_importances_, X_test), reverse=True)

[(0.5044792019960175, 'price'),
 (0.37958675850207435, 'word_count'),
 (0.11593403950190773, 'age')]

### Final Attempt - Random Forest With Strongest Predictor (Price)

In [19]:
X2 = df['price'].values.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X2, encoded_y, random_state=42)
rf = RandomForestClassifier(n_estimators=200)
rf = rf.fit(X_train, y_train)
print(f'R-Squared for test data: {rf.score(X_test, y_test)}')
y_pred = rf.predict(X_test)
print(f'R-Squared for predictions: {r2_score(y_test, y_pred)}')

R-Squared for test data: 0.6337897980143786
R-Squared for predictions: 0.12276429227362606


### Limitations
* Too many values in our categorical Variables
* Subjectivity of wine scoring practices
* Reviews written by amateur reviewers

### Ultimately machine learning will not cost any sommeliers their job
* Wine quality assessments appear to be very subjective
* There is a lot of variability in the relationships between score and other descriptive variables
* After trying numerous types of models and data manipulation practices, none of our models were able to assess wine quality

In [20]:
%%HTML
<html><head><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js" integrity="sha256-Ae2Vz/4ePdIu6ZyI/5ZGsYnb+m0JlOmKPjt6XZ9JJkA=" crossorigin="anonymous"></script><script src="https://unpkg.com/@jupyter-widgets/html-manager@*/dist/embed-amd.js" crossorigin="anonymous"></script><script type="application/vnd.jupyter.widget-state+json">{"version_major": 2, "version_minor": 0, "state":{}}</script></head><body><iframe src="https://giphy.com/embed/E3L5goMMSoAAo" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/amy-schumer-E3L5goMMSoAAo">via GIPHY</a></p></body></html>