# Goals
My aim here is to create a wine recommendation system for a friend's web application. Presumably, the web application's database would already be populated with various wines and their characteristics.

The idea would be for a user to sign up with the app, and slowly build out their rankings of various wines they've tried.

At a certain point, the following model would be used to predict future user wine rankings based on the characteristics of wines they'd already have ranked.

The following test uses an existing database of various wines, their characteristics, and simulated user rankings in order to determine the best type of learning model for the given circumstances.

In [1]:
import pandas as pd
import numpy as np

# Importing Data
I found a dataset containing wine characteristics that I'll use to simulate a database of various wines and their features.

In [2]:
column_names = [
    'Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 
    'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 
    'Proanthocyanins', 'Color intensity', 'Hue', 
    'OD280/OD315 of diluted wines', 'Proline'
]

path = '/Users/caseyfranco/Desktop/Data Science Resources/Wine Data/wine.data'
df = pd.read_csv(path, header=None, names=column_names)

df.head()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


# User Ranking Simulation
Because I don't have access to these wines, it isn't feasible to try and replicate user rankings myself. Also, if I were to use randomly generated numbers for the user ranking column, this wouldn't accurately reflect a human having a particular taste for a certain type of wine.

To try and get around these, I'll choose two columns at random, normalize them, create a "taste score" by summing the two, and then scale that score to a 1-10 scale to simulate a user rating.

This should create a correlative relation between the user rankings and the wine features that the predictive models can try to identify.

Lastly, I'll remove the normalized columns and the taste score from the training data.

In reality, I wouldn't know which would be the most predictive categories but could use Feature Importance to determine which factors mattered to a person the most.

In [13]:
# Create new, normalized Alcohol and Phenols columns 
df['Alcohol_norm'] = (df['Alcohol'] - df['Alcohol'].min()) / (df['Alcohol'].max() - df['Alcohol'].min())
df['Total_phenols_norm'] = (df['Total phenols'] - df['Total phenols'].min()) / (df['Total phenols'].max() - df['Total phenols'].min())

# Compute a simple taste score based on these features
df['Taste_Score'] = df[['Alcohol_norm', 'Total_phenols_norm']].sum(axis=1)

# Scale the Taste_Score to a 1-10 scale for the User_Rating
df['User_Rating'] = 1 + (df['Taste_Score'] - df['Taste_Score'].min()) * 9 / (df['Taste_Score'].max() - df['Taste_Score'].min())

# Drop the normalization columns
df.drop(['Alcohol_norm', 'Total_phenols_norm', 'Taste_Score'], axis=1, inplace=True)

df.head()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline,User_Rating
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,7.599563
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,5.657262
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,5.905169
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,10.0
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,6.031853


# Model Evaluation
I'll try Simple Linear, Ridge, and Lasso Regressions as well as Random Forest to test their ability to identify patterns between a user's preferences and the characteristics of various wines in order to make predictions on whether or not a user might enjoy a given wine.

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

X = df.drop('User_Rating', axis=1)  # Features
y = df['User_Rating']  # Target variable

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest Regression': RandomForestRegressor(random_state=42)
}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predictions
    
    # Evaluation
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f'{name} - Mean Squared Error: {mse:.2f}, R^2 Score: {r2:.2f}')


Linear Regression - Mean Squared Error: 0.00, R^2 Score: 1.00
Ridge Regression - Mean Squared Error: 0.00, R^2 Score: 1.00
Lasso Regression - Mean Squared Error: 1.24, R^2 Score: 0.68
Random Forest Regression - Mean Squared Error: 0.07, R^2 Score: 0.98


In [16]:
# It seems likely that I would use Random Forest, but I'll check the cross-validation scores to be sure
from sklearn.model_selection import cross_val_score

# Adjusted loop to include cross-validation
for name, model in models.items():
    # Perform 5-fold cross-validation
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    
    # Convert scores to positive (because cross_val_score returns negative values for MSE to maximize the score)
    cv_scores_positive = -cv_scores
    
    # Calculate the mean and standard deviation of the cross-validated scores
    mean_cv_score = np.mean(cv_scores_positive)
    std_cv_score = np.std(cv_scores_positive)
    
    print(f'{name} - CV Mean Squared Error: {mean_cv_score:.2f}, Std: {std_cv_score:.2f}')


Linear Regression - CV Mean Squared Error: 0.00, Std: 0.00
Ridge Regression - CV Mean Squared Error: 0.00, Std: 0.00
Lasso Regression - CV Mean Squared Error: 2.27, Std: 0.68
Random Forest Regression - CV Mean Squared Error: 0.36, Std: 0.14


# Conclusion
Based on the model results, it's likely that the Linear and Ridge regressions are overfitting due to the linear nature of the simulated user rankings. Lasso regression has an underwealming r squared score. 

Random Forest seems to be a good fit, however, it is important to note that I cannot confirm that until I create my own database and populate it with actual user scores.

There's no computing for good taste!

From here the model would be saved using Joblib and brought into Flask where it would be integrated with the web app.

This program would also need to be fleshed out. Its functionality would lie in regularly cycling through unranked wines, testing them against the existing user preferences, recommending those that scored the highest predicted rankings, and regualrly retraining itself each time a user gave a new ranking or revised an old ranking. 