# Wines Points prediction 

In [38]:
%load_ext autoreload
%autoreload 2
import sys; sys.path.append('../')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Here we will try to predict the points a wine will get based on known characteristics (i.e. features, in the ML terminology). The mine point in this stage is to establish a simple, ideally super cost effective, basline.
In the real world there is a tradeoff between complexity and perforamnce, and the DS job, among others, is to present a tradeoff tables of what performance is achivalbel at what complexity level. 

to which models with increased complexity and resource demands will be compared. Complexity should then be translated into cost. For example:
 * Compute cost 
 * Maintenance cost
 * Serving costs (i.e. is new platform needed?) 
 

## Loading the data

In [8]:
import pandas as pd
import cufflinks as cf; cf.go_offline()

In [9]:
wine_reviews = pd.read_csv("data/winemag-data-130k-v2.csv")
wine_reviews.shape

(129971, 14)

In [10]:
wine_reviews.sample(5)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
3649,3649,US,Sagemoor's old Cab vines comprise 79% of the b...,Sagemoor Vineyard,93,40.0,Washington,Columbia Valley (WA),Columbia Valley,Paul Gregutt,@paulgwine,Walla Walla Vintners 2005 Sagemoor Vineyard Ca...,Cabernet Sauvignon,Walla Walla Vintners
1047,1047,US,This is a pretty good buy for a Merlot that sh...,,85,16.0,California,Napa Valley,Napa,,,Irony 2004 Merlot (Napa Valley),Merlot,Irony
11083,11083,US,Full blueberry and blackberry aromas can't ove...,Grandmére,81,25.0,California,Amador County,Sierra Foothills,,,Renwood 1997 Grandmére Zinfandel (Amador County),Zinfandel,Renwood
122702,122702,Australia,Rosemount's Diamond Label Chardonnay used to b...,Diamond Label,88,10.0,Australia Other,South Eastern Australia,,Joe Czerwinski,@JoeCz,Rosemount 2009 Diamond Label Chardonnay (South...,Chardonnay,Rosemount
8526,8526,Italy,"Made entirely with Sangiovese, the shy nose ev...",Dinostro,87,20.0,Tuscany,Toscana,,Kerin O’Keefe,@kerinokeefe,Podere Il Castellaccio 2015 Dinostro Sangioves...,Sangiovese,Podere Il Castellaccio


## Points prediction

Points is descrete value target. There for we are talking about a prediction (Regression) problem (in contrary to classification problem). Prediction solutions can be measured in few metrics:

* MSE - [Mean score error](https://en.wikipedia.org/wiki/Mean_squared_error)
* R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
* MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)

Read more [here](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b)

### Train and test set split

To properly report results, let's split to train and test datasets:

In [11]:
train_data = wine_reviews.sample(frac = 0.8)
test_data = wine_reviews[~wine_reviews.index.isin(train_data.index)]
assert(len(train_data) + len(test_data) == len(wine_reviews))

In [12]:
len(test_data), len(train_data)

(25994, 103977)

### Baselines

In [40]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [41]:
def calc_prediction_quality(df, pred_score_col, true_score_col):
    return pd.Series({'MSE': mean_squared_error(df[true_score_col], df[pred_score_col]),
                      'MAE': mean_absolute_error(df[true_score_col], df[pred_score_col]),
                      'R2': r2_score(df[true_score_col], df[pred_score_col])})

#### Baseline 1

The most basic baseline is simply the average points. The implementaion is as simple as:

In [9]:
test_data['basiline_1_predicted_points'] = train_data.points.mean()
b1_stats = calc_prediction_quality(test_data, 'basiline_1_predicted_points', 'points')
b1_stats

MSE    9.358940e+00
MAE    2.502118e+00
R2    -4.132097e-09
dtype: float64

#### Basline 2

We can probably improve by predicting the average score based on the origin country:

In [10]:
avg_points_by_country = train_data.groupby('country').points.mean()
avg_points_by_country.head()

country
Argentina                 86.695950
Armenia                   88.000000
Australia                 88.565969
Austria                   90.056719
Bosnia and Herzegovina    86.500000
Name: points, dtype: float64

In [11]:
test_data['basiline_2_predicted_points'] = test_data.country.map(avg_points_by_country).fillna(train_data.points.mean())
b2_stats = calc_prediction_quality(test_data, 'basiline_2_predicted_points', 'points')
b2_stats

MSE    8.900472
MAE    2.437661
R2     0.048987
dtype: float64

### Baseline 3

Adding more breakdowns will increase our granularity but can result in overfitting. Yet:

In [12]:
avg_points_by_country_and_region = train_data.groupby(['country','province']).points.mean().rename('basiline_3_predicted_points')
avg_points_by_country_and_region.head()

country    province        
Argentina  Mendoza Province    86.814476
           Other               85.983982
Armenia    Armenia             88.000000
Australia  Australia Other     85.531707
           New South Wales     87.545455
Name: basiline_3_predicted_points, dtype: float64

In [13]:
test_data_with_baseline_3 = test_data.merge(avg_points_by_country_and_region, on = ['country','province'], how='left')
test_data_with_baseline_3.basiline_3_predicted_points = test_data_with_baseline_3.basiline_3_predicted_points.fillna(test_data_with_baseline_3.basiline_2_predicted_points).fillna(test_data.basiline_1_predicted_points)
test_data_with_baseline_3.shape, test_data.shape

((25994, 17), (25994, 16))

In [14]:
b3_stats = calc_prediction_quality(test_data_with_baseline_3, 'basiline_3_predicted_points', 'points')
b3_stats

MSE    8.380775
MAE    2.349499
R2     0.104517
dtype: float64

### Baselines summary

In [15]:
baseline_summary = pd.DataFrame([b1_stats, b2_stats, b3_stats], index=['baseline_1', 'baseline_2','baseline_3'])
baseline_summary

Unnamed: 0,MSE,MAE,R2
baseline_1,9.35894,2.502118,-4.132097e-09
baseline_2,8.900472,2.437661,0.04898723
baseline_3,8.380775,2.349499,0.1045167


In [16]:
baseline_summary.to_csv('data/baselines_summary.csv', index=False)

## Training a Boosting trees regressor

In [17]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#### Preparing data - Lable encoding categorical features

In [18]:
categorical_features = ['country','province','region_1','region_2','taster_name','variety','winery']
numerical_features = ['price']
features = categorical_features + numerical_features

In [27]:
encoded_features = wine_reviews[categorical_features].apply(lambda col: le.fit_transform(col.fillna('NA')))
encoded_features['price'] = wine_reviews.price.fillna(-1)
encoded_features['points'] = wine_reviews.points
encoded_features.head()

Unnamed: 0,country,province,region_1,region_2,taster_name,variety,winery,price,points
0,22,332,424,6,9,691,11608,-1.0,87
1,32,108,738,6,16,451,12956,15.0,87
2,41,269,1218,17,15,437,13018,14.0,87
3,41,218,549,6,0,480,14390,13.0,87
4,41,269,1218,17,15,441,14621,65.0,87


#### Re-splitting to train and test

In [29]:
train_encoded_features = encoded_features[encoded_features.index.isin(train_data.index)]
test_encoded_features = encoded_features[encoded_features.index.isin(test_data.index)]
assert(len(train_encoded_features) + len(test_encoded_features) == len(wine_reviews))

#### Fitting a tree-regressor

In [37]:
from src.models import i_feel_lucky_xgboost_training

In [31]:
train_encoded_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103977 entries, 0 to 129970
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   country      103977 non-null  int64  
 1   province     103977 non-null  int64  
 2   region_1     103977 non-null  int64  
 3   region_2     103977 non-null  int64  
 4   taster_name  103977 non-null  int64  
 5   variety      103977 non-null  int64  
 6   winery       103977 non-null  int64  
 7   price        103977 non-null  float64
 8   points       103977 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 7.9 MB


In [32]:
xgb_clf, clf_name = i_feel_lucky_xgboost_training(train_encoded_features, test_encoded_features, features, 'points', name='xgb_clf_points_prediction')

Let's look at the function output - specifically the **xgb_clf_points_prediction** column:

In [33]:
test_encoded_features.head()

Unnamed: 0,country,province,region_1,region_2,taster_name,variety,winery,price,points,xgb_clf_points_prediction
3,41,218,549,6,0,480,14390,13.0,87,86
9,15,11,21,6,16,437,8989,27.0,87,88
17,0,216,633,6,12,280,7830,13.0,87,84
24,22,332,992,6,9,389,2065,35.0,87,90
32,22,332,992,6,14,691,6694,-1.0,86,87


In [34]:
xgb_stats = calc_prediction_quality(test_encoded_features, 'xgb_clf_points_prediction','points')
xgb_stats

MSE    6.138878
MAE    1.883204
R2     0.344063
dtype: float64

In [35]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb'])
all_compared

Unnamed: 0,MSE,MAE,R2
baseline_1,9.35894,2.502118,-4.132097e-09
baseline_2,8.900472,2.437661,0.04898723
baseline_3,8.380775,2.349499,0.1045167
regression_by_xgb,6.138878,1.883204,0.3440627


In [36]:
all_compared.to_csv('data/all_models_compared.csv', index=False)

## Classical NLP approaches

### Using only the text from the "description" column

To be implemende by you.

### Using both the text and other features

To be implemende by you.

## Deep Learning approaches

### Fully connected network on the text only

#### Tokenization

In [13]:
tf.keras.layers.TextVectorization(
    max_tokens=None,
    standardize='lower_and_strip_punctuation',
    split='whitespace',
    ngrams=None,
    output_mode='int',
    output_sequence_length=None,
    pad_to_max_tokens=False,
    vocabulary=None,
    idf_weights=None,
    sparse=False,
    ragged=False,
)

<keras.layers.preprocessing.text_vectorization.TextVectorization at 0x28fe80c40>

In [14]:
from tensorflow.keras.layers import TextVectorization, Embedding, Dense, GlobalAveragePooling1D, Dropout

What is a good size for the vocabulary? 

In [15]:
wine_reviews.description.apply(lambda x: len(x.split(' '))).quantile([0.95, 0.99])

0.95    60.0
0.99    71.0
Name: description, dtype: float64

In [16]:
vocab_size = 32000
sequence_length = 60

# Use the text vectorization layer to normalize, split, and map strings to integers. Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    #standardize=lambda text: tf.strings.lower(text), # You can use your own normalization function here
    standardize='lower_and_strip_punctuation', # Or you can use a pre-made normalization function
    max_tokens=vocab_size,    
    split='whitespace',
    output_mode='int',
    name = 'Text_processing',
    output_sequence_length=sequence_length)

In [26]:
vectorize_layer.adapt(train_data['description'])

In [33]:
sample_description = train_data['description'].sample().iloc[0]
print(sample_description)
vectorize_layer(sample_description)

Smoky toasty scents dominate this 50-50 blend of Tempranillo and Grenache. Firm purple fruits offer a taste of Spanish varietal character, with accents of bull's blood. This needs a bit more bottle age, which should continue to flesh out the finish.


<tf.Tensor: shape=(60,), dtype=int64, numpy=
array([  170,   235,   213,   539,     7,  1930,    48,     5,   810,
           2,   354,    75,   517,    46,   720,     4,   382,     5,
        1426,   386,    81,     6,   410,     5, 12589,  2414,     7,
         294,     4,   119,    67,   272,   131,   153,   206,  1093,
          13,   944,    88,     3,    20,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0])>

In [35]:
for token in vectorize_layer(sample_description).numpy()[:20]:
    print(f"{token} ---> ",vectorize_layer.get_vocabulary()[token])

170 --->  smoky
235 --->  toasty
213 --->  scents
539 --->  dominate
7 --->  this
1930 --->  5050
48 --->  blend
5 --->  of
810 --->  tempranillo
2 --->  and
354 --->  grenache
75 --->  firm
517 --->  purple
46 --->  fruits
720 --->  offer
4 --->  a
382 --->  taste
5 --->  of
1426 --->  spanish
386 --->  varietal


#### Modeling

In [23]:
embedding_dim=16

model = tf.keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    GlobalAveragePooling1D(),
    Dropout(0.2),
    Dense(164, activation='tanh', name='hidden_layer'),
    Dropout(0.2),
    Dense(1, name = 'output_layer')
])

In [24]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Text_processing (TextVector  (None, 60)               0         
 ization)                                                        
                                                                 
 embedding (Embedding)       (None, 60, 16)            512000    
                                                                 
 global_average_pooling1d_2   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dropout_4 (Dropout)         (None, 16)                0         
                                                                 
 hidden_layer (Dense)        (None, 164)               2788      
                                                                 
 dropout_5 (Dropout)         (None, 164)              

In [25]:
tf.keras.utils.plot_model(model, show_dtype=True, show_shapes=True, show_layer_names=True)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model/model_to_dot to work.
