# Wine classification project

This is a NLP and deep learning project using `spacy`, `pytorch` and `torchtext`, aiming to retrieve information about a wine based on short text reviews, written by a taster.

## Objective

The data we're looking to guess are the **country** of production, the  **province** of production, and the **grape variety**.  
As a side objective, we show that on this dataset we can retrieve with a great accuracy the **name of the taster**, only from a review they have written.

## Dataset

The dataset contain data scraped from [WineEnthusiast](https://www.winemag.com/?s=&drink_type=wine), and is hosted on [this kaggle page](https://www.kaggle.com/zynicide/wine-reviews#winemag-data-130k-v2.csv). Each example contains a written review of a wine, an various data about this wine like the country and province of production, the grape variety, the winery, the name and twitter handle of the taster, a general grade and a price index.

## What we'll do

We'll perform the following steps: 
- Load and clean the data
- Setup the training, validation and testing datasets
- Setup pre-trained word embeddings
- Create a CNN to classify our data
- Write a training routine
- Test our model

In appendix, you'll also find:
- Helpers to analysis our model performance and diagnose misclassifications
- Our initial RNN implementation, which was not performing as well as our CNN


### Library loading

In [428]:
import torch
import numpy as np
from torchtext import data

SEED = 2753 # We always use the same seed for reproducibility
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize= 'spacy') # for CNN
# TEXT = data.Field(tokenize= 'spacy', include_lengths=True) # If you want to try the RNN, you need to include the lengths in the TEXT field
LABEL = data.LabelField()


We load `torch` and `torchtext`, and setup our fields for `torchtext`. Note that we indicate we're going to use `spacy` as our tokenizer. You need to have spacy installed for this to work, as well as downloading an english language model. `torchtext` expects this model to be called `en`, so you might have to rename it.

### Load and clean the dataset

In [617]:
import csv

CURRENT_LABEL = 'country' # Column we're currently trying to guess. Change this to any of the above columns.

# String to int relation between column name and column index, to access them easily
COLUMNS_STOI = {
    'country': 1, 
    'province': 6, 
    'taster_name': 9,
    'variety': 12,
}

MIN_SAMPLE_NUMBER = 150

column_number = COLUMNS_STOI[CURRENT_LABEL]

with open('datasets/winemag-data-130k-v2.csv') as f:
    reader = csv.reader(f)
    lines_uncontrolled = []
    counts = {}

    for row in reader:
        if not row[column_number]:
            # Skip the row if it doesn't have the current label
            continue
        if CURRENT_LABEL == 'province':
            # Fix the issue where "Bordeaux" is also sometimes called "Burgundy" (they are the same thing)
            if province == "Burgundy":
                row[6] = "Bordeaux"
        # Keep a count of each label occurence
        if not row[column_number] in counts.keys():
            counts[row[column_number]] = 1
        else:
            counts[row[column_number]] += 1
        lines_uncontrolled.append(row)
        
lines = []

# Remove the rows where the label is too rare
for row in lines_uncontrolled:
    if counts[row[column_number]] >= MIN_SAMPLE_NUMBER:
        lines.append(row)
    
            
            
print("Removed " + str(len(lines_uncontrolled) - len(lines)) + " rows")

print("Number of classes before cutting:", len(counts.keys()))
print("Original number of rows:", len(lines_uncontrolled))    
print("Rows after cutting:", len(lines))
print("Labels kept:", [k for k in counts.keys() if counts[k] >= MIN_SAMPLE_NUMBER])

Removed 1277 rows
Number of classes before cutting: 44
Original number of rows: 129909
Rows after cutting: 128632
Labels kept: ['Chile', 'Austria', 'South Africa', 'Canada', 'US', 'Spain', 'Argentina', 'Italy', 'Germany', 'France', 'Australia', 'Israel', 'Greece', 'Portugal', 'New Zealand']


The dataset sometimes lacks data, so we need to make sure we only select the rows where the data we're looking at is present. We also want to keep only the examples for which we have enough data : for instance, if a variety is too rare in the dataset, we won't be able to determine rules to understand what this variety consists in. We can finetune the threshold with `MIN_SAMPLE_NUMBER`. We set it to `150`, which is `1/1000` of the total dataset size.

### Train, validation and test splits

In [618]:
TEST_SET_SIZE = .3
VALIDATION_SET_SIZE = .2

indices = list(range(1, len(lines)))
np.random.seed(SEED)
np.random.shuffle(indices)

first_split_index = int(TEST_SET_SIZE * len(lines))
second_split_index = int((TEST_SET_SIZE+VALIDATION_SET_SIZE) * len(lines))

test_indices = indices[:first_split_index]
validation_indices = indices[first_split_index:second_split_index]
train_indices = indices[second_split_index:]

train_set = [lines[k] for k in train_indices]
test_set = [lines[k] for k in test_indices]
validation_set = [lines[k] for k in validation_indices]

print("Train set size:", len(train_set))
print("Validation set size:", len(validation_set))
print("Test set size:", len(test_set))
print("Train set sample:", train_set[0])

Train set size: 64315
Validation set size: 25727
Test set size: 38589
Train set sample: ['93313', 'US', 'A pleasant sipper for drinking now, with citrus fruit, Asian-pear and peach flavors, accented by acidity. This 100% Sauvignon was unoaked.', 'Honker Blanc', '86', '15.0', 'California', 'Napa Valley', 'Napa', '', '', 'Tudal 2012 Honker Blanc White (Napa Valley)', 'White Blend', 'Tudal']


We split our dataset in train, validation and test. We choose the size of the validation dataset to be 20% of the total size, and the test set to be 30%, leaving 50% for the training.

We then write these sets to csv files so we can load them afterwards. Note that we're using the `csv` library to write, because our wine reviews contain commas, so we need to be careful.

In [620]:
import os
try:
    os.mkdir('preprocessed_datasets')
except OSError:
    # It means the directory already exists, so let's just continue
    pass
    

with open('preprocessed_datasets/train.csv', 'w') as train_file:
    writer = csv.writer(train_file)
    writer.writerows(train_set)
    
with open('preprocessed_datasets/test.csv', 'w') as test_file:
    writer = csv.writer(test_file)
    writer.writerows(test_set)
with open('preprocessed_datasets/validation.csv', 'w') as validation_file:
    writer = csv.writer(validation_file)
    writer.writerows(validation_set)

### Setup the datasets

Then we'll setup the datasets so they can be used by `torchtext`. Here, we tell the library what the lines contains, and what data we want to use. We can select here the label we want to be working on, by setting it to `LABEL`, otherwise we leave it to `None`.

The `description` field, which contains the reviews, will always be set to `TEXT`: this is the field on which we're going to do some NLP.

In [598]:
# Put the label you want to predict as `LABEL`, all the other ones to `None`.
tv_datafields = [("id", None),
                 ("country", LABEL),
                 ("description", TEXT),
                 ("designation", None),
                 ("points", None),
                 ("price", None),
                 ("province", None),
                 ("region_1", None),
                 ("region_2", None),
                 ("taster_name", None),
                 ("taster_twitter_handle", None),
                 ("title", None),
                 ("variety", None),
                 ("winery", None)]

trn, vld, tst = data.TabularDataset.splits(path='preprocessed_datasets',
                                     format="csv",
                                     train= 'train.csv',
                                     validation='validation.csv',
                                     test='test.csv',
                                     fields=tv_datafields)

### Setup word embedding

Now we'll use pretrained word embeddings to improve the accuracy and speed up the training of our models.  
We'll build ourselves a vocabulary of the words encountered in the reviews (and in the labels), but as the reviews are quite big, we'll only keep the words common enough. For this we can set a limit on the number of words in our vocabulary. This is not necessary for the labels, because the vocabulary for them is much smaller.

**Beware :** `glove.6B.100d` is a library of pretrained vectors. It weights around **800M** and if you don't have it installed, running the following cell will download it. Make sure you have a good connection.

In [621]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(trn,
                 max_size=MAX_VOCAB_SIZE,
                 vectors = "glove.6B.100d",  # CAREFUL: this will download ~800M of data
                 unk_init = torch.Tensor.normal_)
LABEL.build_vocab(trn)

print("Reviews vocab length:", len(TEXT.vocab))
print("Labels vocab length:", len(LABEL.vocab))

Reviews vocab length: 25002
Labels vocab length: 15


Note how we get `25 002` and not `25 000` as our TEXT vocab lenght. This is because `torchtext` adds two reserved tokens: it replaces the word out of our vocab with a `<unk>` (unknown) token, and adds padding so the samples are all the same size with a `<pad>` token.

We can check the most common words in our reviews vocab:

In [623]:
print(TEXT.vocab.freqs.most_common(30))

[(',', 217968), ('.', 174982), ('and', 171656), ('of', 84962), ('the', 83328), ('a', 78092), ('with', 57268), ('is', 48333), ('wine', 39836), ('-', 37121), ('this', 36172), ('in', 30052), ('flavors', 29509), ('to', 27829), ('The', 26060), ("'s", 25579), ('fruit', 24629), ('It', 21558), ('on', 21247), ('it', 21153), ('This', 20348), ('that', 19620), ('palate', 19021), ('aromas', 17499), ('acidity', 17096), ('finish', 17009), ('tannins', 15116), ('from', 14883), ('but', 14649), ('cherry', 14086)]


We notice that the most common word is a comma, which explains why we had to be careful with our csv reading and writing.

Now we'll setup iterators, which will allow us to iterate through batches of our training, validation and testing datasets:

In [601]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # If you have cuda support, this will make sure you're using if for training

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (trn, vld, tst), 
    batch_size = BATCH_SIZE,
#     sort_within_batch=True, # For RNN, you need to uncomment this line as batches need to be sorted
    sort_key=lambda x: len(x.description), 
    device = device)

Note that we sort our data according to the length of the review. This is because we need to add some padding to the reviews to make sure all the samples in a batch are of the same size. Gathering samples of same size close together will ensure we won't have to add too much padding, which will speed up the process a bit.

### Creating the CNN

In [602]:
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, 
                 dropout, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, 
                                              out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) 
                                    for fs in filter_sizes
                                    ])
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        text = text.permute(1, 0)        
        embedded = self.embedding(text)
        embedded = embedded.unsqueeze(1)
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        cat = self.dropout(torch.cat(pooled, dim = 1))
        return self.fc(cat)

In the `__init__` function we define the architecture of our model. 
- First we have an **embedding layer** (our input vectors are one-hot vector and are sparse, this will turn them into smaller, non-sparse vector)
- Several **convolution layers** : convolution on text is a bit specific, we wrote a little bit more about it in our pdf (in French). Basically, it performs convolution a bit like we would do on images, but instead of layers we use n-grams. Then they all use **ReLU** as an activation function, and then use **max pooling**.
- Finally a **linear layer**, of same output size as our number of classes, so we can perform classification
- Note we're using **dropout**: this is a technique to avoid overfitting, by randomly setting some node to 0 at each forward pass.

Next we'll have to choose the parameters of this architecture:

In [603]:
INPUT_DIMENSION = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [2,3,4]
OUTPUT_DIM = len(LABEL.vocab)
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = CNN(INPUT_DIMENSION, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)

- `INPUT_DIM` and `OUTPUT_DIM` are based on our data.
- The embedding dimension `EMBEDDING_DIM` is fixed by the pretrained data we've loaded, so we have to keep this one at 100.
- We can choose `N_FILTER` and `FILTER_SIZES` freely, as well as the dropout rate `DROUPOUT`.

We can now use our pre-trained embeddings to setup initial values in our embedding layer:

In [605]:
pretrained_embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.3460,  0.7065,  0.1639,  ..., -1.4077,  1.7792, -0.9527],
        [-0.9241, -1.4135, -0.8655,  ...,  0.0169, -0.8565, -0.1619],
        [-0.1077,  0.1105,  0.5981,  ..., -0.8316,  0.4529,  0.0826],
        ...,
        [-0.4288, -0.0500, -0.3499,  ..., -1.2627,  0.1444, -0.8879],
        [-1.0038,  0.6452, -0.3984,  ..., -0.6172, -0.0960,  0.2449],
        [ 0.3714, -1.2620, -0.1996,  ..., -0.2593,  1.2749,  1.0969]])

Of course the pretrained vectors did not contain the `<unk>` and `<pad>` tokens, so we assign them all-zeros token:

In [606]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

### Training

Now we have everything defined, we can train our model.

We choose Adam as our optimizer (the nice thing about Adam is that we don't have to select a learning rate, as we would need with stochastic gradient descent).  
We also need to choose a loss function. Here we use `CrossEntropyLoss` from `pytorch`, which is for when a sample belongs to exclusively one class (this is our case, as each wine only belongs to one country, one province, has only one writer...)

In [607]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.CrossEntropyLoss()

model = model.to(device)
criterion = criterion.to(device)

Now we need to define an accuracy function. As we are doing multi-class classification, we can use the proportion of correctly classified samples in a batch, in other words: on each sample, we choose the label with the max probability, and then we check on the batch what is the proportion of correctly classified labels:

In [624]:
def categorical_accuracy(preds, y):
    max_preds = preds.argmax(dim = 1, keepdim = True)
    correct = max_preds.squeeze(1).eq(y)
    return correct.sum() / torch.FloatTensor([y.shape[0]])

We can now define our training and evaluating functions, which will repectively train the model and evaluate accuracy batch after batch.

We are always using the `description` field (review text) as an input, but we can take varying outputs depending on what label we're experimenting on, so we need to get this one back with `getattr`.

In [612]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        
        predictions = model(batch.description)
        
        loss = criterion(predictions, getattr(batch, CURRENT_LABEL))
        
        acc = categorical_accuracy(predictions, getattr(batch, CURRENT_LABEL))
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.description)
            
            loss = criterion(predictions, getattr(batch, CURRENT_LABEL))
            
            acc = categorical_accuracy(predictions, getattr(batch, CURRENT_LABEL))

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


We create a helper function to keep track of time during training, so we can compare how fast our different models are:

In [613]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Now it's time to run the training!

We choose the number of epochs we want to run the model on, and when we get better results, we save the model in a separate file to make sure we don't lose it as this step can be time consuming.

In [614]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'wine-prediction-model.pt')
    
    print('Epoch: ' + str(epoch+1) + ' | Epoch Time: ' + str(epoch_mins) + 'm '+ str(epoch_secs) + 's')
    print('\tTrain Loss: ' + str(train_loss) + ' | Train Acc: ' + str(train_acc*100) + '%')
    print('\tVal. Loss: ' + str(valid_loss) + ' |  Val. Acc: ' + str(valid_acc*100) + '%')

Epoch: 1.02 | Epoch Time: 1m 23s
	Train Loss: 1.0312679250738515 | Train Acc: 68.61107597896708%
	Val. Loss: 0.6479807856367595 |  Val. Acc: 78.97497878145816%
Epoch: 2.02 | Epoch Time: 1m 43s
	Train Loss: 0.6399591669810946 | Train Acc: 79.38793848403057%
	Val. Loss: 0.5278927133376918 |  Val. Acc: 82.23280917056164%
Epoch: 3.02 | Epoch Time: 1m 50s
	Train Loss: 0.5153121326100174 | Train Acc: 82.79821760025784%
	Val. Loss: 0.4710242266011475 |  Val. Acc: 84.09052209474555%
Epoch: 4.02 | Epoch Time: 1m 47s
	Train Loss: 0.4371247465782498 | Train Acc: 84.90787587355618%
	Val. Loss: 0.45896594141104924 |  Val. Acc: 84.39764224771244%
Epoch: 5.02 | Epoch Time: 1m 47s
	Train Loss: 0.38324987962471313 | Train Acc: 86.60134180861327%
	Val. Loss: 0.43887938610949917 |  Val. Acc: 85.26816562337068%
Epoch: 6.02 | Epoch Time: 1m 49s
	Train Loss: 0.3407266934490322 | Train Acc: 87.95163483168949%
	Val. Loss: 0.45016990779940763 |  Val. Acc: 85.4316586730492%
Epoch: 7.02 | Epoch Time: 1m 45s
	Tra

### Testing the results

Now we have trained the model, we can use the test samples we have left aside to test its perfomance on unknown samples:

In [615]:
model.load_state_dict(torch.load('wine-prediction-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print('Test Loss: ' + str(test_loss) + ' | Test Acc: '+ str(test_acc*100) + '%')

Test Loss: 0.4363667101384593 | Test Acc: 85.35604932612645%


Depending on the experiment you're running, you can get various results at this step. We kept in our pdf a track of the results we could obtain here.

### Live testing

To play a bit more with the model, we can use `spacy` to classify in live some reviews :

In [625]:
import spacy
nlp = spacy.load('en_core_web_sm')

def predict_class(model, sentence, min_len = 4):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    preds = model(tensor)
    max_preds = preds.argmax(dim = 1)
    return max_preds.item()

In the following cell, you can put any review in the description, and check how the model classifies it.

In [492]:
description = "This cooperative, based in Aÿ, has benefited from the fine Pinot Noir in the village to produce a ripe red fruited wine. With balanced acidity and a soft aftertaste, it is ready to drink."
pred_class = predict_class(model, description)
print('Predicted class is: ' + str(pred_class) + ' = ' + str(LABEL.vocab.itos[pred_class]))

Predicted class is: 0 = Pinot Noir


## Appendix

We also provide a few things that are not directly linked to our results above but we used during our work.  

### Results exploration

To check how our model was performing, as especially in what cases it didn't perform well, we used the following script. It allows us to see a number of wrongly classified samples. This is for instance how we found out that **Burgundy** was also called **Bordeaux** in the dataset, which led to lots of classification errors.

In [626]:
import csv

LIMIT = 1000  # How many results to display
SHOW_ONLY_WRONG = True # If set to true, will only show the wrongly classified samples

with open('preprocessed_datasets/test.csv') as f:
    reader = csv.reader(f)

    i = 0
    for row in reader:
        if i > LIMIT:
            break
        sentence = row[2]
        real_value = row[6]
        pred_value = predict_class(model, sentence)
        if not SHOW_ONLY_WRONG or real_value != LABEL.vocab.itos[pred_value]:
            print(sentence)
            print("Actual: " + str(real_value) + ", predicted: " + str(LABEL.vocab.itos[pred_value]) + "\n")
        i += 1
    

A pure impression ripe Golden Delicious apples shows on the nose of this wine, while the palate majors in citrus. Dry and fresh, it offers tingling zestiness with a pleasantly bitter edge. The dry finish lasts, leaving you to savor apple, zest and something less tangible—perhaps earth or stone. Drink now until 2030.
Actual: Alsace, predicted: France

Starts out odd and exotic, with blueberry and Middle Eastern spice aromas. Feels condensed and jammy, with full flavors of herbs, boysenberry and plum. Shows freshness along with simplicity, with finishing herbal notes of sage and tarragon. Contains 10% each Merlot and Cabernet.
Actual: Northern Spain, predicted: Spain

Vine Cliff consistently produces one of the best Chardonnays in Napa Valley, and here's another one. The cool vintage gives it refreshing acidity, while the flavors are ripe and frankly delicious, suggesting pineapple and Key lime pie, vanilla custard, buttered toast, vanilla and dusty pie spices. A girding of minerality he

A Barbera offering dark plum and handfuls of meaty brawn, it has rounded tannins and plenty of intensity. Pair this with spicy sausage and risotto to help it mellow.
Actual: California, predicted: Italy

Intense and rich wine, with touches of smoky wood, the tannins ripe, the acidity a refreshing balancing factor. With its cherry character, there is fruit already, but wait a year and this will be impressive.
Actual: Beaujolais, predicted: France

From the cool-climate region of Elgon, the Cluver Pinot offers sweet berries and violet on the nose and an unfolding palate of red berry, coffee and chocolate. Restrained and ageable, the wine's acidity gives it food-oriented character; pair with game or wild fish.
Actual: Elgin, predicted: US

Simple and dry, this Cabernet has modest blackberry, tea and oak flavors. With its firm tannins, it gets the basic Cabernet job done.
Actual: California, predicted: US

Made of 60% Cabernet Sauvignon, 20% Merlot and 20% Cabernet Franc, this has black-sk

The high proportion of Sémillon in the blend gives the wine an extra richness. That rounds out the more herbaceous character and brings in ripe peach and kiwi flavors. It's already a lively, drinkable wine, but should only improve over another few months.
Actual: Bordeaux, predicted: France

Although it's not indicated on the label, this is a Blanc de Blancs from Chardonnay. It is warm and ripe, with a soft texture allied to fresh acidity. Orange and lemon zest give a great lift to this rounded wine.
Actual: Champagne, predicted: France

Scents of plum, orange rind, banana and a touch of mentholated herb unfold on the nose, while the palate offers fruit-skin flavors of plum and cherry, lending a slightly tannic structure.
Actual: Languedoc-Roussillon, predicted: South Africa

Delicately floral in aroma, this soft, earthy Pinot offers blue and red berry fruit that are layered and generous on the finish. Food-friendly, it has the heft to stand up to bigger, bolder flavors on the plate or

This blends 51% Merlot with 29% Zinfandel, 15% Petit Verdot and 5% Petite Sirah, combining nicely for a full-bodied expression of chocolate and dried dark cherry. On the palate, it becomes dusty and somewhat rustic, with an intensely juicy finish.
Actual: California, predicted: US

Full and round, with snappy acidity that brings life and lift to the luscious fruit, this delicious wine is aimed at those who like a festival of green and yellow fruits in their glass. It has fine texture and length, with a defining mineral streak that elevates this well beyond just fruity. It has medium-term aging potential as well.
Actual: Oregon, predicted: France

Dusty minerals, pollen and spice notes lend complexity to this honey-toned auslese. The palate is decadently sweet, glazed in caramelized sugar and heaving with ripe cantaloupe and peach flavors. A bite of lemon-lime acidity offsets all the opulence, lending edge to a long, lingering finish.
Actual: Mosel, predicted: Germany

Foresty earth not

Wild strawberry and a soft layering of fresh earth form the core of this blend, based mostly on Cabernet Sauvignon, with a healthy dose of Merlot and then smaller supporting handfuls of Cabernet Franc, Malbec and Petit Verdot. Blackberries come to fore as the wine develops, along with a swathe of bittersweet chocolate.
Actual: California, predicted: US

Plum, berry and milk chocolate aromas are not too aggressive. The palate is round and friendly feeling, with flavors of wild berry, sweet oak and carob. Finishes with solid oaky flavors. Good Syrah that's uncomplicated and generally well made.
Actual: Colchagua Valley, predicted: Chile

Lucien Lardy takes the risk of not using sulfur during fermentation, giving a wine that retains its great fruitiness. It's bright, full of red cherry fruits and lively acidity. Tannins are a hint in the background, enough to give structure but not to suggest long aging. Drink until 2018.
Actual: Beaujolais, predicted: France

Not a bad wine, but green an

This is a very fine, high level Cabernet that shows the impeccable nature of Mondavi Reserve. Yet it's also very tannic and far from being drinkable. Shows a vast repository of black currant, blackberry, black cherry and new oak flavors wrapped into a smooth, velvety mouthfeel whose finish goes on forever. Masculine and authoritative, it should continue to improve for many years. Drink now, with decanting, and through 2017, at least.
Actual: California, predicted: US

Simple, with blackberry, cherry, carob and herb flavors wrapped into a jagged texture.
Actual: California, predicted: US

The tannins are smooth on this ripe, jammy wine. Its raspberry and cherry fruit flavors are finished with a white sugar opulence. The blend is Zin, Petite Sirah and Bordeaux varieties.
Actual: California, predicted: US

Not only is this appellation blend the cheaper Nebbiolo offered by Steve Clifton's Italian varietal-focused winery, it's the best right out of the bottle. Bright strawberry, sweet rose,

An intriguing blend of grapes both red (Touriga Nacional and Baga) and white (Bical), this is a fruity wine, deep in color although with a lightness from the lively acidity and fresh red-berry fruits. The wine is juicy, with the tannins already softening and rounding out. Drink from 2016.
Actual: Bairrada, predicted: Portugal

A 100% varietal wine that opens invitingly in wafts of baked bread and follows on the palate with juicy black currant and dried herb, this shows complexity and elegance in equal measure. Firm tannin girds substantial characteristics of black licorice and tobacco, with a succulent finish. Drink now through 2023.
Actual: California, predicted: US

Tropical and a bit briney on the nose. The palate is healthy in feel, with sweet white fruit flavors, a pinch of lemon and creamy oak. Finishes quick, with a nice feel and adequate acidity; will do best if well chilled.
Actual: Mendoza Province, predicted: Spain

This is a gorgeous and superbly balanced Syrah that's a rea

The 18 acres of Château de Grenouilles are a monopoly of the La Chablisienne cooperative. A huge, ripe wine, very fruity, with yellow peaches. The richness is balanced with a steely core of minerality. Lively acidity finishes this impressive wine.
Actual: Burgundy, predicted: France

Light and crisp, this is an open, bright and lightly herbaceous wine. Green fruits and lime juice are dominant along with a fresh, bright texture. It is ready to drink.
Actual: Loire Valley, predicted: France

Earthy, moderately complex and mature aromas of lemon peel, dried red-berry fruits and sandalwood set up a tannic palate with vitality given that this is now nine-years-old. Earthy cherry and plum flavors are uncomplicated and finish tannic, with mildly leafy berry undertones. Drink through 2018.
Actual: Northern Spain, predicted: Spain

Tart Morello cherries and black plums are accented with spice, cocoa and balsamic notes throughout this earthy, nuanced Pinot Noir. It's already showing signs of mat

This is a soft fruity wine. It has attractive red-fruit flavors, very ripe fruit and light acidity at the end. The wine is ready to drink.
Actual: Tejo, predicted: France

The Columbia Valley Cuvée is JM Cellars' Bordeaux blend of Cab, Merlot and Cab Franc.  Fragrant and seductive, spicy and deep, it's loaded with wild berry fruit flavors, dried herbs and hints of rock. The new oak is not intrusive, the acids are balanced, and the finish is light and smooth.
Actual: Washington, predicted: US

Medium to full bodied, with nicely supple tannins, Wirra Wirra's Church Block is an attractive wine for near-term consumption. Flavors in this blend of 48% Cabernet, 37% Shiraz and 15% Merlot range from broiled tomato to tart berry, then pick up some dried fruit character with air, finishing with an unusually tart edge.
Actual: South Australia, predicted: Australia

Four appellations across the county contributed to this easygoing white. It's dark gold in color with a taste of peach skin, pineappl

### RNN experiments

We also implemented a multi-class RNN as it usually works well on text analysis, because of the sequential nature of text. However, it turned out that it was not performing as well as our CNN described above, and was much longer to train. If you want to try to run it by yourself (beware: the training can take several hours), you'll have to change a few things in the code above - look for the comments about RNN.

The train and evaluate functions are similar to those of the CNN, the main difference being the text length being taken into account.

In [None]:
def trainRNN(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for (index, batch) in enumerate(iterator):
        print(index/len(iterator))
        optimizer.zero_grad()
        
        text, text_lengths = batch.description
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.province)
        
        acc = categorical_accuracy(predictions, batch.province)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluateRNN(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.description
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.province)
            
            acc = categorical_accuracy(predictions, batch.province)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

N_EPOCHS = 10
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = trainRNN(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluateRNN(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'wine-prediction-model.pt')
    
    print('Epoch: ' + str(epoch+1.02) + ' | Epoch Time: ' + str(epoch_mins) + 'm '+ str(epoch_secs) + 's')
    print('\tTrain Loss: ' + str(train_loss) + ' | Train Acc: ' + str(train_acc*100) + '%')
    print('\tVal. Loss: ' + str(valid_loss) + ' |  Val. Acc: ' + str(valid_acc*100) + '%')

And below you can find the definition of our RNN architecture:

In [381]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        embedded = self.dropout(self.embedding(text))
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)   
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        return self.fc(hidden.squeeze(0))

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = len(LABEL.vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)