# Training Gradient Boosting

This notebooks trains gradient boost model based on unigram tfidf vectors of the tasting prescription in order to predict the variety, province and taster name. The aim is less accuracy (or another metric) than evaluating features importance.  

As the training takes time, they are all saved in the pickle format to be used later or in the writeup notebook. We also remove the wines without taster names, and set the maximum number of features for the tfidf and bow vectorizer to 300 only. This way, we reduce a bit the time needed for training the gradient boosts. 


In [1]:
# IMPORTING THE NECESSARY PACKAGES AND FUNCTIONS:

# generic:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import time 

# more specific:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import pickle # to save models, for instance LDA outputs

# NLP:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


In [2]:
# load processed and clean data:
filepath = '../data/winedata_processed_and_tokenized.csv'
winedata = pd.read_csv(filepath).drop('index', axis=1)
print(winedata.shape)
winedata.head().T

(106873, 14)


Unnamed: 0,0,1,2,3,4
country,Italy,Portugal,US,US,France
description,"Aromas include tropical fruit, broom, brimston...","This is ripe and fruity, a wine that is smooth...","Tart and snappy, the flavors of lime flesh and...","Much like the regular bottling from 2012, this...",This dry and restrained wine offers spice in p...
designation,Vulkà Bianco,Avidagos,,Vintner's Reserve Wild Child Block,
points,87,87,87,87,87
price,,15,14,65,24
province,Sicily & Sardinia,Douro,Oregon,Oregon,Alsace
region_1,Etna,,Willamette Valley,Willamette Valley,Alsace
region_2,,,Willamette Valley,Willamette Valley,
taster_name,Kerin O’Keefe,Roger Voss,Paul Gregutt,Paul Gregutt,Roger Voss
title,Nicosia 2013 Vulkà Bianco (Etna),Quinta dos Avidagos 2011 Avidagos Red (Douro),Rainstorm 2013 Pinot Gris (Willamette Valley),Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Trimbach 2012 Gewurztraminer (Alsace)


In [3]:
# As gradient boosting is quite time consuming, we reduce the size of the dataset.

# To do so, we exclude wines without taster names (and also because we want to predict it)
winedata.dropna(subset=['taster_name'], axis=0, how='any', inplace=True)
winedata.reset_index(inplace=True)

In [4]:
print(winedata.shape)
winedata.isnull().sum()

(86220, 15)


index                         0
country                       0
description                   0
designation               24537
points                        0
price                      6094
province                      0
region_1                   7447
region_2                  49020
taster_name                   0
title                         0
variety                       0
winery                        0
tokenized_descriptions        0
token_descr_as_string         0
dtype: int64

In [5]:
# greatly reduce the size, just for testing:
winedata = winedata.head(1000)

In [6]:
# vectorization using tfidf :
time0 = time.time()

# we only keep only 300 features:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1), # 1-gram
                               max_df=0.95, # ignore t that have a df higher than max_df (corpus-specific stopwords)
                               min_df=10, # ignore terms that have a doc freq lower than threshold.
                               max_features=300, # max number of features
                               use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                               norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                               smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                              )
# Applying the vectorizer on the "clean" descriptions:
wine_tfidf = tfidf_vectorizer.fit_transform(winedata.token_descr_as_string)

# list of features
terms = tfidf_vectorizer.get_feature_names()

# store the features in a dataframe:
tf_idf_features = pd.DataFrame(wine_tfidf.toarray(), columns=terms)

# add the different labels that interest us (indexes are the same):
tf_idf_features.loc[:, 'variety'] = winedata['variety']
tf_idf_features.loc[:, 'province'] = winedata['province']
tf_idf_features.loc[:, 'taster_name'] = winedata['taster_name']

print('Done! Vectorization took', time.time()-time0, 'seconds.')

Done! Vectorization took 0.04287385940551758 seconds.


In [7]:
# vectorization using BoW :
time0 = time.time()

# we only keep only 300 features:
bow_vectorizer = CountVectorizer(ngram_range=(1,1), # 1-gram
                               max_df=0.95, # ignore t that have a df higher than max_df (corpus-specific stopwords)
                               min_df=10, # ignore terms that have a doc freq lower than threshold.
                               max_features=300, # max number of features
                              )
# Applying the vectorizer on the "clean" descriptions:
wine_bow = bow_vectorizer.fit_transform(winedata.token_descr_as_string)

# list of features
terms = bow_vectorizer.get_feature_names()

# store the features in a dataframe:
bow_features = pd.DataFrame(wine_bow.toarray(), columns=terms)

# add the different labels that interest us (indexes are the same):
bow_features.loc[:, 'variety'] = winedata['variety']
bow_features.loc[:, 'province'] = winedata['province']
bow_features.loc[:, 'taster_name'] = winedata['taster_name']

print('Done! Vectorization took', time.time()-time0, 'seconds.')

Done! Vectorization took 0.03232884407043457 seconds.


In [8]:
# same X, only Y differs
X_tfidf = tf_idf_features.drop(['variety', 'province', 'taster_name'], axis=1)
X_bow = bow_features.drop(['variety', 'province', 'taster_name'], axis=1)



## Using tf-idf


In [13]:
time0 = time.time()

for dep_var in ['variety', 'province', 'taster_name']:
    Y = tf_idf_features[dep_var]
    X_train, X_test, y_train, y_test = train_test_split(X_tfidf, Y, test_size=0.3, random_state=51)
    
    gb = GradientBoostingClassifier() # default parameters (!)
    gb.fit(X_train, y_train) # train

    # Evaluating the model on train set:
    y_pred = gb.predict(X_train)
    print(dep_var, ': ')
    print(classification_report(y_train, y_pred))
    print("")
    
    # save the gb model, so that it can be reloaded later, e.g. for plotting
    pkl_filename = '../data/gradient_boost_predictions/gb_tfidf_'+dep_var+'.pkl'
    with open(pkl_filename, 'wb') as file:
        pickle.dump(gb, file)
    
print('Done! Training took', time.time()-time0, 'seconds.')
    

variety : 
                            precision    recall  f1-score   support

                  Albariño       1.00      1.00      1.00         4
                   Barbera       1.00      1.00      1.00         1
  Bordeaux-style Red Blend       1.00      1.00      1.00        63
Bordeaux-style White Blend       1.00      1.00      1.00        10
            Cabernet Franc       1.00      1.00      1.00        12
        Cabernet Sauvignon       1.00      1.00      1.00        56
                 Carmenère       1.00      1.00      1.00         4
           Champagne Blend       1.00      1.00      1.00        11
                Chardonnay       1.00      1.00      1.00        78
              Chenin Blanc       1.00      1.00      1.00         3
                     Gamay       1.00      1.00      1.00        11
            Gewürztraminer       1.00      1.00      1.00        11
                  Grenache       1.00      1.00      1.00         5
          Grüner Veltliner       1.0

## Using Bag of Words 

In [12]:
time0 = time.time()

for dep_var in ['variety', 'province', 'taster_name']:
    Y = bow_features[dep_var]
    X_train, X_test, y_train, y_test = train_test_split(X_bow, Y, test_size=0.3, random_state=51)
    
    gb = GradientBoostingClassifier() # default parameters (!)
    gb.fit(X_train, y_train) # train

    # Evaluating the model on train set:
    y_pred = gb.predict(X_train)
    print(dep_var, ': ')
    print(classification_report(y_train, y_pred))
    print("")
    
    # save the gb model, so that it can be reloaded later, e.g. for plotting
    pkl_filename = '../data/gradient_boost_predictions/gb_bow_'+dep_var+'.pkl'
    with open(pkl_filename, 'wb') as file:
        pickle.dump(gb, file)
    
print('Done! Training took', time.time()-time0, 'seconds.')
    

variety : 
                            precision    recall  f1-score   support

                  Albariño       1.00      1.00      1.00         4
                   Barbera       1.00      1.00      1.00         1
  Bordeaux-style Red Blend       1.00      1.00      1.00        63
Bordeaux-style White Blend       1.00      1.00      1.00        10
            Cabernet Franc       1.00      1.00      1.00        12
        Cabernet Sauvignon       1.00      1.00      1.00        56
                 Carmenère       1.00      1.00      1.00         4
           Champagne Blend       1.00      1.00      1.00        11
                Chardonnay       1.00      1.00      1.00        78
              Chenin Blanc       1.00      1.00      1.00         3
                     Gamay       1.00      1.00      1.00        11
            Gewürztraminer       1.00      1.00      1.00        11
                  Grenache       1.00      1.00      1.00         5
          Grüner Veltliner       1.0