# Imports

In [1]:
# data manipulation libraries
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 60)

# data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.pylabtools import figsize

%matplotlib inline
# to display visuals in the notebook

%config InlineBackend.figure_format='retina'
#to enable high resolution plots

# feature extraction and preprocessing
import re
import datetime

# feature transformation and preprocessing
from category_encoders.ordinal import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Auxiliary Functions

The function below will enable us to observe the missing values as a percentage per feature.

In [2]:
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()

    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)

    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent],
                              axis=1)

    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values'})

    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = (mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1))

    # Print some summary information
    print("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
          "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")

    # Return the dataframe with missing information
    return mis_val_table_ren_columns

# Understand & Clean & Format Data

In [3]:
train = pd.read_csv("../data/train/train.csv") 
test = pd.read_csv("../data/test/test.csv")
train.sample(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1712,US,"This is an herbal, earthy, delicately layered ...",Morelli Lane Vineyard,88,45.0,California,Russian River Valley,Sonoma,Virginie Boone,@vboone,Lost Canyon 2014 Morelli Lane Vineyard Pinot N...,Pinot Noir,Lost Canyon
4815,US,"Soft and heavy, this has lots of sweet oak rid...",Reserve,86,48.0,California,Napa Valley,Napa,,,Rutherford Ranch 2008 Reserve Cabernet Sauvign...,Cabernet Sauvignon,Rutherford Ranch
1299,US,"Vanilla, baking spice, dark-fruit and herb aro...",,90,36.0,Washington,Columbia Valley (WA),Columbia Valley,Sean P. Sullivan,@wawinereport,Bunnell 2011 Grenache (Columbia Valley (WA)),Grenache,Bunnell
7298,US,Fire up the Barbie and drink this Zin with bur...,,82,15.0,California,Sonoma County,Sonoma,,,Alterra 2011 Zinfandel (Sonoma County),Zinfandel,Alterra
3336,Portugal,"Fermented in open lagars, this wine has a dens...",Villa Oliveira,93,60.0,Dão,,,Roger Voss,@vossroger,Casa da Passarella 2010 Villa Oliveira Touriga...,Touriga Nacional,Casa da Passarella


In [4]:
print("There are {} rows and {} columns in the train dataset."
      .format(train.shape[0], train.shape[1]))

There are 9000 rows and 13 columns in the train dataset.


In [5]:
print("There are {} rows and {} columns in the test dataset."
      .format(test.shape[0], test.shape[1]))

There are 1000 rows and 13 columns in the test dataset.


# Descriptive statistics & information about datasets

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9000 entries, 0 to 8999
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                8994 non-null   object 
 1   description            9000 non-null   object 
 2   designation            6455 non-null   object 
 3   points                 9000 non-null   int64  
 4   price                  8403 non-null   float64
 5   province               8994 non-null   object 
 6   region_1               7505 non-null   object 
 7   region_2               3469 non-null   object 
 8   taster_name            7223 non-null   object 
 9   taster_twitter_handle  6888 non-null   object 
 10  title                  9000 non-null   object 
 11  variety                9000 non-null   object 
 12  winery                 9000 non-null   object 
dtypes: float64(1), int64(1), object(11)
memory usage: 914.2+ KB


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                1000 non-null   object 
 1   description            1000 non-null   object 
 2   designation            716 non-null    object 
 3   points                 1000 non-null   int64  
 4   price                  920 non-null    float64
 5   province               1000 non-null   object 
 6   region_1               831 non-null    object 
 7   region_2               384 non-null    object 
 8   taster_name            792 non-null    object 
 9   taster_twitter_handle  756 non-null    object 
 10  title                  1000 non-null   object 
 11  variety                1000 non-null   object 
 12  winery                 1000 non-null   object 
dtypes: float64(1), int64(1), object(11)
memory usage: 101.7+ KB


In [8]:
train.describe()

Unnamed: 0,points,price
count,9000.0,8403.0
mean,88.455222,35.532191
std,3.025945,40.750683
min,80.0,5.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,1300.0


Majority of the features are categorical and we have some missing data in the both datasets. Machine learning models can only work with numerical and non-empty values. Thus we are going to develop strategies in Feature Engineering to impute the missing data and transform categorical values into the numeric values.

In [9]:
test.describe()

Unnamed: 0,points,price
count,1000.0,920.0
mean,88.503,34.675
std,3.067475,42.240874
min,80.0,7.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,41.0
max,97.0,1000.0


## Description of features and target

In [10]:
train.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,US,"Expressive aromas of smoke, embers and blue fr...",,88,35.0,Washington,Columbia Valley (WA),Columbia Valley,Sean P. Sullivan,@wawinereport,Damsel 2013 Syrah (Columbia Valley (WA)),Syrah,Damsel
1,South Africa,"Soft mint, spice, cocoa and smoke on the nose ...",Redhill,89,30.0,Stellenbosch,,,Susan Kostrzewa,@suskostrzewa,Simonsig 2005 Redhill Pinotage (Stellenbosch),Pinotage,Simonsig
2,Portugal,"An elegant, finely rounded wine, with firm tan...",,90,,Douro,,,Roger Voss,@vossroger,Quinta de la Rosa 2008 Red (Douro),Portuguese Red,Quinta de la Rosa
3,South Africa,Winemaker: Louis Nel. This Cab-Shiraz blend is...,Cape Winemakers Guild Rapscallion,91,,Stellenbosch,,,Lauren Buzzeo,@laurbuzz,Louis Nel 2015 Cape Winemakers Guild Rapscalli...,Cabernet Sauvignon-Shiraz,Louis Nel
4,Portugal,"Lightly wood aged and spicy, this is a fine re...",Casa Américo Branco Reserva,90,,Dão,,,Roger Voss,@vossroger,Seacampo 2014 Casa Américo Branco Reserva Encr...,Encruzado,Seacampo


In [11]:
train.sample(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
8701,US,"There are massive flavors in this wine, and th...",Bacigalupi Vineyard,87,42.0,California,Russian River Valley,Sonoma,,,Gracianna 2010 Bacigalupi Vineyard Zinfandel (...,Zinfandel,Gracianna
5287,US,This clean if somewhat simple bottling starts ...,Barn Block Reserve,88,60.0,California,Santa Cruz Mountains,Central Coast,Matt Kettmann,@mattkettmann,Black Ridge 2014 Barn Block Reserve Pinot Noir...,Pinot Noir,Black Ridge
8888,France,"Solid and textured, this is a firm wine with a...",Château Belrose,89,15.0,Bordeaux,Bordeaux Supérieur,,Roger Voss,@vossroger,Maison Bouey 2011 Château Belrose (Bordeaux S...,Bordeaux-style Red Blend,Maison Bouey
4910,US,There's an unparalleled woody intensity on the...,Estate Bottled,95,70.0,California,Santa Cruz Mountains,Central Coast,Matt Kettmann,@mattkettmann,Mount Eden Vineyards 2013 Estate Bottled Caber...,Cabernet Sauvignon,Mount Eden Vineyards
2180,France,"For a wine meant to be consumed now, this is a...",Nouveau,85,13.0,Beaujolais,Beaujolais-Villages,,Roger Voss,@vossroger,Albert Bichot 2012 Nouveau (Beaujolais-Villages),Gamay,Albert Bichot


With some intuition, expert knowledge and help of Google here are the explanation of features and target:

- <b>country:</b> Origin of the wine producer
- <b>description:</b> Presentment of the taster to describe the wine
- <b>designation:</b> Name of the wine given to the wine by the producer, sometimes used interchangeably with vineyard. Usually available in the title.
- <b>points:</b> Our target value, representing score of a particular wine received from a taster. An important remark is, a wine may receive different points from the same taster.
- <b>price:</b> Price of the wine
- <b>region_1:</b> Official definition of the place where the grapes for a wine are grown
- <b>region_2:</b> Official definition of the place where the grapes for a wine are grown
- <b>taster_name:</b> The taster, as obvious assigns points to the wine
- <b>title:</b> Name of the wine, as available on the label
- <b>variety:</b> Grape variety of the wine
- <b>winery:</b> name of the wine producer