# Imports

In [1]:
# data manipulation libraries
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 60)

# data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.pylabtools import figsize

%matplotlib inline
# to display visuals in the notebook

%config InlineBackend.figure_format='retina'
#to enable high resolution plots

# feature extraction and preprocessing
import re
import datetime

# feature transformation and preprocessing
from category_encoders.ordinal import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Understand & Clean & Format Data

In [2]:
train = pd.read_csv("../data/train/train.csv") 
test = pd.read_csv("../data/test/test.csv")
train.sample(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
4637,Spain,The winery has dialed back the time in oak for...,Blanco Fermentado en Barrica,85,8.0,Northern Spain,Rioja,,Michael Schachner,@wineschach,Marqués de Cáceres 2006 Blanco Fermentado en B...,White Blend,Marqués de Cáceres
8837,US,"Deep blackberry syrup, chipped slate and sharp...",,91,50.0,California,Santa Ynez Valley,Central Coast,Matt Kettmann,@mattkettmann,Pegasus Estate 2013 Cabernet Sauvignon (Santa ...,Cabernet Sauvignon,Pegasus Estate
1716,Austria,Steely. Cool fruit joins an immense series of ...,Engelreich,92,25.0,Traisental,,,Roger Voss,@vossroger,Markus Huber 2009 Engelreich Riesling (Traisen...,Riesling,Markus Huber
531,France,"Tough at the moment, this is a wine with impre...",,92,,Bordeaux,Pauillac,,Roger Voss,@vossroger,Château Grand-Puy Ducasse 2010 Pauillac,Bordeaux-style Red Blend,Château Grand-Puy Ducasse
7141,France,"Ripe and fruity, this wine has a full, rich te...",,88,50.0,Burgundy,Meursault,,Roger Voss,@vossroger,Jaffelin 2010 Meursault,Chardonnay,Jaffelin


In [3]:
print("There are {} rows and {} columns in the train dataset."
      .format(train.shape[0], train.shape[1]))

There are 9000 rows and 13 columns in the train dataset.


In [4]:
print("There are {} rows and {} columns in the test dataset."
      .format(test.shape[0], test.shape[1]))

There are 1000 rows and 13 columns in the test dataset.


# Descriptive statistics & information about datasets

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9000 entries, 0 to 8999
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                8994 non-null   object 
 1   description            9000 non-null   object 
 2   designation            6455 non-null   object 
 3   points                 9000 non-null   int64  
 4   price                  8403 non-null   float64
 5   province               8994 non-null   object 
 6   region_1               7505 non-null   object 
 7   region_2               3469 non-null   object 
 8   taster_name            7223 non-null   object 
 9   taster_twitter_handle  6888 non-null   object 
 10  title                  9000 non-null   object 
 11  variety                9000 non-null   object 
 12  winery                 9000 non-null   object 
dtypes: float64(1), int64(1), object(11)
memory usage: 914.2+ KB


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                1000 non-null   object 
 1   description            1000 non-null   object 
 2   designation            716 non-null    object 
 3   points                 1000 non-null   int64  
 4   price                  920 non-null    float64
 5   province               1000 non-null   object 
 6   region_1               831 non-null    object 
 7   region_2               384 non-null    object 
 8   taster_name            792 non-null    object 
 9   taster_twitter_handle  756 non-null    object 
 10  title                  1000 non-null   object 
 11  variety                1000 non-null   object 
 12  winery                 1000 non-null   object 
dtypes: float64(1), int64(1), object(11)
memory usage: 101.7+ KB


In [7]:
train.describe()

Unnamed: 0,points,price
count,9000.0,8403.0
mean,88.455222,35.532191
std,3.025945,40.750683
min,80.0,5.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,1300.0


Majority of the features are categorical and we have some missing data in the both datasets. Machine learning models can only work with numerical and non-empty values. Thus we are going to develop strategies in Feature Engineering to impute the missing data and transform categorical values into the numeric values.

In [8]:
test.describe()

Unnamed: 0,points,price
count,1000.0,920.0
mean,88.503,34.675
std,3.067475,42.240874
min,80.0,7.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,41.0
max,97.0,1000.0


## Description of features and target

In [9]:
train.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,US,"Expressive aromas of smoke, embers and blue fr...",,88,35.0,Washington,Columbia Valley (WA),Columbia Valley,Sean P. Sullivan,@wawinereport,Damsel 2013 Syrah (Columbia Valley (WA)),Syrah,Damsel
1,South Africa,"Soft mint, spice, cocoa and smoke on the nose ...",Redhill,89,30.0,Stellenbosch,,,Susan Kostrzewa,@suskostrzewa,Simonsig 2005 Redhill Pinotage (Stellenbosch),Pinotage,Simonsig
2,Portugal,"An elegant, finely rounded wine, with firm tan...",,90,,Douro,,,Roger Voss,@vossroger,Quinta de la Rosa 2008 Red (Douro),Portuguese Red,Quinta de la Rosa
3,South Africa,Winemaker: Louis Nel. This Cab-Shiraz blend is...,Cape Winemakers Guild Rapscallion,91,,Stellenbosch,,,Lauren Buzzeo,@laurbuzz,Louis Nel 2015 Cape Winemakers Guild Rapscalli...,Cabernet Sauvignon-Shiraz,Louis Nel
4,Portugal,"Lightly wood aged and spicy, this is a fine re...",Casa Américo Branco Reserva,90,,Dão,,,Roger Voss,@vossroger,Seacampo 2014 Casa Américo Branco Reserva Encr...,Encruzado,Seacampo


In [10]:
train.sample(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
243,Chile,"Fiery and wild at first, with aromas of cinnam...",Reserva,86,18.0,Curicó Valley,,,Michael Schachner,@wineschach,Aresti 2007 Reserva Merlot (Curicó Valley),Merlot,Aresti
2750,Italy,"Woodland berry, fragrant blue flower, camphor ...",Gallina,91,45.0,Piedmont,Barbaresco,,Kerin O’Keefe,@kerinokeefe,Ugo Lequio 2014 Gallina (Barbaresco),Nebbiolo,Ugo Lequio
5379,Italy,"Smoke, flint, mature stone fruit, dried sage a...",Vintage,90,35.0,Southern Italy,Greco di Tufo,,Kerin O’Keefe,@kerinokeefe,Mastroberardino 2007 Vintage (Greco di Tufo),Greco,Mastroberardino
7511,Germany,Initially demure notes of pressed apple and pe...,Auslese Sweet,90,17.0,Rheinhessen,,,Anna Lee C. Iijima,,Desire 2015 Auslese Sweet Gewürztraminer (Rhei...,Gewürztraminer,Desire
2409,US,"An amazing wine, so light and delicate in the ...",Tondre H Block,94,65.0,California,Santa Lucia Highlands,Central Coast,,,Tantara 2011 Tondre H Block Pinot Noir (Santa ...,Pinot Noir,Tantara


With some intuition, expert knowledge and help of Google here are the explanation of features and target:

- <b>country:</b> Origin of the wine producer
- <b>description:</b> Presentment of the taster to describe the wine
- <b>designation:</b> Name of the wine given to the wine by the producer, sometimes used interchangeably with vineyard. Usually available in the title.
- <b>points:</b> Our target value, representing score of a particular wine received from a taster. An important remark is, a wine may receive different points from the same taster.
- <b>price:</b> Price of the wine
- <b>region_1:</b> Official definition of the place where the grapes for a wine are grown
- <b>region_2:</b> Official definition of the place where the grapes for a wine are grown
- <b>taster_name:</b> The taster, as obvious assigns points to the wine
- <b>title:</b> Name of the wine, as available on the label
- <b>variety:</b> Grape variety of the wine
- <b>winery:</b> name of the wine producer