# Data Cleaning and Processing

The aim of this notebook is to:  

* clean the dataset (based on the observations made in Data_Exploration)
* parse the descriptions
* clean the descriptions

In [59]:
# general:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# NLP:
import spacy

In [2]:
raw_winedata = pd.read_csv("../data/winemag-data-190314.csv").drop("Unnamed: 0", axis=1) # indexes were stored as col
initial_number_of_rows = raw_winedata.shape[0]
print(raw_winedata.shape)
raw_winedata.head()

(141617, 14)


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,vintage
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013.0
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0


## Handle Missing Values

All rows have descriptions.   

For now, we will only remove the one row without a variety:

In [3]:
winedata = raw_winedata[pd.notnull(raw_winedata['variety'])]
new_number_of_rows = winedata.shape[0]
print("Number of rows removed: " , initial_number_of_rows - new_number_of_rows)
print("New size of dataset: " , winedata.shape)

Number of rows removed:  1
New size of dataset:  (141616, 14)


## Handle Duplicate Descriptions

Remove the text descriptions that are duplicates, although most describe different wines.  

In [4]:
winedata = winedata.drop_duplicates(subset='description', keep=False)
print("Number of rows removed: " , new_number_of_rows - winedata.shape[0])
print("New size of dataset: " , winedata.shape)
new_number_of_rows = winedata.shape[0]

Number of rows removed:  76
New size of dataset:  (141540, 14)


## Parse the Descriptions

In [43]:
# We load the English model
parser = spacy.load('en_core_web_md')

In [44]:
# and parse every description: !!! takes some minutes to run!
winedata.loc[:,'description_parsed'] = winedata['description'].apply(parser)
winedata.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,vintage,description_parsed
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013.0,"(Aromas, include, tropical, fruit, ,, broom, ,..."
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0,"(This, is, ripe, and, fruity, ,, a, wine, that..."
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0,"(Tart, and, snappy, ,, the, flavors, of, lime,..."
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0,"(Pineapple, rind, ,, lemon, pith, and, orange,..."
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0,"(Much, like, the, regular, bottling, from, 201..."


## Clean the Text Description

* lower case
* remove stopwords
* remove punctuation

To be considered if needed: remove some other domain-specific stopwords (e.g. "wine")

In [53]:
# remove stopwords and from parsed data:
def clean_parsed_text(text):
    # Filter out punctuation and stop words.
    words = [token.lower_ for token in text if not token.is_punct and not token.is_stop]
    
    # EVENTUALLY ADD HERE FILTER FOR domain-specific stopwords
    
    # TO DO : REMOVE DOUBLE SPACES !
    
    return words


In [56]:
# winedata['description_parsed'] = winedata['description_parsed'].apply(clean_parsed_text)
winedata['description_parsed'] = winedata['description_parsed'].apply(clean_parsed_text)

In [57]:
# a look at a text description now:
winedata['description_parsed'][0]

['aromas',
 'include',
 'tropical',
 'fruit',
 'broom',
 'brimstone',
 'dried',
 'herb',
 'palate',
 'overly',
 'expressive',
 'offering',
 'unripened',
 'apple',
 'citrus',
 'dried',
 'sage',
 'alongside',
 'brisk',
 'acidity']

In [60]:
# save the parsed data, as it takes time to generate it!
winedata.to_csv('../data/winedata_parsed.csv', index=False)