# Wine Reviews

Comp 562 final project by Chris Burgess, Justin Do, Rhea Gupta, and Kyuyeon Kim.

This project is going to be based around a dataset of win reviews. Our goal is to creat models that are able to make predictions about the kind of grape, score, and area of growth. We are first going to look at cleaning the data and presenting some basic statistics. Then we will move onto how we made the models and lastly how they performed. 

End Goal: create a model that can identify the variety, winery, and location of a wine based on a description.

## Importing and Cleaning

In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

plt.rcParams['figure.dpi']= 150

In [43]:
# different environments may not correctly have a working dir. Change this line as appropriate
path_to_data = r"./data/winemag-data-130k-v2.csv"
df = pd.read_csv(path_to_data)
df = df.drop(columns=['Unnamed: 0'])


In [44]:
df.head(3)
print(df.columns)

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')


In [45]:
# Compute a year column for the data

year_reg = r'[12][0-9]{3}'
years = []
for title in df['title']:
    year_result = re.search(year_reg, title)
    if year_result is None:
        years.append(0)
    else:
        years.append(int(year_result.group(0)))

df['year'] = years

df.head(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012


# Getting the Word Roots

In this section, we are going to find the corpus of word roots present in the body of reviews, and associate each review with the word roots that it contains.

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")

The following code finds all of the word root (lemmas) for nouns, verbs, adjectives, and proper nouns per review and per the entire data set. Takes approximately 20 minutes to run.

In [46]:
global_set = set()
description_lemmas = []

for k,v in df.iterrows():
    local_set = set()
    description_doc = nlp(v["description"])
    for token in description_doc:
        if token.pos_ == 'NOUN' or token.pos_ == 'VERB' or token.pos_ == 'PROPN'or token.pos_ == 'ADJ':
            local_set.add(token.lemma_)
            global_set.add(token.lemma_)
    description_lemmas.append(local_set)


In [47]:
df['lemmas'] = description_lemmas
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year,lemmas
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,2013,"{aroma, tropical, fruit, offer, expressive, un..."
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011,"{fruit, firm, juicy, berry, ripe, wine, smooth..."
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013,"{dominate, underscore, snappy, steel, flavor, ..."
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013,"{drizzle, give, bit, blossom, astringent, semi..."
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012,"{tannic, stew, herbal, companion, rough, wine,..."


In [39]:
export_location = './data/winemag-data-130k-v2_lemma.csv'
df.to_csv(export_location)

The following contains the entire body of word roots that appear in the set of reviews.

In [49]:
print(len(global_set))

28603
