# Project Background - Wine Reviews

This project, and the data used, is based on the dataset called "Wine Reviews" listed on [Kaggle](https://www.kaggle.com/zynicide/wine-reviews/home). 

The dataset contains 14 columns and 130,000 rows of wine reviews and was scraped from [WineEnthusiast](https://www.winemag.com/?s=&drink_type=wine) by Kaggler [zackthoutt](https://www.kaggle.com/zynicide) who was interested in "creating a model that can identify variety, winery, and location of a wine based on a description." 

## Data Discussion

The dataset contains 14 columns:
* '# Number
* Country: The country that the wine is from
* Description: A few sentences form a sommelier describing the wine's taste, smell, look, feel, etc.
* Designation: The vineyard within the winery where the grapes that made the wine are from. 
* Points: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews of wines that score greater than 80).
* Price: The cost for a bottle of wine
* Provice: The state or provice that the wine is from
* Region_1: The growing area in a provice or state 
* Region_2: Sometimes there are more specific regions specific within a wine growing area (can contain NULL Values)
* Taster_name: Name of the person who tasted and reviewed the wine. 
* Taster_twitter_handle: Twitter handle for the person who tasted and reviewed the wine
* Title: The title of the wine review, which often contains the vintage if you're interested in extracted that feature
* Variety: The type of grapes used to make the win (i.e. Pinot Noir)



### Import Libraries

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
%matplotlib inline

## Userful Functions
Some Functions I created to clean up some of the code later on

In [38]:
def missingData (data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    missing_data = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
    return missing_data

#### Read in the data

In [39]:
df = pd.read_csv('C:/Users\h9067jib\Desktop\project\SemesterDataScienceProject\winemag-data-130k-v2.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [40]:
df.shape

(129971, 14)

We're pretty close to 130k rows of data

## Data Cleansing

#### Check for Duplicate Data
First thing I like to do is check for duplicate data.

Looks like the Description field is the best choice here as this comes from a reviewer and should be the most unique value.

In [41]:
df[df.duplicated('description', keep = False)].sort_values('description')

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
67614,67614,US,"100% Malbec, it's redolent with dark plums, wi...",,87,20.0,Washington,Rattlesnake Hills,Columbia Valley,Sean P. Sullivan,@wawinereport,Roza Ridge 2010 Malbec (Rattlesnake Hills),Malbec,Roza Ridge
46540,46540,US,"100% Malbec, it's redolent with dark plums, wi...",,87,20.0,Washington,Rattlesnake Hills,Columbia Valley,Sean P. Sullivan,@wawinereport,Roza Ridge 2010 Malbec (Rattlesnake Hills),Malbec,Roza Ridge
119702,119702,US,"100% Sangiovese, this pale pink wine has notes...",Meadow,88,18.0,Washington,Columbia Valley (WA),Columbia Valley,Sean P. Sullivan,@wawinereport,Ross Andrew 2013 Meadow Rosé (Columbia Valley ...,Rosé,Ross Andrew
72181,72181,US,"100% Sangiovese, this pale pink wine has notes...",Meadow,88,18.0,Washington,Columbia Valley (WA),Columbia Valley,Sean P. Sullivan,@wawinereport,Ross Andrew 2013 Meadow Rosé (Columbia Valley ...,Rosé,Ross Andrew
73731,73731,France,"87-89 Barrel sample. A pleasurable, perfumed w...",Barrel sample,88,,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Lalande-Borie 2008 Barrel sample (Sai...,Bordeaux-style Red Blend,Château Lalande-Borie
100745,100745,France,"87-89 Barrel sample. A pleasurable, perfumed w...",Barrel sample,88,,Bordeaux,Saint-Julien,,Roger Voss,@vossroger,Château Lalande-Borie 2008 Barrel sample (Sai...,Bordeaux-style Red Blend,Château Lalande-Borie
73730,73730,France,87-89 Barrel sample. Minty aromas give lifted ...,Barrel sample,88,,Bordeaux,Saint-Émilion,,Roger Voss,@vossroger,Château Haut-Sarpe 2008 Barrel sample (Saint-...,Bordeaux-style Red Blend,Château Haut-Sarpe
100744,100744,France,87-89 Barrel sample. Minty aromas give lifted ...,Barrel sample,88,,Bordeaux,Saint-Émilion,,Roger Voss,@vossroger,Château Haut-Sarpe 2008 Barrel sample (Saint-...,Bordeaux-style Red Blend,Château Haut-Sarpe
73729,73729,France,87-89 Barrel sample. With its lovely fresh fru...,Barrel sample,88,,Bordeaux,Lalande de Pomerol,,Roger Voss,@vossroger,Château Bertineau Saint-Vincent 2008 Barrel sa...,Bordeaux-style Red Blend,Château Bertineau Saint-Vincent
100743,100743,France,87-89 Barrel sample. With its lovely fresh fru...,Barrel sample,88,,Bordeaux,Lalande de Pomerol,,Roger Voss,@vossroger,Château Bertineau Saint-Vincent 2008 Barrel sa...,Bordeaux-style Red Blend,Château Bertineau Saint-Vincent


Over 20k rows are duplicates based on the description. I'm going to drop these from the data set 

In [42]:
df = df.drop_duplicates('description')
df.shape

(119955, 14)

Leaves us with about 120k rows of data

#### Check for NULL Values


In [43]:
print(df.shape)
missingData(df)

(119955, 14)


Unnamed: 0,Total,Percent
region_2,73195,61.018715
designation,34532,28.787462
taster_twitter_handle,29441,24.54337
taster_name,24912,20.767788
region_1,19558,16.304448
price,8388,6.992622
province,59,0.049185
country,59,0.049185
variety,1,0.000834
winery,0,0.0


I'm not too concerned about region_2. This is more of an optional descriptor. Additionally, I'm not going to worry about the missing twitter handle or twitter name fields. Once I get into Feature Engineering, I'll determine if those are useful or not. 

I do need to take care of the missing "price" fields and the "Unnamed" column for now. 

In [44]:
df=df.dropna(subset=['price'])
df=df.reset_index(drop=True)
print(df.shape)
missingData(df)

(111567, 14)


Unnamed: 0,Total,Percent
region_2,65008,58.268126
designation,32050,28.727133
taster_twitter_handle,27751,24.873843
taster_name,23268,20.855629
region_1,18011,16.143663
province,55,0.049298
country,55,0.049298
variety,1,0.000896
winery,0,0.0
title,0,0.0


## Exploratory Data Analysis (EDA)

## Hypothesis Test, Feature Engineering & Model Selection

### Hypothesis Test

### Feature Engineering


### Model Selection

### Machine Learning Model Implementation

### Conclusion and Remarks on Further Development