# Wines Review - Data exploration and transformation

See a comperhansive overview and details on this dataset [in the kaggle page](https://www.kaggle.com/zynicide/wine-reviews?select=winemag-data-130k-v2.csv).


In the data exploration stage, we want to get familiar with the data we are working on, in terms of:

* Shape of the dataset - how big is it (how many rows), how many columns, how is it constructed what each row represent, etc.
* Columns data types, missing values, distributions, unique values etc.
* Correlations, dependencies, inconsistencies (e.g. "Yes" vs "yes") etc.

Based on the above (and more) we might choose to:

* Reshape the data 
* Remove duplications/invalid entries/outliers
* Transform part of the data
* Normalize values to a stanrd from (i.e. set all strings fields to lower case form)
* Etc.

The goal of this stage is to gain understanding of what is the data infont of us and prepare it for our modeling stage.

In [1]:
import pandas as pd
import cufflinks as cf; cf.go_offline()

## Data loading 

In [2]:
wine_reviews = pd.read_csv("data/winemag-data-130k-v2.csv")
wine_reviews.shape

(129971, 14)

In [3]:
wine_reviews.sample(5)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
84154,84154,France,"The wine has a lively mousse, crispened with a...",Oriane Brut,89,43.0,Champagne,Champagne,,Roger Voss,@vossroger,Faniel et Fils NV Oriane Brut (Champagne),Champagne Blend,Faniel et Fils
44237,44237,US,"Flashy and delicious, this just blows your min...",Moose Valley Vineyard,93,135.0,California,Napa Valley,Napa,,,D.R. Stephens 2009 Moose Valley Vineyard Caber...,Cabernet Sauvignon,D.R. Stephens
103148,103148,US,This barrel-forward wine offers aromas of cara...,,88,36.0,Washington,Columbia Valley (WA),Columbia Valley,Sean P. Sullivan,@wawinereport,Terra Vina Wines 2012 Tempranillo (Columbia Va...,Tempranillo,Terra Vina Wines
73934,73934,New Zealand,This wine starts on the shy side but with seve...,Mohua,88,14.0,Marlborough,,,,,Peregrine 2015 Mohua Sauvignon Blanc (Marlboro...,Sauvignon Blanc,Peregrine
83010,83010,US,"Lots of ripe, luscious raspberry, cherry and m...",7200 Grassini Vineyard,87,42.0,California,Happy Canyon of Santa Barbara,Central Coast,,,Foxen 2010 7200 Grassini Vineyard Cabernet Sau...,Cabernet Sauvignon,Foxen


## Vocabulary creation

In [1]:
import nltk

In [None]:
df['desc_tokens'] = df.description.apply(lambda desc: nltk.word_tokenize(desc))

In [27]:
all_words = [word for tweet in df.pureTextTweet for word in nltk.word_tokenize(tweet) if word.lower() not in stopwords] # Words without stop words
words_df = DataFrame(data = all_words, columns = ['word']).word.value_counts().reset_index()
words_df.columns = ['word','wordCount']
words_df['wordImportance'] = len(words_df) / words_df.wordCount / words_df.wordCount.max()
words_df.head()

Unnamed: 0,word,wordCount,wordImportance
0,rt,1338,0.007189
1,win,507,0.018972
2,giveaway,452,0.021281
3,enter,299,0.03217
4,im,279,0.034476
