This data interested me when I searched in the Kaggle datasets. I am not a professional taster. But I remember the time I tried to create cocktails by myself during a summer holiday. It was...not good (- -|||). Recently, I started to learn about wine basics, especially about how to read a wine label, since it is always difficult to select a suitable one from so many varieties of wines when purchasing. I have stepped on several land mines so far. So hope to learn more from this dataset as well.

Roadmap:

1. Import data : drop duplicates
2. Preliminary analysis
 * Numerical variables: descriptive stats, distribution, correlation analysis
 * Categorical variables: frequency table, correlation analysis
3. Data preparation
 * Missing data imputation
 * Text preprocessing
 * Feature extraction
4. Modeling

## Import Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook', font_scale=1.5)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0)
print("Before removing duplicates:", len(wine_reviews))
wine_reviews.tail()

Before removing duplicates: 150930


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


### Drop Duplicates

One way to remove duplicated is based on all columns. I have also seen people removing duplicates based on the description column. The two ways yield different results as seen from below.

In [2]:
# if we count for description, we can see there are multiple duplicate rows for each description
# wine_reviews.description.value_counts(dropna=False)

# dedup based on all columns
wine_reviews = wine_reviews.drop_duplicates()
print("Removing duplicates based on all columns:", len(wine_reviews))

Removing duplicates based on all columns: 97851


In [4]:
# dedup based on description
wine_reviews_ddp = wine_reviews.drop_duplicates('description')
print("Removing duplicates based on description:", len(wine_reviews_ddp))

Removing duplicates based on description: 97821


Which way to use? I will find these duplicate rows and see.

In [5]:
# full join two dedupped data and find the rows only in the first with '_merge' flag
wine_reviews_all = wine_reviews.merge(wine_reviews_ddp, how='outer', indicator=True)
dup_wine_desc = wine_reviews_all[wine_reviews_all['_merge']=='left_only'].description

wine_reviews_all[wine_reviews_all['description'].isin(dup_wine_desc)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,_merge
4605,Italy,"Ripe plum, game, truffle, leather and menthol ...",,88,72.0,Tuscany,Brunello di Montalcino,,Sangiovese,La Mannella,both
7311,Italy,"Ripe plum, game, truffle, leather and menthol ...",,87,40.0,Tuscany,Brunello di Montalcino,,Sangiovese,Poggiarellino,left_only
17785,Italy,"Gibilmoro, a pure expression of Nero d'Avola, ...",Gibilmoro,86,20.0,Sicily & Sardinia,Sicilia,,Merlot,Di Prima,both
17858,Italy,"Gibilmoro, a pure expression of Nero d'Avola, ...",Gibilmoro,86,20.0,Sicily & Sardinia,Sicilia,,Nero d'Avola,Di Prima,left_only
21992,Australia,In 2009 this single vineyard offering includes...,Noble Baron,90,50.0,South Australia,Barossa,,Cabernet Sauvignon,Château Tanunda,both
22106,US,In 2009 this single vineyard offering includes...,Horse Heaven Vineyard,90,15.0,Washington,Horse Heaven Hills,Columbia Valley,Sauvignon Blanc,Chateau Ste. Michelle,left_only
22114,US,"From Minick Family estate vineyards, this inte...",,90,12.0,Washington,Yakima Valley,Columbia Valley,Riesling,Willow Crest,both
22115,Austria,"From Minick Family estate vineyards, this inte...",Kremser Wachtberg,90,25.0,Niederösterreich,,,Grüner Veltliner,Winzer Krems,left_only
22828,Chile,"Dark, earthy and rubbery aromas go along with ...",Estate,86,12.0,Aconcagua Valley,,,Cabernet Sauvignon,Errazuriz,both
22880,US,"Dark, earthy and rubbery aromas go along with ...",Steinbeck Vineyard,86,21.0,California,Paso Robles,Central Coast,Syrah,Eberle,left_only


Personally, I will go with the first approach: dedupping based on all columns. Take the first two rows for example, though the wine descriptions are the same, the wineries are different, which may somehow explain the different points and prices. If I remove one of them, we may lose some information.

**Before we proceed with next steps, remember to reset the index.** Otherwise, you would get an error with text cleaning just like me.

In [6]:
# just to illustrate the difference
wine_reviews.reset_index().tail()

Unnamed: 0,index,country,description,designation,points,price,province,region_1,region_2,variety,winery
97846,149635,US,A Syrah-Grenache blend that's dry and rustical...,Bungalow Red,84,15.0,California,Santa Barbara County,Central Coast,Syrah-Grenache,Casa Barranca
97847,149636,Portugal,Oreo eaters will enjoy the aromas of this wine...,30-year old tawny,84,,Port,,,Port,Casa Santa Eufemia
97848,149637,US,"Outside of the vineyard, wines like this are w...",,84,6.0,California,California,California Other,Merlot,Delicato
97849,149638,Argentina,"Heavy and basic, with melon and pineapple arom...",,84,9.0,Mendoza Province,Uco Valley,,Sauvignon Blanc,Finca El Portillo
97850,149639,Australia,"Smooth in the mouth, this Chard starts off wit...",,84,8.0,Australia Other,South Eastern Australia,,Chardonnay,Jacob's Creek


In [7]:
wine_reviews = wine_reviews.reset_index(drop = True)

## Preliminary Analysis

As always, we need to check and understand our raw data first.

### Numeric Variables

#### 1. Check the distribution

In [None]:
wine_reviews.describe()

The points are within 80 and 100, with majority of wines have got points less than 90. There are no missing values in points. The range of price is much larger, from 4 to 2300, with a great quantity of wines are less expensive than 40. This makes sense because good wines are always precious.

In [None]:
print('Skewness=%.3f' %wine_reviews['points'].skew())
print('Kurtosis=%.3f' %wine_reviews['points'].kurtosis())
sns.distplot(wine_reviews['points'], bins=20, kde=True);

In [None]:
print('Skewness=%.3f' %wine_reviews['price'].skew())
print('Kurtosis=%.3f' %wine_reviews['price'].kurtosis())
sns.distplot(wine_reviews['price'].dropna());

In [None]:
print('Skewness=%.3f' %np.log(wine_reviews['price']).skew())
print('Kurtosis=%.3f' %np.log(wine_reviews['price']).kurtosis())
sns.distplot(np.log(wine_reviews['price']).dropna());

As we can see from the plots, values of points are quite concentrated but price is highly skewed with extreme values. After log transfomation, both the skewness and kurtosis are somewhat corrected.

#### 2. Correlation between price and point

In [None]:
sns.set(style = 'whitegrid', rc = {'figure.figsize':(8,6), 'axes.labelsize':12})
sns.scatterplot(x = 'price', y = 'points', data = wine_reviews);

It is hard to see any patterns according to the scatter plot. Let us try boxplot.

In [None]:
sns.boxplot(x = 'points', y = 'price', palette = 'Set2', data = wine_reviews, linewidth = 1.5);

The overall trend that prices go up as points increase is obvious. That is, there exists a positive correlation between these two. Of course, among the wines that have the same points, their prices can vary a lot.

In [None]:
wine_reviews['points'].corr(wine_reviews['price'])

Not so close to 1, so the positive association is not strong.

### Categorical Variables

#### 1. Check the distribution

In [None]:
wine_cat = wine_reviews.select_dtypes(include=['object']).columns
print('n rows: %s' %len(wine_reviews))
for i in range(len(wine_cat)):
    c = wine_cat[i]
    print(c, ': %s' %len(wine_reviews[c].unique()))

Because there are too many unique values in columns like designation, it is not very informative to make a frequency table including them all. There could be only a few counts within each category and we cannot tell any patterns. I will just pick country and region_2.

In [None]:
wine_reviews['country'].value_counts()

Nearly half of the wines are coming from the US. Italy and France are also two main wine production countries.

In [None]:
print(wine_reviews['region_2'].isna().sum())
wine_reviews['region_2'].value_counts()

Excluding the missing values in region_2, we can see that Central Coast, Sonoma, Columbia Valley, and Napa are among top wine pruduction areas.

*An issue not related to data...*

*The variable description under region_1 says "the wine growing area in a province or state (ie Napa)" while that under region_2 says "sometimes there are more specific regions specified within a wine growing area". But if we take a look at both region_1 and region_2 columns, it seems that region_1 is more granular than region_2. This can also be referred from the above unique value counts, - region_1 1237 and region_2 19. *

#### 2. Correlation with point

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (12, 7))
col_order = wine_reviews.groupby(['country'])['points'].aggregate(np.median).reset_index().sort_values('points')
p = sns.boxplot(x = 'country', y = 'points', palette = 'Set3', data = wine_reviews, order = col_order['country'], linewidth = 1.5)
plt.setp(p.get_xticklabels(), rotation = 90)
ax.set_xlabel('');

Wines produced in England far exceed any other countries though the quantity is relatively small. As a well-known wine country, France also produces many high quality wines judging from the average point. Furthermore, its variety is diverse as the points range from 80 to 100. German wines have the similar characteristics. 

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (10, 6))
col_order = wine_reviews.groupby(['region_2'])['points'].aggregate(np.median).reset_index().sort_values('points')
p = sns.boxplot(x = 'region_2', y = 'points', palette = 'Set3', data = wine_reviews, order = col_order['region_2'], linewidth = 1.5)
plt.setp(p.get_xticklabels(), rotation = 60)
ax.set_xlabel('');

The median points of wines from Willamette Valley, Columbia Valley, and Napa seem to be the same. These three regions are all located in the US. Actually the regions listed in the plot are all located in the US. It seems that US wine regions are populated well while regions of other countries are not in region_2 column.

#### 3. How about description?

In [None]:
wine_reviews['word_count'] = wine_reviews['description'].apply(lambda x: len(str(x).split(" ")))
sns.boxplot(x = 'points', y = 'word_count', palette = 'Set3', data = wine_reviews, linewidth = 1.5);

Looks like wines with higher points usually have longer descriptions. That is interesting. Maybe next time when I make comparisons, I can select whose labels have the most words. Easy to implement, huh?

### Response Variable

We have already known that wine points range from 80 to 100. Intuitively, it should be a regression problem. But because of the range limit, I decide to convert the points into a categorical variable.

In [None]:
def transform_points_simplified(points):
    if points < 84:
        return 1
    elif points >= 84 and points < 88:
        return 2 
    elif points >= 88 and points < 92:
        return 3 
    elif points >= 92 and points < 96:
        return 4 
    else:
        return 5

## Data Preparation

### Missing Data Imputation

In [None]:
print(wine_reviews.isnull().sum())

In [8]:
# calculate percentage of missing values
wine_missing = pd.DataFrame(wine_reviews.isnull().sum()/len(wine_reviews.index) * 100)
wine_missing.columns = ['percent']
wine_missing

Unnamed: 0,percent
country,0.003066
description,0.0
designation,30.552575
points,0.0
price,8.911508
province,0.003066
region_1,16.281898
region_2,59.6417
variety,0.0
winery,0.0


We only have 9 independent variables in total so the principle is keeping as most as possible. In my house price kernel, I first used 15% of missing values as the criterion to decide whether to keep a variable or not. But if we do the same thing here, we would end up with only 6 variables.

In [9]:
# first, we know that region_2 has nearly 60% of missing values so drop it
wine_reviews.drop(['region_2'], inplace = True, axis = 1, errors = 'ignore')

# second, it is not sensible to replace na with most frequent category for designation and region_1, so I create a new Unkown category
wine_reviews['designation'].fillna('Unknown', inplace = True)
wine_reviews['region_1'].fillna('Unknown', inplace = True)

# last, replace na with median for numeric variable price
wine_reviews['price'].fillna((wine_reviews['price'].median()), inplace = True)
wine_reviews.tail()

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
97846,US,A Syrah-Grenache blend that's dry and rustical...,Bungalow Red,84,15.0,California,Santa Barbara County,Syrah-Grenache,Casa Barranca
97847,Portugal,Oreo eaters will enjoy the aromas of this wine...,30-year old tawny,84,25.0,Port,Unknown,Port,Casa Santa Eufemia
97848,US,"Outside of the vineyard, wines like this are w...",Unknown,84,6.0,California,California,Merlot,Delicato
97849,Argentina,"Heavy and basic, with melon and pineapple arom...",Unknown,84,9.0,Mendoza Province,Uco Valley,Sauvignon Blanc,Finca El Portillo
97850,Australia,"Smooth in the mouth, this Chard starts off wit...",Unknown,84,8.0,Australia Other,South Eastern Australia,Chardonnay,Jacob's Creek


Since country and region_1 both have 3 missing values, I am wondering if they are among the same rows. 

In [10]:
wine_reviews[wine_reviews['country'].isna()]

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
1091,,Delicate white flowers and a spin of lemon pee...,Askitikos,90,17.0,,Unknown,Assyrtiko,Tsililis
1383,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,Unknown,Red Blend,Büyülübağ
60636,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,Unknown,Pinot Noir,Chilcas


Further take a look at the rows whose winery are one of the above three. It seems like many unkown region_1 are located in Chile. But I am questionable about how to define province and region_1. The provinces listed are Chile’s prominent wine regions, e.g. Maule Valley. Maule Valley belongs to Maule Region if you check it using Google Map. In principle, XXX Valley should be a name of a region, like Napa Valley in California. 

I also checked several Chilean wines on winemag.com. They usually have a appellation which format is like "Central Valley, Chile", rather than others like "Paso Robles, Central Coast, California, US". So that should be the reason why these XXX Valleys appear in the province column.

In [None]:
wine_reviews[wine_reviews.winery.isin(['Tsililis', 'Büyülübağ', 'Chilcas'])]

I searched online and found the information about the wines with missing country and province.
* [Tsililis 2015 Askitikos Assyrtiko](https://www.winemag.com/buying-guide/tsililis-2015-askitikos-assyrtiko)
* [Büyülübağ 2012 Shah Red](https://www.winemag.com/buying-guide/buyuluba-2012-shah-red)
* [Chilcas 2006 Piedra Feliz Pinot Noir](https://www.winemag.com/buying-guide/chilcas-2006-piedra-feliz-pinot-noir-san-rafael)

In [11]:
wine_reviews.loc[wine_reviews.designation == 'Askitikos', 'country'] = 'Greece'
wine_reviews.loc[wine_reviews.designation == 'Askitikos', 'province'] = 'Thessaly'

wine_reviews.loc[wine_reviews.designation == 'Shah', 'country'] = 'Turkey'
wine_reviews.loc[wine_reviews.designation == 'Shah', 'province'] = 'Marmara'

# As I have said, San Rafael is located in Maule Region; for simplicity, I assign 'Maule Valley' in line with other rows
wine_reviews.loc[wine_reviews.designation == 'Piedra Feliz', 'country'] = 'Chile'
wine_reviews.loc[wine_reviews.designation == 'Piedra Feliz', 'province'] = 'Maule Valley'
wine_reviews.loc[wine_reviews.designation == 'Piedra Feliz', 'region_1'] = 'San Rafael'

In [12]:
wine_reviews[wine_reviews.designation.isin(['Askitikos', 'Shah', 'Piedra Feliz'])]

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
1091,Greece,Delicate white flowers and a spin of lemon pee...,Askitikos,90,17.0,Thessaly,Unknown,Assyrtiko,Tsililis
1383,Turkey,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,Marmara,Unknown,Red Blend,Büyülübağ
60636,Chile,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,Maule Valley,San Rafael,Pinot Noir,Chilcas


### One-Hot Encoding

Before we process the description column, we need to encode the rest of categorical variables first.

In [13]:
enc_cols = wine_reviews.columns.drop(['description', 'points', 'price'])
for col in enc_cols:
    dummies = pd.get_dummies(wine_reviews[col], prefix = col, drop_first = False)
    X_encoded = pd.concat([wine_reviews['price'], dummies], axis = 1)

X_encoded.shape

(97851, 14811)

### Text Preprocessing

In [14]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

wine_desc = pd.DataFrame({'description': wine_reviews['description']})

#### 1. Lower case

In [15]:
wine_desc['clean_desc'] = wine_desc['description'].apply(lambda x: x.lower())
wine_desc['clean_desc'].head()

0    this tremendous 100% varietal wine hails from ...
1    ripe aromas of fig, blackberry and cassis are ...
2    mac watson honors the memory of a wine once ma...
3    this spent 20 months in 30% new french oak, an...
4    this is the top wine from la bégude, named aft...
Name: clean_desc, dtype: object

#### 2. Remove punctuation

In [16]:
wine_desc['clean_desc'] = wine_desc['clean_desc'].str.replace('[^\w\s]', '')
wine_desc['clean_desc'].head()

0    this tremendous 100 varietal wine hails from o...
1    ripe aromas of fig blackberry and cassis are s...
2    mac watson honors the memory of a wine once ma...
3    this spent 20 months in 30 new french oak and ...
4    this is the top wine from la bégude named afte...
Name: clean_desc, dtype: object

#### 3. Remove numbers

In [17]:
wine_desc['clean_desc'] = wine_desc['clean_desc'].str.replace('[0-9]+', '')
wine_desc['clean_desc'].head()

0    this tremendous  varietal wine hails from oakv...
1    ripe aromas of fig blackberry and cassis are s...
2    mac watson honors the memory of a wine once ma...
3    this spent  months in  new french oak and inco...
4    this is the top wine from la bégude named afte...
Name: clean_desc, dtype: object

#### 4. Remove stop words

In [18]:
stop_words = stopwords.words('english')
wine_desc['clean_desc'] = wine_desc['clean_desc'].apply(lambda x: ' '.join(w for w in x.split() if w not in stop_words))
wine_desc['clean_desc'].head()

0    tremendous varietal wine hails oakville aged t...
1    ripe aromas fig blackberry cassis softened swe...
2    mac watson honors memory wine made mother trem...
3    spent months new french oak incorporates fruit...
4    top wine la bégude named highest point vineyar...
Name: clean_desc, dtype: object

#### 5. Lemmatization

I have also tried stemming but sometimes stemming can transform words a lot. On the other hand, lemmatization is to obtain the root word, which is what we really want to keep in our texts.

In [None]:
# stem words
porter = PorterStemmer()
wine_desc['clean_desc'][:10].apply(lambda x: ' '.join([porter.stem(w) for w in x.split()]))

I have updated the approach here since I found that lemmatization would treat a word as a noun by default. So we need to find out the POS tag and pass it on to the lemmatizer.

In [None]:
# lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

def get_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize(sentence):
    tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    lemmatized_sentence = []
    for word, tag in tagged:
        wntag = get_wordnet_tag(tag)
        if wntag is None:
            lemmatized_sentence.append(lemmatizer.lemmatize(word))
        else:
            lemmatized_sentence.append(lemmatizer.lemmatize(word, pos = wntag))
    return ' '.join(lemmatized_sentence)

wine_desc['clean_desc'] = wine_desc['clean_desc'].apply(lambda x: lemmatize(x))
wine_desc['clean_desc'][:10]

Let us see the most common words in our descriptions.

In [None]:
pd.Series(' '.join(wine_desc['clean_desc']).split()).value_counts()[:10]

Look at how we have cleaned the texts so far.

In [None]:
wine_desc.head()

### Feature Extraction

Now we have words of text representing discrete, categorical features of wines. In order to prepare text data for predictive modeling, we need to map these textual data to real valued vectors for use as input to a machine learning algorithm, called feature extraction.

#### 1. Word Counts with CountVectorizer

For each row, CountVectorizer returns a vector whose length equals the number of distinct words from the document, with integer counts for the number of times each word appeared in the document.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
X_desc = wine_desc['clean_desc']

vectorizer = CountVectorizer()
vectorizer.fit(X_desc)
X_count = vectorizer.transform(X_desc)
print(X_count.shape)

In [None]:
vectorizer = CountVectorizer(max_features = 20000)
vectorizer.fit(X_desc)
X_count = vectorizer.transform(X_desc)

#### 2. Word Frequencies with TfidfVectorizer

- TF(Term Frequency) = (Number of Occurences of a word)/(Total words in the document) 
- IDF(Inverse Document Frequency) = Log((Total number of documents)/(Number of documents containing the word))  
- TF-IDF = multiplication of the TF and IDF

The higher the value of IDF, the more unique is the word. So we can think of IDF as a penalty on TF. For example, "fruit" is one of the commonly occurring words across wine descriptions as we have seen. However, it is not useful to distinguish between documents and so its TF-IDF score can not be high.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer.fit(X_desc)
X_tfidf = vectorizer.transform(X_desc)
print(X_tfidf.shape)

X = X_tfidf
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X, y, test_size=0.1, random_state=12)

#### 3. Word Embeddings

* Word2Vec

In [None]:
import gensim 
from gensim.models import Word2Vec

tokenized_desc = [sentence.split() for sentence in wine_desc['clean_desc']]
model_word2vec = gensim.models.Word2Vec(tokenized_desc, min_count = 3, size = 100, window = 5) 

In [None]:
# access vector for one word
print(model_word2vec['redcherry'])

* Glove

In [None]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
descriptions = wine_desc['clean_desc'].values
tokenizer.fit_on_texts(descriptions)
word_index = tokenizer.word_index

X_train = tokenizer.texts_to_sequences(descriptions)

vocab_size = len(tokenizer.word_index) + 1
print(sentences_train[2])
print(X_train[2])

In [None]:
# load the pre-trained word-embedding vectors 
embeddings_index = {}
with open('../input/glove6b100dtxt/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# create a tokenizer 
token = text.Tokenizer()
token.fit_on_texts(trainDF['text'])
word_index = token.word_index

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# convert text to sequence of tokens and pad them to ensure equal length vectors 
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)
valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=70) 

# create token-embedding mapping
embedding_matrix = np.zeros((len(word_index) + 1, 100))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [None]:
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
nonzero_elements / vocab_size

In [None]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=False))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

## Start Training

### 1. CountVectorizer

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

y = wine_reviews['points'].values
X_train_count, X_test_count, y_train_count, y_test_count = train_test_split(X_count, y, test_size = 0.25, random_state = 12)

# train the model
rf = RandomForestRegressor(n_estimators=400)
rf.fit(X_train_count, y_train_count)

# test the model
y_pred_rf = rf.predict(X_test_count)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_count, y_pred_rf))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test_count, y_pred_rf))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_count, y_pred_rf)))

In [None]:


X_count_df = pd.DataFrame(X_count.toarray())
X = pd.concat([X_encoded, X_count_df], axis = 1)
X_train_count, X_test_count, y_train_count, y_test_count = train_test_split(X, y, test_size = 0.25, random_state = 12)