In [1]:
!pip install nltk
import numpy as np 
import pandas as pd 
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, r2_score, mean_squared_error
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import nltk
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Wine Points Prediction

**Note:** a sample is being used due to the kernel running out of memory due to the jupyter notebook kernel running out of memory and dying when performing analysis otherwise as a result predictions depend on the sample that is given as it is obtained randomly. Furthermore these predictions vary with $\pm1\%$

In [4]:
wine_df = pd.read_csv("wine.csv")
str_cols = ['description', 'price',  'title', 'variety', 'country', 'designation', 'province', 'winery']

reviews = wine_df.sample(20000)[['points'] + str_cols].reset_index()
reviews = reviews.drop(['index'], axis=1)
reviews.head()

Unnamed: 0,points,description,price,title,variety,country,designation,province,winery
0,83,"Barely ripe, with green citrus and feline spra...",18.0,Starmont 2009 Sauvignon Blanc (Napa Valley),Sauvignon Blanc,US,,California,Starmont
1,90,"This 100% Syrah shows intense blackberry, crèm...",45.0,Donelan 2010 Cuvee Christine Syrah (Sonoma Cou...,Syrah,US,Cuvee Christine,California,Donelan
2,93,This year's dry conditions produced a fine cro...,47.0,Burmester 1989 Colheita Tawny (Port),Port,Portugal,Colheita Tawny,Port,Burmester
3,86,This is a plush and upfront wine that offers s...,14.0,Pelassa 2005 Barbera d'Alba,Barbera,Italy,,Piedmont,Pelassa
4,87,Pineapple and mango aromas mix with notes of b...,12.0,Columbia Crest 2014 Grand Estates Chardonnay (...,Chardonnay,US,Grand Estates,Washington,Columbia Crest


We first have to change features that we are going to use from categorical to numerical varialbes. This is done in order to give the features meaning when performing different forms of analysis on them in order to predict the points given to a bottle of wine

In [5]:
# assign numerical values to string columns

factorized_wine = reviews[str_cols].drop(['description'], axis=1).copy()
for col in str_cols[2:]:
    factorized_wine[col] = pd.factorize(reviews[col])[0]

factorized_wine.head()

Unnamed: 0,price,title,variety,country,designation,province,winery
0,18.0,0,0,0,-1,0,0
1,45.0,1,1,0,0,0,1
2,47.0,2,2,1,1,1,2
3,14.0,3,3,2,-1,2,3
4,12.0,4,4,0,2,3,4


Now we assign the variables we just factorized along with the price of the wine to be our X values and our y value will be what we are trying to predict, which in this case are the points for a bottle of wine.

In [6]:
X = factorized_wine.to_numpy('int64')
y = reviews['points'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Below we perform several different formst of prediction to see which one produces the best result. 

We then need to determine how accurate this algorithm is given the estimates returned from the random forest regression. We do this by using `score()` which returns the coefficient of determination of the prediction ($r^2$). In other words it is the observed y variation that can be explained by the and by the regression model. We also compute the residual mean squared error of the model (rmse).

### linear regression

In [7]:
from sklearn import linear_model

model = linear_model.LinearRegression()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print('r2 score:', model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.025621093791445948
rmse score: 3.004001678134892


as you can see this isnt the best prediction model so lets try some other methods and see what we get

### linear discriminant analysis

In [8]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)

pred = lda_model.predict(X_test)

print('r2 score:', lda_model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.132
rmse score: 3.1166648841349627


The results from this method are not good either so onto the next one

### classification tree

In [9]:
from sklearn import tree

dt_model = tree.DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

pred = dt_model.predict(X_test)

print('r2 score:', dt_model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.1468
rmse score: 3.2819506394825626


The methods that we have done prior as well as this one are getting use nowhere and are showing very little signs of improvement so lets pivot to a different direction and try to predict the points based off of the description of the wine.

## Incorporating description

In [11]:
reviews.head()

Unnamed: 0,points,description,price,title,variety,country,designation,province,winery
0,83,"Barely ripe, with green citrus and feline spra...",18.0,Starmont 2009 Sauvignon Blanc (Napa Valley),Sauvignon Blanc,US,,California,Starmont
1,90,"This 100% Syrah shows intense blackberry, crèm...",45.0,Donelan 2010 Cuvee Christine Syrah (Sonoma Cou...,Syrah,US,Cuvee Christine,California,Donelan
2,93,This year's dry conditions produced a fine cro...,47.0,Burmester 1989 Colheita Tawny (Port),Port,Portugal,Colheita Tawny,Port,Burmester
3,86,This is a plush and upfront wine that offers s...,14.0,Pelassa 2005 Barbera d'Alba,Barbera,Italy,,Piedmont,Pelassa
4,87,Pineapple and mango aromas mix with notes of b...,12.0,Columbia Crest 2014 Grand Estates Chardonnay (...,Chardonnay,US,Grand Estates,Washington,Columbia Crest


Because we are focusing on the description (review) of the wine here is an example of one

In [12]:
reviews['description'][5]

"A crisp, minerally wave of apples and pear start this wine from the cool-climate region of Elgin, and on the palate, it's equally delicate. Dry but with a touch of pretty sweetness, the wine is embraceable and a great solo sip."

We remove punctuation and other special characters and convert everything to lower case as it is not significat that words be capitalized.

In [13]:
descriptions = []

for descrip in reviews['description']:
    line = re.sub(r'\W', ' ', str(descrip))
    line = line.lower()
    descriptions.append(line)
    
len(descriptions)

20000

Here we use `TfidfVectorizer`, in order to understand what it is what term frequency-inverse document frequency (TF_IDT) is must be explained first. TF-IDF is a measure that evaluates the relevancy that a word has for a document inside a collection of other documents. Furthermore TF-IDF can be defined as the following:

$ \text{Term Frequency (TF)} = \frac{\text{Frequency of a word}}{\text{Total number of words in document}} $

$ \text{Inverse Document Frequency (IDF)} = \log{\frac{\text{Total number of documents}}{\text{Number of documents that contain the word}}} $

$ \text{TF-IDF} = \text{TF} \cdot \text{IDF} $

In turn what `TfidfVectorizer` gives us is a list of feature lists that we can use as estimators for prediction. 

The parameters for `TfidfVectorizer` are max_features, max_df, and stop_words. 
max_features tells us to only look at the top n features of the total document
max_df causes the vectorizer to ignore terms that have a document frequency strictly higher than the given threshold. In our case because a float is its value we ignore words that appear in more that 80% of documents
stop_words allows us to pass in a set of stop words. Stop words are words that add little to no meaning to a sentence. This includes words such as i, our, him, and her. 
Folling this we fit and transform the data then we xplit it into training and testing data

In [14]:
y = reviews['points'].values
vec = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
X = vec.fit_transform(descriptions).toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

Now that we've split the data we use `RandomForestRegressor()` to make our prediciton given that its a random forest algorithm it takes the average of the decision trees that were created and used as estimates.

In [15]:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

RandomForestRegressor()

In [16]:
pred = rfr.predict(X_test)

Now we check to see how good our model is at predicting the points for a bottle of wine

In [17]:
print('r2 score:', rfr.score(X_test, y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.4950601387266653
rmse score: 2.1461066422710684


In [18]:
cvs = cross_val_score(rfr, X_test, y_test, cv=10)
cvs.mean()

0.43710399909256914

This is solely based off the description of the wine. As you can see this is a large improvement in both the score and rmse for any sort of prediction that was done with any of the methods performed above. However, it is still not the best for several reasons. The first being the $r^2$ score, or how well our model is at making predictions. There is still a large portion of the data that is not being accurately predicted. 

The other issue pertains to when the model does fail at making the prediction. given that the rmse score is very high this can be interpreted as when we do fail we fail rather spectacualary. However, given that the context of this problem is making a prediction for determining arbitrary integer point values for bottles of wine, failing spectaculary is not necesarilly what is occuring. The rmse value tells use that with each incorrect prediction we are about 2.1 points off. However, it is still less than ideal.

Below we see if we can improve upon these shortcomings. 

## Combining features

Next we combine the features that were obtained from `TfidfVectorizer` with the features that we just factorized in there respective rows.

In [19]:
wine_X = factorized_wine.to_numpy('int64')
X = np.concatenate((wine_X,X),axis=1)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

In [21]:
rfr_fac = RandomForestRegressor()
rfr_fac.fit(X_train, y_train)

RandomForestRegressor()

In [22]:
fac_pred = rfr_fac.predict(X_test)

Next we perform the same actions as above to determine the accuracy of the prediction. That is we use `score()` and perform a 10 fold cross validation and then take the mean of the scores. 

In [23]:
print('r2 score:', rfr_fac.score(X_test, y_test))
print('rmse score:', mean_squared_error(y_test, fac_pred, squared=False))

r2 score: 0.5738683914990945
rmse score: 1.9715297968836283


In [24]:
fac_cvs = cross_val_score(rfr_fac, X_test, y_test, cv=10)
fac_cvs.mean()

0.5221010937266104

As we can see from the scores computed above the accuracy is an improvement from only using the wine description (review) as a feature. Both the $r^2$ score and RMSE value improved by about 8% and 0.15 respectively. However, the model isnt all that reliable as there is only slightly above 50% of the bottles of wine from the sample can have their score predicted accurately

# Conclusion

After comparing the price to the points for a bottle of wine we learned that the majority of the data is clustered towards the middle in regards to the point value a bottle was awarded and there are few outliers in either the positive or negative direction. Furthermore most wine follows the trend of having a greater number of points awarded as the price increases.

From these trends we attempted to determine if we can actually predict how many points a bottle of wine will receive. Given the the best prediction that we could obtain took into account several features in addition with the price of the wine and only has 52% accuracy we are lead to believe that the point system that results from the wine in this dataset is more subjective than objective.