In [4]:
!pip install nltk
import numpy as np 
import pandas as pd 
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import nltk
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

As seen in the previous section just performing regression analysis, k-nearest neighbors, linear discriminant analysis and other forms of algorithms is not very good at predicting the points given to a bottle of wine. Below we attempt to see if taking in a more specialized feature can have an impact on the accuracy of predicting the score.

Below is a function that will assign a classification to a row in the dataframe based off the score that is passed into it.

In [5]:
def classify_wine_points(score):
    if score <= 83:
        return 'ok'
    elif score <= 85:
        return 'below average'
    elif score <= 86:
        return 'average'
    elif score <= 88:
        return 'good'
    elif score <= 94:
        return 'great'
    else:
        return 'perfect'

We separate the dataframe into values respective of its classification. This is done inorder to have an equal samples from each range. We then take a sample of 2000 from each classification.

**Note:** a sample is being used due to the kernel running out of memory due to the jupyter notebook kernel running out of memory when performing analysis otherwise as a result predictions depend on the sample that is given as it is obtained randomly. Furthermore these predictions vary with $\pm1\%$

In [6]:
wine_df = pd.read_csv("wine.csv")
wine_df['classification'] = wine_df['points'].apply(classify_wine_points)
reviews_ok = wine_df.loc[wine_df['classification'] == 'ok']
reviews_ba = wine_df.loc[wine_df['classification'] == 'below average']
reviews_a = wine_df.loc[wine_df['classification'] == 'average']
reviews_good = wine_df.loc[wine_df['classification'] == 'good']
reviews_great = wine_df.loc[wine_df['classification'] == 'great']
reviews_p = wine_df.loc[wine_df['classification'] == 'perfect']

reviews = reviews_ok.sample(2000).append(reviews_ba.sample(2000))\
.append(reviews_a.sample(2000))\
.append(reviews_good.sample(2000))\
.append(reviews_great.sample(2000))\
.append(reviews_p.sample(2000)).reset_index()

Because we are focusing on the description (review) of the wine here is an example of one

In [7]:
reviews['description'][5]

"Basic everyday Pinot Noir. It's dry, with simple cola and cherry flavors, and a grating acidity."

We remove punctuation and other special characters and convert everything to lower case as it is not significat that words be capitalized.

In [8]:
descriptions = []

for descrip in reviews['description']:
    line = re.sub(r'\W', ' ', str(descrip))
    line = line.lower()
    descriptions.append(line)

Here we use `TfidfVectorizer`, in order to understand what it is what term frequency-inverse document frequency (TF_IDT) is must be explained first. TF-IDF is a measure that evaluates the relevancy that a word has for a document inside a collection of other documents. Furthermore TF-IDF can be defined as the following:

$ \text{Term Frequency (TF)} = \frac{\text{Frequency of a word}}{\text{Total number of words in document}} $

$ \text{Inverse Document Frequency (IDF)} = \log{\frac{\text{Total number of documents}}{\text{Number of documents that contain the word}}} $

$ \text{TF-IDF} = \text{TF} \cdot \text{IDF} $

In turn what `TfidfVectorizer` gives us is a list of feature lists that we can use as estimators for prediction. 

The parameters for `TfidfVectorizer` are max_features, max_df, and stop_words. 
max_features tells us to only look at the top n features of the total document
max_df causes the vectorizer to ignore terms that have a document frequency strictly higher than the given threshold. In our case because a float is its value we ignore words that appear in more that 80% of documents
stop_words allows us to pass in a set of stop words. Stop words are words that add little to no meaning to a sentence. This includes words such as i, our, him, and her. 
Folling this we fit and transform the data then we xplit it into training and testing data

In [9]:
y = reviews['points'].values
vec = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
X = vec.fit_transform(descriptions).toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

Now that our data is split into a training and test set we can use a machine learning algorithm in order to predict the outcome. Which in this case we attempt to predict the points (score) given to a bottle of wine. We do this with `RandomForestRegressor()`. given that its a random forest algorithm it takes the average of the decision trees that were created and used as estimates.

In [10]:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

RandomForestRegressor()

In [11]:
pred = rfr.predict(X_test)

Now we then need to determine how accurate this algorithm is given the estimates returned from the random forest regression. We do this by using `score()` and by performing a 10 fold cross validation and then taking the mean of the scores returned.

In [12]:
rfr.score(X_test, y_test)

0.6590980437958478

In [13]:
cvs = cross_val_score(rfr, X_test, y_test, cv=10)
cvs.mean()

0.613404338579916

This is solely based off the description of the wine. As you can see this is a large improvement over any sort of prediction that was done with linear regression, k-nearest neighbor or linear discriminant analysis. However, it is still not the best as there is still a large portion of the data that is not being accurately predicted, below we see if it can be improved upon.

Now we factorize the other features of the dataset, specifically those mentioned in `str_cols` by factorizing categorical variables we are able to include them as estimators for our prediction

In [14]:
# assign numerical values to string columns
str_cols = ['country', 'variety', 'province', 'region_1', 'winery']
factorized_wine = reviews[str_cols].copy()
for col in str_cols:
    factorized_wine[col] = pd.factorize(reviews[col])[0]

factorized_wine

Unnamed: 0,country,variety,province,region_1,winery
0,0,0,0,0,0
1,1,1,1,1,1
2,0,2,2,2,2
3,0,3,3,3,3
4,0,4,2,4,4
...,...,...,...,...,...
11995,0,3,2,13,5412
11996,0,2,2,2,3622
11997,21,11,177,-1,3450
11998,0,13,2,125,3077


Next we combine the features that were obtained from `TfidfVectorizer` with the features that we just factorized in there respective rows.

In [15]:
wine_X = factorized_wine.to_numpy('int64')
X = np.concatenate((wine_X,X),axis=1)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

In [17]:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

y_pred_rf_count = rfr.predict(X_test)

Next we perform the same actions as above to determine the accuracy of the prediction. That is we use `score()` and perform a 10 fold cross validation and then take the mean of the scores.

In [18]:
rfr.score(X_test, y_test)

0.7998418678785136

In [19]:
cvs = cross_val_score(rfr, X_test, y_test, cv=10)
cvs.mean()

0.7627166607378533

As we can see from the scores computed above the accuracy drastically improves when the features from performing TF-IDF are combined with those from factorizing categorical variables in the dataset. The resulting improvement is by nearly 15%.