We separate this into two separate notebooks because of memory issues when running in a single notebook. Part of this reason is that is attributed to the size of the dataset even when a sample is used. If not separated from the main file then there is a high chance that the kernel will crash.

In [1]:
import numpy as np 
import pandas as pd 
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [2]:
wine_df = pd.read_csv("wine.csv")
reviews = wine_df.sample(20000).reset_index()

Example Description

In [3]:
reviews['description'][5]

"A fruit-driven style, with cherry and raspberry notes accented by hints of chocolate and coffee. It's ripe and creamy on the midpalate, just fading a little too quickly on the finish. Drink now–2013."

In order to perform sentiment analysis we have to break the scores into ranges such as good, average, poor, etc...

In [4]:
def classify_wine_points(score):
    if score <= 80:
        return 'ok'
    elif score <= 85:
        return 'below average'
    elif score <= 90:
        return 'average'
    elif score <= 95:
        return 'good'
    elif score <= 98:
        return 'great'
    else:
        return 'perfect'

In [5]:
reviews['classification'] = reviews['points'].apply(classify_wine_points)

We remove punctuation and other special characters and convert everything to lower case

In [6]:
descriptions = []

for descrip in reviews['description']:
    line = re.sub(r'\W', ' ', str(descrip))
    line = line.lower()
    descriptions.append(line)

Here we use TF-IDF vectorization in combination with a random forest classifier to predict the score range

In [7]:
class_col = reviews.columns.get_loc('classification')
labels = reviews.iloc[:, class_col].values
vec = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
descriptions = vec.fit_transform(descriptions).toarray()

X_train, X_test, y_train, y_test = train_test_split(descriptions, labels, test_size=0.2, random_state=0)

We dont add a value for max_depth in order to maximize precision even though the run time is poor

In [8]:
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
pred = text_classifier.predict(X_test)

Accuracy is around 66-69% depending on the sample, this is a decenttly wide range but a sample of the data has to be used or else kernel will crash but the recall is **very** bad also labels 0, 4, and 5 dont get recognized at all which another bad sign

This is much more accurate than the previous methods which solely relied on numerical values but not without its problems

In [9]:
print(classification_report(y_test,pred))
print(accuracy_score(y_test, pred))

               precision    recall  f1-score   support

      average       0.65      0.92      0.76      2256
below average       0.75      0.24      0.36       682
         good       0.76      0.43      0.55      1025
        great       0.00      0.00      0.00        21
           ok       0.00      0.00      0.00        13
      perfect       0.00      0.00      0.00         3

     accuracy                           0.67      4000
    macro avg       0.36      0.27      0.28      4000
 weighted avg       0.69      0.67      0.63      4000

0.6715


  _warn_prf(average, modifier, msg_start, len(result))
