In [None]:
import pandas as pd

Download the data containing Amazon reviews, `low_memory=False` is used for detecting the correct data types

In [None]:
df = pd.read_csv("https://github.com/christianw/applied-ux/raw/main/week03/amazon_reviews.csv.xz", low_memory=False)

Calculate how often the different products appear, only the top ones

In [None]:
df["name"].value_counts().head()

Create a mask for the product reviews we are interested in. The mask contains `True` for reviews which talk about `Fire Tablet, 7 Display, Wi-Fi, 8 GB - Includes Special Offers, Magenta`, in all other cases, it contains `False`

In [None]:
pos_mask = df["name"] == "Fire Tablet, 7 Display, Wi-Fi, 8 GB - Includes Special Offers, Magenta"

In [None]:
pos_mask.value_counts()

Let's find the reviews which contain the `Fire Tablet, 7 Display, Wi-Fi, 8 GB - Includes Special Offers, Magenta` and put them in a `DataFrame` called `pos`. Set the column `ft7` to the value `1` in this dataframe

In [None]:
pos = df[pos_mask].copy()
pos["ft7"] = 1

Find the length of the dataset containing the reviews about the `Fire Tablet, 7 Display, Wi-Fi, 8 GB - Includes Special Offers, Magenta`

In [None]:
len(pos)

Invert the mask with `~` to find the negative examples. Set the value of our target to `0` as these reviews do not talk about `Fire Tablet, 7 Display, Wi-Fi, 8 GB - Includes Special Offers, Magenta`

In [None]:
neg = df[~pos_mask].copy()
neg["ft7"] = 0

Find the length of this dataset:

In [None]:
len(neg)

Build a new dataset which contains the same amount of positive and negative samples by drawing a random amount of data from the `neg` dataset which is larger than the `pos` dataset. The `random_state` is used to get reproducible results.

In [None]:
labeled = pd.concat([pos, neg.sample(len(pos), random_state=42)])

Create the vectorizer for the vectorizing the text data. `ngram_range` selects single tokens (=words) and combinations of two tokens (so-called bigrams). `max_df` removes words which appear too often (in more than 70% of the documents), `min_df` only uses words which appear at least three times.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS as stop_words
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=list(stop_words), 
                        max_df=0.7, min_df=3)

Perform the vectorization by calculating the document-term-matrix.

In [None]:
tfidf_vectors = tfidf.fit_transform(labeled["reviews.text"].map(str))

Get the dimensions of the document-term matrix

In [None]:
tfidf_vectors.shape

Rename the variables as shortcut. The document-term matrix is called `X` and is the independent variable, whereas the target (whether the review is about `ft7` or not) is the dependent variable and is called `Y`.

In [None]:
X = tfidf_vectors
Y = labeled["ft7"]

Perform the split of the labeled data in to training and test data with the fraction `0.75/0.25`

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = \
        train_test_split(X, Y, test_size=0.25, random_state=42)

Create the model. `loss='hinge'` uses a SVM.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

clf = SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3, random_state=42)
#clf = SVC(kernel='linear', random_state=42)
#clf = SVC(random_state=42)

Train the model

In [None]:
clf.fit(X_train, Y_train)

Predict the values of the test dataset

In [None]:
Y_predicted = clf.predict(X_test)

Calculate the performance metrics of the classifier

In [None]:
from sklearn import metrics
print(metrics.classification_report(Y_test, Y_predicted))