<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week8/Text_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining and Machine Learning - Week 6
# Text Analytics

[Text Analytics](https://people.ischool.berkeley.edu/~hearst/text-mining.html) (or text mining) is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles.

### Table of Contents
#### 1. Summary
* 1.1 Applications
* 1.2 Tokenization and Stopwords
* 1.3 Stemming and Lemmatization
* 1.4 Text Representation

#### 2. Text Preparation
* 2.1 Install spaCy
* 2.2 Tokenization
* 2.3 Dependency Parsing
* 2.4 Remove Stopwords
* 2.5 Lemmatization
* 2.6 Entity Detection
* 2.7 Exercise
* 2.8 Solution

#### 3. Text Representation
* 3.1 Bag of Words (BOW)
* 3.2 TF-IDF Representation
* 3.3 Exercise
* 3.4 Solution

#### 4. Text Classification: Alexa Reviews
* 4.1 Load and prepare data
* 4.2 Classification of the reviews using logistic regression
* 4.3 How can we improve the accuracy?

## 1. Summary

### 1.1 Applications
There are many applications of text analytics, for example:
* Search for relevant websites or articles using a search engine
* Sentiment Analysis (e.g. classify tweets or film reviews as positive, neutral or negative)
* Chatbots (e.g. Siri, Alexa)
* Project idea: The Impact of Donald Trump’s Tweets on Financial Markets
* Etc.

### 1.2 Tokenization and Stopwords
Tokens are the elementary building blocks (words, numbers, characters) in a document. Tokenization is the process of splitting an input
sequence into tokens. Example: "I love data science" --> "I", "love", "data", "science". Stopwords are common words that appear very frequently (e.g. "is", "and", "you", etc.). It is convenient to remove them as they do not add much to the content of a document and are therefore generally not useful for text analysis or, worse still, make it worse by adding noise.

### 1.3 Lemmatization and Stemming
* Goal: have the same token for different forms of a word (e.g. fishing, fished, fisher, fishers, etc.)
* Lemmatization: Find what is the lemma of a word (e.g. feet -> foot)
* Stemming: one method for lemmatization where rules that remove the ending of a word are applied (e.g. fishing -> fish)


### 1.4 Text Representation
* Goal: transform text into numerical features such that it can be used ML algorithms.
* Bag of Words (BOW): works in many case but order is not preserved (solution: n-grams)
* TF-IDF: emphasizes important words

## 2. Text Preparation
In this section, we explain how to prepare a text for analysis. This includes tockeninzing the text, removing stopwords, etc.

### 2.1 Install spaCy
[spaCy](https://spacy.io/) is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently.

We install the library and its English-language model.

In [None]:
# Install and update spaCy
!pip install -U spacy

# Download the english language model
!python -m spacy download en

In [None]:
# Import required packages
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

### 2.2 Tokenization

Tokenization is the process of breaking a text into pieces called tokens. A token simply refers to an individual part of a sentence having some semantic value. SpaCy‘s tokenizer takes input in form of unicode text and outputs a sequence of token objects. In addition, SpaCy automatically breaks your document into tokens when a document is created using the language model.

Let’s take a look at a simple example. Imagine we have the following text, and we would like to tokenize it:

> When learning data science, you shouldn't get discouraged!

> Challenges and setbacks aren't failures, they're just part of the journey. You've got this!

There are a couple of different ways we can appoach this. The first is called __word tokenization__, which means breaking up the text into individual words. This is a critical step for many language processing applications, as they often require inputs in the form of individual words rather than longer strings of text.

In [None]:
# Load English language model
sp = spacy.load('en_core_web_sm')

# Declare text
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

# spaCy object is used to create a document
my_doc = sp(text)

my_doc

In [None]:
# This is a spaCy document
type(my_doc)

In [None]:
# Create list of tokens
token_list = []

for token in my_doc:
    token_list.append(token.text)

token_list

As we can see, spaCy produces a list that contains each token as a separate item. Notice that it has recognized that contractions such as _shouldn’t_ actually represent two distinct words, and has thus broken them down into two distinct tokens.

In the example above, we first load the language dictionary. Here we load the english one using `spacy.load('en_core_web_sm')` create an object of this class, "sp", which is used to create documents with linguistic annotations and various language properties. After creating the document, we create a list of tokens.

We can also see the parts-of-speech (POS) of each of these tokens using the `.pos_` attribute, as shown below. POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "fish" can be used as both a noun and verb, depending upon the context.

In [None]:
# POS
for word in my_doc:
    print(word.text, word.pos_)

In [None]:
# Another example
doc1 = sp("I like to fish") # verb
doc2 = sp("I eat a fish") # noun

for word in doc1:
  print(word.text, word.pos_)

print("-----------------")

for word in doc2:
  print(word.text, word.pos_)


If we want, we can also break the text into sentences rather than words. This is called __sentence tokenization__. When performing sentence tokenization, the tokenizer looks for specific characters that normally fall between sentences, like periods, exclaimation points, and newline characters.

In [None]:
# create list of sentence tokens
sents_list = []

for sent in my_doc.sents:
    sents_list.append(sent.text)

sents_list

### 2.3 Dependency Parsing
__Depenency parsing__ is a language processing technique that allows to better determine the meaning of a sentence by analyzing how it is constructed to determine how the individual words relate to each other.

Consider, for example, the sentence “Joe throws the ball.” We have two nouns (Joe and ball) and one verb (throws). But we can’t just look at these words individually, or we may end up thinking that the ball throws Joe! To understand the sentence correctly, we need to look at the word order and sentence structure, not just the words.

Below, we have a short sentence. We’ll use a spaCy method called `noun_chunks`, which breaks the input down into nouns and the words describing them, and iterate through each chunk in our source text, identifying the word, its root, its dependency identification, and which chunk it belongs to.

In [None]:
doc = sp(" Joe threw a ball, and President Donald, in pursuit of the ball, hit a wall.") # notice the space at the beginning

for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

In [None]:
# Let's visualize this
displacy.render(doc, style="dep", jupyter= True, options={'distance': 120})

### 2.4 Remove Stopwords
Most text data that we work with is going to contain a lot of words that are not actually useful to us. These words, called stopwords, are useful in human speech, but they do not have much to contribute to the meaning of a sentence. Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time of the analysis (since there are fewer words to process). This makes text analysis more efficient.

Let’s take a look at the stopwords spaCy includes by default.

In [None]:
# Import stopwords from English language
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Print total number of stopwords
print('Number of stopwords: %d' % len(spacy_stopwords))

# Print 20 stopwords
print('20 stopwords: %s' % list(spacy_stopwords)[:20])

Now that we’ve got our list of stopwords, let’s use it to remove the stopwords from the text string we were working on in the previous section.

In [None]:
# Which words will be removed?
my_doc

In [None]:
# Declare list for filtered sentence
filtered_sent = []

# Filter stopwords
for word in my_doc:
    if word.is_stop == False:
        filtered_sent.append(word.text)

filtered_sent

In [None]:
# We can also remove the punctuation
filtered_sent2 = []
removed_tokens = []

# Filter stopwords, punctuation and spaces
for word in my_doc:
  if (word.is_stop == True) or (word.is_punct == True) or (word.is_space == True):
    removed_tokens.append(word.text)
  else:
    filtered_sent2.append(word.text)

removed_tokens

In [None]:
filtered_sent2

### 2.5 Lemmatization
Lemmatization is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, so we need a way to change all the words that are forms of the word connect into the word connect itself.

One method for doing this is called __stemming__. Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word, the root. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect. This kind of simple stemming is often all that’s needed, but lemmatization—which actually looks at words and their roots (called lemma) as described in the dictionary—is more precise (e.g feet -> foot).

Let's look at this simple example.

In [None]:
# Lemmatization
lem = sp("run runs ran running runner runners")

# Find lemma for each word
for word in lem:
    print(word.text, word.lemma_)

### 2.6 Entity Detection

__Entity detection__, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within a text. This is really helpful for quickly extracting information from the text, since you can quickly pick out important topics or indentify key sections of it.

Let’s try out some entity detection using a few paragraphs from this [article](https://www.bloomberg.com/features/trump-tweets-market/).

In [None]:
article = sp("""
President Donald Trump gets a lot of attention for using Twitter to attack American trading partners, political foes, and media companies. But he often takes to the platform to celebrate the strength of the world’s largest economy and its publicly-traded companies.

Before U.S. stocks peaked in late January, he drew a direct connection between the increase in market value of American companies and his administration’s pro-growth policies on more than 10 occasions in that month alone.
""")

entities = [(i, i.label_, i.label) for i in article.ents]
entities

The above example how spaCy is able to identify a variety of different entity types, including specific locations (GPE), date-related words (DATE), important numbers (CARDINAL), specific individuals (PERSON), etc.

Using `displaCy` we can also visualize the text, with each identified entity highlighted by a color and labeled. We’ll use `style="ent"` to tell displaCy that we want to visualize entities here.

In [None]:
displacy.render(article, style="ent", jupyter=True)

### 2.7 Exercise
For each word in the sentence below, print its lemma.

In [None]:
sentence = """The happiness of your life depends upon the quality of your thoughts: therefore, guard accordingly, and take care that you entertain no notions unsuitable to virtue and reasonable nature."""

# Print lemma
# [YOUR CODE HERE]

Create two lists, the first one containing the punctuation and the second one the words (tokens).

In [None]:
# [YOUR CODE HERE]

### 2.8 Solution

In [None]:
# # Lemma
# doc = sp(sentence)
# for word in doc:
#   print(word.text, word.lemma_)

# # punct and tokens
# punct = []
# tokens = []
# for word in doc:
#   if word.is_punct == True:
#     punct.append(word)
#   elif word.is_space == False:
#     tokens.append(word)

# print(punct)
# print(tokens)

## 3. Text Representation
We now show how to transform a text into an usable input for text classification. We use the first sentence of the article from the last section and two other sentences.

In [None]:
# Article as a string, not a spacy object
article = """
President Donald Trump gets a lot of attention for using Twitter to attack American trading partners, political foes, and media companies."""

# Sentences
s1 = """Donald Trump is a great friend, and he has four or five Picassos on his plane. And that's where I would look at them.""" # from Shaquille O'Neal
s2 = """Donald Trump is a phony, a fraud. His promises are as worthless as a degree from Trump University.""" # from Mitt Romney

# List of sentences
texts = [article, s1, s2]

### 3.1 Bag of Words (BOW)
We use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class of scikit learn.

In [None]:
# Using default tokenizer 
count = CountVectorizer(ngram_range=(1,2), stop_words="english")
bow = count.fit_transform(texts)

# Show feature matrix
bow.toarray()

In [None]:
# Get feature names
feature_names = count.get_feature_names()

# View feature names
feature_names

In [None]:
# Show as a dataframe
pd.DataFrame(
    bow.todense(), 
    columns=feature_names
    )

### 3.2 TF-IDF Representation


Recall that:

- term frequency tf = count(word, document) / len(document) 
- term frequency idf = log( len(collection) / count(document_containing_term, collection) )
- tf-idf = tf * idf 

It is important to mention that the IDF value for a word remains the same throughout all the documents as it depends upon the total number of documents. On the other hand, TF values of a word differ from document to document.


In [None]:
# Using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words="english")
features = tfidf.fit_transform(texts)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names()
)

In [None]:
texts

### 3.3 Exercise
Create a TF-IDF Representation of the three above sentences using bigrams and the following stopwords: ["and", "a", "is"].

In [None]:
# YOUR CODE HERE

### 3.4 Solution

In [None]:
# tfidf = TfidfVectorizer(ngram_range=(2, 2), stop_words=["and", "a", "is"])
# features = tfidf.fit_transform(texts)
# pd.DataFrame(
#     features.todense(),
#     columns=tfidf.get_feature_names()
# )

## 4. Text Classification: Alexa reviews

We are going to use a real-world data set: [Amazon Alexa product reviews](https://www.kaggle.com/bittlingmayer/amazonreviews).

This data set comes as a tab-separated file (.tsv). It has has five columns: `rating`, `date`, `variation`, `verified_reviews`, `feedback`.

`rating` denotes the rating each user gave Alexa (out of 5). `date` indicates the date of the review, and `variation` describes which model the user reviewed. `verified_reviews` contains the text of each review, and `feedback` contains a sentiment label, with 1 denoting positive sentiment (the user liked it) and 0 denoting negative sentiment (the user didn’t).

We are going develop a classification model that looks at the review text and predicts whether a review is positive or negative. Since this data set already includes whether a review is positive or negative in the `feedback` column, we can use those answers to train and test our model. Our goal here is to produce an accurate model that we could then use to process new user reviews and quickly determine whether they were positive or negative.

### 4.1 Load and prepare data

In [None]:
# Import additional packages
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [None]:
# Load data
url = "https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week8/data/amazon_alexa.tsv"
df = pd.read_csv(url, delimiter="\t")
df.sample(10)

In [None]:
df.info()

In [None]:
# Change date to datetime
df["date"] = pd.to_datetime(df["date"])

In [None]:
df.info()

In [None]:
# Base rate: the data-set is unbalanced!
df.feedback.value_counts()

In [None]:
round(df.feedback.value_counts()[1] / len(df), 4)

###### Tokening the Data With spaCy

We create a `spacy_tokenizer()` function that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, lowercasing, and removing stopwords.

__A note from spacy documentation__: spaCy adds a special case for pronouns: all pronouns are lemmatized to the special token `-PRON-`. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal pronouns.

In [None]:
# Create a list of punctuation marks
punctuations = string.punctuation

punctuations

In [None]:
# Create a list of stopwords
stop_words = spacy.lang.en.stop_words.STOP_WORDS

list(stop_words)[:10]

In [None]:
# Load English language model
sp = spacy.load('en_core_web_sm')

# Create tokenizer function
def spacy_tokenizer(sentence):
    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = sp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]
    ## alternative way
    # mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Return preprocessed list of tokens
    return mytokens

# Example
review = df["verified_reviews"].sample()
review.values[0]

In [None]:
spacy_tokenizer(review.values[0])

In [None]:
df.iloc[2982, 3]

In [None]:
spacy_tokenizer(df.iloc[2982, 3])

#### Vectorization Feature Engineering (TF-IDF)

We use the TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize the documents. This is a way of representing how important a particular term is in the context of a given document, based on how many times the term appears and how many other documents that same term appears in. The higher the TF-IDF, the more important that term is to that document.

In [None]:
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer

### 4.2 Classification of the reviews using logistic regression

In [None]:
# Select features
X = df['verified_reviews'] # the features we want to analyze
ylabels = df['feedback'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=1234, stratify=ylabels)

X_train

In [None]:
y_train

In [None]:
# Define classifier
classifier = LogisticRegression()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

In [None]:
# Evaluate the model
def evaluate(true, pred):
    precision = precision_score(true, pred)
    recall = recall_score(true, pred)
    f1 = f1_score(true, pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

In [None]:
# Predictions
y_pred = pipe.predict(X_test)

# Evaluation - test set
evaluate(y_test, y_pred)

For the test set, the model correctly identifies a sentiment 91.90% of the time. This is almost the same as the base rate (91.84%). Therefore, the model does not work very well. This may be due to the fact that we have an unbalanced sample with too much positive reviews. Maybe the model cannot learn how to classify negative reviews well since there are too few examples of them.

A recall of 1 means that if a sentiment is positive, it will be classififed as positive.

We observe approximately the same on the training set, as shown below.

In [None]:
# Evaluate on training set
evaluate(y_train, pipe.predict(X_train))

In [None]:
# Which reviews are classified as negative in test set?
X_test.loc[pipe.predict(X_test) == 0]

In [None]:
# Which reviews are classified as negaive in training set?
X_train.loc[pipe.predict(X_train) == 0].values

In [None]:
# Prediction for new reviews
example_review_1 = "I really love the product. It is very helpful. I use it everyday" # positive
example_review_2 = "It stopped working, I want to return it" # negative
example_review_3 = "I don't like it, it is bad" # negative

examples = pd.Series([example_review_1, example_review_2, example_review_3])
examples

In [None]:
pipe.predict(examples)

### 4.3 How can we improve the accuracy?
In order to improve the prediction, we can try to:
* Resample our data (i.e. add negative examples for better training)
* Tune the hyperparameters of the model
* Improve text preparation
* Use another classififer

We illustrate how to improve text preparation, resampling, and the use of another classifier below.

#### 4.3.1 Improve text preparation
The purpose here is to optimize the parameters of the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class.

In [None]:
# Create list of configs
def configs():

    models = list()
    
    # Define config lists
    ngram_range = [(1,1), (1,2), (1, 3), (2, 2), (2, 3), (3, 3)]
    min_df = [1]
    max_df = [1.0]
    analyzer=['word', 'char']
    
    # Create config instances
    for n in ngram_range:
        for i in min_df:
            for j in max_df:
              for a in analyzer:
                    cfg = [n, i, j, a]
                    models.append(cfg)
    return models

configs = configs()
configs[:10]

In [None]:
# Define list for result
result = []

for config in configs:

    # Redefine vectorizer
    tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer, 
                                   ngram_range=config[0],
                                   min_df=config[1], max_df=config[2], analyzer=config[3])

    # Define classifier
    classifier = LogisticRegression()

    # Create pipeline
    pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

    # Fit model on training set
    pipe.fit(X_train, y_train)

    # Predictions
    y_pred = pipe.predict(X_test)

    # Print accuracy on test set
    print("CONFIG: ", config)
    evaluate(y_test, y_pred)
    print("-----------------------")

    # Append to result
    result.append([config, accuracy_score(y_test, y_pred)])

Our tries do not work, we have to try further or use something else to improve prediction.

#### 4.3.2 Resampling

In [None]:
# Create balanced dataframe - base rate = 0.5
df_new = pd.concat([df[df["feedback"] == 1].sample(len(df[df["feedback"] == 0])), df[df["feedback"] == 0]], axis=0).reset_index()
df_new

In [None]:
# Select features
X = df_new['verified_reviews'] # the features we want to analyze
ylabels = df_new['feedback'] # the labels, or answers, we want to test against

# Train test split
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X, ylabels, test_size=0.2, random_state=1234, stratify=ylabels)

# Define classifier
classifier = LogisticRegression()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train_b, y_train_b)

# Predictions
y_pred_b = pipe.predict(X_test_b)

# Evaluation - test set
evaluate(y_test_b, y_pred_b)

#### 4.3.3 Use another classifier

In [None]:
# Use random forest
from sklearn.ensemble import RandomForestClassifier

# Define vectorizer
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer

# Define classifier
classifier = RandomForestClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation - test set
evaluate(y_test, y_pred)

# Evaluation - training set
evaluate(y_train, pipe.predict(X_train))

Of course, combining the three above-mentioned techniques should give the best result.

#### BONUS

In [None]:
# BONUS: predict the rating
df.sample(5)

In [None]:
df.rating.value_counts()

In [None]:
# Base rate
round(df.rating.value_counts()[5] / len(df), 4)

In [None]:
# Select features
X = df['verified_reviews'] # the features we want to analyze
y = df['rating'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# Define classifier
classifier = LogisticRegression()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Generate Model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation - test set
print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred):.4f}")

In [None]:
# BONUS 2: use random forest

# Define classifier
classifier = RandomForestClassifier(n_estimators=50)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Generate Model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Evaluation - test set
print(f"CONFUSION MATRIX:\n{confusion_matrix(y_test, y_pred)}")
print(f"ACCURACY SCORE:\n{accuracy_score(y_test, y_pred):.4f}")