<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/INFO371_week6_7_Text_Representation_allMarkdown.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 371: Data Mining Applications

## Week 6-7: Text Representation
### Prof. Charles Dorner, EdD (Candidate)
### College of Computing and Informatics, Drexel University

# Import Libraries
- spaCy: spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
- pandas: Used for data manipulation and analysis
- sklearn's CountVectorizer: Convert a collection of text documents to a matrix of token counts
- sklearn's TfidfVectorizer: Convert a collection of raw documents to a matrix of TF-IDF features.

```
import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt
import spacy
```

In [None]:


import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt
import spacy



# Upload and read the text data

```
sms = pd.read_csv("spam.csv", encoding="latin-1")
sms.head()
```

In [None]:


sms = pd.read_csv("spam.csv", encoding="latin-1")
sms.head()



Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


```
sms.shape
```

In [None]:
sms.shape

(5572, 5)

```
sms = sms[["v2", "v1"]]
sms.columns = ["message", "label"]
```

In [None]:


sms = sms[["v2", "v1"]]
sms.columns = ["message", "label"]



```
sms.shape
```

In [None]:
sms.shape

(5572, 2)

```
sms.head()
```

In [None]:
sms.head()

Unnamed: 0,message,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


```
sms.loc[0].message
```

In [None]:
sms.loc[0].message

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

```
sms.loc[2].message
```

In [None]:
sms.loc[2].message

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

# Understanding the Data
- It has five columns: v1, v2, and three unnamed columns.
- The v1 column denotes the label of the text whether it is a spam or not.
- The v2 column contains the text.

# The label class distribution

```
sms.label.value_counts()
```

In [None]:
sms.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,4825
spam,747


```
sms.label.value_counts() / len(sms)
```

In [None]:
sms.label.value_counts() / len(sms)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,0.865937
spam,0.134063


# Spacy Tokenizer
- We will use spaCy library for word tokenization
- We will import spaCy English language model
- We will remove stop words and punctuations
- We will extract lemmas

```
nlp = spacy.load("en_core_web_sm")
```

In [None]:
nlp = spacy.load("en_core_web_sm")

```
doc = nlp(sms.loc[0].message)
```

In [None]:
doc = nlp(sms.loc[0].message)

```
sms.loc[0].message
```

In [None]:
sms.loc[0].message

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

```
tokens_info = []
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_)
```

In [None]:


tokens_info = []
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_)



Go go VERB VB ROOT
until until ADP IN prep
jurong jurong PROPN NNP compound
point point PROPN NNP pobj
, , PUNCT , punct
crazy crazy ADJ JJ advcl
.. .. PUNCT . punct
Available available ADJ JJ ROOT
only only ADV RB advmod
in in ADP IN prep
bugis bugis PROPN NNP nmod
n n X FW cc
great great ADJ JJ amod
world world NOUN NN nmod
la la ADP IN compound
e e PROPN NNP compound
buffet buffet PROPN NNP pobj
... ... PUNCT : punct
Cine Cine PROPN NNP nsubj
there there PRON EX advmod
got get VERB VBD ROOT
amore amore ADV RB amod
wat wat NOUN NN dobj
... ... PUNCT : punct


```
tokens_info = []
for token in doc:
    tokens_info.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \
            token.shape_, token.is_alpha, token.is_stop])
tokens_df = pd.DataFrame(tokens_info, columns=['Token', 'Lemma', 'POS', 'TAG', 'DEP', 'Shape', 'Is_Alpha', 'Is_Stop'])
tokens_df
```

In [None]:
tokens_info = []
for token in doc:
    tokens_info.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \
            token.shape_, token.is_alpha, token.is_stop])
tokens_df = pd.DataFrame(tokens_info, columns=['Token', 'Lemma', 'POS', 'TAG', 'DEP', 'Shape', 'Is_Alpha', 'Is_Stop'])
tokens_df


Unnamed: 0,Token,Lemma,POS,TAG,DEP,Shape,Is_Alpha,Is_Stop
0,Go,go,VERB,VB,ROOT,Xx,True,True
1,until,until,ADP,IN,prep,xxxx,True,True
2,jurong,jurong,PROPN,NNP,compound,xxxx,True,False
3,point,point,PROPN,NNP,pobj,xxxx,True,False
4,",",",",PUNCT,",",punct,",",False,False
5,crazy,crazy,ADJ,JJ,advcl,xxxx,True,False
6,..,..,PUNCT,.,punct,..,False,False
7,Available,available,ADJ,JJ,ROOT,Xxxxx,True,False
8,only,only,ADV,RB,advmod,xxxx,True,True
9,in,in,ADP,IN,prep,xx,True,True


# Create a tokenizer using spacy

```
nlp = spacy.load("en_core_web_sm")

# Creating our tokenzer function
def spacy_tokenizer(sentence):
    """This function will accepts a sentence as input and processes the sentence into tokens, performing lemmatization,
    lowercasing, removing stop words and punctuations."""

    # Creating our token object which is used to create documents with linguistic annotations
    doc = nlp(sentence)

    # removing stop words and punctuations
    mytokens = [word for word in doc if not word.is_stop and word.pos_ != 'PUNCT']

    #lemmatizing each token and converting each token in lower case
    mytokens = [word.lemma_.lower().strip() if word.pos_ != "PRON" else word.text.lower() for word in mytokens ]

    # Return preprocessed list of tokens
    return mytokens
```

In [None]:
nlp = spacy.load("en_core_web_sm")

# Creating our tokenzer function
def spacy_tokenizer(sentence):
    """This function will accepts a sentence as input and processes the sentence into tokens, performing lemmatization,
    lowercasing, removing stop words and punctuations."""

    # Creating our token object which is used to create documents with linguistic annotations
    doc = nlp(sentence)

    # removing stop words and punctuations
    mytokens = [word for word in doc if not word.is_stop and word.pos_ != 'PUNCT']

    #lemmatizing each token and converting each token in lower case
    mytokens = [word.lemma_.lower().strip() if word.pos_ != "PRON" else word.text.lower() for word in mytokens ]

    # Return preprocessed list of tokens
    return mytokens



```
spacy_tokenizer(sms.loc[345].message)
```

In [None]:
spacy_tokenizer(sms.loc[345].message)

['gudnite', '....', 'tc', 'practice', 'go']

## Retrievel practice on text pre-processing

# Feature Engineering
The objective is to predict whether a text is spam or not. For a classification model to understand the text,  we must convert them into numeric format.

## Vectorization
- We will convert labels to 1 or 0 such that spam=1 and ham=0
- We are going to use Bag of Words(BoW) to convert text into numeric format.
- BoW converts text into the matrix of occurrence of words within a given - document. It focuses on whether given word occurred or not in given document and generate the matrix called as BoW matrix/Document Term Matrix
- We are going to use sklearn's CountVectorizer to generate BoW matrix.
- In CountVectorizer we will use custom tokenizer 'spacy_tokenizer' and - ngram range to define the combination of adjacent words. So unigram means sequence of single word and bigrams means sequence of 2 continuous words.
- Likewise, n means sequence of n continuous words.
- In this example we are going to use unigram, so our lower and upper bound of ngram range will be (1,1)

```
from sklearn.feature_extraction.text import CountVectorizer
```

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

## First, test binary vectorization

```
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=True)
```

In [None]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=True)

```
sms.loc[0].message
```

In [None]:
sms.loc[0].message

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

```
bow_vector.fit_transform(sms.loc[0:5].message).todense()
```

In [None]:
bow_vector.fit_transform(sms.loc[0:5].message).todense()



matrix([[0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
         0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0],
        [1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
         1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
         0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,


```
# Convert all text into vectors
X = bow_vector.fit_transform(sms.message)
```

In [None]:
# Convert all text into vectors
X = bow_vector.fit_transform(sms.message)

```
X.shape
```

In [None]:
X.shape

(5572, 8213)

```
# Convert class label to numeric 1 or 0
y = sms.label.map({'spam':1, 'ham':0})
y
```

In [None]:


# Convert class label to numeric 1 or 0
y = sms.label.map({'spam':1, 'ham':0})
y



Unnamed: 0,label
0,0
1,0
2,1
3,0
4,0
...,...
5567,1
5568,0
5569,0
5570,0


# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [None]:


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
cls = KNeighborsClassifier()
```

In [None]:


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
cls = KNeighborsClassifier()



```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [None]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



```
scores
```

In [None]:
scores

array([0.90134529, 0.90134529, 0.90347924, 0.89225589, 0.90123457])

```
np.mean(scores)
```

In [None]:
np.mean(scores)

np.float64(0.8999320559858678)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [None]:
cls.fit(X_train, y_train)

```
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
```

In [None]:


from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score



```
preds = cls.predict(X_test)
```

In [None]:
preds = cls.predict(X_test)

```
preds.shape
```

In [None]:
preds.shape

(1115,)

```
accuracy_score(preds, y_test)
```

In [None]:
accuracy_score(preds, y_test)

0.9103139013452914

```
precision_score(preds, y_test)
```

In [None]:
precision_score(preds, y_test)

0.3288590604026846

```
recall_score(preds, y_test)
```

In [None]:
recall_score(preds, y_test)

1.0

```
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [None]:


print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 0.3288590604026846
Recall: 1.0
F1-Measure: 0.494949494949495
Accuracy: 0.9103139013452914


```

```

## Second, test count vectorization

```
bow_vector_tf = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=False)
```

In [None]:


bow_vector_tf = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=False)



```
# Convert all text into vectors
X = bow_vector_tf.fit_transform(sms.message)
```

In [None]:


# Convert all text into vectors
X = bow_vector_tf.fit_transform(sms.message)





```
X.shape
```

In [None]:
X.shape

(5572, 8213)

```
X[0].todense()
```

In [None]:
X[0].todense()

matrix([[0, 0, 0, ..., 0, 0, 0]])

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [None]:


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

In [None]:
cls = KNeighborsClassifier()

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [None]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



```
scores
```

In [None]:
scores

array([0.90134529, 0.9058296 , 0.91245791, 0.89225589, 0.90572391])

```
np.mean(scores)
```

In [None]:
np.mean(scores)

np.float64(0.9035225196660175)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [None]:
cls.fit(X_train, y_train)

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [None]:


preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 0.35570469798657717
Recall: 1.0
F1-Measure: 0.5247524752475248
Accuracy: 0.9139013452914798


## Retrieval practice on binaryvector and countvector

```

```

## Test TFIDF vectorization

```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
```

In [None]:


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)



```
# Convert all text into vectors
X = tfidf_vector.fit_transform(sms.message)
```

In [None]:


# Convert all text into vectors
X = tfidf_vector.fit_transform(sms.message)



```
X.shape
```

In [None]:
X.shape

(5572, 8213)

```
(X[3678].toarray() != 0).sum()
```

In [None]:
(X[3678].toarray() != 0).sum()

np.int64(2)

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [None]:


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

In [None]:
cls = KNeighborsClassifier()

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [None]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



```
scores
```

In [None]:
scores

array([0.89798206, 0.89573991, 0.89674523, 0.88552189, 0.89337823])

```
np.mean(scores)
```

In [None]:
np.mean(scores)

np.float64(0.8938734630812359)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [None]:
cls.fit(X_train, y_train)

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [None]:


preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 0.3087248322147651
Recall: 1.0
F1-Measure: 0.4717948717948718
Accuracy: 0.9076233183856502


```

```

## Test Word Embeddings
- Use word2vec to embed each word in a message as a vector.
- Use the mean of all word vectors in a message as the message embedding.

In [92]:
!pip install spacy
!python -m spacy download en_core_web_md


import spacy
import numpy as np
from tqdm import tqdm

# Load the medium model (has 300d word vectors)
nlp = spacy.load("en_core_web_md")  # 300-dim like word2vec-google-news-300

# Function to get mean word embedding
def get_embedding(text):
    doc = nlp(text)
    return doc.vector  # Automatically averages word vectors

# Embed each message
message_embeddings = []
for message in tqdm(sms['message']):
    message_embeddings.append(get_embedding(message))

X = np.array(message_embeddings)
print(X.shape)


Collecting numpy>=1.19.0 (from spacy)
  Downloading numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Downloading numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.5 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have nu

100%|██████████| 5572/5572 [01:02<00:00, 88.70it/s]

(5572, 300)





# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [93]:


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

In [94]:


cls = KNeighborsClassifier()



```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [95]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



Exception ignored on calling ctypes callback function: <function ThreadpoolController._find_libraries_with_dl_iterate_phdr.<locals>.match_library_callback at 0x7cd318f35b20>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/threadpoolctl.py", line 1005, in match_library_callback
    self._make_controller_from_path(filepath)
  File "/usr/local/lib/python3.11/dist-packages/threadpoolctl.py", line 1187, in _make_controller_from_path
    lib_controller = controller_class(
                     ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/threadpoolctl.py", line 114, in __init__
    self.dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /usr/local/lib/python3.11/dist-packages/numpy.libs/libscipy_openblas64_-99b71e

```
scores
```

In [96]:
scores

array([0.94506726, 0.93721973, 0.94388328, 0.92929293, 0.91806958])

```
np.mean(scores)
```

In [97]:
np.mean(scores)

np.float64(0.9347065573522972)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [98]:
cls.fit(X_train, y_train)

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [99]:


preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 0.8791946308724832
Recall: 0.6931216931216931
F1-Measure: 0.7751479289940828
Accuracy: 0.9318385650224216


## Retrieval practice on tfidf and embeddings

```

```