<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/INFO371_week6_7_Text_Representation_allMarkdown.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 371: Data Mining Applications

## Week 6-7: Text Representation
### Prof. Charles Dorner, EdD (Candidate)
### College of Computing and Informatics, Drexel University

# Import Libraries
- spaCy: spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
- pandas: Used for data manipulation and analysis
- sklearn's CountVectorizer: Convert a collection of text documents to a matrix of token counts
- sklearn's TfidfVectorizer: Convert a collection of raw documents to a matrix of TF-IDF features.

```
import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt
import spacy
```

# Upload and read the text data

```
sms = pd.read_csv("spam.csv", encoding="latin-1")
sms.head()
```

```
sms.shape
```

```
sms = sms[["v2", "v1"]]
sms.columns = ["message", "label"]
```

```
sms.shape
```

```
sms.head()
```

```
sms.loc[0].message
```

```
sms.loc[2].message
```

# Understanding the Data
- It has five columns: v1, v2, and three unnamed columns.
- The v1 column denotes the label of the text whether it is a spam or not.
- The v2 column contains the text.

# The label class distribution

```
sms.label.value_counts()
```

```
sms.label.value_counts() / len(sms)
```

# Spacy Tokenizer
- We will use spaCy library for word tokenization
- We will import spaCy English language model
- We will remove stop words and punctuations
- We will extract lemmas

```
nlp = spacy.load("en_core_web_sm")
```

```
doc = nlp(sms.loc[0].message)
```

```
sms.loc[0].message
```

```
tokens_info = []
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_)
```

```
tokens_info = []
for token in doc:
    tokens_info.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \
            token.shape_, token.is_alpha, token.is_stop])
tokens_df = pd.DataFrame(tokens_info, columns=['Token', 'Lemma', 'POS', 'TAG', 'DEP', 'Shape', 'Is_Alpha', 'Is_Stop'])
tokens_df
```

# Create a tokenizer using spacy

```
nlp = spacy.load("en_core_web_sm")

# Creating our tokenzer function
def spacy_tokenizer(sentence):
    """This function will accepts a sentence as input and processes the sentence into tokens, performing lemmatization,
    lowercasing, removing stop words and punctuations."""

    # Creating our token object which is used to create documents with linguistic annotations
    doc = nlp(sentence)

    # removing stop words and punctuations
    mytokens = [word for word in doc if not word.is_stop and word.pos_ != 'PUNCT']

    #lemmatizing each token and converting each token in lower case
    mytokens = [word.lemma_.lower().strip() if word.pos_ != "PRON" else word.text.lower() for word in mytokens ]

    # Return preprocessed list of tokens
    return mytokens
```

```
spacy_tokenizer(sms.loc[345].message)
```

## Retrievel practice on text pre-processing

# Feature Engineering
The objective is to predict whether a text is spam or not. For a classification model to understand the text,  we must convert them into numeric format.

## Vectorization
- We will convert labels to 1 or 0 such that spam=1 and ham=0
- We are going to use Bag of Words(BoW) to convert text into numeric format.
- BoW converts text into the matrix of occurrence of words within a given - document. It focuses on whether given word occurred or not in given document and generate the matrix called as BoW matrix/Document Term Matrix
- We are going to use sklearn's CountVectorizer to generate BoW matrix.
- In CountVectorizer we will use custom tokenizer 'spacy_tokenizer' and - ngram range to define the combination of adjacent words. So unigram means sequence of single word and bigrams means sequence of 2 continuous words.
- Likewise, n means sequence of n continuous words.
- In this example we are going to use unigram, so our lower and upper bound of ngram range will be (1,1)

```
from sklearn.feature_extraction.text import CountVectorizer
```

## First, test binary vectorization

```
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=True)
```

```
sms.loc[0].message
```

```
bow_vector.fit_transform(sms.loc[0:5].message).todense()
```

```
# Convert all text into vectors
X = bow_vector.fit_transform(sms.message)
```

```
X.shape
```

```
# Convert class label to numeric 1 or 0
y = sms.label.map({'spam':1, 'ham':0})
y
```

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

# Let us build a KNN classifier

```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
cls = KNeighborsClassifier()
```

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

```
scores
```

```
np.mean(scores)
```

# Test the classifier

```
cls.fit(X_train, y_train)
```

```
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
```

```
preds = cls.predict(X_test)
```

```
preds.shape
```

```
accuracy_score(preds, y_test)
```

```
precision_score(preds, y_test)
```

```
recall_score(preds, y_test)
```

```
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

```

```

## Second, test count vectorization

```
bow_vector_tf = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=False)
```

```
# Convert all text into vectors
X = bow_vector_tf.fit_transform(sms.message)
```

```
X.shape
```

```
X[0].todense()
```

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

```
scores
```

```
np.mean(scores)
```

# Test the classifier

```
cls.fit(X_train, y_train)
```

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

## Retrieval practice on binaryvector and countvector

```

```

## Test TFIDF vectorization

```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
```

```
# Convert all text into vectors
X = tfidf_vector.fit_transform(sms.message)
```

```
X.shape
```

```
(X[3678].toarray() != 0).sum()
```

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

```
scores
```

```
np.mean(scores)
```

# Test the classifier

```
cls.fit(X_train, y_train)
```

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

```

```

## Test Word Embeddings
- Use word2vec to embed each word in a message as a vector.
- Use the mean of all word vectors in a message as the message embedding.

```
# prompt: embed each message into an embedding vector using word2vec

import gensim.downloader as gensim
```

```
# Load a pre-trained Word2Vec model (e.g., 'word2vec-google-news-300')

#word2vec = gensim.load('word2vec-google-news-300')
```

```
def get_embedding(text):
    """Generates an embedding vector for a given text using Word2Vec."""
    tokens = spacy_tokenizer(text)  # Assuming spacy_tokenizer is defined in the previous code
    vectors = []
    for token in tokens:
        try:
            vectors.append(word2vec[token])
        except KeyError:
            # Handle out-of-vocabulary words (e.g., use a zero vector)
            vectors.append(np.zeros(300))  # Assuming the embedding dimension is 300

    if vectors:  # Check if there are any valid word embeddings
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(300)  # Return a zero vector if no valid word embeddings are found
```

```
from tqdm import tqdm
```

```
# Embed each message as a vector
message_embeddings = []
for message in tqdm(sms['message']):
    message_embeddings.append(get_embedding(message))
```

```
X = np.array(message_embeddings)
X.shape
```

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

```
scores
```

```
np.mean(scores)
```

# Test the classifier

```
cls.fit(X_train, y_train)
```

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

## Retrieval practice on tfidf and embeddings

```

```