<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/nlp/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation

In [None]:
# Original text with emojis
original_text = """
Yesterday, I couldn't believe my eyes:
I saw a man carrying a bunch of colorful balloons 🎈🎈 flying
over the city! It was like something out of a fairy tale.
Later, I read in the news that he's a scientist
who's experimenting with new forms of transportation.
His 'flying' experiment was part of a larger study.
I hope it goes well; it's such a unique idea! 😊
""".replace('\n', ' ')

## Step 0: Parse Text

In [None]:
!pip install emoji


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import emoji

# Convert emojis to their text description
parsed_text = emoji.demojize(original_text)

print(parsed_text)


 Yesterday, I couldn't believe my eyes:  I saw a man carrying a bunch of colorful balloons :balloon::balloon: flying  over the city! It was like something out of a fairy tale.  Later, I read in the news that he's a scientist  who's experimenting with new forms of transportation.  His 'flying' experiment was part of a larger study.  I hope it goes well; it's such a unique idea! :smiling_face_with_smiling_eyes: 


## Step 1: Remove Special Characters

In [None]:
!pip install regex


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
# Import the regex module
import re

# Emojis are encoded as :text: so we can remove them using a regex pattern
emoji_pattern = re.compile(":[a-zA-Z_]+:")
step1_text = emoji_pattern.sub("", parsed_text)
print(step1_text)


 Yesterday, I couldn't believe my eyes:  I saw a man carrying a bunch of colorful balloons  flying  over the city! It was like something out of a fairy tale.  Later, I read in the news that he's a scientist  who's experimenting with new forms of transportation.  His 'flying' experiment was part of a larger study.  I hope it goes well; it's such a unique idea!  


## Step 2: Handle Contractions

In [None]:
# Expand Contractions
# For this example, we'll manually define a dictionary of contractions and their expansions

contractions_dict = {
    "couldn't": "could not",
    "he's": "he is",
    "who's": "who is",
    "it's": "it is",
}

step2_text = step1_text
for contraction, expanded in contractions_dict.items():
    step2_text = step2_text.replace(contraction, expanded)
print(step2_text)


 Yesterday, I could not believe my eyes:  I saw a man carrying a bunch of colorful balloons  flying  over the city! It was like something out of a fairy tale.  Later, I read in the news that he is a scientist  who is experimenting with new forms of transportation.  His 'flying' experiment was part of a larger study.  I hope it goes well; it is such a unique idea!  


## Step 3: Punctuation

In [None]:
# Remove Punctuation
# We can use regex to remove any characters that are not alphanumeric (letters and numbers)

step3_text = re.sub(r'[^\w\s\']', '', step2_text)
print(step3_text)


 Yesterday I could not believe my eyes  I saw a man carrying a bunch of colorful balloons  flying  over the city It was like something out of a fairy tale  Later I read in the news that he is a scientist  who is experimenting with new forms of transportation  His 'flying' experiment was part of a larger study  I hope it goes well it is such a unique idea  


## Step 4: Lowercasing

In [None]:
# Convert to Lowercase
# Python's string method .lower() will convert all characters to lowercase

step4_text = step3_text.lower()
print(step4_text)


 yesterday i could not believe my eyes  i saw a man carrying a bunch of colorful balloons  flying  over the city it was like something out of a fairy tale  later i read in the news that he is a scientist  who is experimenting with new forms of transportation  his 'flying' experiment was part of a larger study  i hope it goes well it is such a unique idea  


## Step 5: Handle Quotation Marks

In [None]:
# Remove Quotation Marks
# If there were any quotation marks left, we would remove them here

step5_text = re.sub(r'[\'\"]', '', step4_text)
print(step5_text)


 yesterday i could not believe my eyes  i saw a man carrying a bunch of colorful balloons  flying  over the city it was like something out of a fairy tale  later i read in the news that he is a scientist  who is experimenting with new forms of transportation  his flying experiment was part of a larger study  i hope it goes well it is such a unique idea  


## Step 6: Whitespace Normalization

In [None]:
# Normalize Whitespace
# We can use regex to replace any series of whitespace characters with a single space

step6_text = re.sub(r'\s+', ' ', step5_text).strip()
print(step6_text)


yesterday i could not believe my eyes i saw a man carrying a bunch of colorful balloons flying over the city it was like something out of a fairy tale later i read in the news that he is a scientist who is experimenting with new forms of transportation his flying experiment was part of a larger study i hope it goes well it is such a unique idea


## Step 7: Tokens

In [None]:
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(step6_text)
print(tokens)

['yesterday', 'i', 'could', 'not', 'believe', 'my', 'eyes', 'i', 'saw', 'a', 'man', 'carrying', 'a', 'bunch', 'of', 'colorful', 'balloons', 'flying', 'over', 'the', 'city', 'it', 'was', 'like', 'something', 'out', 'of', 'a', 'fairy', 'tale', 'later', 'i', 'read', 'in', 'the', 'news', 'that', 'he', 'is', 'a', 'scientist', 'who', 'is', 'experimenting', 'with', 'new', 'forms', 'of', 'transportation', 'his', 'flying', 'experiment', 'was', 'part', 'of', 'a', 'larger', 'study', 'i', 'hope', 'it', 'goes', 'well', 'it', 'is', 'such', 'a', 'unique', 'idea']


## Step 8a: Stemming

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)

['yesterday', 'i', 'could', 'not', 'believ', 'my', 'eye', 'i', 'saw', 'a', 'man', 'carri', 'a', 'bunch', 'of', 'color', 'balloon', 'fli', 'over', 'the', 'citi', 'it', 'wa', 'like', 'someth', 'out', 'of', 'a', 'fairi', 'tale', 'later', 'i', 'read', 'in', 'the', 'news', 'that', 'he', 'is', 'a', 'scientist', 'who', 'is', 'experi', 'with', 'new', 'form', 'of', 'transport', 'hi', 'fli', 'experi', 'wa', 'part', 'of', 'a', 'larger', 'studi', 'i', 'hope', 'it', 'goe', 'well', 'it', 'is', 'such', 'a', 'uniqu', 'idea']


## Step 8b: Lemmatization

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import spacy

# Load the Spacy model
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp(step6_text)

# Tokenization and Lemmatization
lemmatized_tokens = [token.lemma_ for token in doc]

# Printing lemmatized tokens
print(lemmatized_tokens)


['yesterday', 'I', 'could', 'not', 'believe', 'my', 'eye', 'I', 'see', 'a', 'man', 'carry', 'a', 'bunch', 'of', 'colorful', 'balloon', 'fly', 'over', 'the', 'city', 'it', 'be', 'like', 'something', 'out', 'of', 'a', 'fairy', 'tale', 'later', 'I', 'read', 'in', 'the', 'news', 'that', 'he', 'be', 'a', 'scientist', 'who', 'be', 'experiment', 'with', 'new', 'form', 'of', 'transportation', 'his', 'fly', 'experiment', 'be', 'part', 'of', 'a', 'large', 'study', 'I', 'hope', 'it', 'go', 'well', 'it', 'be', 'such', 'a', 'unique', 'idea']


## Step 9: Removal of Stopwords

In [None]:
# Generate a list of lemmatized tokens that are not stopwords and not punctuation
filtered_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

# Print the non-stopwords tokens
print(filtered_tokens)

['yesterday', 'believe', 'eye', 'see', 'man', 'carry', 'bunch', 'colorful', 'balloon', 'fly', 'city', 'like', 'fairy', 'tale', 'later', 'read', 'news', 'scientist', 'experiment', 'new', 'form', 'transportation', 'fly', 'experiment', 'large', 'study', 'hope', 'go', 'unique', 'idea']


# Analysing and Modeling

## BoW

In [None]:
!pip install sklearn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer(
    stop_words='english',
    lowercase=True,
    token_pattern=r"(?u)\b\w\w+\b"
)

corpus = [original_text]

# Fit the vectorizer on your text data (i.e. corpus) to build the vocabulary
vectorizer.fit(corpus)

# Transform your text data into a BoW representation
bow_representation = vectorizer.transform(corpus)

# The BoW representation is now a sparse matrix containing the word counts
# You can convert it to an array if you want to see it in a more readable format
bow_array = bow_representation.toarray()

# To see the feature names (words) corresponding to each column in the BoW array
feature_names = vectorizer.get_feature_names_out()

# The bow_array is your BoW representation and feature_names are the words in the vocabulary
print(bow_array)
print(feature_names)

[[1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
['balloons' 'believe' 'bunch' 'carrying' 'city' 'colorful' 'couldn'
 'experiment' 'experimenting' 'eyes' 'fairy' 'flying' 'forms' 'goes'
 'hope' 'idea' 'larger' 'later' 'like' 'man' 'new' 'news' 'read' 'saw'
 'scientist' 'study' 'tale' 'transportation' 'unique' 'yesterday']


## Cosine Similarity

In [None]:
!pip install numpy

In [None]:
t1 = "The big cat is on the table and the small cat in the window "
t2 = "The table and the window are small"
t3 = "The moon and the small tree are big"

corpus = [t1, t2, t3]


vectorizer.fit(corpus)
bow_representation = vectorizer.transform(corpus)
bow_array = bow_representation.toarray()
feature_names = vectorizer.get_feature_names_out()
print(bow_array)
print(feature_names)

[[1 2 0 1 1 0 1]
 [0 0 0 1 1 0 1]
 [1 0 1 1 0 1 0]]
['big' 'cat' 'moon' 'small' 'table' 'tree' 'window']


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Convert the list of vectors to a numpy array
vector_array = np.array(bow_array)

# Calculate the cosine similarity between all pairs of vectors in the list
# This will produce a similarity matrix where each element (i, j) is the cosine similarity
# between the ith and jth vectors in the list
cosine_similarities = cosine_similarity(vector_array)

cosine_similarities


array([[1.        , 0.61237244, 0.35355339],
       [0.61237244, 1.        , 0.28867513],
       [0.35355339, 0.28867513, 1.        ]])

## Word2Vec

In [None]:
!pip install gensim

In [None]:
from gensim.models import Word2Vec

# Normally, you would need a large corpus of sentences to train a Word2Vec model.

# Let's create a simple corpus using the given words.
# This is not ideal for training a Word2Vec model but is only for demonstration purposes.
sentences = [filtered_tokens]  # In reality, you would need a much larger and varied corpus

# Initialize and train the Word2Vec model
model = Word2Vec(sentences=sentences, vector_size=5, window=5, min_count=1, workers=2)

# Extract the Word2Vec embeddings for each word
word_vectors = {token: model.wv[token] for token in filtered_tokens}

# The word_vectors dictionary now contains the Word2Vec embeddings for the words
print(word_vectors)

## Glove

In [None]:
import spacy

# Load the spaCy model that includes GloVe vectors
nlp = spacy.load('en_core_web_md')  # or 'en_core_web_lg' for larger GloVe vectors

# Access vectors for each word using the loaded model
glove_vectors = {token: nlp.vocab[token].vector for token in filtered_tokens}

print(glove_vectors)

## Fasttext

In [None]:
from gensim.models.fasttext import FastText as FT_gensim

# Asegúrate de actualizar esta ruta al lugar donde tienes el modelo FastText
model_path = '/path/to/fasttext/model.bin'

# Cargamos el modelo de FastText
model = FT_gensim.load_fasttext_format(model_path)

# Obtener los vectores FastText para cada palabra
fasttext_vectors = {token: model.wv[token] for token in filtered_tokens}
print(fasttext_vectors)

## Sentence Transformers

In [None]:
!pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sentences we want to encode. Example:
sentence = ["Yesterday, I couldn't believe my eyes: I saw a man carrying a bunch of colorful balloons flying over the city! It was like something out of a fairy tale. Later, I read in the news that he's a scientist who's experimenting with new forms of transportation. His 'flying' experiment was part of a larger study. I hope it goes well; it's such a unique idea! "]

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
print(embedding)