<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/nlp/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation

In [None]:
# Original text with emojis
original_text = """
Yesterday, I couldn't believe my eyes:
I saw a man carrying a bunch of colorful balloons 🎈🎈 flying
over the city! It was like something out of a fairy tale.
Later, I read in the news that he's a scientist
who's experimenting with new forms of transportation.
His 'flying' experiment was part of a larger study.
I hope it goes well; it's such a unique idea! 😊
""".replace('\n', ' ')

## Step 0: Parse Text

In [None]:
!pip install emoji


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import emoji

# Convert emojis to their text description
parsed_text = emoji.demojize(original_text)

print(parsed_text)


 Yesterday, I couldn't believe my eyes:  I saw a man carrying a bunch of colorful balloons :balloon::balloon: flying  over the city! It was like something out of a fairy tale.  Later, I read in the news that he's a scientist  who's experimenting with new forms of transportation.  His 'flying' experiment was part of a larger study.  I hope it goes well; it's such a unique idea! :smiling_face_with_smiling_eyes: 


## Step 1: Remove Special Characters

In [None]:
!pip install regex


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
# Import the regex module
import re

# Emojis are encoded as :text: so we can remove them using a regex pattern
emoji_pattern = re.compile(":[a-zA-Z_]+:")
step1_text = emoji_pattern.sub("", parsed_text)
print(step1_text)


 Yesterday, I couldn't believe my eyes:  I saw a man carrying a bunch of colorful balloons  flying  over the city! It was like something out of a fairy tale.  Later, I read in the news that he's a scientist  who's experimenting with new forms of transportation.  His 'flying' experiment was part of a larger study.  I hope it goes well; it's such a unique idea!  


## Step 2: Handle Contractions

In [None]:
# Expand Contractions
# For this example, we'll manually define a dictionary of contractions and their expansions

contractions_dict = {
    "couldn't": "could not",
    "he's": "he is",
    "who's": "who is",
    "it's": "it is",
}

step2_text = step1_text
for contraction, expanded in contractions_dict.items():
    step2_text = step2_text.replace(contraction, expanded)
print(step2_text)


 Yesterday, I could not believe my eyes:  I saw a man carrying a bunch of colorful balloons  flying  over the city! It was like something out of a fairy tale.  Later, I read in the news that he is a scientist  who is experimenting with new forms of transportation.  His 'flying' experiment was part of a larger study.  I hope it goes well; it is such a unique idea!  


## Step 3: Punctuation

In [None]:
# Remove Punctuation
# We can use regex to remove any characters that are not alphanumeric (letters and numbers)

step3_text = re.sub(r'[^\w\s\']', '', step2_text)
print(step3_text)


 Yesterday I could not believe my eyes  I saw a man carrying a bunch of colorful balloons  flying  over the city It was like something out of a fairy tale  Later I read in the news that he is a scientist  who is experimenting with new forms of transportation  His 'flying' experiment was part of a larger study  I hope it goes well it is such a unique idea  


## Step 4: Lowercasing

In [None]:
# Convert to Lowercase
# Python's string method .lower() will convert all characters to lowercase

step4_text = step3_text.lower()
print(step4_text)


 yesterday i could not believe my eyes  i saw a man carrying a bunch of colorful balloons  flying  over the city it was like something out of a fairy tale  later i read in the news that he is a scientist  who is experimenting with new forms of transportation  his 'flying' experiment was part of a larger study  i hope it goes well it is such a unique idea  


## Step 5: Handle Quotation Marks

In [None]:
# Remove Quotation Marks
# If there were any quotation marks left, we would remove them here

step5_text = re.sub(r'[\'\"]', '', step4_text)
print(step5_text)


 yesterday i could not believe my eyes  i saw a man carrying a bunch of colorful balloons  flying  over the city it was like something out of a fairy tale  later i read in the news that he is a scientist  who is experimenting with new forms of transportation  his flying experiment was part of a larger study  i hope it goes well it is such a unique idea  


## Step 6: Whitespace Normalization

In [None]:
# Normalize Whitespace
# We can use regex to replace any series of whitespace characters with a single space

step6_text = re.sub(r'\s+', ' ', step5_text).strip()
print(step6_text)


yesterday i could not believe my eyes i saw a man carrying a bunch of colorful balloons flying over the city it was like something out of a fairy tale later i read in the news that he is a scientist who is experimenting with new forms of transportation his flying experiment was part of a larger study i hope it goes well it is such a unique idea


## Step 7: Tokens

In [None]:
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(step6_text)
print(tokens)

['yesterday', 'i', 'could', 'not', 'believe', 'my', 'eyes', 'i', 'saw', 'a', 'man', 'carrying', 'a', 'bunch', 'of', 'colorful', 'balloons', 'flying', 'over', 'the', 'city', 'it', 'was', 'like', 'something', 'out', 'of', 'a', 'fairy', 'tale', 'later', 'i', 'read', 'in', 'the', 'news', 'that', 'he', 'is', 'a', 'scientist', 'who', 'is', 'experimenting', 'with', 'new', 'forms', 'of', 'transportation', 'his', 'flying', 'experiment', 'was', 'part', 'of', 'a', 'larger', 'study', 'i', 'hope', 'it', 'goes', 'well', 'it', 'is', 'such', 'a', 'unique', 'idea']


## Step 8a: Stemming

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)

['yesterday', 'i', 'could', 'not', 'believ', 'my', 'eye', 'i', 'saw', 'a', 'man', 'carri', 'a', 'bunch', 'of', 'color', 'balloon', 'fli', 'over', 'the', 'citi', 'it', 'wa', 'like', 'someth', 'out', 'of', 'a', 'fairi', 'tale', 'later', 'i', 'read', 'in', 'the', 'news', 'that', 'he', 'is', 'a', 'scientist', 'who', 'is', 'experi', 'with', 'new', 'form', 'of', 'transport', 'hi', 'fli', 'experi', 'wa', 'part', 'of', 'a', 'larger', 'studi', 'i', 'hope', 'it', 'goe', 'well', 'it', 'is', 'such', 'a', 'uniqu', 'idea']


## Step 8b: Lemmatization

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import spacy

# Load the Spacy model
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp(step6_text)

# Tokenization and Lemmatization
lemmatized_tokens = [token.lemma_ for token in doc]

# Printing lemmatized tokens
print(lemmatized_tokens)


['yesterday', 'I', 'could', 'not', 'believe', 'my', 'eye', 'I', 'see', 'a', 'man', 'carry', 'a', 'bunch', 'of', 'colorful', 'balloon', 'fly', 'over', 'the', 'city', 'it', 'be', 'like', 'something', 'out', 'of', 'a', 'fairy', 'tale', 'later', 'I', 'read', 'in', 'the', 'news', 'that', 'he', 'be', 'a', 'scientist', 'who', 'be', 'experiment', 'with', 'new', 'form', 'of', 'transportation', 'his', 'fly', 'experiment', 'be', 'part', 'of', 'a', 'large', 'study', 'I', 'hope', 'it', 'go', 'well', 'it', 'be', 'such', 'a', 'unique', 'idea']


## Step 9: Removal of Stopwords

In [None]:
# Generate a list of lemmatized tokens that are not stopwords and not punctuation
filtered_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

# Print the non-stopwords tokens
print(filtered_tokens)

['yesterday', 'believe', 'eye', 'see', 'man', 'carry', 'bunch', 'colorful', 'balloon', 'fly', 'city', 'like', 'fairy', 'tale', 'later', 'read', 'news', 'scientist', 'experiment', 'new', 'form', 'transportation', 'fly', 'experiment', 'large', 'study', 'hope', 'go', 'unique', 'idea']


# Analysing and Modeling

## Bag-of-Words

## Word2Vec

In [None]:
!pip install gensim


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from gensim.models import Word2Vec

# Normally, you would need a large corpus of sentences to train a Word2Vec model.
# For demonstration purposes, let's assume we have a preprocessed corpus in the variable 'sentences'.
# In a real scenario, 'sentences' would be a list of tokenized sentences, such as:
# [['sentence', 'one'], ['sentence', 'two'], ...]

# Let's create a simple corpus using the given words.
# This is not ideal for training a Word2Vec model but is only for demonstration purposes.
sentences = [filtered_tokens]  # In reality, you would need a much larger and varied corpus

# Initialize and train the Word2Vec model
model = Word2Vec(sentences=sentences, vector_size=5, window=5, min_count=1, workers=2)

# Extract the Word2Vec embeddings for each word
word_vectors = {token: model.wv[token] for token in filtered_tokens}

# The word_vectors dictionary now contains the Word2Vec embeddings for the words
print(word_vectors)

{'yesterday': array([0.05573254, 0.13782153, 0.12202191, 0.19076994, 0.18546835],
      dtype=float32), 'believe': array([0.14623424, 0.10146036, 0.13513738, 0.01522945, 0.127011  ],
      dtype=float32), 'eye': array([-0.06818472, -0.0188273 ,  0.11536448, -0.15047473, -0.0786405 ],
      dtype=float32), 'see': array([-0.15021135, -0.01854374,  0.19075541, -0.14642999, -0.04669122],
      dtype=float32), 'man': array([-0.03875482,  0.16154873, -0.11861791,  0.00090325, -0.09507468],
      dtype=float32), 'carry': array([-0.192071  ,  0.10014586, -0.17519172, -0.0878365 , -0.000702  ],
      dtype=float32), 'bunch': array([-0.00592363, -0.1532248 ,  0.19229487,  0.09964116,  0.18466286],
      dtype=float32), 'colorful': array([-0.16319434,  0.08999249, -0.08272365,  0.01644203,  0.17000751],
      dtype=float32), 'balloon': array([-0.08924353,  0.09035001, -0.1357392 , -0.07096978,  0.18797016],
      dtype=float32), 'fly': array([-0.1423496 ,  0.12924455,  0.1794374 , -0.10034543, -0

## Glove

In [None]:
import spacy

# Load the spaCy model that includes GloVe vectors
nlp = spacy.load('en_core_web_md')  # or 'en_core_web_lg' for larger GloVe vectors

# Access vectors for each word using the loaded model
glove_vectors = {token: nlp.vocab[token].vector for token in filtered_tokens}

print(glove_vectors)

{'yesterday': array([-5.3686e-01, -6.7472e-01,  7.1475e-01, -4.2046e-01, -1.6463e-01,
       -3.1780e+00,  2.3652e-01,  2.2072e+00,  7.7180e-01,  2.4232e+00,
       -9.7736e-01, -9.2642e-02, -1.3278e+00, -9.8747e-01, -7.6141e-01,
        6.5162e-01,  1.3443e+00, -1.8304e+00, -9.6480e-01,  7.0550e-01,
        1.8747e+00, -2.0091e+00,  1.1968e+00, -1.5860e+00, -1.5587e+00,
        1.0639e+00, -3.3912e-02, -1.2193e+00,  7.0644e-01, -6.6466e-01,
        1.6788e+00, -3.7597e-01, -1.4935e+00, -8.0441e-01,  1.1311e+00,
       -3.5550e+00, -1.9938e+00, -7.4687e-01, -3.3976e-01, -3.3396e-01,
       -2.2661e+00,  1.7473e+00,  4.6772e-01, -3.0244e+00, -3.7955e-01,
       -2.1847e+00, -2.1376e-01, -2.6220e+00, -1.2467e+00,  2.1031e+00,
        1.0414e+00, -8.5794e-01, -4.9357e-01, -8.8624e-01,  9.5925e-01,
        1.7096e+00,  1.1844e+00,  6.6893e-01, -4.5962e-01,  1.3284e+00,
        8.9461e-01,  9.3940e-01,  4.8283e-01,  1.6327e+00, -1.0626e+00,
        4.1424e-01, -3.2859e+00,  2.5070e+00, -2.9

## Fasttext

In [None]:
from gensim.models.fasttext import FastText as FT_gensim

# Asegúrate de actualizar esta ruta al lugar donde tienes el modelo FastText
model_path = '/path/to/fasttext/model.bin'

# Cargamos el modelo de FastText
model = FT_gensim.load_fasttext_format(model_path)

# Obtener los vectores FastText para cada palabra
fasttext_vectors = {token: model.wv[token] for token in filtered_tokens}
print(fasttext_vectors)

## Sentence Transformers

In [None]:
!pip install -U sentence-transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sentences we want to encode. Example:
sentence = ["Yesterday, I couldn't believe my eyes: I saw a man carrying a bunch of colorful balloons flying over the city! It was like something out of a fairy tale. Later, I read in the news that he's a scientist who's experimenting with new forms of transportation. His 'flying' experiment was part of a larger study. I hope it goes well; it's such a unique idea! "]

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
print(embedding)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[-1.21276259e-01  2.79691387e-02 -1.28672466e-01  3.96319432e-03
   2.32903883e-01 -7.06531525e-01  4.79212701e-01 -6.05591722e-02
  -3.55121613e-01 -3.07407856e-01  1.72225192e-01  7.88966790e-02
  -1.55326262e-01  9.43226814e-02 -1.58958107e-01  5.48852198e-02
  -3.15553784e-01 -2.83665687e-01 -4.31371570e-01  2.20421165e-01
   9.78308916e-03 -2.07050387e-02  6.69788197e-02  8.67410675e-02
  -2.86514256e-02  4.13699210e-01  1.44136190e-01 -1.91965550e-01
  -2.95976847e-01 -1.54506102e-01  2.42535263e-01  4.34511788e-02
  -2.47367308e-01  6.29284680e-02 -1.32736592e-02 -1.41074201e-02
  -5.85875958e-02 -8.18783492e-02  2.57328868e-01 -1.86339095e-01
   2.62883306e-01 -1.34742156e-01 -1.30251087e-02  1.76957890e-01
  -9.23755318e-02  1.95815951e-01  4.29639630e-02  4.78094295e-02
   2.80338079e-02  1.50066003e-01 -1.71438262e-01 -4.49072987e-01
  -1.61810592e-01 -5.01348019e-01  8.12960416e-02  1.21933058e-01
   3.91381145e-01 -4.61825170e-02  8.87053087e-02 -3.65146935e-01
  -3.83876