<a href="https://colab.research.google.com/github/diascarolina/i2a2/blob/main/exercise01/problem01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting Texts to Matrices

- **Name**: Carolina Araujo Dias

Our task is to convert the following given text to a matrix. We can use any method.

We have 4 sentences in our text, so we'll load each sentence into a row in a pandas DataFrame and check the number of times each words appears in each sentence.

In [1]:
input_text = """
Fusce risus ex, posuere at ante at, condimentum vestibulum nunc. Proin dapibus egestas neque, a tempor odio pharetra eget. Nullam tempus felis eu consectetur tincidunt. Integer fermentum eu quam vitae tempus. Vivamus volutpat ut dui vitae sollicitudin. Donec ornare dolor a vestibulum pharetra. Mauris eget dui sapien. Mauris lobortis feugiat neque, nec congue leo viverra semper. In vel mauris nunc. Aenean scelerisque arcu a varius tempor. Proin quis nisi et mi dictum tempor. Pellentesque mattis risus metus, id luctus orci ultrices ac. Praesent bibendum lectus a nisl mollis, eu suscipit eros pretium.
Phasellus vitae elit efficitur, varius felis id, vulputate neque. Praesent dictum nunc in velit lobortis tempus. Duis pulvinar ut urna eget volutpat. In hac habitasse platea dictumst. Aenean eros libero, ultrices eu dolor id, ultrices bibendum mauris. Nunc quis nunc feugiat, dapibus sem eget, pretium orci. Mauris mauris felis, pulvinar nec dapibus in, laoreet ac nibh. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. In sed scelerisque massa. Morbi sit amet ex erat. Duis sit amet laoreet ipsum. Nullam rutrum dapibus metus id sollicitudin. Sed lorem metus, maximus id maximus ac, vehicula vitae neque.
Curabitur vestibulum lorem diam, nec dictum lectus sodales id. Morbi at dui at ex fermentum porttitor sit amet et ipsum. Nullam et nibh vitae arcu porta pretium. Aenean ultricies bibendum nibh vitae mattis. Donec nec dignissim lorem. Vivamus ut lectus nec sem suscipit gravida. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Proin nec metus at sapien tempor ultrices. Donec sed orci ante. Quisque finibus maximus justo, sit amet dignissim justo ornare ut. Morbi imperdiet sagittis tristique. Nunc vestibulum velit eget neque ultrices, sit amet egestas elit eleifend.
Curabitur facilisis elementum ligula vitae pellentesque. Nulla facilisi. Nulla vel tristique arcu. Sed dictum nisl ac laoreet venenatis. Proin eros est, suscipit id sodales eu, semper maximus eros. Sed maximus id nulla vel auctor. Nullam libero odio, auctor eget felis et, viverra gravida nisl. Vivamus sed luctus eros, nec mattis leo. Phasellus eget accumsan justo. Proin nisl sem, luctus ut gravida consequat, congue sit amet arcu.
"""

# Text Preprocessing

In [2]:
def clean_text(text: str) -> str:
    """
    Function that cleans the given text, removing punctuation
    and capitalization
    """
    cleaned_text = text.lower()
    cleaned_text = re.sub('\[.*?\]', '', cleaned_text)
    cleaned_text = re.sub('\w*\d\w*', '', cleaned_text)
    cleaned_text = re.sub('[%s]' % re.escape(string.punctuation), '', cleaned_text)
    return cleaned_text

# Solution 01

In [3]:
import pandas as pd
import plotly.express as px
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import string
import re

In [4]:
cleaned_text = clean_text(input_text)
cleaned_text

'\nfusce risus ex posuere at ante at condimentum vestibulum nunc proin dapibus egestas neque a tempor odio pharetra eget nullam tempus felis eu consectetur tincidunt integer fermentum eu quam vitae tempus vivamus volutpat ut dui vitae sollicitudin donec ornare dolor a vestibulum pharetra mauris eget dui sapien mauris lobortis feugiat neque nec congue leo viverra semper in vel mauris nunc aenean scelerisque arcu a varius tempor proin quis nisi et mi dictum tempor pellentesque mattis risus metus id luctus orci ultrices ac praesent bibendum lectus a nisl mollis eu suscipit eros pretium\nphasellus vitae elit efficitur varius felis id vulputate neque praesent dictum nunc in velit lobortis tempus duis pulvinar ut urna eget volutpat in hac habitasse platea dictumst aenean eros libero ultrices eu dolor id ultrices bibendum mauris nunc quis nunc feugiat dapibus sem eget pretium orci mauris mauris felis pulvinar nec dapibus in laoreet ac nibh pellentesque habitant morbi tristique senectus et net

Now we have our cleaned text with no punctuation, no capitalization, only line breaks. Now let's turn it into a pandas DataFrame, where each row is a sentence and each column is a word.

In [5]:
cleaned_text = cleaned_text.split('\n')
cleaned_text

['',
 'fusce risus ex posuere at ante at condimentum vestibulum nunc proin dapibus egestas neque a tempor odio pharetra eget nullam tempus felis eu consectetur tincidunt integer fermentum eu quam vitae tempus vivamus volutpat ut dui vitae sollicitudin donec ornare dolor a vestibulum pharetra mauris eget dui sapien mauris lobortis feugiat neque nec congue leo viverra semper in vel mauris nunc aenean scelerisque arcu a varius tempor proin quis nisi et mi dictum tempor pellentesque mattis risus metus id luctus orci ultrices ac praesent bibendum lectus a nisl mollis eu suscipit eros pretium',
 'phasellus vitae elit efficitur varius felis id vulputate neque praesent dictum nunc in velit lobortis tempus duis pulvinar ut urna eget volutpat in hac habitasse platea dictumst aenean eros libero ultrices eu dolor id ultrices bibendum mauris nunc quis nunc feugiat dapibus sem eget pretium orci mauris mauris felis pulvinar nec dapibus in laoreet ac nibh pellentesque habitant morbi tristique senectus

In [6]:
word_dict = {}
i = 0

for sentence in cleaned_text:
    if sentence != '':
        word_dict[i] = sentence
        i = i + 1
word_dict

{0: 'fusce risus ex posuere at ante at condimentum vestibulum nunc proin dapibus egestas neque a tempor odio pharetra eget nullam tempus felis eu consectetur tincidunt integer fermentum eu quam vitae tempus vivamus volutpat ut dui vitae sollicitudin donec ornare dolor a vestibulum pharetra mauris eget dui sapien mauris lobortis feugiat neque nec congue leo viverra semper in vel mauris nunc aenean scelerisque arcu a varius tempor proin quis nisi et mi dictum tempor pellentesque mattis risus metus id luctus orci ultrices ac praesent bibendum lectus a nisl mollis eu suscipit eros pretium',
 1: 'phasellus vitae elit efficitur varius felis id vulputate neque praesent dictum nunc in velit lobortis tempus duis pulvinar ut urna eget volutpat in hac habitasse platea dictumst aenean eros libero ultrices eu dolor id ultrices bibendum mauris nunc quis nunc feugiat dapibus sem eget pretium orci mauris mauris felis pulvinar nec dapibus in laoreet ac nibh pellentesque habitant morbi tristique senectu

In [7]:
df_words = pd.DataFrame.from_dict(word_dict,
                                  orient = 'index',
                                  columns = ['sentence'])
df_words

Unnamed: 0,sentence
0,fusce risus ex posuere at ante at condimentum ...
1,phasellus vitae elit efficitur varius felis id...
2,curabitur vestibulum lorem diam nec dictum lec...
3,curabitur facilisis elementum ligula vitae pel...


Now that we have this DataFrame, we'll remove the repeated words and transform the unique words into the columns of the DataFrame, so that the numbers in the DataFrame shows how much that words appeared in each of the four sentences.

For this we can use the set data structure and convert it into a list.

In [10]:
unique_words_list = []
i = 0
while i < len(df_words.index):
    unique_words_list = unique_words_list + df_words.iloc[:, 0][i].split()
    i += 1
unique_words_list = list(set(unique_words_list))
print(f'We have {len(unique_words_list)} unique words.')

We have 142 unique words.


Now let's go ahead and construct the DataFrame.

In [12]:
unique_words_list.sort()
df_final = pd.DataFrame(columns = unique_words_list)
line = ['']*len(unique_words_list)
i = 0
while i < len(df_words.index):
    j = 0
    while j < len(unique_words_list):
        word_counted = df_final.columns[j]
        line[j] = df_words['sentence'][i].count(word_counted)
        j += 1
    count_row = pd.Series(line, index = df_final.columns)
    df_final = df_final.append(count_row, ignore_index=True)
    i += 1
df_final

Unnamed: 0,a,ac,accumsan,ad,aenean,amet,ante,aptent,arcu,at,auctor,bibendum,class,condimentum,congue,consectetur,consequat,conubia,curabitur,dapibus,diam,dictum,dictumst,dignissim,dolor,donec,dui,duis,efficitur,egestas,eget,eleifend,elementum,elit,erat,eros,est,et,eu,ex,...,pulvinar,quam,quis,quisque,risus,rutrum,sagittis,sapien,scelerisque,sed,sem,semper,senectus,sit,sociosqu,sodales,sollicitudin,suscipit,taciti,tempor,tempus,tincidunt,torquent,tristique,turpis,ultrices,ultricies,urna,ut,varius,vehicula,vel,velit,venenatis,vestibulum,vitae,vivamus,viverra,volutpat,vulputate
0,33,1,0,0,1,0,1,0,1,5,0,1,0,1,1,1,0,0,0,1,0,1,0,0,1,1,2,0,0,1,2,0,0,0,0,1,3,8,4,1,...,0,1,1,0,2,0,0,1,1,0,1,1,0,0,0,0,1,1,0,3,2,1,0,0,0,1,0,0,2,1,0,1,0,0,2,2,1,1,1,0
1,45,4,0,1,1,2,0,0,0,5,0,1,0,0,0,0,0,0,0,3,0,2,1,0,1,0,2,2,1,1,2,0,0,2,1,1,1,12,2,1,...,2,0,1,0,0,1,0,0,1,2,1,0,1,2,0,0,1,0,0,0,1,0,0,1,1,2,0,1,4,1,1,1,1,0,0,2,0,0,1,1
2,34,1,0,1,1,3,1,1,1,4,0,1,1,0,0,0,0,1,1,0,1,1,0,2,0,2,1,0,0,1,1,1,0,2,0,0,3,9,0,1,...,0,0,1,1,0,0,1,1,0,1,1,0,0,3,1,1,0,1,1,1,0,0,1,1,0,2,1,0,2,0,0,1,1,0,2,2,1,0,0,0
3,31,4,1,0,0,1,0,0,2,3,2,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,2,0,1,0,0,3,1,5,1,0,...,0,0,0,0,0,0,0,0,0,3,2,1,0,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,2,0,1,0,1,1,1,0,0


# Check Text

We can sort the text to see which words are most frequent in each sentence.

In [13]:
df_final.T.sort_values(by = [0, 1, 2, 3], ascending = False)

Unnamed: 0,0,1,2,3
a,33,45,34,31
et,8,12,9,5
in,6,7,3,2
at,5,5,4,3
eu,4,2,0,1
...,...,...,...,...
consequat,0,0,0,1
elementum,0,0,0,1
facilisis,0,0,0,1
ligula,0,0,0,1


From this we see that "a" and "et" are the most frequent words in the senteces.

# References

- [Understanding NLP Word Embeddings — Text Vectorization](https://towardsdatascience.com/understanding-nlp-word-embeddings-text-vectorization-1a23744f7223)
- [Text Classification Demystified: An Introduction to Word Embeddings](https://www.freecodecamp.org/news/demystify-state-of-the-art-text-classification-word-embeddings/)
- [Word Embedding In NLP with Python Code Implementation](https://www.analyticssteps.com/blogs/word-embedding-nlp-python-code)
- [Cosine Similarity](https://www.sciencedirect.com/topics/computer-science/cosine-similarity)
- [Understanding Cosine Similarity And Its Application](https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a)
- [Cosine Similarity – Understanding the math and how it works (with python codes)](https://www.machinelearningplus.com/nlp/cosine-similarity/)
- [Transforming a Text to Vector](https://stackoverflow.com/questions/17053459/how-to-transform-a-text-to-vector)
- [Getting Started with Text Vectorization](https://towardsdatascience.com/getting-started-with-text-vectorization-2f2efbec6685)
- [Word Embedding: fazendo o computador entender o significado das palavras](https://medium.com/turing-talks/word-embedding-fazendo-o-computador-entender-o-significado-das-palavras-92fe22745057)