# Example usage

This package, `text_processing_util_mds24`, includes four functions for processing and representing text data for machine learning tasks, specifically natural language processing. It provides three different functions for text representations that take a list of documents in the form of raw text: `frequency_vectorizer`, `tfidf_vectorizer` and `tokenizer_padding`. If users wish to represent text in another way, `text_clean` will make their lives easier by converting all characters to lower case, removing all punctuations and numbers, and splitting each document into a list of words. Examples on how to use these functions are documented on this page.

## Imports

In [None]:
from text_processing_util_mds24 import text_clean, frequency_vectorizer, tfidf_vectorizer, tokenizer_padding

## Creating Text Documents

We will first create a sample list of documents using the first paragraph of _On the Origin of Species_ by Charles Darwin. (Note that this book is in the public domain.) The paragraph is stored in the file `origin_of_species.txt`. Here, each sentence in the paragraph is an individual document.

In [None]:
with open("origin_of_species.txt", encoding="utf-8") as text_data_file:
    origin_of_species = [line.rstrip() for line in text_data_file]

origin_of_species

## Cleaning the Text

`text_clean()` cleans raw text for further text processing. This function will convert all characters to lower case, remove punctuations as well as numbers, and split words by spaces. All other functions in this package will call `text_clean()` before transforming the text to other representations, and therefore accept raw text as input. The user can also use this function to clean texts before feeding the texts to another algorithm of their choice.
The usage of this function is demonstrated below.

In [None]:
cleaned_txt = text_clean(origin_of_species)
for c in cleaned_txt:
    print(str(c) + '\n')

In addition to cleaning the text, the package provides three different text representations to be used for machine learning models: frequency vectorizer, TF-IDF vectorizer and tokenizer plus padding.

## Text Representation 1: Frequency Vectorizer

The `frequency_vectorizer` calculates the frequency of each word in a list of text documents to capture the significance of each word in each document. This function is useful for transforming text data into a feature matrix (word frequency matrix) that is to be used for machine learning.

The usage of this function is demonstrated below.


In [None]:
tf_matrix, tf_feature_names = frequency_vectorizer(origin_of_species)

print("Frequency Matrix:")
print(tf_matrix)
print("\nFeature Names:")
print(tf_feature_names)

## Text Representation 2: TF-IDF Vectorizer

The `tfidf_vectorizer` function computes the Term Frequency-Inverse Document Frequency (TF-IDF) scores for a given list of documents, providing a numerical representation that highlights the importance of terms within the context of the entire document set. This function is useful for transforming text data into a feature matrix, capturing the significance of terms while considering their frequency and uniqueness across the document collection.

The usage of this function is demonstrated below.

In [None]:
text_tdidf_vectorized, feature_names = tfidf_vectorizer(origin_of_species)

print("Vecotrized documents:")
for vectorized_doc in text_tdidf_vectorized:
    print(vectorized_doc)

print("\nFeature names:")
print(feature_names)

## Text Representation 3: Tokenizer and Padding

If you would like to feed the data to recurrent neural networks (RNNs), you can use transform your text with `tokenizer_padding`. This function converts each word into an individual token represented by a number, but keeps the order of the original sentence, which is important for RNNs. It also pads shorter sequences with zeros at the end because deep learning libraries generally do not accept sequences of uneven lengths.

The usage of this function is demonstrated below.

In [None]:
text_tokenized_padded = tokenizer_padding(origin_of_species)
text_tokenized_padded