# Example usage

This package, `text_processing_util_mds24`, includes four functions for processing and representing text data for machine learning tasks, specifically natural language processing. It provides three different functions for text representations that take a list of documents in the form of raw text: `frequency_vectorizer`, `tfidf_vectorizer` and `tokenizer_padding`. If users wish to represent text in another way, `text_clean` will make their lives easier by converting all characters to lower case, removing all punctuations and numbers, and splitting each document into a list of words. Examples on how to use these functions are documented on this page.

## Imports

In [1]:
from text_processing_util_mds24 import text_clean, frequency_vectorizer, tfidf_vectorizer, tokenizer_padding

## Creating Text Documents

We will first create a sample list of documents using the first paragraph of _On the Origin of Species_ by Charles Darwin. Here, each sentence in the paragraph is an individual document.

In [2]:
origin_of_species = ["When on board H.M.S. Beagle, as naturalist, I was much struck with certain facts in the distribution " \
                     + "of the organic beings inhabiting South America, and in the geological relations of the present to the " \
                     + "past inhabitants of that continent.",
                     "These facts, as will be seen in the latter chapters of this volume, " \
                     + "seemed to throw some light on the origin of species—that mystery of mysteries, as it has been called by " \
                     + "one of our greatest philosophers.",
                     "On my return home, it occurred to me, in 1837, that something might " \
                     + "perhaps be made out on this question by patiently accumulating and reflecting on all sorts of facts which " \
                     + "could possibly have any bearing on it.",
                     "After five years’ work I allowed myself to speculate on the subject, " \
                     + "and drew up some short notes; these I enlarged in 1844 into a sketch of the conclusions, which then seemed " \
                     + "to me probable: from that period to the present day I have steadily pursued the same object.",
                     "I hope that I may be excused for entering on these personal details, as I give them to show that " \
                     + "I have not been hasty in coming to a decision."]

## Cleaning the Text

`text_clean()` cleans raw text for further text processing. This function will convert all characters to lower case, remove punctuations as well as numbers, and split words by spaces. All other functions in this package will call `text_clean()` before transforming the text to other representations, and therefore accept raw text as input. The user can also use this function to clean texts before feeding the texts to another algorithm of their choice.
The usage of this function is demonstrated below.

In [3]:
cleaned_txt = text_clean(origin_of_species)
for c in cleaned_txt:
    print(str(c) + '\n')

['when', 'on', 'board', 'hms', 'beagle', 'as', 'naturalist', 'i', 'was', 'much', 'struck', 'with', 'certain', 'facts', 'in', 'the', 'distribution', 'of', 'the', 'organic', 'beings', 'inhabiting', 'south', 'america', 'and', 'in', 'the', 'geological', 'relations', 'of', 'the', 'present', 'to', 'the', 'past', 'inhabitants', 'of', 'that', 'continent']

['these', 'facts', 'as', 'will', 'be', 'seen', 'in', 'the', 'latter', 'chapters', 'of', 'this', 'volume', 'seemed', 'to', 'throw', 'some', 'light', 'on', 'the', 'origin', 'of', 'species—that', 'mystery', 'of', 'mysteries', 'as', 'it', 'has', 'been', 'called', 'by', 'one', 'of', 'our', 'greatest', 'philosophers']

['on', 'my', 'return', 'home', 'it', 'occurred', 'to', 'me', 'in', 'that', 'something', 'might', 'perhaps', 'be', 'made', 'out', 'on', 'this', 'question', 'by', 'patiently', 'accumulating', 'and', 'reflecting', 'on', 'all', 'sorts', 'of', 'facts', 'which', 'could', 'possibly', 'have', 'any', 'bearing', 'on', 'it']

['after', 'five',

In addition to cleaning the text, the package provides three different text representations to be used for machine learning models: frequency vectorizer, TF-IDF vectorizer and tokenizer plus padding.

## Text Representation 1: Frequency Vectorizer

The frequency_vectorizer calculates the frequency of each word in a list of text documents. This function is useful for transforming text data into a feature matrix, and word frequency matrix.

The usage of this function is demonstrated below.


In [4]:
tf_matrix, tf_feature_names = frequency_vectorizer(origin_of_species)

print("Frequency Matrix:")
print(tf_matrix)
print("\nFeature Names:")
print(tf_feature_names)


Frequency Matrix:
[[0.         0.         0.         0.         0.         0.02564103
  0.02564103 0.         0.02564103 0.         0.02564103 0.
  0.         0.02564103 0.02564103 0.         0.         0.02564103
  0.         0.         0.         0.02564103 0.         0.
  0.         0.         0.02564103 0.         0.         0.
  0.         0.02564103 0.         0.         0.         0.02564103
  0.         0.         0.         0.         0.         0.02564103
  0.         0.         0.02564103 0.05128205 0.02564103 0.02564103
  0.         0.         0.         0.         0.         0.
  0.         0.         0.02564103 0.         0.         0.
  0.         0.02564103 0.         0.         0.         0.
  0.07692308 0.02564103 0.         0.02564103 0.         0.
  0.         0.02564103 0.         0.         0.         0.
  0.         0.         0.02564103 0.         0.         0.
  0.         0.02564103 0.         0.         0.         0.
  0.         0.         0.         0.     

## Text Representation 2: TF-IDF Vectorizer

The `tfidf_vectorizer` function computes the Term Frequency-Inverse Document Frequency (TF-IDF) scores for a given list of documents, providing a numerical representation that highlights the importance of terms within the context of the entire document set. This function is useful for transforming text data into a feature matrix, capturing the significance of terms while considering their frequency and uniqueness across the document collection.

The usage of this function is demonstrated below.

In [5]:
text_tdidf_vectorized, feature_names = tfidf_vectorizer(origin_of_species)

print("Vecotrized documents:")
for vectorized_doc in text_tdidf_vectorized:
    print(vectorized_doc)

print("\nFeature names:")
print(feature_names)

Vecotrized documents:
[ 0.          0.          0.          0.          0.          0.02349463
  0.00572163  0.          0.00572163  0.          0.02349463  0.
  0.          0.02349463  0.02349463  0.          0.          0.02349463
  0.          0.          0.          0.02349463  0.          0.
  0.          0.          0.02349463  0.          0.          0.
  0.          0.00572163  0.          0.          0.          0.02349463
  0.          0.          0.          0.          0.          0.02349463
  0.          0.          0.00572163 -0.00934982  0.02349463  0.02349463
  0.          0.          0.          0.          0.          0.
  0.          0.          0.02349463  0.          0.          0.
  0.          0.02349463  0.          0.          0.          0.
  0.         -0.00467491  0.          0.02349463  0.          0.
  0.          0.02349463  0.          0.          0.          0.
  0.          0.          0.01309809  0.          0.          0.
  0.          0.02349463  0.

## Text Representation 3: Tokenizer and Padding

If you would like to feed the data to recurrent neural networks (RNNs), you can use transform your text with `tokenizer_padding`. This function converts each word into an individual token represented by a number, but keeps the order of the original sentence, which is important for RNNs. It also pads shorter sequences with zeros at the end because deep learning libraries generally do not accept sequences of uneven lengths.

The usage of this function is demonstrated below.

In [6]:
text_tokenized_padded = tokenizer_padding(origin_of_species)
text_tokenized_padded

array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
         12.,  13.,  14.,  15.,  16.,  17.,  18.,  16.,  19.,  20.,  21.,
         22.,  23.,  24.,  15.,  16.,  25.,  26.,  18.,  16.,  27.,  28.,
         16.,  29.,  30.,  18.,  31.,  32.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.],
       [ 33.,  14.,   6.,  34.,  35.,  36.,  15.,  16.,  37.,  38.,  18.,
         39.,  40.,  41.,  28.,  42.,  43.,  44.,   2.,  16.,  45.,  18.,
         46.,  47.,  18.,  48.,   6.,  49.,  50.,  51.,  52.,  53.,  54.,
         18.,  55.,  56.,  57.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.],
       [  2.,  58.,  59.,  60.,  49.,  61.,  28.,  62.,  15.,  31.,  63.,
         64.,  65.,  35.,  66.,  67.,   2.,  39.,  68.,  53.,  69.,  70.,
         24.,  71.,   2.,  72.,  73.,  18.,  14.,  74.,  75.,  76.,  77.,
         78.,  79.,   2.,  49.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.],
       [ 80.,