# A brief introduction to text corpus pre-processing for machine learning applications.
### By André Walsøe, Data scientist/Head Engineer Oslo University Library 2018
This is a brief introduction to text corpus pre-processing for machine learning applications produced for the Research Bazaar, "Hands-on Workshop: Exploring Research Data with Artificial Intelligence and Design Thinking" at UiO, January 11th 2019. 


Goals:
1. Understand basic techniques for removing noise from text data and how to apply these techniques on a corpus.
2. Understand techniques for improving data representation for machine learning applications, using lemmatization, stemming, stopword removal etc.
3. Understand the two main forms of feature extraction for text data: count vectorization and Tf-idf.
4. Be able to apply the above learned techniques on simple problems using python. 
![alt text](https://www.kdnuggets.com/wp-content/uploads/text-data-task-framework.png "Logo Title Text 1")
source: https://www.kdnuggets.com/2017/12/general-approach-preprocessing-text-data.html


# 1 Preprocess text

In [122]:
text = """Heraklous to2 Pompeius greeting and all good wishes for his health. As soon as the letter reaches you,\n
come immediately. Since Ap92eis died, Se@rapous never stops pestering me at home. Since she demands: \n
'Produce th#e+ memoranda', she stands the-re saying: 'Two memoranda'. I do not give them. Argyrios has utterly\n 
confir=med us, saying: 'You h23a21ve not paid the tax on sales'. If the letter reaches you, c54ome immediately. If not,\n
I shall leave the house. If you don't come, I shall arrange about the house (?) and come to y9ou in order that...\n"""

## 1.1 Lower text
Lowering all the capital-letters in the text input. This is done in order to make the computer understand that the 
words "Artificial" means the same as "artificial" etc. This will decrease the size of the vocabulary.

In [123]:
text = text.lower()
text

"heraklous to2 pompeius greeting and all good wishes for his health. as soon as the letter reaches you,\n\ncome immediately. since ap92eis died, se@rapous never stops pestering me at home. since she demands: \n\n'produce th#e+ memoranda', she stands the-re saying: 'two memoranda'. i do not give them. argyrios has utterly\n \nconfir=med us, saying: 'you h23a21ve not paid the tax on sales'. if the letter reaches you, c54ome immediately. if not,\n\ni shall leave the house. if you don't come, i shall arrange about the house (?) and come to y9ou in order that...\n"

## 1.2 Remove linebreak
Some texts contains the linebreak symbol "\n". This symbol is only there to show that it is the end of the line and the start of a new line. This contains no semantical information is therefore removed.

In [124]:
text = text.replace("\n", " ")
text

"heraklous to2 pompeius greeting and all good wishes for his health. as soon as the letter reaches you,  come immediately. since ap92eis died, se@rapous never stops pestering me at home. since she demands:   'produce th#e+ memoranda', she stands the-re saying: 'two memoranda'. i do not give them. argyrios has utterly   confir=med us, saying: 'you h23a21ve not paid the tax on sales'. if the letter reaches you, c54ome immediately. if not,  i shall leave the house. if you don't come, i shall arrange about the house (?) and come to y9ou in order that... "

## 1.3 Remove all numbers
In this step we remove all the numbers. This is a choice that needs to be considered from application to application. In some applications numbers may contain significant information, and in others not. In this examples the numbers are removed as we consider them as not important.

In [125]:
def remove_numbers(text):
    split_text = list(text)
    split_text
    split_text = [x for x in split_text if not x.isdigit()]
    text = ''.join(split_text)
    return text
text = remove_numbers(text)
text

"heraklous to pompeius greeting and all good wishes for his health. as soon as the letter reaches you,  come immediately. since apeis died, se@rapous never stops pestering me at home. since she demands:   'produce th#e+ memoranda', she stands the-re saying: 'two memoranda'. i do not give them. argyrios has utterly   confir=med us, saying: 'you have not paid the tax on sales'. if the letter reaches you, come immediately. if not,  i shall leave the house. if you don't come, i shall arrange about the house (?) and come to you in order that... "

## 1.4 Remove all non-alphabetical symbols
In this step all non-alphabetical symbols are removed using regular expressions (often shortened to regex). Regex is a string pattern search algorithm. This is also application dependent, but in most applications this is removed (or at least a selection of symbols) as it adds noise to the data.

In [126]:
import re
def remove_non_alpha(text):
    regex = re.compile('[^a-zA-Z\s]')
    text = regex.sub('', text)
    return text
text = remove_non_alpha(text)
text

'heraklous to pompeius greeting and all good wishes for his health as soon as the letter reaches you  come immediately since apeis died serapous never stops pestering me at home since she demands   produce the memoranda she stands there saying two memoranda i do not give them argyrios has utterly   confirmed us saying you have not paid the tax on sales if the letter reaches you come immediately if not  i shall leave the house if you dont come i shall arrange about the house  and come to you in order that '

## Some functions for statistics

In [127]:
import numpy as np
## Calculate average word length
def avg_word_len(tokenized_array):
    len_arr = [len(x) for x in tokenized_array]
    avg = np.average(len_arr)
    return avg
def vocab_size(tokenized_array):
    return len(set(tokenized_array))


## 1.5 Tokenize text
Tokenization is the process if splitting a string into words, sentences or other units dependent on the applications.
Example using word tokenization:

1) Input string: "This is an example"

2) Output array: ["this", "is", "an", "example"]

For this task we will use a platform called NLTK, Natural Language Toolkit.

Description from their website:

> NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use 
>  interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing 
>  libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for 
>  industrial-strength NLP libraries, and an active discussion forum.

For further information on how to use NLTK and also how to do Natural Language Processing in general, check this out:
http://www.nltk.org/book/

In [128]:
import nltk
def tokenize_text(text):
    tokenized_text = nltk.word_tokenize(text)
    return tokenized_text
tokenized_text = tokenize_text(text)
print("Average word_length: {}".format(avg_word_len(tokenized_text)))
print("Length of text: {}".format(len(tokenized_text)))
print("Vocabulary size: {}".format(vocab_size(tokenized_text)))
print(tokenized_text)

Average word_length: 4.456521739130435
Length of text: 92
Vocabulary size: 62
['heraklous', 'to', 'pompeius', 'greeting', 'and', 'all', 'good', 'wishes', 'for', 'his', 'health', 'as', 'soon', 'as', 'the', 'letter', 'reaches', 'you', 'come', 'immediately', 'since', 'apeis', 'died', 'serapous', 'never', 'stops', 'pestering', 'me', 'at', 'home', 'since', 'she', 'demands', 'produce', 'the', 'memoranda', 'she', 'stands', 'there', 'saying', 'two', 'memoranda', 'i', 'do', 'not', 'give', 'them', 'argyrios', 'has', 'utterly', 'confirmed', 'us', 'saying', 'you', 'have', 'not', 'paid', 'the', 'tax', 'on', 'sales', 'if', 'the', 'letter', 'reaches', 'you', 'come', 'immediately', 'if', 'not', 'i', 'shall', 'leave', 'the', 'house', 'if', 'you', 'dont', 'come', 'i', 'shall', 'arrange', 'about', 'the', 'house', 'and', 'come', 'to', 'you', 'in', 'order', 'that']


## 1.6 Remove stopwords
In this step we remove stopwords from the text. Stopwords is a selection of common words that add little to no information about the content of the text. The biggest advantage is that it removes high frequency terms that carry little semantic importance. Another advantage is that it reduces the dimensionality of the data, which results in reducing computation. The disadvantage can be that it removes some verbs that may have importance in some settings. For example in Shakespeare, the phrase "to be or not to be" would be removed. 

In the section below we can see a typical selection of stopwords for the english language.

In [129]:
from nltk.corpus import stopwords
stopWords = stopwords.words('english')
stopWords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [130]:
def remove_stopwords(tokenized_text):
    tokenized_text = [x for x in tokenized_text if x not in stopWords]
    return tokenized_text
print("Average word_length before stopword removal: {}".format(avg_word_len(tokenized_text)))
print("Length of text before stopword removal: {}".format(len(tokenized_text)))
print("Vocabulary size before stopword removal: {}".format(vocab_size(tokenized_text)))
print("------------------------------------")
tokenized_text = remove_stopwords(tokenized_text)

print("Average word_length after stopword removal: {}".format(avg_word_len(tokenized_text)))
print("Length of text after stopword removal: {}".format(len(tokenized_text)))
print("Vocabulary size after stopword removal: {}".format(vocab_size(tokenized_text)))

print(tokenized_text)

Average word_length before stopword removal: 4.456521739130435
Length of text before stopword removal: 92
Vocabulary size before stopword removal: 62
------------------------------------
Average word_length after stopword removal: 5.9
Length of text after stopword removal: 50
Vocabulary size after stopword removal: 39
['heraklous', 'pompeius', 'greeting', 'good', 'wishes', 'health', 'soon', 'letter', 'reaches', 'come', 'immediately', 'since', 'apeis', 'died', 'serapous', 'never', 'stops', 'pestering', 'home', 'since', 'demands', 'produce', 'memoranda', 'stands', 'saying', 'two', 'memoranda', 'give', 'argyrios', 'utterly', 'confirmed', 'us', 'saying', 'paid', 'tax', 'sales', 'letter', 'reaches', 'come', 'immediately', 'shall', 'leave', 'house', 'dont', 'come', 'shall', 'arrange', 'house', 'come', 'order']


## 1.7 Lemmatization
Simply explained lemmatization is the process of converting a word to it's dictionary form. 
Example from wikipedia https://en.wikipedia.org/wiki/Lemmatisation:
>For instance:
>1. The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
>2. The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.
>3. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; >e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the >correct lemma depending on the context.


In [131]:
import nltk
from nltk.stem import WordNetLemmatizer

def lemmatize_text(tokenized_text):
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized_text = [wordnet_lemmatizer.lemmatize(x) for x in tokenized_text]
    return lemmatized_text

print("Average word_length before lemmatization: {}".format(avg_word_len(tokenized_text)))
print("Length of text before lemmatization: {}".format(len(tokenized_text)))
print("Vocabulary size before lemmatization: {}".format(vocab_size(tokenized_text)))

tokenized_text = lemmatize_text(tokenized_text)
print(tokenized_text)

print("Average word_length after lemmatization: {}".format(avg_word_len(tokenized_text)))
print("Length of text after lemmatization: {}".format(len(tokenized_text)))
print("Vocabulary size after lemmatization: {}".format(vocab_size(tokenized_text)))

Average word_length before lemmatization: 5.9
Length of text before lemmatization: 50
Vocabulary size before lemmatization: 39
['heraklous', 'pompeius', 'greeting', 'good', 'wish', 'health', 'soon', 'letter', 'reach', 'come', 'immediately', 'since', 'apeis', 'died', 'serapous', 'never', 'stop', 'pestering', 'home', 'since', 'demand', 'produce', 'memoranda', 'stand', 'saying', 'two', 'memoranda', 'give', 'argyrios', 'utterly', 'confirmed', 'u', 'saying', 'paid', 'tax', 'sale', 'letter', 'reach', 'come', 'immediately', 'shall', 'leave', 'house', 'dont', 'come', 'shall', 'arrange', 'house', 'come', 'order']
Average word_length after lemmatization: 5.68
Length of text after lemmatization: 50
Vocabulary size after lemmatization: 39


## 1.8 Stemming
Stemming is the process of collapsing the morphological variants of a word together. For example, the without stemming the terms process, processing and processed will be treated as distinct items with separate term frequences (Jurafsky). The biggest advantage with applying stemming is that it allows matching of all morphological variants of the term. The disadvantage is that it throws away useful differences. For example the words "stocks" and "stockings" will be reduced to the term "stock", which is two totally different words. Another example is the word "university" and "universe" which will be reduced to "univers". 

In [132]:
import nltk
from nltk.stem.porter import PorterStemmer

def stem_text(tokenized_text):
    porter_stemmer = PorterStemmer()
    stemmed_text = [porter_stemmer.stem(x) for x in tokenized_text]
    return stemmed_text

print("Average word_length before stemming: {}".format(avg_word_len(tokenized_text)))
print("Length of text before stemming: {}".format(len(tokenized_text)))
print("Vocabulary size before stemming: {}".format(vocab_size(tokenized_text)))

tokenized_text = stem_text(tokenized_text)


print("Average word_length after stemming: {}".format(avg_word_len(tokenized_text)))
print("Length of text after stemming: {}".format(len(tokenized_text)))
print("Vocabulary size after stemming: {}".format(vocab_size(tokenized_text)))
tokenized_text

Average word_length before stemming: 5.68
Length of text before stemming: 50
Vocabulary size before stemming: 39
Average word_length after stemming: 4.86
Length of text after stemming: 50
Vocabulary size after stemming: 39


['herakl',
 'pompeiu',
 'greet',
 'good',
 'wish',
 'health',
 'soon',
 'letter',
 'reach',
 'come',
 'immedi',
 'sinc',
 'apei',
 'die',
 'serap',
 'never',
 'stop',
 'pester',
 'home',
 'sinc',
 'demand',
 'produc',
 'memoranda',
 'stand',
 'say',
 'two',
 'memoranda',
 'give',
 'argyrio',
 'utterli',
 'confirm',
 'u',
 'say',
 'paid',
 'tax',
 'sale',
 'letter',
 'reach',
 'come',
 'immedi',
 'shall',
 'leav',
 'hous',
 'dont',
 'come',
 'shall',
 'arrang',
 'hous',
 'come',
 'order']

# 2. Preprocess Text Corpus
A corpus is a structured set of texts which can be used for statistical analysis or similar. Below we have a small corpus of just 3 texts to experiment with. In this section I will show how we can apply the pre-processing methods shown above on a corpus. We will re-use the functions written above. 

In [133]:
example_corpus = ["There is no friend \nas2 loy1al as21 a book", "3There are \nworse cri!m2es than burning books. One of them is not reading them.", "I discovered me in the -231 !22 library. \nI went to find me in the library."]
example_corpus

['There is no friend \nas2 loy1al as21 a book',
 '3There are \nworse cri!m2es than burning books. One of them is not reading them.',
 'I discovered me in the -231 !22 library. \nI went to find me in the library.']

## 2.1 Lower text
How to lower all text in a a corpus. 

In [134]:
example_corpus = [x.lower() for x in example_corpus]
example_corpus

['there is no friend \nas2 loy1al as21 a book',
 '3there are \nworse cri!m2es than burning books. one of them is not reading them.',
 'i discovered me in the -231 !22 library. \ni went to find me in the library.']

## 2.2 Remove Linebreak
Remove linebreak symbols from text. 

In [135]:
example_corpus = [x.replace("\n", " ") for x in example_corpus]
example_corpus

['there is no friend  as2 loy1al as21 a book',
 '3there are  worse cri!m2es than burning books. one of them is not reading them.',
 'i discovered me in the -231 !22 library.  i went to find me in the library.']

## 2.3 Remove All Numbers
Removing all numbers from text

In [136]:
example_corpus = [remove_numbers(x) for x in example_corpus]
example_corpus

['there is no friend  as loyal as a book',
 'there are  worse cri!mes than burning books. one of them is not reading them.',
 'i discovered me in the - ! library.  i went to find me in the library.']

## 2.4 Remove all non-alphabetical symbols

In [137]:
example_corpus = [remove_non_alpha(x) for x in example_corpus]
example_corpus

['there is no friend  as loyal as a book',
 'there are  worse crimes than burning books one of them is not reading them',
 'i discovered me in the   library  i went to find me in the library']

## 2.5 Tokenize text

In [138]:
tokenized_example_corpus = [tokenize_text(x) for x in example_corpus]
tokenized_example_corpus

[['there', 'is', 'no', 'friend', 'as', 'loyal', 'as', 'a', 'book'],
 ['there',
  'are',
  'worse',
  'crimes',
  'than',
  'burning',
  'books',
  'one',
  'of',
  'them',
  'is',
  'not',
  'reading',
  'them'],
 ['i',
  'discovered',
  'me',
  'in',
  'the',
  'library',
  'i',
  'went',
  'to',
  'find',
  'me',
  'in',
  'the',
  'library']]

In [139]:
## Some statistics functions
import numpy as np
## Calculate average word length
def avg_word_len_corpus(tokenized_corpus):
    merged_corpus = []
    for l in tokenized_corpus:
        merged_corpus += l
    
    len_arr = [len(x) for x in merged_corpus]
    
    avg = np.average(len_arr)
    return avg
def get_text_lengths(tokenized_corpus):
    len_arr = [len(x) for x in tokenized_corpus]
    return len_arr
def vocab_size_corpus(tokenized_corpus):
    merged_corpus = []
    for l in tokenized_corpus:
        merged_corpus += l
    
    return len(set(merged_corpus))


## 2.6 Remove stopwords

In [140]:
print("Average word_length before stopword removal: {}".format(avg_word_len_corpus(tokenized_example_corpus)))
print("Length of texts before stopword removal: {}".format(get_text_lengths(tokenized_example_corpus)))
print("Vocabulary size before stopword removal: {}".format(vocab_size_corpus(tokenized_example_corpus)))

tokenized_example_corpus = [remove_stopwords(x) for x in tokenized_example_corpus]
tokenized_example_corpus




print("Average word_length after stopword removal: {}".format(avg_word_len_corpus(tokenized_example_corpus)))
print("Length of texts after stopword removal: {}".format(get_text_lengths(tokenized_example_corpus)))
print("Vocabulary size after stopword removal: {}".format(vocab_size_corpus(tokenized_example_corpus)))



Average word_length before stopword removal: 3.7567567567567566
Length of texts before stopword removal: [9, 14, 14]
Vocabulary size before stopword removal: 28
Average word_length after stopword removal: 5.714285714285714
Length of texts after stopword removal: [3, 6, 5]
Vocabulary size after stopword removal: 13


## 2.7 Lemmatization

In [141]:
print("Average word_length before lemmatization: {}".format(avg_word_len_corpus(tokenized_example_corpus)))
print("Length of texts before lemmatization: {}".format(get_text_lengths(tokenized_example_corpus)))
print("Vocabulary size before lemmatization: {}".format(vocab_size_corpus(tokenized_example_corpus)))

tokenized_example_corpus = [lemmatize_text(x) for x in tokenized_example_corpus]
tokenized_example_corpus

print("Average word_length after lemmatization: {}".format(avg_word_len_corpus(tokenized_example_corpus)))
print("Length of texts after lemmatization: {}".format(get_text_lengths(tokenized_example_corpus)))
print("Vocabulary size after lemmatization: {}".format(vocab_size_corpus(tokenized_example_corpus)))

Average word_length before lemmatization: 5.714285714285714
Length of texts before lemmatization: [3, 6, 5]
Vocabulary size before lemmatization: 13
Average word_length after lemmatization: 5.571428571428571
Length of texts after lemmatization: [3, 6, 5]
Vocabulary size after lemmatization: 12


## 2.8 Stemming

In [142]:
print("Average word_length before stemming: {}".format(avg_word_len_corpus(tokenized_example_corpus)))
print("Length of texts before stemming: {}".format(get_text_lengths(tokenized_example_corpus)))
print("Vocabulary size before stemming: {}".format(vocab_size_corpus(tokenized_example_corpus)))
tokenized_example_corpus = [stem_text(x) for x in tokenized_example_corpus]
tokenized_example_corpus

print("Average word_length after stemming: {}".format(avg_word_len_corpus(tokenized_example_corpus)))
print("Length of texts after stemming: {}".format(get_text_lengths(tokenized_example_corpus)))
print("Vocabulary size after stemming: {}".format(vocab_size_corpus(tokenized_example_corpus)))

Average word_length before stemming: 5.571428571428571
Length of texts before stemming: [3, 6, 5]
Vocabulary size before stemming: 12
Average word_length after stemming: 4.785714285714286
Length of texts after stemming: [3, 6, 5]
Vocabulary size after stemming: 12


# 3 Feature Extraction
Feature extraction is the process of transforming input data to a set of features to a format that can be used for machine learning. There are several different ways of doing this. Text classification is the focus of this workshop, therefore we have chosen to focus on bag-of-words feature extraction and 2 techniques for performing it. 

* Count Vectorization
* Term Frequency Inverse Document Frequency representation.

For feature extraction we will use the stemmed version of the corpus. As shown below:

In [143]:
stemmed_corpus = [' '.join(x) for x in tokenized_example_corpus]
stemmed_corpus

['friend loyal book',
 'wors crime burn book one read',
 'discov librari went find librari']

## 3.1 Count Vectorization
Count vectorization is technique where a set of texts is turned into a array of token counts. 
Information about the sklearn implementation can be found here: 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [144]:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
count_vectorized_corpus = vectorizer.fit_transform(stemmed_corpus)
print("Vector headers")
print(vectorizer.get_feature_names())

print("count_array")
print(count_vectorized_corpus.toarray())

Vector headers
['book', 'burn', 'crime', 'discov', 'find', 'friend', 'librari', 'loyal', 'one', 'read', 'went', 'wors']
count_array
[[1 0 0 0 0 1 0 1 0 0 0 0]
 [1 1 1 0 0 0 0 0 1 1 0 1]
 [0 0 0 1 1 0 2 0 0 0 1 0]]


## 3.2 Term Frequency Inverse Document Frequency representation

A tfidf-presentation prfers words that are frequent in the current document, but rare overall in the corpus. 

![alt text](https://mungingdata.files.wordpress.com/2017/11/equation.png "Logo Title Text 1")

![alt text](https://mungingdata.files.wordpress.com/2017/11/tfidf.png "Logo Title Text w")

For more information about sklearns implementation:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [145]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_vectorizer = TfidfTransformer()
tfidf_vectorized_corpus = tfidf_vectorizer.fit_transform(count_vectorized_corpus)
print("Vector headers")
print(vectorizer.get_feature_names())

print("count_array")
print(tfidf_vectorized_corpus.toarray())

Vector headers
['book', 'burn', 'crime', 'discov', 'find', 'friend', 'librari', 'loyal', 'one', 'read', 'went', 'wors']
count_array
[[0.4736296  0.         0.         0.         0.         0.62276601
  0.         0.62276601 0.         0.         0.         0.        ]
 [0.32200242 0.42339448 0.42339448 0.         0.         0.
  0.         0.         0.42339448 0.42339448 0.         0.42339448]
 [0.         0.         0.         0.37796447 0.37796447 0.
  0.75592895 0.         0.         0.         0.37796447 0.        ]]


# 4.  Practice tasks

## 4.1 Task 1 - Text pre-processing
The text below needs to be pre-processed to be ready for use. Identify and (if you have time) apply the techniques
needed.

In [146]:
text = """To Akousilaos@\n , (?) finance-agent, greeting. Paeu3s2 son of Souneus12, assistant, has pre1paid\n in accordance with a receipt (?) for the poll-tax of the 10th year of Tiberius Caesar Augustus at Tebtunis twelve drachmas of debased silver, total 12 dr. debased silver. The 10th year of Tiberius Caesar Augustus, Choiak 13. And on Pharmouthi 5 through Akousilaos at Tebtunis twelve drachmas for poll-tax, total 12 dr."""
text

'To Akousilaos@\n , (?) finance-agent, greeting. Paeu3s2 son of Souneus12, assistant, has pre1paid\n in accordance with a receipt (?) for the poll-tax of the 10th year of Tiberius Caesar Augustus at Tebtunis twelve drachmas of debased silver, total 12 dr. debased silver. The 10th year of Tiberius Caesar Augustus, Choiak 13. And on Pharmouthi 5 through Akousilaos at Tebtunis twelve drachmas for poll-tax, total 12 dr.'

In [147]:
## Implement your solution here


## 4.2 Task 2 - Label normalization
You are given a set of labels. Look at the labels and determine what needs to be done to standardize and clean the data.

In [148]:
labels = open("data/dates.txt","r").readlines()
labels[:20]

['Between 300 and 130 B.C. \n',
 'Late 2nd or 3rd century A.D. \n',
 'Late 2nd century B.C. \n',
 '181/182 or 213/214 A.D. \n',
 'Between 105 and 90 B.C. \n',
 'Between 300 and 130 B.C. \n',
 '2nd century A.D. \n',
 'Between 300 and 130 B.C. \n',
 '29 Oct. or 8 Nov. 114 B.C. \n',
 '1st century A.D. \n',
 '3rd century A.D. \n',
 '2nd or 3rd century A.D. \n',
 'Early 2nd century B.C. \n',
 'Between 300 and 130 B.C. \n',
 '2nd century B.C. \n',
 '2nd century A.D. \n',
 'ca. 192/191 B.C.? \n',
 'Between 300 and 130 B.C. \n',
 'Early 2nd century B.C. \n',
 'Between 300 and 130 B.C. \n']

In [149]:
## Implement your solution here. 


## 4.3 Task 3 - Find most significant words for each text in the given corpus
Hint: Use tf-idf vectorization

In [150]:
corpus = ["Never wear your best trousers when you go out to fight for freedom and truth.",
         "The pillars of truth and the pillars of freedom - they are the pillars of society.",
          "The strongest man in the world is he who stands most alone."
         ]

In [151]:
## Implement your solution here.
