# Text Preprocessing Practice/Review

Import tools from libraries (should look for a "get_part_of_speech" tool).

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from part_of_speech import get_part_of_speech
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

Initialize tools.

In [None]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
stop_words.add("'s")
stemmer = PorterStemmer()

Import text to preprocess.

In [None]:
reznor_wiki = """
</p><p>On January 7, 2011, Reznor announced that he would again be working with Fincher, this time to provide the <a href="/wiki/The_Girl_with_the_Dragon_Tattoo_(soundtrack)" title="The Girl with the Dragon Tattoo (soundtrack)">score</a> for the American adaptation of <i><a href="/wiki/The_Girl_with_the_Dragon_Tattoo" title="The Girl with the Dragon Tattoo">The Girl with the Dragon Tattoo</a></i>.<sup id="cite_ref-91" class="reference"><a href="#cite_note-91">&#91;90&#93;</a></sup> A cover of "<a href="/wiki/Immigrant_Song" title="Immigrant Song">Immigrant Song</a>" by <a href="/wiki/Led_Zeppelin" title="Led Zeppelin">Led Zeppelin</a>, produced by Reznor and Ross, with <a href="/wiki/Karen_O" title="Karen O">Karen O</a> (of the <a href="/wiki/Yeah_Yeah_Yeahs" title="Yeah Yeah Yeahs">Yeah Yeah Yeahs</a>) as the featured singer, accompanied a trailer for the film.<sup id="cite_ref-ImmigrantSong_92-0" class="reference"><a href="#cite_note-ImmigrantSong-92">&#91;91&#93;</a></sup> Reznor and Ross' second collaboration with Fincher was scored as the film was shot, based on the concept, "What if we give you music the minute you start to edit stuff together?" Reznor explained in 2014 that the composition process was "a lot more work," and that he "would be hesitant to go as far in that direction in the future."<sup id="cite_ref-Joe_93-0" class="reference"><a href="#cite_note-Joe-93">&#91;92&#93;</a></sup>
>
"""

Clean up text (HTML tags, non alphanumeric characters and punctuation).

In [None]:
wiki_cleaned_up = re.sub('(<.+?>|[&#;]+\d+|[;,"()\'\.\?<>])', '', reznor_wiki)
print(wiki_cleaned_up)

Make text lower case.

In [None]:
wiki_lower = wiki_cleaned_up.lower()
print(wiki_lower)

Tokenize text.

In [None]:
wiki_tokenized = word_tokenize(wiki_lower)
print(wiki_tokenized)

Remove stop words from the text.

In [None]:
wiki_no_stop_words = [word for word in wiki_tokenized if word not in stop_words]
print(wiki_no_stop_words)

Lemmatize the text.

In [None]:
wiki_lemmatized = [lemmatizer.lemmatize(word, get_part_of_speech(word)) for word in wiki_no_stop_words]
print(wiki_lemmatized)

Stemmatize the text.

In [None]:
wiki_stemmatized = [stemmer.stem(word) for word in wiki_no_stop_words]
print(wiki_stemmatized)

Optional: create a set of the unique words in the text.

In [None]:
words_in_set = set(wiki_stemmatized)
print(words_in_set)