# Text Preprocessing

A first step in many nlp applications is cleaning up and standardizing texts in several ways. Which of these you choose to use will depend on your application. Common preprocessing steps are:
* removing newlines
* lowercasing the text
* removing stop words
* removing punctuation

In [6]:
text = """I am Sam. \nSam I am. \nI do not like green eggs and ham.\n"""
print(text)

I am Sam. 
Sam I am. 
I do not like green eggs and ham.



### remove newlines

Sometimes you may *not* want to remove newlines, for example if you want to keep paragraph text together. Below we see an example of removing newlines by using the string method **.replace()**, in which the first argument is the thing to be replaced and the second argument is what you want to replace it with, in this case nothing ''.

In [7]:
text = text.replace('\n','')
print(text)

I am Sam. Sam I am. I do not like green eggs and ham.


### lowercase

The lowercase() function can convert the entire text to lower case.

In [8]:
text = text.lower()
print(text)

i am sam. sam i am. i do not like green eggs and ham.


### tokenize and remove stopwords and punctuation

When we want to count words we often want to omit punctuation and stop words. Stop words are common words that don't carry a lot of meaning.

First we have to import nltk word_tokenize and stopwords. We also import the string class so that we can get a list of punctuation symbols. 

After tokenizing the text we used a list comprehension to remove stop words and punctuation. A list comprehension is like shorthand for a loop. The list comprehension below creates a new list by looping through each word in the tokens list and not including it if it is a stop word or a punctuation symbol.

In [9]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
    
tokens = word_tokenize(text)
print("tokens after tokenizing: ", tokens)
tokens = [word for word in tokens if word not in stopwords.words('english') and 
          word not in string.punctuation]
print("tokens after removing stop words and punctuation: ", tokens)

tokens after tokenizing:  ['i', 'am', 'sam', '.', 'sam', 'i', 'am', '.', 'i', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham', '.']
tokens after removing stop words and punctuation:  ['sam', 'sam', 'like', 'green', 'eggs', 'ham']
