# Special Character & Stopword Removal

## Removing Special Characters

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

Python has a built-in package called <code>re</code>, which can be used to work with Regular Expressions.

More on regular expressions: https://www.w3schools.com/python/python_regex.asp

### RegEx Example

Search the string to see if it starts with "The" and ends with "Spain".

In [1]:
import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")

YES! We have a match!


Run the below code cell to create the <code>remove_special_characters</code> function. This function will allow you to remove special characters from a text string.

A caret located in a bracket means ‘not.’ 
If remove_digits parameter is True, "^a-zA-Z0-9\s" matches any characters other than 
alphabets ([a-zA-Z]) or digits ([0-9]), followed by a white space ([\s]).
If 'remove_digits' parameter is False, the the function will remove numbers as well. 

In [2]:
def remove_special_characters(text, remove_digits=False):
    '''
    A caret located in a bracket means ‘not.’ 
    If remove_digits parameter is True, "^a-zA-Z0-9\s" matches any characters other than 
    alphabets ([a-zA-Z]) or digits ([0-9]), followed by a white space ([\s]).
    If 'remove_digits' parameter is False, the the function will remove numbers as well. 
    '''
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

**Execute the function "remove_special_characters"**

The below example demonstrates two different executions of the <code>remove_special_characters</code> function for the following text.
> "Well this was fun! What do you think? 123#@"

In [3]:
remove_special_characters("Well this was fun! What do you think? 123#@", 
                          remove_digits=True)

'Well this was fun What do you think '

In [4]:
remove_special_characters("Well this was fun! What do you think? 123#@", 
                          remove_digits=False)

'Well this was fun What do you think 123'

## Removing Stopwords

We will be importing the <code>nltk</code> package, which stands for Natural Language Toolkit. This is a suite of libraries and programs for symbolic and statistical natural language processing for English.

<code>nltk</code> contains tools for pre-processing data, and we will be using these tools to remove stopwords, stem words, and lemmaztize words. 

More on <code>nltk</code> here: https://www.nltk.org/book/ch03.html

In [5]:
import nltk # import nltk library

from nltk.corpus import stopwords # stopwords tool
from nltk.tokenize.toktok import ToktokTokenizer # word tokenization tool

nltk.download('stopwords') # get the stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andreaekey/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

The code cell below will display a list of stopwords we will be removing from our text. 

In [6]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

The below code cell will create the <code>remove_stopwords</code> function. This function will allow you to remove stopwords from a text.

In [7]:
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no') # we will not remove 'no' from texts
stopword_list.remove('not') # we will not remove 'not' from texts

def remove_stopwords(text, is_lower_case=False):
    # First, tokenize the text
    tokens = tokenizer.tokenize(text)
    # remove whitespaces in each token
    tokens = [token.strip() for token in tokens]
    # if "is_lower_case" parameter is True, 
    # we will not remove stopwords that have any upper case letter
    if is_lower_case: 
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    # If "is_lower_case" parameter is False, 
    # we will remove any stopwords no matter whether they are in uppercase or not
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

**Execute the function "remove_stopwords"**

The below example demonstrates two different executions of the <code>remove_stopwords</code> function for the following text.
> "The, and, if are StopWords, computer is not"

In [8]:
remove_stopwords("The, and, if are StopWords, computer is not")

', , StopWords , computer not'

In [9]:
remove_stopwords("The, and, if are StopWords, computer is not", is_lower_case=True)

'The , , StopWords , computer not'

## Exercise (Stopword Removal)

Consider the below product reviews:  
  
1. The product is really very good. – POSITIVE  
2. The product seems to be good. – POSITIVE  
3. Good product. I really liked it. – POSITIVE  
4. The product is not good. – NEGATIVE  
5. I didn’t like the product. – NEGATIVE  

Each numbered review has the review text on the left and the label on the right. A product review will be labeled either 'POSITIVE' or 'NEGATIVE'.

Remove stopwords in each review using the code provided earlier in the module. Observe what happens to the review comments.

In [10]:
# remove stopwords from review text
rm1 = remove_stopwords ("The product is really very good.")
rm2 = remove_stopwords ("The product seems to be good.")
rm3 = remove_stopwords ("Good product. I really liked it.")
rm4 = remove_stopwords ("The product is not good.")
rm5 = remove_stopwords ("I didn’t like the product.")

print(rm1)
print(rm2)
print(rm3)
print(rm4)
print(rm5)

product really good .
product seems good .
Good product. really liked .
product not good .
’ like product .


The meaning of Review 5 is changed after removing stop words. This is because the
NLTK's stopword list contains contractions of negative expressions like 'didn't," 
"doesn't," "hadn't," etc.
Removing stop words, especially when it comes to negative contractions like 'didn't,' 
can change the meaning of an original text. We should remove stop words with care.

# Stemming & Lemmatization

## Stemming

Run the below code cell to create the <code>simple_stemmer</code> function. This function will allow you to stem words in a text.

In [11]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    # split the text into individual word and return a list of words
    # the 'ps' function stems each word, and .join() function joins the stemmed words with whitespace.
    text = ' '.join([ps.stem(word) for word in text.split()]) 
    return text

**Execute the function "simple_stemmer"**

The below example demonstrates a sample execution of the function <code>simple_stemmer</code> function for the following text.
> "My system keeps crashing his crashed yesterday, ours crashes daily"

In [12]:
simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")

'My system keep crash hi crash yesterday, our crash daili'

## Lemmatization

Run the below code cell to create the <code>lemmatize_text</code> function. This function will allow you to lemmatize words in a text.

In [13]:
from nltk.stem import WordNetLemmatizer # word lemmatization tool
nltk.download('punkt')
nltk.download('wordnet')

wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    s = " " # create an empty string that later will contain lemmatized words,
    t_l = [] # create an empty list
    t_w = nltk.word_tokenize(text) # tokenize the text
    # assign the list of tokenized words into t_w.
    for w in t_w:
        # “pos” is a part of speech parameter and “v” means verbs. 
        # We will lemmatize verbs only. 
        l_w = wordnet_lemmatizer.lemmatize(w, pos="v")
        # append l_w into the list t_l
        t_l.append(l_w)
    # joint the tokens to make a complete sentence
    text = s.join(t_l)
    return text

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/andreaekey/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/andreaekey/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Sample execution of the function "lemmatize_text"

The below example demonstrates a sample execution of the function <code>lemmatize_text</code> function for the following text.
> "My system keeps crashing! his was crashed yesterday, ours crashes daily"

In [14]:
lemmatize_text("My system keeps crashing! his was crashed yesterday, ours crashes daily")

'My system keep crash ! his be crash yesterday , ours crash daily'

## Exercise  (Text Preprocessing)

1) Remove special characters (including numbers) and stopwords.

2) Lemmatize the following paragraph. 

  
> “We measured the serum lipid profile, together with plasma fibrinogen and serum lipoprotein(a) (Lp[a]), glucose, bilirubin, and albumin levels in 491 patients (310 men) who were referred for the management of primary dyslipidemia. All these variables have been shown to predict vascular events. The patients were not taking lipid-lowering drugs; hypertension was present in 156 (31.7%) of them. Of the hypertensive patients, 52 (33%) were not receiving any treatment to control their blood pressure. Lipid-hostile antihypertensive drugs were associated with a significantly higher fibrinogen concentration when compared with untreated hypertensives or those taking lipid-neutral/lipid-friendly drugs (median values: 383, 353, and 336 mg/dL, respectively; P < .01). Lipid-neutral/lipid-friendly antihypertensive drugs were associated with lower Lp(a) levels when compared with untreated hypertensives (median values: 22 and 45 mg/dL, respectively; P < .05). The serum bilirubin level was significantly lower in the untreated hypertensives when compared with normotensives or the treated hypertensives. There were no significant differences in lipids, glucose, or albumin among the groups of hypertensives or normotensives. The influence of antihypertensive drugs on additional cardiovascular risk factors should be considered when selecting medication to reduce blood pressure.”

In [15]:
text = "We measured the serum lipid profile, together with plasma fibrinogen and serum lipoprotein(a) (Lp[a]), glucose, bilirubin, and albumin levels in 491 patients (310 men) who were referred for the management of primary dyslipidemia. All these variables have been shown to predict vascular events. The patients were not taking lipid-lowering drugs; hypertension was present in 156 (31.7%) of them. Of the hypertensive patients, 52 (33%) were not receiving any treatment to control their blood pressure. Lipid-hostile antihypertensive drugs were associated with a significantly higher fibrinogen concentration when compared with untreated hypertensives or those taking lipid-neutral/lipid-friendly drugs (median values: 383, 353, and 336 mg/dL, respectively; P < .01). Lipid-neutral/lipid-friendly antihypertensive drugs were associated with lower Lp(a) levels when compared with untreated hypertensives (median values: 22 and 45 mg/dL, respectively; P < .05). The serum bilirubin level was significantly lower in the untreated hypertensives when compared with normotensives or the treated hypertensives. There were no significant differences in lipids, glucose, or albumin among the groups of hypertensives or normotensives. The influence of antihypertensive drugs on additional cardiovascular risk factors should be considered when selecting medication to reduce blood pressure."
# remove special characters
rm_sc = remove_special_characters(text)
# remove stopwords
rm_sw = remove_stopwords(rm_sc)
# lemmatize words
final_text = lemmatize_text(rm_sw)

print(final_text)

measure serum lipid profile together plasma fibrinogen serum lipoproteina Lpa glucose bilirubin albumin level 491 patients 310 men refer management primary dyslipidemia variables show predict vascular events patients not take lipidlowering drug hypertension present 156 317 hypertensive patients 52 33 not receive treatment control blood pressure Lipidhostile antihypertensive drug associate significantly higher fibrinogen concentration compare untreated hypertensives take lipidneutrallipidfriendly drug median value 383 353 336 mgdL respectively P 01 Lipidneutrallipidfriendly antihypertensive drug associate lower Lpa level compare untreated hypertensives median value 22 45 mgdL respectively P 05 serum bilirubin level significantly lower untreated hypertensives compare normotensives treat hypertensives no significant differences lipids glucose albumin among group hypertensives normotensives influence antihypertensive drug additional cardiovascular risk factor consider select medication red