### Instructions:
1. Tokenize the sample text and display results.
2. Show your results by removing punctuation from the sample text.
3. Display all the stop words in the sample text.
4. Show your results after applying Stop Words to the sample text.
5. Show your results after applying Porter Stemmer to the sample text.
6. Show your results after applying Lemmatization to the sample text.

In [44]:
import re
import nltk
import pandas as pd
import string
string.punctuation
from string import punctuation
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [45]:
pd.set_option('display.max_colwidth',50)
data = pd.read_csv("sample_text.csv")
data.head(6)

Unnamed: 0,Number,Phrase
0,1,Such an analysis can reveal features
1,2,that are not easily visible from the
2,3,variations in the individual genes and
3,4,can lead to a picture of expression that
4,5,is more biologically transparent and
5,6,accessible to interpretation.


In [46]:
def removePunctuation(text):
    text_nopunct = [a for a in text if a not in string.punctuation]
    return text_nopunct

In [47]:
data['Clean'] = data['Phrase'].apply(lambda x: removePunctuation(x))
data.head(6)

Unnamed: 0,Number,Phrase,Clean
0,1,Such an analysis can reveal features,"[S, u, c, h, , a, n, , a, n, a, l, y, s, i, ..."
1,2,that are not easily visible from the,"[t, h, a, t, , a, r, e, , n, o, t, , e, a, ..."
2,3,variations in the individual genes and,"[v, a, r, i, a, t, i, o, n, s, , i, n, , t, ..."
3,4,can lead to a picture of expression that,"[c, a, n, , l, e, a, d, , t, o, , a, , p, ..."
4,5,is more biologically transparent and,"[i, s, , m, o, r, e, , b, i, o, l, o, g, i, ..."
5,6,accessible to interpretation.,"[a, c, c, e, s, s, i, b, l, e, , t, o, , i, ..."


In [48]:
def removePunctuation(text):
    text_nopunct = "".join([a for a in text if a not in string.punctuation])
    return text_nopunct

data['Clean'] = data['Phrase'].apply(lambda x: removePunctuation(x))
data.head(6)

Unnamed: 0,Number,Phrase,Clean
0,1,Such an analysis can reveal features,Such an analysis can reveal features
1,2,that are not easily visible from the,that are not easily visible from the
2,3,variations in the individual genes and,variations in the individual genes and
3,4,can lead to a picture of expression that,can lead to a picture of expression that
4,5,is more biologically transparent and,is more biologically transparent and
5,6,accessible to interpretation.,accessible to interpretation


In [50]:
def token(text):
    tokens = re.split('\W+', text)
    return tokens

data['Tokenize'] = data['Clean'].apply(lambda x: token(x.lower()))
data.head(6)

Unnamed: 0,Number,Phrase,Clean,Tokenize
0,1,Such an analysis can reveal features,Such an analysis can reveal features,"[such, an, analysis, can, reveal, features]"
1,2,that are not easily visible from the,that are not easily visible from the,"[that, are, not, easily, visible, from, the]"
2,3,variations in the individual genes and,variations in the individual genes and,"[variations, in, the, individual, genes, and]"
3,4,can lead to a picture of expression that,can lead to a picture of expression that,"[can, lead, to, a, picture, of, expression, that]"
4,5,is more biologically transparent and,is more biologically transparent and,"[is, more, biologically, transparent, and]"
5,6,accessible to interpretation.,accessible to interpretation,"[accessible, to, interpretation]"


['i', 'me', 'my', 'myself', 'we', 'our', 'ours']

In [51]:
stopwords = nltk.corpus.stopwords.words('english')

def removeStopwords(text_tokenized):
    text_clean = [word for word in text_tokenized if word not in stopwords]
    return text_clean

data['Stopwords'] = data['Tokenize'].apply(lambda x: removeStopwords(x))
data.head(6)

Unnamed: 0,Number,Phrase,Clean,Tokenize,Stopwords
0,1,Such an analysis can reveal features,Such an analysis can reveal features,"[such, an, analysis, can, reveal, features]","[analysis, reveal, features]"
1,2,that are not easily visible from the,that are not easily visible from the,"[that, are, not, easily, visible, from, the]","[easily, visible]"
2,3,variations in the individual genes and,variations in the individual genes and,"[variations, in, the, individual, genes, and]","[variations, individual, genes]"
3,4,can lead to a picture of expression that,can lead to a picture of expression that,"[can, lead, to, a, picture, of, expression, that]","[lead, picture, expression]"
4,5,is more biologically transparent and,is more biologically transparent and,"[is, more, biologically, transparent, and]","[biologically, transparent]"
5,6,accessible to interpretation.,accessible to interpretation,"[accessible, to, interpretation]","[accessible, interpretation]"


In [52]:
ps = PorterStemmer()

def stemming(tokenized):
    text = [ps.stem(word) for word in tokenized]
    return text

In [53]:
data['Stemmer'] = data['Stopwords'].apply(lambda x: stemming(x))
data.tail(6)

Unnamed: 0,Number,Phrase,Clean,Tokenize,Stopwords,Stemmer
0,1,Such an analysis can reveal features,Such an analysis can reveal features,"[such, an, analysis, can, reveal, features]","[analysis, reveal, features]","[analysi, reveal, featur]"
1,2,that are not easily visible from the,that are not easily visible from the,"[that, are, not, easily, visible, from, the]","[easily, visible]","[easili, visibl]"
2,3,variations in the individual genes and,variations in the individual genes and,"[variations, in, the, individual, genes, and]","[variations, individual, genes]","[variat, individu, gene]"
3,4,can lead to a picture of expression that,can lead to a picture of expression that,"[can, lead, to, a, picture, of, expression, that]","[lead, picture, expression]","[lead, pictur, express]"
4,5,is more biologically transparent and,is more biologically transparent and,"[is, more, biologically, transparent, and]","[biologically, transparent]","[biolog, transpar]"
5,6,accessible to interpretation.,accessible to interpretation,"[accessible, to, interpretation]","[accessible, interpretation]","[access, interpret]"


In [56]:
words = data['Stopwords'][0:7]
words

0       [analysis, reveal, features]
1                  [easily, visible]
2    [variations, individual, genes]
3        [lead, picture, expression]
4        [biologically, transparent]
5       [accessible, interpretation]
Name: Stopwords, dtype: object

In [58]:
lemmatizer = WordNetLemmatizer()

for y in words:
    for w in y: 
        print("\n\n"+w, "\nSTEMMING :", ps.stem(w), "\nLEMMATIZATION :", lemmatizer.lemmatize(w))



analysis 
STEMMING : analysi 
LEMMATIZATION : analysis


reveal 
STEMMING : reveal 
LEMMATIZATION : reveal


features 
STEMMING : featur 
LEMMATIZATION : feature


easily 
STEMMING : easili 
LEMMATIZATION : easily


visible 
STEMMING : visibl 
LEMMATIZATION : visible


variations 
STEMMING : variat 
LEMMATIZATION : variation


individual 
STEMMING : individu 
LEMMATIZATION : individual


genes 
STEMMING : gene 
LEMMATIZATION : gene


lead 
STEMMING : lead 
LEMMATIZATION : lead


picture 
STEMMING : pictur 
LEMMATIZATION : picture


expression 
STEMMING : express 
LEMMATIZATION : expression


biologically 
STEMMING : biolog 
LEMMATIZATION : biologically


transparent 
STEMMING : transpar 
LEMMATIZATION : transparent


accessible 
STEMMING : access 
LEMMATIZATION : accessible


interpretation 
STEMMING : interpret 
LEMMATIZATION : interpretation
