## Stems and lemmas from GoT

We have  a couple of sentences from George R.R. Martin's Game of Thrones.

stems reduce a word to its root whereas lemmas produce an actual word. However, speed can differ significantly between the methods with stemming being much faster.

In [1]:
GoT='Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armour yourself in it, and it will never be used to hurt you.'

In [2]:
# Import the required packages from nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize

porter = PorterStemmer()
WNlemmatizer = WordNetLemmatizer()

# Tokenize the GoT string
tokens = word_tokenize(GoT) 

## Stemming

In [3]:
import time

# Log the start time
start_time = time.time()

# Build a stemmed list
stemmed_tokens = [porter.stem(token) for token in tokens]

# Log the end time
end_time = time.time()

print('Time taken for stemming in seconds: ', end_time - start_time)
print('Stemmed tokens: ', stemmed_tokens) 

Time taken for stemming in seconds:  0.0009968280792236328
Stemmed tokens:  ['never', 'forget', 'what', 'you', 'are', ',', 'for', 'sure', 'the', 'world', 'will', 'not', '.', 'make', 'it', 'your', 'strength', '.', 'then', 'it', 'can', 'never', 'be', 'your', 'weak', '.', 'armour', 'yourself', 'in', 'it', ',', 'and', 'it', 'will', 'never', 'be', 'use', 'to', 'hurt', 'you', '.']


## Lemmatization

In [4]:
import time

# Log the start time
start_time = time.time()

# Build a lemmatized list
lem_tokens = [WNlemmatizer.lemmatize(token) for token in tokens]

# Log the end time
end_time = time.time()

print('Time taken for lemmatizing in seconds: ', end_time - start_time)
print('Lemmatized tokens: ', lem_tokens) 

Time taken for lemmatizing in seconds:  1.6366140842437744
Lemmatized tokens:  ['Never', 'forget', 'what', 'you', 'are', ',', 'for', 'surely', 'the', 'world', 'will', 'not', '.', 'Make', 'it', 'your', 'strength', '.', 'Then', 'it', 'can', 'never', 'be', 'your', 'weakness', '.', 'Armour', 'yourself', 'in', 'it', ',', 'and', 'it', 'will', 'never', 'be', 'used', 'to', 'hurt', 'you', '.']


## Stemming Spanish reviews of amazon product

In [6]:
import pandas as pd
reviews = pd.read_csv('amazon_reviews_sample.csv')

In [8]:
from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(reviews.review)):
    languages.append(detect_langs(reviews.iloc[row, 2]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]
# Assign the list to a new feature 
reviews['language'] = languages

In [15]:
non_english_reviews = reviews[reviews['language'] == 'es']

In [16]:
non_english_reviews

Unnamed: 0.1,Unnamed: 0,score,review,language
1259,1259,1,La reencarnación vista por un científico: El ...,es
1260,1260,1,Excelente Libro / Amazing book!!: Este libro ...,es
1261,1261,1,Magnifico libro: Brian Weiss ha dejado una ma...,es
1639,1639,1,El libro mas completo que existe para nosotra...,es
1745,1745,1,Excelente!: Una excelente guía para todos aqu...,es
2486,2486,1,Palabras de aliento para tu caminar con Dios:...,es
2903,2903,1,fabuloso: mil gracias por el producto fabulos...,es
3318,3318,1,Excelentes botas.. excelentes boots: Excelent...,es
3694,3694,0,Why not Spanish ???: Alguien me puede decir p...,es
4820,4820,1,"La mejor película de Moore: A mi juicio, esta...",es


In [18]:
# Import the required packages
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

# Import the Spanish SnowballStemmer
SpanishStemmer = SnowballStemmer("spanish")

# Create a list of tokens
tokens = [word_tokenize(review) for review in non_english_reviews.review] 
# Stem the list of tokens
stemmed_tokens = [[SpanishStemmer.stem(word) for word in token] for token in tokens]

# Print the first item of the stemmed tokenss
print(stemmed_tokens[0])

['la', 'reencarn', 'vist', 'por', 'un', 'cientif', ':', 'el', 'prim', 'libr', 'del', 'dr.', 'weiss', 'sig', 'siend', 'un', 'gran', 'libr', 'par', 'tod', 'aquell', 'a', 'quien', 'les', 'inquiet', 'el', 'tem', 'de', 'la', 'reencarn', ',', 'asi', 'no', 'cre', 'en', 'ella', '.']


## STEM FROM TWEETS

In [19]:
tweets_raw = pd.read_csv('tweets.csv')

In [22]:
tweets = tweets_raw.text

In [23]:
# Import the function to perform stemming
from nltk.stem import PorterStemmer
#from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize

# Call the stemmer
porter = PorterStemmer()

# Transform the array of tweets to tokens
tokens = [word_tokenize(tweet) for tweet in tweets]
# Stem the list of tokens
stemmed_tokens = [[porter.stem(word) for word in tweet] for tweet in tokens] 
# Print the first element of the list
print(stemmed_tokens[0])

['@', 'virginamerica', 'what', '@', 'dhepburn', 'said', '.']
