### [Link to Intro to NLP Slides](https://docs.google.com/presentation/d/1N1cj7IeSfkGjHcYQHEmbj13WAlga7K7O2jTRMr9BmKM/)

# Run the cells below to get setup

In [1]:
import sys, os
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  !rm -r AI4All2020-Michigan-NLP
  !git clone https://github.com/alahnala/AI4All2020-Michigan-NLP.git
  !cp -r AI4All2020-Michigan-NLP/utils/ .
  !cp -r AI4All2020-Michigan-NLP/Data/ .
  !cp -r AI4All2020-Michigan-NLP/slides/ .
  !cp -r AI4All2020-Michigan-NLP/Experiment-Report-Templates/ .
  !echo "=== Files Copied ==="

In [2]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.stem.snowball import PorterStemmer
from utils.nlp_basics import *
from utils.syllable import *
print('Done')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/allielahnala/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Done


# Outline

1. Tokenization
2. Lemmatization
3. Stemming
4. Part-of-speech tagging
5. Stopwords

![Tokenization](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/tokenization.png)

# Let's play with the string sequence `cake_wikipedia`

## 1. Simplest tokenizer: split on spaces

Run the cell below. Here we split the sequence by spaces. How would you describe these tokens?

In [None]:
# The first few sentences from the wikipedia page on Cake https://en.wikipedia.org/wiki/Cake
cake_wikipedia = 'Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked. In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, and pies.'

# calling .split() on a string will split the string on spaces
tokens = cake_wikipedia.split()
show_tokens(tokens)


## 2. Split on spaces and separate punctuation from words.

Run the cell below. How would you describe these tokens?

In [None]:
# nltk is a library that is open for anyone to use. 
# It stands for "natural language tool kit" and has many useful functions
from nltk.tokenize import word_tokenize

# We use nltk's function "word_tokenize"
tokens = word_tokenize(cake_wikipedia)

show_tokens(tokens)

## 3. Split on syllables.

Run the cell below. How would you describe these tokens?

In [None]:
syllable_tokenize = SyllableTokenizer()
tokens = syllable_tokenize.tokenize(cake_wikipedia)

# Show table
show_tokens(tokens)

## 4. Challenge: What are some tokenization considerations to make if you're working with tweets?

Try making a tokenizer that keeps hashtags with the # and user handles with the @.

In [None]:
tweet = '@RiikkaTheCat is a #CoolCat :D:)'

def tokenizer(string):
    
    ## Your code (use as many lines as you like)
    tokens = 
    
    
    return tokens


tokens = tokenizer(tweet)
print(tokens)

## Would tokenization in English look the same as other languages?

In [None]:
french = "C'est en effet tout à fait dans la ligne des positions que notre Parlement a toujours adoptées."


In [None]:
tokens = french.split()
show_tokens(tokens)

In [None]:
tokens = word_tokenize(french, language='french')
show_tokens(tokens)

![Lemmatization](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/lemmas.png)

# Lemmatization

In [None]:
tokens = word_tokenize(cake_wikipedia)

import spacy

# Uses nlp pipeline from spacy to obtain linguistic features
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

doc = nlp("".join(cake_wikipedia))
allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']

# Get lemmas
lemmas = [token.lemma_ for token in doc]

# Here we are making a list of original tokens and a list of stemmed tokens for only the tokens that changed after stemming
lemmas_diff = [lemma for token, lemma in zip(tokens, lemmas) if token.lower() != lemma]
og = [token for token, lemma in zip(tokens, lemmas) if token.lower() != lemma]

# Show table
show_lemmas(og, lemmas_diff)


![Stemming](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/stemming.png)

# Stemming

In [None]:
# Define a module that will stem the text for us
stemmer = PorterStemmer()

# Use the stemmer on our text
stemmed = [stemmer.stem(token) for token in tokens]


# Here we are making a list of original tokens and a list of stemmed tokens for only the tokens that changed after stemming
og = [token for token, stem in zip(tokens, stemmed) if token.lower() != stem]
stemmed_diff = [stem for token, stem in zip(tokens, stemmed) if token.lower() != stem]

# Put stemmed data and text in a dataframe so we can output a table
data = {'Stems': stemmed_diff, 'Text':og}
df = pd.DataFrame(data, columns = ['Text', 'Stems'])

# Show table
df.T

![Part-of-Speech Tagging](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/pos.png)

# Part-of-speech tagging

In [None]:
# https://en.wikipedia.org/wiki/Cake
cake_wikipedia = 'Cake is a form of sweet food made from flour, sugar, and other ingredients, that is usually baked. In their oldest forms, cakes were modifications of bread, but cakes now cover a wide range of preparations that can be simple or elaborate, and that share features with other desserts such as pastries, meringues, custards, and pies.'

# Uses nlp pipeline from spacy to obtain linguistic features
doc = nlp("".join(cake_wikipedia))

data = {'Text':[token.text for token in doc], 'Lemma':[token.lemma_ for token in doc], 'Part-of-speech':[token.pos_ for token in doc], 'Dependency':[token.dep_ for token in doc], 'Shape':[token.shape_ for token in doc], 'Is Alpha':[token.is_alpha for token in doc], 'Stopword':[token.is_stop for token in doc]}
df = pd.DataFrame (data, columns = ['Text', 'Part-of-speech'])


df.T # show data (T means transpose, excluding the T is fine too)

![Removing Stopwords](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/stopwords.png)

# Stopwords

Run the cell below and observe which words are stopwords if they have **True** in the stopword row

In [None]:
df = pd.DataFrame (data, columns = ['Text', 'Stopword'])
df.T

Run the cell below to observe just the stopwords in our text

In [None]:
stopwords = df.loc[df['Stopword'] == True]
stopwords.T

# References

1. https://www.nltk.org/api/nltk.tokenize.html
2. https://www.nltk.org/_modules/nltk/tokenize/sonority_sequencing.html#SyllableTokenizer
3. https://spacy.io/api/lemmatizer
4. https://spacy.io/usage/linguistic-features
5. https://universaldependencies.org/docs/u/pos/