# NLP Basics

In [None]:
import nltk



Goal of this section is to get started with textual data. We will start preprocessing data and make use of the **Natural Language Toolkit (NLTK)**

At the end of this session you will be able to work with textual data and you will know some of the parts related to working with textual data and pre-processing it like:

- Cleaning
- Tokenizing
- Segementation
- Normalizing
- Stemming


For this part of of the Session we will make use of NKTK and use the data provided by NLTK. To get the relevant data use the download function from NLTK. And download the parts realted to the NLTK Book:
1. d (Download function) 
2. book (Download Content related to the NLTK book)
3. q ( Quit)

This will download all the data wen need for this section

In [None]:
nltk.download_shell()

## Cleaning Data 

Cleaning the data and preparing it for the next steps is our first task in many applications. We will perform the following steps:
1. Load the raw text.
2. Split into tokens.
3. Convert to lowercase.
4. Remove punctuation from each token.
5. Filter out remaining tokens that are not alphabetic.
6. Filter out tokens that are stop words.


In [None]:
#get some HTML data
url = "http://www.madrid-guide-spain.com/classic-madrid.html"

In [None]:
from urllib import request
html = request.urlopen(url).read()
html[:60]

For cleaning of the HTML we will use a Python Library called Beatiful Soup. This is in general a very handy Library when working with HTML. More infromation is available at:
http://www.crummy.com/software/BeautifulSoup


In [None]:
#!conda install beautifulsoup4 -y

In [None]:
from bs4 import BeautifulSoup

In [None]:
raw = BeautifulSoup(html,"lxml").get_text()

In [None]:
raw

## Tokenizing

Tokenizing will split our text first into sentences and then into word tokens

In [None]:
from nltk.tokenize import word_tokenize
from nltk import sent_tokenize
sentences = sent_tokenize(raw)
print(len(sentences))
print(sentences[5])


In [None]:
tokenized_docs = [word_tokenize(sentence.lower()) for sentence in sentences] 
print(len(tokenized_docs))
print(tokenized_docs[27])

## Stop Word removal

Stop words are highly frequent words like "the" or "to". Stopwords usually have little lexical content and in many cases we want to remove stop words from our text. NLTK comes with a stop words corpus conatining commonly agreed upon stop words for a variety of languages.

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

In [None]:
from nltk.corpus import stopwords
print(stopwords.words('english'))
stop_words=stopwords.words('english')
# getting rid of all the symbols
# stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) 

In [None]:
def removeStopWords(words):
    words = [w for w in words if not w in stop_words]
    return words

In [None]:
clean_text = [removeStopWords(sentence) for sentence in tokenized_docs]

In [None]:
print(clean_text[1])

An alternative to remove stopwords based on a dictionary is to remove stopwords based on a POS Tagger, we will see this later

Another normalization task involves identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.

In [None]:
# removal of stopwords  alternative check POS TAGSET
#https://stackoverflow.com/questions/19130512/stopword-removal-with-nltk

## Stemming
NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), while the Lancaster stemmer does not.

In [None]:
import nltk
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

In [None]:
print(raw)

In [None]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.SnowballStemmer('english')
print([porter.stem(t) for t in tokens])

In [None]:
print([lancaster.stem(t) for t in tokens])

In [None]:
print([snowball.stem(t) for t in tokens])

In [None]:
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t, pos="n") for t in tokens])

## Task:
Try out different Sentences and different parameters for the "pos" parameter, how does the output change?

Possible values are: 
'a', 's', 'r', 'n', 'v' -> ADJ, ADJ_SAT, ADV, NOUN, VERB

How does this change the lemmatization


## Tagging

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them accordingly.

In [None]:
review = """WOW! The best word that describes this movie is "wow"! 
Not only to say that this is the best Action movie of all time, 
this is probably one of the greatest movies ever made. 
The people in my country watched this film when there where limited VHS cassettes at all. 
And again, my favorite Director did an timeless epic-masterpiece. 
Yes, an epic. Every scene in this movie is beyond the perfection. The timeless plot. 
Groundbreaking effects. Unforgettable "Hasta la vista, baby." .

Perfect direction for a sci-fix action film.
When the action starts, you're in for the ride of your life. 
There never be the same movie like T2. 
What else I can say about this film? A Must see for everyone."""

In [None]:
text = word_tokenize(sent_tokenize(review)[2])
nltk.pos_tag(text)

All the labels are described in the help of NLTK, to get the description for all POS classes just call:

In [None]:
nltk.help.upenn_tagset('.*')

## Task: Extract all the verbs from the review text

In [None]:
for sent in (sent_tokenize(review)):
    for pos_elem in nltk.pos_tag(word_tokenize(sent)):
        if pos_elem[1].startswith('VB'):
            print(pos_elem)

## Extra Task: Combine the POS tags with the lemmatizer to lemmatize everything based on its POS tag

In [None]:
#'a', 's', 'r', 'n', 'v' -> ADJ, ADJ_SAT, ADV, NOUN, VERB
def get_pos(tag):
    if tag.startswith('J'):
        return "a"
    elif tag.startswith('V'):
        return "v"
    elif tag.startswith('N'):
        return "n"
    elif tag.startswith('R'):
        return "r"
    else:
        return "n"
tagged_text= nltk.pos_tag(text)
for tagged_elem in tagged_text:#
    print(wnl.lemmatize(tagged_elem[0], pos=get_pos(tagged_elem[1])))

## NER - Named Entity Recognition

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities.

NLTK provides a classifier that has already been trained to recognize named entities, accessed with the function nltk.ne_chunk(). If we set the parameter binary=True, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [None]:
#http://www.bbc.com/future/story/20180126-meet-the-motorbike-racing-robot
sent="""Yamaha’s initial concept was a “humanoid robot that can ride a motorcycle autonomously” and the company teamed up with SRI International to achieve its vision. SRI, the Stanford Research Institute, as it was originally known, was founded in 1946 to be the cutting edge of innovation in Silicon Valley"""

In [None]:
sent_tokenized=word_tokenize(sent)
print(sent_tokenized)

In [None]:
sent=nltk.pos_tag(sent_tokenized)
print(sent)

In [None]:
print(nltk.ne_chunk(sent, binary=False))

## Task:

Try different sentences, con you find sentences for all kinds of named entites NLTK can detect?  
* Geographical Entity
* Organization
* Person
* Geopolitical Entity
* Time indicator
* Artifact
* Event
* Natural Phenomenon

# This is the end of Part 1

# NLP & Supervised learning

We will start with some easy example for Supervised learning. We will build a classifier that tells us the gender of a name.

Lets start by loading the Names corpus and having a look at some attributes

In [None]:
import nltk

In [None]:
names = nltk.corpus.names

In [None]:
names.fileids()

In [None]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')

We can also have a look at names that are ambiguous for gender:


In [None]:
print([w for w in male_names if w in female_names])

It is well known that names ending in the letter a are almost always female. We can see this and some other patterns in a graph produced by the following code.

In [None]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
        for name in names.words(fileid))


In [None]:
cfd.plot()

The first step in building our classifier is to decide for the features and how to encode them. We will start by looking at the final letter.

In [None]:
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('Patrick')

In [None]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [None]:
import random
random.shuffle(labeled_names)

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new "naive Bayes" classifier.

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

In [None]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
classifier.classify(gender_features('Patrick'))

In [None]:
classifier.classify(gender_features('Sohpie'))

We can systematically evaluate the classifier on a much larger quantity of unseen data:

In [None]:
 print(nltk.classify.accuracy(classifier, test_set))

We can also have a look at the best performing features

In [None]:
classifier.show_most_informative_features(20)

## Task: Modify the gender_features() 
Modify the function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.


In [None]:
def gender_features(word):
    return {'last_letter': word[-1],
            'first_two': word[:-2],
            'last_two': word[-2:],
            'last_three': word[-3:],
            'lenght': len(word),
            'first_letter': word[1],
           }

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

we can also try another classifier, for instance the SVM from sklearn

In [None]:
import nltk.classify
from sklearn.svm import LinearSVC

In [None]:
classifier = nltk.classify.SklearnClassifier(LinearSVC(C=0.1,class_weight="balanced"))
classifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

# Movie Review Data

lets have a look at more complex problem: Sentiment Analysis

Sentiment analysis can be seen as a part of Text classification. Goal is to detect if a text is positive or negative (or neutral)
NLTK comes already with a large set of textual documents and also has some data for Sentiment Analysis which we will use in the remainder of this session.

In [None]:
from nltk.corpus import movie_reviews
from collections import defaultdict

In [None]:
#create a list with movie reviews from the NLTK corpus
# every element in the list is a list with the words of the review and the sentiment (pos/neg)
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]
len(documents)

overall we have a set of 2000 movie reviews

In [None]:
text = movie_reviews.raw()

In [None]:
text[0:1000]

In [None]:
documents = defaultdict(list)
for i in movie_reviews.fileids():
    documents[i.split('/')[0]].append(i)
print(documents['pos'][:10])# first ten pos reviws.)

print(documents['neg'][:10])# first ten neg reviews.)

In [None]:
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')

documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]

In [None]:
#from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
from nltk.probability import FreqDist

all_words = FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(word for (word,count) in all_words.most_common(2000))

In [None]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.show_most_informative_features(50)

The current version gets an accuracy of ~ 0.88, lets see how much better we can get by applying some of the learned methods, e.g. stemming or POS tagging.
We could also change the features to use not only words but bigrams, or try something complety different

## Lemmatizing

In [None]:
wnl = nltk.WordNetLemmatizer()

In [None]:
all_words = FreqDist(wnl.lemmatize(w.lower()) for w in movie_reviews.words())
word_features = list(word for (word,count) in all_words.most_common(2000))

In [None]:
def document_features(document):
    words=[wnl.lemmatize(word) for word in document]
    document_words = set(words)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [None]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

## Using only adjectives

In [None]:
adjectives=[]
for sent in movie_reviews.sents():
    for word, pos in nltk.pos_tag(sent):
        if pos in ["JJ","JJR" "JJS"]: # feel free to add any other  tags
            adjectives.append(word)

In [None]:
adjectives

In [None]:
all_words = FreqDist(wnl.lemmatize(w.lower(),pos="a") for w in adjectives)
word_features = list(word for (word,count) in all_words.most_common(2000))

In [None]:
word_features

In [None]:
def document_features(document):
    words=[wnl.lemmatize(word.lower(),pos="a") for word in document]
    document_words = set(words)
    features = {}
    for word in word_features:
        features['contains({})'.format(word.lower())] = (word.lower() in document_words)
    return features

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [None]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.show_most_informative_features(50)

# Just try other things 