# SCS 3546 Week 6 - Natural Language Processing

## Introduction

- Develop some familiarity with several of the key concepts in NLP
- Understanding Regular Expressions and their applications
- Have a look at some of the features of the Natural Language Toolkit (NLTK)

## What is Natural Language Processing?

- Natural Language Processing (NLP) is the study of computational treatment of natural (human) language


- NLP applications are:
    - Text Classification and Categorization
    - Machine Translation (e.g. Google Translate)
    - Chatbots (e.g. IBM Watson-powered chatbot)
    - Assistants (Siri, Google home, Alexa)
    - Sentiment analysis
    - Search engines (Google, Yahoo!)
    - Content Recommendation
    - Many more!

## Ambiguity in Language Issue

Ambiguity in language is one of the main issues in interpreting the concepts. Look at the following sentences. There are several ways to interpret them.

- "Teachers Strikes Idle Kids."
- "Stolen Painting Found by Tree."
- "Local High School Dropouts Cut in Half."
- “Eats shoots and leaves.”
- Can sometimes use word sense to disambiguate
- Pronoun resolution: “I threw a brick through the window.  It broke.”
- Some languages don’t have word boundaries
- Common misspellings
- Poetry
- Grammar is a description, not a prescription

## Document Similarity

Look at the following tables. How can find some similarities between Text 1 and Text 2?

|                                                                                                                                                                           Text 1                                                                                                                                                                          	|                                                                                                                                                                Text 2                                                                                                                                                               	|
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:	|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:	|
| As far as my interest in ancientartifacts is concerned, there were many who considered my focus to beother-worldly. I’d spend hours in myroom looking at the pieces in my collection and thinking about who might have held that object in their hand and for what purpose? In some moments I could actually believe I was one of our prehistoric brethren.| The other world was one of artifacts, objects and other abstractions. If you consider the purpose of these constructs: collections, lists, and so on, you may come to the conclusion that technology simply mirrors things we see in oureveryday existence. We have room to contemplate these things because we have time on our hands. 	|

## NLP Pipeline

Figure below shows the main building blocks of creating Natural Language Data Pipelines.
![image.png](attachment:image.png)

##  Regular Expressions

- Helps you find and match the patterns in text
    - Finding all the email addresses in webpage
    - Email addresses are strings that has exactly one @ sign, and at least one . in the part after the @
    - Based on the above description (which is too general but mostly enough for basic check) the regular expression for an email address is as follows
    - Email pattern: **[^@]+@[^@]+\.[^@]+**
    - The above pattern means some at least one character which is not @ ([^@]+), followed by an @ sign, followed by at least one character which is not @ ([^@]+), followed by a single . , and again followed by at least one character which is not @ ([^@]+)
    - A more comprehensive pattern that does not allow spaces inside email adresses is as follows:
    - Email pattern: **[^@|\s]+@[^@]+\.[^@|\s]+**
    
- Some Regular expression syntax examples 
    - **.**	Matches any character
    - **^abc**	Matches some pattern abc at start of a string
    - **abc$**	Matches some pattern abc at the end of a string
    - **[abc]**	Matches one of a set of characters
    - **[A-Z0-9]**	Matches one of a range of character
    - **ed|ing|s**	Matches one of the specified strings
    - **\***	Matches zero or more of the previous item
    - **\+**	Matches one or more of the previous item

#  Regular Expressions in Python

- Take a crash course to learn the most commonly used syntax for writing the patterns
    - [Regex tutorial — A quick cheatsheet by examples](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)
    - [Python Regex Cheatsheet](https://www.debuggex.com/cheatsheet/regex/python)
    - In Python **re** library can be used. It has many methods and the following methos are more common
        - re.match()
        - re.search()
    - The match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match. It’s important to keep this distinction in mind. Remember, match() will only report a successful match which will start at 0; if the match wouldn’t start at zero, match() will not report it.

In [None]:
# Let's see how we can make use of re library to extract part of text
import re
print(re.match('super', 'superstition').span())
print(re.match('super', 'insuperable'))
print(re.search('super', 'superstition').span())
print(re.search('super', 'insuperable').span())

### Finding email addresses in a text document using Regular Expressions

The following code shows how we can use **re** library to extract all the **valid** email addresses from a source docuements.

In [None]:
#Finding email addresses in a text file
import re
string="Hello friend, You can send me an email either to example@me.com or to me@example.com or me[at]gmail[dot]com"
# findall returns all non-overlapping matches of pattern in string, as a list of strings. 
res = re.findall("[^@|\s]+@[^@]+\.[^@|\s]+",string) 
res

# NLP Preprocessing Steps


## Word Tokenization
- Definition: Splitting a string or document into tokens (smaller parts such as sentences and words)
    - Split on white space (in most languages)
    - Regular expressions and finite state automata are often used
    - Special cases: Names (particularly multi-word), initials, hyphenated words, abbreviations, special forms (dates, phone numbers, URLs, etc.)
    
## Stemming
- Definition: Finding the “root word” e.g. return lie given lying
- NLTK has several built-in stemmers that handle many irregular cases but not perfectly

## Normalization
- Hashtag expansions
- Make all caps or lower
- Possibly remove “stop” words e.g. the, it, about, all… (depending on the application)

## Collocations and Bigrams
- A combination of the words instead of individual words


# The Natural Language Toolkit: NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

- Originally developed in 2001 at the University of Pennsylvania
- Version 3.3 released in 2018
- Natural Language Processing with Python: http://www.nltk.org/book_1ed  
- Try “Some simple things you can do with NLTK”: http://www.nltk.org/index.html  
- Figure below shows how NLTK helps us in noun phrase chuncking 


## Hands-On NLTK

NLTK provides easy preprocessing operations. Lets take a look at an example of text analysis for extracting interesting data from a given text. The directory **hands on nltk tutorial** contains examples of them as follows:

1. Downloading Libs and Testing That They Are Working
    * Getting ready to start!
    * Text Analysis Using nltk.text
    * Extracting interesting data from a given text
2. Deriving N-Grams from Text
    * Creating n-grams (for language classification)
    * Detecting Text Language by Counting Stop Words.ipynb
    * A simple way to find out what language a text is written in
    * Language Identifier Using Word Bigrams
    * State-of-the-art language classifier
3. Bigrams, Stemming and Lemmatizing
    * NLTK makes bigrams, stemming and lemmatization super-easy
    * Finding Unusual Words in Given Language
    * Which words do not belong with the rest of the text?
    * Creating a POS Tagger
    * Creating a Parts Of Speech tagger
4. Parts of Speech and Meaning
    * Exploring awesome features offered by WordNet
    * Name Gender Identifier
    * Building a classifier that guesses the gender of a name
    * Classifying News Documents into Categories
    * Building a classifier that guesses the category of a news item
5. Sentiment Analysis
    * Is a movie review positive or negative?
    * Sentiment Analysis with nltk.sentiment.SentimentAnalyzer and VADER tools
    * More sentiment analysis!
6. Twitter Stream (and Cleaning Tweets)
    * Live-stream tweets from Twitter
    * Twitter Search
    * Search through past tweets
7. Word2Vec (gensim)
    * Google's Word2vec


In [None]:
from nltk.tokenize import word_tokenize
from nltk.text import Text

In [None]:
my_string = "Two plus two is four, minus one that's three — quick maths. Every day man's on the block. Smoke trees. See your girl in the park, that girl is an uckers. When the thing went quack quack quack, your men were ducking! Hold tight Asznee, my brother. He's got a pumpy. Hold tight my man, my guy. He's got a frisbee. I trap, trap, trap on the phone. Moving that cornflakes, rice crispies. Hold tight my girl Whitney."
tokens = word_tokenize(my_string)
tokens = [word.lower() for word in tokens]
tokens[:5]

In [None]:
t = Text(tokens)
t

This method of converting raw strings to NLTK `Text` instances can be used when reading text from a file. For instance:
```python
f = open('my-file.txt','rU') # Opening a file with the mode 'U' or 'rU' will open a file for reading in universal newline mode. All three line ending conventions will be translated to a "\n"
raw = f.read()
```

In [None]:
t.concordance('uckers') # concordance() is a method of the Text class of NLTK. It finds words and displays a context window. Word matching is not case-sensitive.
# concordance() is defined as follows: concordance(self, word, width=79, lines=25). Note default values for optional params.

## Resources
- https://www.nltk.org/
- http://www.nltk.org/book_1ed
- https://en.wikipedia.org/wiki/Regular_expression
- https://github.com/hb20007/hands-on-nltk-tutorial