# Getting Started

## Understanding NLP Tasks

### Tasks in Natural Language Processing
- Tokenization
  - 文本切分为词语和句子
  - Example: Mary | had | a | little | lamb. | Its | fleece | was | white | as | snow

- Stopword Removal
  - 过滤掉 "common words"，不包含信息的 words
  - Example: Mary (had a) little lamb.
  
- N-Grams - N元语法
  - Example: (New York) is a great city. Have you ever been to (New York)?
  - 上面的 New York 应该被当做一个 entity.此 entity 是两个词，所以叫 Bigrams
  
- Word Sense Disambiguation - 词义消歧
  - Example: The movie had really (cool) effects. / I'd like a tall glass of (cool) water.

- Parts of Speech (POS, 词类）Tagging - 词性标注
  - Mary had a little lamb.
  - None Verb  Adj.  None
  
- Stemming - 词干提取
  - Close/Closed/Closely/Closer => Clos

### Tokenizing Text

In [1]:
import nltk

text = "Mary had a little lamb. Her flece was white as snow"
from nltk.tokenize import word_tokenize, sent_tokenize
sents = sent_tokenize(text)
print(sents)

['Mary had a little lamb.', 'Her flece was white as snow']


In [2]:
words = [word_tokenize(sent) for sent in sents]
print(words)

[['Mary', 'had', 'a', 'little', 'lamb', '.'], ['Her', 'flece', 'was', 'white', 'as', 'snow']]


### Removing Stopwords

In [3]:
from nltk.corpus import stopwords # import a set of stopwords
from string import punctuation
customStopWords = set(stopwords.words('english') + list(punctuation))

In [4]:
wordsWOStopwords = [word for word in word_tokenize(text) if word not in customStopWords]
print(wordsWOStopwords)

['Mary', 'little', 'lamb', 'Her', 'flece', 'white', 'snow']


### Identifying Bigrams

In [5]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordsWOStopwords) # Constructs bigrams from a list of words

# show distinct bigrams and their frequencies
sorted(finder.ngram_fd.items())

[(('Her', 'flece'), 1),
 (('Mary', 'little'), 1),
 (('flece', 'white'), 1),
 (('lamb', 'Her'), 1),
 (('little', 'lamb'), 1),
 (('white', 'snow'), 1)]

### Stemming and POS Tagging

In [6]:
# different morphological (形态学的) forms of the same word: closed, closing, close
text2 = "Mary closed on closing night when she was in the mood to close."

# Stemming
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
stemmedWords = [st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['mary', 'clos', 'on', 'clos', 'night', 'when', 'she', 'was', 'in', 'the', 'mood', 'to', 'clos', '.']


#### POS Tagging
- NNP: Noun
- VBD: Verb
- PRP: Pronoun

In [7]:
nltk.pos_tag(word_tokenize(text2))

[('Mary', 'NNP'),
 ('closed', 'VBD'),
 ('on', 'IN'),
 ('closing', 'NN'),
 ('night', 'NN'),
 ('when', 'WRB'),
 ('she', 'PRP'),
 ('was', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mood', 'NN'),
 ('to', 'TO'),
 ('close', 'VB'),
 ('.', '.')]

### Word Sense Disambiguation - 词义消歧

Wordnet is a lexicon (a little like a thesaurus).
- synset: basic entity in Wordnet, one single definition of a word.

In [8]:
from nltk.corpus import wordnet as wn
for ss in wn.synsets('bass'):
    print(ss, ss.definition())

Synset('bass.n.01') the lowest part of the musical range
Synset('bass.n.02') the lowest part in polyphonic music
Synset('bass.n.03') an adult male singer with the lowest voice
Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae
Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)
Synset('bass.n.06') the lowest adult male singing voice
Synset('bass.n.07') the member with the lowest range of a family of musical instruments
Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes
Synset('bass.s.01') having or denoting a low vocal or instrumental range


注意上面第4是一种鱼，第7是一种乐器

In [9]:
from nltk.wsd import lesk # lesk 是词义消歧的一种算法
sense1 = lesk(word_tokenize('Sing in a lower tone, along with the bass'), 'bass')
print(sense1, sense1.definition())

Synset('bass.n.07') the member with the lowest range of a family of musical instruments


In [10]:
sense2 = lesk(word_tokenize('This sea bass was really hard to catch'), 'bass')
print(sense2, sense2.definition())

Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae


### Spam Detection

#### Rule Based Approach
- email => Static Rules => Spam/Ham
- Static Rules: Contains specif keywords

**Use Machine Learning**
- Difficult for humans to express rules
- A large amount of historical data is available
- Patterns/Relationships are dynamic

#### Machine Learning Approach
- email => Updated Rules => Spam/Ham.
  - Updated Rules <=> Historical Data


### Understanding Types of Machine Learning Approaches

#### Typical ML Workflow
- **Pick your problem**: Identify which type of problem we need to solve
- Represent Data: Represent data using numeric attributes
- Apply an Algoritum: Use a standard algorithm to find a model

#### Pick your Problem
- ML Problems generally fall under a broad set of categories
  - **Classification Clustering**
  - Recommendation
  - Regression

#### Classification
- Spam Detection
  - Is this emal **Spam** or **Ham**?
- Sentiment Analysis
  - Is this tweet **positive** or **negative**?
- Algorithms which perform classification are known as **Classifiers**

#### Clustering
- E.g., a large groups of articles => divide them into **groups** based on some **common attributes**. Key: the groups to be divided into are **unknown beforehand**
- For above example, aater, we might realize that these groups represet meaningful divisions
  - Themes, Topics

**Differences between Classification and Clustering**
- Classification is used to perform a specific task, like spam detection/sentiment analysi
- Clustering is used when you just want to explore the data, detect the patterns that you did not know existed

### Understanding the Mechanics of Machine Learning

#### Typical ML Workflow
- Pick your problem: Identify which type of problem we need to solve
- **Represent Data**: Represent data using numeric attributes
- **Apply an Algoritum**: Use a standard algorithm to find a model

#### Represent Data

Use meaningful numeric attributes to represent text
- Term Frequency
- TF-IDF (Term Frequency - Inverse Document Frequency)

#### Apply an Algoritum
Use an algorithm to find patterns from the historical data
- Updated Rules <=> Historical Data
- Rules are meant to quantify relations between variables. The rules together form something called a **Model**
- A Model can be:
  - a mathematical equation
  - a set of rules (if-then-else statements)
- The choice of algorithm depends mainly on the type of problem
  - Classification / Naive Bayes / Support Vector Machiens
  - For Clustering problem, algorithm choices: K-Means / Hierarchical Clustering

