# Basics of NLP (Natural Language Processing)

* Tasks in NLP:
    - translation, 
    - automatic summarization, 
    - Named Entity Recognition (NER), 
    - speech recognition, 
    - relationship extraction, and 
    - topic segmentation.

**Steps in NLP Pipeline:**   
1. Sentence Segementation  
2. Word Tokenization  
3. Stemming  
4. Lemmatization  
5. Identify Stop Words  
6. Dependency Parsing  
7. POS (Part of Speech) Tagging  
8. Named Entity Recognition (NER)  
9. Chunking  

Difficulties in NLP:
 * Ambiguity
    * Lexical Ambiguity : noun adj or verb
    * Syntactic Ambiguity
    * Referential Ambiguity
 * Lack of Context
 * Named Entity Recognition (NER):


In [4]:
import nltk
import warnings

## Tokenization

Splits text by word/sentence

```python
 from nltk.tokenize import sent_tokenize, word_tokenize
```

In [7]:
from nltk.tokenize import sent_tokenize, word_tokenize

example_string = """
Natural language processing (NLP) is a field of computer science that deals with the interaction between computers and human (natural) languages. It's a subfield of artificial intelligence that deals with the ability of computers to understand and process human language, including speech and text.
NLP has many applications, including machine translation, speech recognition, text analysis, and question answering. It's used in a variety of industries, including healthcare, finance, and customer service.
One of the most important tasks in NLP is to understand the meaning of text. This can be challenging because words can have multiple meanings, and the meaning of a sentence can depend on the context in which it's used. NLP systems use a variety of techniques to understand meaning, including詞法分析, 語法分析, and 語義分析.
Another important task in NLP is to generate text. This can be challenging because it requires the system to understand the meaning of the text it's generating and to be able to express that meaning in a way that is both grammatically correct and fluent. NLP systems use a variety of techniques to generate text, including machine translation, text summarization, and question answering.
NLP is a rapidly growing field, and it's having a major impact on the way we interact with computers. As NLP systems become more sophisticated, they'll be able to understand and process human language in ways that are currently unimaginable. This will lead to new and innovative applications in a variety of industries.
"""

In [8]:
# Sentence tokenizer
sent_tokenize(example_string)

['\nNatural language processing (NLP) is a field of computer science that deals with the interaction between computers and human (natural) languages.',
 "It's a subfield of artificial intelligence that deals with the ability of computers to understand and process human language, including speech and text.",
 'NLP has many applications, including machine translation, speech recognition, text analysis, and question answering.',
 "It's used in a variety of industries, including healthcare, finance, and customer service.",
 'One of the most important tasks in NLP is to understand the meaning of text.',
 "This can be challenging because words can have multiple meanings, and the meaning of a sentence can depend on the context in which it's used.",
 'NLP systems use a variety of techniques to understand meaning, including詞法分析, 語法分析, and 語義分析.',
 'Another important task in NLP is to generate text.',
 "This can be challenging because it requires the system to understand the meaning of the text 

In [11]:
# Word Tokenizer
word_tokenize(example_string)[:50] # Prinying first 50 only

['Natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'field',
 'of',
 'computer',
 'science',
 'that',
 'deals',
 'with',
 'the',
 'interaction',
 'between',
 'computers',
 'and',
 'human',
 '(',
 'natural',
 ')',
 'languages',
 '.',
 'It',
 "'s",
 'a',
 'subfield',
 'of',
 'artificial',
 'intelligence',
 'that',
 'deals',
 'with',
 'the',
 'ability',
 'of',
 'computers',
 'to',
 'understand',
 'and',
 'process',
 'human',
 'language',
 ',',
 'including',
 'speech',
 'and']

## Stopwords

Words that you want to ignore ex comman words like 'is', 'an', 'the' etc. as they are not that meaningful in some tasks

In [14]:
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abhim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
chankya_quote = "It is better to die than to preserve this life by incurring disgrace. The loss of life causes but a moment's grief, but disgrace brings grief every day of one's life."
words_in_quote = word_tokenize(chankya_quote)
print(words_in_quote)

['It', 'is', 'better', 'to', 'die', 'than', 'to', 'preserve', 'this', 'life', 'by', 'incurring', 'disgrace', '.', 'The', 'loss', 'of', 'life', 'causes', 'but', 'a', 'moment', "'s", 'grief', ',', 'but', 'disgrace', 'brings', 'grief', 'every', 'day', 'of', 'one', "'s", 'life', '.']


In [21]:
# Creating a set of stopwords in english
stop_words = set(stopwords.words("english"))

In [23]:
# METHOD 1 To FILTER
filtered_list = [] # Holds non-stopwords
for word in words_in_quote:
   if word.casefold() not in stop_words: #casefold() ignores the case
       filtered_list.append(word)

print(filtered_list)

['better', 'die', 'preserve', 'life', 'incurring', 'disgrace', '.', 'loss', 'life', 'causes', 'moment', "'s", 'grief', ',', 'disgrace', 'brings', 'grief', 'every', 'day', 'one', "'s", 'life', '.']


In [24]:
# METHOD 2 : List Comprehesion filter out stopword
filtered_list = [
    word for word in words_in_quote if word.casefold() not in stop_words
]
print(filtered_list)

['better', 'die', 'preserve', 'life', 'incurring', 'disgrace', '.', 'loss', 'life', 'causes', 'moment', "'s", 'grief', ',', 'disgrace', 'brings', 'grief', 'every', 'day', 'one', "'s", 'life', '.']


## Stemming and Lemmatization

### Stemming

Stemming is a task to reduce the word to root word

Some stemmer available in nltk are:
* Porter stemmer
* Snowball stemmer
* ARLSTem Stemmer

In [26]:
# Porter Stemmer in NLTK
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [28]:
stemmed_words = [stemmer.stem(word) for word in words_in_quote]
print(stemmed_words)

['it', 'is', 'better', 'to', 'die', 'than', 'to', 'preserv', 'thi', 'life', 'by', 'incur', 'disgrac', '.', 'the', 'loss', 'of', 'life', 'caus', 'but', 'a', 'moment', "'s", 'grief', ',', 'but', 'disgrac', 'bring', 'grief', 'everi', 'day', 'of', 'one', "'s", 'life', '.']


#### Problem with stemmer

Understemming and overstemming are two ways stemming can go wrong:

1. **Understemming** happens when two related words should be reduced to the same stem but aren’t. This is a false negative.

1. **Overstemming** happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive.

#### Snowball Stemmer (Porter2)

Few Rules:  
* ILY  -----> ILI  
* LY   ----->   
* SS   -----> SS  
* S    ----->   
* ED   -----> E,Nil  

In [29]:
from nltk.stem.snowball import SnowballStemmer
snow_stemmer = SnowballStemmer(language='english')

stemmed_words = [snow_stemmer.stem(word) for word in words_in_quote]
print(stemmed_words)

['it', 'is', 'better', 'to', 'die', 'than', 'to', 'preserv', 'this', 'life', 'by', 'incur', 'disgrac', '.', 'the', 'loss', 'of', 'life', 'caus', 'but', 'a', 'moment', "'s", 'grief', ',', 'but', 'disgrac', 'bring', 'grief', 'everi', 'day', 'of', 'one', "'s", 'life', '.']


### Lemmatizing

A **lemma** is a word that represents a whole group of words, and that group of words is called a **lexeme**.

Lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

In [30]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [31]:
lemmatizer.lemmatize("scarves")

'scarf'

In [34]:
string_for_lemmatizing = "The friends of DeSoto love scarves."
words = word_tokenize(string_for_lemmatizing)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']


## Part of Speech (POS)