# NLP (Natural Language Processing)
NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.

## NLTK
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data.

# The Basics of NLP for Text

1. Sentence Tokenization
2. Word Tokenization
3. Text Lemmatization and Stemming
4. Stop Words
5. Regex
6. Bag-of-Words
7. TF-IDF

In [2]:
import nltk

## Sentence Tokenization:
* Sentence tokenization (also called sentence segmentation) is the problem of dividing a string of written language into its component sentences. 
* we can split apart the sentences whenever we see a punctuation mark.

In [3]:
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

Backgammon is one of the oldest known board games.

Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.

It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.



## Word Tokenization:
* Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words
    

In [4]:
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)
    print()

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']

['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']

['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']



## Text Lemmatization and Stemming:

* The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.


* Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
        

* Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
        

* stemmer operates without knowledge of the context and easier to implement and usually run faster. 
        

* The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

In [5]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def comp_stem_and_lem(stemmer, lemmatizer, word, pos):
    print("Stemmer", stemmer.stem(word))
    print("Lemmatizer", lemmatizer.lemmatize(word, pos))
    print()
    
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

comp_stem_and_lem(stemmer, lemmatizer, word="caring", pos=wordnet.VERB)
comp_stem_and_lem(stemmer, lemmatizer, word="drove", pos=wordnet.VERB)

Stemmer care
Lemmatizer care

Stemmer drove
Lemmatizer drive



## Stopwords
* Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

In [6]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Optimized way

In [7]:
stop_words = set(stopwords.words("english"))
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    without_stop_words = [word for word in words if not word in stop_words]
    print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']
['Its', 'history', 'traced', 'back', 'nearly', '5,000', 'years', 'archeological', 'discoveries', 'Middle', 'East', '.']
['It', 'two', 'player', 'game', 'player', 'fifteen', 'checkers', 'move', 'twenty-four', 'points', 'according', 'roll', 'two', 'dice', '.']


## Regex
* A regular expression, regex, or regexp is a sequence of characters that define a search pattern.

* We can use regex to apply additional filtering to our text. For example, we can remove all the non-words characters. In many cases, we don’t need the punctuation marks and it’s easy to remove them with regex.

In [8]:
import re
for sentence in sentences:
    pattern = r"[^\w]"
    print(re.sub(pattern, " ", sentence))

Backgammon is one of the oldest known board games 
Its history can be traced back nearly 5 000 years to archeological discoveries in the Middle East 
It is a two player game where each player has fifteen checkers which move between twenty four points according to the roll of two dice 


## Bag-of-Words
* Bag-of-Words (BoW) model is a simple yet powerful technique for representing text data numerically

* Steps:
    * Text Preprocessing
    * Vocabulary Creation
    * Feature Extraction
    * Vectorization

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

bow_matrix = vectorizer.fit_transform(sentences)

bow_matrix_dense = bow_matrix.toarray()

print("Vocabulary:", vectorizer.get_feature_names_out())

print("BoW Matrix:")
print(bow_matrix_dense)

Vocabulary: ['000' 'according' 'archeological' 'back' 'backgammon' 'be' 'between'
 'board' 'can' 'checkers' 'dice' 'discoveries' 'each' 'east' 'fifteen'
 'four' 'game' 'games' 'has' 'history' 'in' 'is' 'it' 'its' 'known'
 'middle' 'move' 'nearly' 'of' 'oldest' 'one' 'player' 'points' 'roll'
 'the' 'to' 'traced' 'twenty' 'two' 'where' 'which' 'years']
BoW Matrix:
[[0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0
  0 0 0 0 0 0]
 [1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1
  1 0 0 0 0 1]
 [0 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 2 1 1 1 1
  0 1 2 1 1 0]]


## TF-IDF
* TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

* One problem with scoring word frequency is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much “informational gain” to the model compared with some rarer and domain-specific words. One approach to fix that problem is to penalize words that are frequent across all the documents. This approach is called TF-IDF.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(sentences)

feature_names = tfidf_vectorizer.get_feature_names_out()
pd.DataFrame(values.toarray(), columns = feature_names)

Unnamed: 0,000,according,archeological,back,backgammon,be,between,board,can,checkers,...,points,roll,the,to,traced,twenty,two,where,which,years
0,0.0,0.0,0.0,0.0,0.365011,0.0,0.0,0.365011,0.0,0.0,...,0.0,0.0,0.215582,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.258828,0.0,0.258828,0.258828,0.0,0.258828,0.0,0.0,0.258828,0.0,...,0.0,0.0,0.152868,0.196845,0.258828,0.0,0.0,0.0,0.0,0.258828
2,0.0,0.1958,0.0,0.0,0.0,0.0,0.1958,0.0,0.0,0.1958,...,0.1958,0.1958,0.115643,0.148911,0.0,0.1958,0.3916,0.1958,0.1958,0.0
