## Introduction
##### What is NLP (Natual Language Processing) 

Is a subfield of computer science, information engineering and AI concerned with interaction between computers and human(natural) languages, in particular how to program computers to process and analyze large amounts of natural language data

##### NLP uses
Speech Recognition 
Document summarization
Sentiment Analysis
ChatBots
Predicting the genre of the book 
Question Answering 



Sentence tokenization
>The process of separating sentences within paragraphs 

In [18]:
text = 'Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.'

In [19]:
from nltk import sent_tokenize
sent_tok = sent_tokenize(text)
for sentence in sent_tok:
    print(sentence)
    print()

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.



Word tokenization
>The process of separating words within the sentences. <br>
Sometimes single a word is split with a space 

In [20]:
from nltk import word_tokenize
word_tok = word_tokenize(text)
[word for word in word_tok]

['Natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'subfield',
 'of',
 'computer',
 'science',
 ',',
 'information',
 'engineering',
 ',',
 'and',
 'artificial',
 'intelligence',
 'concerned',
 'with',
 'the',
 'interactions',
 'between',
 'computers',
 'and',
 'human',
 '(',
 'natural',
 ')',
 'languages',
 ',',
 'in',
 'particular',
 'how',
 'to',
 'program',
 'computers',
 'to',
 'process',
 'and',
 'analyze',
 'large',
 'amounts',
 'of',
 'natural',
 'language',
 'data',
 '.',
 'Challenges',
 'in',
 'natural',
 'language',
 'processing',
 'frequently',
 'involve',
 'speech',
 'recognition',
 ',',
 'natural',
 'language',
 'understanding',
 ',',
 'and',
 'natural',
 'language',
 'generation',
 '.']


Text Lemmatization and Stemming
> documents may contain different form of a word (running, run, runs) and can have related words with similar meaning (nation, national, nationality) <br>
Reduce different form of words and words to its base form.

>Stemming often includes removal of the derivational affixes. (if stemming is applied to the word 'see', it might return just 's'). stemmer operates without knowledge of the context. 
>Lemmatization aims to remove the inflectional endings only to return the base or dictionary form of a word known as **lemma**. (if lemmatization is applied to the word 'see', it would attempt to return either see or saw depending on the use of the token. 

Def: [Inflection](https://en.wikipedia.org/wiki/Inflection): Process of word formation in which a word is modified to express grammatical categories such as tense, case, voice, aspect, person, number, gender and mood. Also referred as conjugation. 


In [21]:
from nltk.stem import PorterStemmer,WordNetLemmatizer
# import nltk
# nltk.download('wordnet')
from nltk.corpus import wordnet


def comp_stem_lemma(stemmer, lemmatizer, word, pos):
    print("Stemmer",stemmer.stem(word))
    print("Lemmatizer",lemmatizer.lemmatize(word,pos))
    print()
    
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
comp_stem_lemma(stemmer,lemmatizer,word="seeing",pos=wordnet.VERB)
comp_stem_lemma(stemmer,lemmatizer,word="spun",pos=wordnet.VERB)

Stemmer see
Lemmatizer see

Stemmer spun
Lemmatizer spin



Stop Words
Words such as 'and' , 'the' , 'a' in a language. These add to noise when ML is applied. 
NLTK has a predefined list of stopwords 

In [22]:
import nltk
# nltk.download("stopwords")

In [23]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [24]:
stop_words = set(stopwords.words('english'))
words = nltk.word_tokenize(text)
without_stop_words = [word for word in words if word not in stop_words]
print(without_stop_words)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield', 'computer', 'science', ',', 'information', 'engineering', ',', 'artificial', 'intelligence', 'concerned', 'interactions', 'computers', 'human', '(', 'natural', ')', 'languages', ',', 'particular', 'program', 'computers', 'process', 'analyze', 'large', 'amounts', 'natural', 'language', 'data', '.', 'Challenges', 'natural', 'language', 'processing', 'frequently', 'involve', 'speech', 'recognition', ',', 'natural', 'language', 'understanding', ',', 'natural', 'language', 'generation', '.']


##### [Regular Expression](https://digitalfortress.tech/tricks/top-15-commonly-used-regex/)<br>
[Python document for RegEx](https://docs.python.org/3/library/re.html?highlight=regex)

In [25]:
import re
pattern = r"[^\w]"
print(re.sub(pattern," ",text))

Natural language processing  NLP  is a subfield of computer science  information engineering  and artificial intelligence concerned with the interactions between computers and human  natural  languages  in particular how to program computers to process and analyze large amounts of natural language data  Challenges in natural language processing frequently involve speech recognition  natural language understanding  and natural language generation 


Playing with regex <br> 
[link1](https://regexr.com/)<br>
[link2](https://regex101.com/)<br>

#### Bag-of-words 
Bag of words is a simplying representation used in natural language processing and information retrieval(IR). In this model, a text is represented as the bag(multiset) of its words, disregarding grammer and even word order but keeping multiplicity.

##### Step 1:  open the document 


In [26]:
import os 
path = os.path.join(os.getcwd() + "\\nlp.txt")
with open(path,'r') as f:
    document = f.read().splitlines()
print(document)

['Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. ', '', 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.']


##### Step 2:  Desiginig the vocabulary
using sklearn's count vectorizer 

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

##### Step 3:  Creating document vectors

In [28]:
bag_of_words = count_vectorizer.fit_transform(document)


##### Step 4: converting the bag-of-words model as a pandas dataframe

In [29]:
import pandas as pd
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(),columns=feature_names)

Unnamed: 0,amounts,analyze,and,artificial,between,challenges,computer,computers,concerned,data,...,processing,program,recognition,science,speech,subfield,the,to,understanding,with
0,1,1,3,1,1,0,1,2,1,1,...,1,1,0,1,0,1,1,2,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,1,0,0,0,0,...,1,0,1,0,1,0,0,0,1,0


Complexity of the bag-of-model comes in deciding how to design the vocabulary of known words(tokens) and how to score presence of known words

its better to decrese the known words using the text cleaning techniques

    1.Ignoring the case of the words
    2.Ignorning punctuation
    3.removing the stop words from our documents
    4.Reducing words to their base form (text lemmatization and stemming) 
    5.Fixing misspelled words

##### n-grams 
is a contiguous sequence of n items from a given sample of text or speech <br>
can be phonemes, syllables, letters, words or base pairs

##### Scoring words
once we have our vocabulary of known words, we need to score the occurance of the words in our data. 

    Counts 
    Frequencies 
    
Downside with scoring word frequency is that most frequent words in the document will have highest scores, which may not contain informational gain to the model compared with some rarer and domain-specific words.
<br>


##### TF-IDF  (Term frequency - Inverse document frequency)
is statistical measure used to evaluate the importance of a word to a document in a collection (corpus) 

To fix the above, penalize the frequent words across all the documents. This approach is called TF-IDF.

\begin{align}
W_{x,y} = tf_{x,y} \times log(\frac{N}{df_x}) \\
\\
tf_{x,y} = \text{frequency of x in y} \\
df_x = \text{number of documents containing x} \\
N = \text{total number of documents}
\end{align}

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(document)

feature_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame(values.toarray(),columns=feature_names)

Unnamed: 0,amounts,analyze,and,artificial,between,challenges,computer,computers,concerned,data,...,processing,program,recognition,science,speech,subfield,the,to,understanding,with
0,0.141589,0.141589,0.323047,0.141589,0.141589,0.0,0.141589,0.283178,0.141589,0.141589,...,0.107682,0.141589,0.0,0.141589,0.0,0.141589,0.141589,0.283178,0.0,0.141589
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.173808,0.0,0.0,0.228537,0.0,0.0,0.0,0.0,...,0.173808,0.0,0.228537,0.0,0.228537,0.0,0.0,0.0,0.228537,0.0


Crash Blossems - A sentence can be interpreted into 2 different meanings