<a href="https://colab.research.google.com/github/adityan-nainar/Machine-Learning/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing**

---



## Stemming, Lemming and Stopwords

In [1]:
import nltk
nltk.download("wordnet")
nltk.download("punkt")   # for tokenization
nltk.download("stopwords")   # for stopwords

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

stemmer = PorterStemmer()
lemmer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
paragraph = """
The Japan Transport Safety Board (JTSB), the South Korean Aviation and Railway Accident Investigation Board (ARAIB), and the United States National Transportation Safety Board (NTSB) all investigated the accident, with assistance from experts in South Korea and the United States. On 30 May 2016, investigators revealed that the low-pressure turbine blades on the left (number one) Pratt & Whitney PW4090 engine had "shattered", with fragments piercing the engine cover, with fragments subsequently found on the runway. The engine's high-pressure turbine blades and high-pressure compressor were intact and free of abnormalities, and investigators found no evidence of a bird strike.[10][11]

The aircraft was repaired and returned to service with Korean Air on 3 June 2016.[12]

The final JTSB investigative report, released on 26 July 2018, discussed a significant number of problems related to the failure and the response of the crew and passengers to it. These included poor maintenance standards that overlooked a crack growing in the LP turbine disc in the engine created by metal fatigue that eventually failed, the failure of the crew to locate the list of emergency procedures for use in such an emergency, beginning evacuation of the aircraft whilst the engines were still turning meaning there was a risk of passengers being blown away by the engines, and passengers ignoring instructions to leave luggage behind when using the evacuation slides risking piercing of the slides.[13] As a result of the fire, the FAA issued an Airworthiness Directive mandating inspection of engines of the type involved in the fire to evaluate the condition of the components which failed on Flight 2708.[5]: 56
"""

In [3]:
sentence = nltk.sent_tokenize(paragraph)
print(sentence)

stemmer.stem('going')
lemmer.lemmatize('historical')

['\nThe Japan Transport Safety Board (JTSB), the South Korean Aviation and Railway Accident Investigation Board (ARAIB), and the United States National Transportation Safety Board (NTSB) all investigated the accident, with assistance from experts in South Korea and the United States.', 'On 30 May 2016, investigators revealed that the low-pressure turbine blades on the left (number one) Pratt & Whitney PW4090 engine had "shattered", with fragments piercing the engine cover, with fragments subsequently found on the runway.', "The engine's high-pressure turbine blades and high-pressure compressor were intact and free of abnormalities, and investigators found no evidence of a bird strike.", '[10][11]\n\nThe aircraft was repaired and returned to service with Korean Air on 3 June 2016.', '[12]\n\nThe final JTSB investigative report, released on 26 July 2018, discussed a significant number of problems related to the failure and the response of the crew and passengers to it.', 'These included 

'historical'

In [4]:
len(sentence)

8

In [5]:
import re # regular expression

corpus = []

for i in range(len(sentence)):
  review = re.sub('[^a-zA-z]', " ", sentence[i])
  review = review.lower()
  corpus.append(review)

print(corpus)

[' the japan transport safety board  jtsb   the south korean aviation and railway accident investigation board  araib   and the united states national transportation safety board  ntsb  all investigated the accident  with assistance from experts in south korea and the united states ', 'on    may       investigators revealed that the low pressure turbine blades on the left  number one  pratt   whitney pw     engine had  shattered   with fragments piercing the engine cover  with fragments subsequently found on the runway ', 'the engine s high pressure turbine blades and high pressure compressor were intact and free of abnormalities  and investigators found no evidence of a bird strike ', '[  ][  ]  the aircraft was repaired and returned to service with korean air on   june      ', '[  ]  the final jtsb investigative report  released on    july       discussed a significant number of problems related to the failure and the response of the crew and passengers to it ', 'these included poor 

In [None]:
for i in corpus:
  words = nltk.word_tokenize(i)
  for word in words:
    if word not in set(stopwords.words('english')):
      lemmer.lemmatize(word)
      print(word)

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
# cv = CountVectorizer()
cv = CountVectorizer(binary=True, ngram_range=(2,3))   # to give binary bag of words

In [9]:
X = cv.fit_transform(corpus)

In [None]:
cv.vocabulary_ # gives the index not the frequency

In [None]:
X[1].toarray()

## TFIDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(ngram_range=(2,3), max_features=3)
X = cv.fit_transform(corpus)

In [22]:
X.toarray()

array([[1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        ],
       [0.757092  , 0.        , 0.65330828],
       [0.        , 0.6113708 , 0.79134426],
       [0.        , 0.36033646, 0.9328224 ],
       [0.        , 0.        , 0.        ]])

## Word2Vec

### using BOW

In [5]:
# Importing the Dataset

import pandas as pd

messages = pd.read_csv('/content/SMSSpamCollection.txt', sep='\t',names=["label", "message"])

In [None]:
print(messages)

In [1]:
#Data cleaning and preprocessing
import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [8]:
corpus = []
for i in range(len(messages)):
    review = re.sub('[^a-zA-Z0-9]', ' ', messages['message'][i])
    review = review.lower()
    review = review.split()

    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [11]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500, binary=True)
X = cv.fit_transform(corpus).toarray()

In [12]:
y=pd.get_dummies(messages['label'])
y=y.iloc[:,1].values

In [13]:
# Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [15]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

In [16]:
#prediction
y_pred=spam_detect_model.predict(X_test)
from sklearn.metrics import accuracy_score,classification_report

In [17]:
score=accuracy_score(y_test,y_pred)
print(score)

0.9865470852017937


In [18]:
from sklearn.metrics import classification_report
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       960
           1       0.94      0.97      0.95       155

    accuracy                           0.99      1115
   macro avg       0.97      0.98      0.97      1115
weighted avg       0.99      0.99      0.99      1115



### using TFIFD

In [21]:
# Creating the TFIDF model
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=2500, ngram_range=(1,2))
X = tv.fit_transform(corpus).toarray()
# Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

In [22]:
#prediction
y_pred=spam_detect_model.predict(X_test)
score=accuracy_score(y_test,y_pred)
print(score)
from sklearn.metrics import classification_report
print(classification_report(y_pred,y_test))

0.9811659192825112
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       976
           1       0.87      1.00      0.93       139

    accuracy                           0.98      1115
   macro avg       0.93      0.99      0.96      1115
weighted avg       0.98      0.98      0.98      1115



### Using WORD2Vec

In [23]:
!pip install gensim



In [25]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    review = review.lower()
    review = review.split()

    review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [28]:
nltk.download('punkt')

from nltk import sent_tokenize
from gensim.utils import simple_preprocess

words=[]
for sent in corpus:
    sent_token=sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [30]:
import gensim
### Lets train Word2vec from scratch
model=gensim.models.Word2Vec(words,window=5,min_count=2)
model.wv.index_to_key

['call',
 'get',
 'ur',
 'gt',
 'lt',
 'go',
 'ok',
 'day',
 'free',
 'know',
 'come',
 'like',
 'time',
 'good',
 'got',
 'love',
 'text',
 'want',
 'send',
 'need',
 'one',
 'txt',
 'today',
 'going',
 'stop',
 'home',
 'lor',
 'sorry',
 'see',
 'still',
 'mobile',
 'take',
 'back',
 'da',
 'reply',
 'dont',
 'think',
 'tell',
 'week',
 'hi',
 'phone',
 'new',
 'later',
 'please',
 'pls',
 'co',
 'msg',
 'min',
 'make',
 'night',
 'dear',
 'message',
 'well',
 'say',
 'thing',
 'much',
 'oh',
 'hope',
 'claim',
 'great',
 'hey',
 'give',
 'number',
 'happy',
 'wat',
 'friend',
 'work',
 'way',
 'yes',
 'www',
 'prize',
 'let',
 'right',
 'tomorrow',
 'already',
 'tone',
 'ask',
 'win',
 'said',
 'life',
 'cash',
 'amp',
 'yeah',
 'im',
 'really',
 'meet',
 'babe',
 'find',
 'miss',
 'morning',
 'thanks',
 'last',
 'uk',
 'service',
 'year',
 'anything',
 'care',
 'would',
 'com',
 'also',
 'lol',
 'nokia',
 'feel',
 'every',
 'keep',
 'sure',
 'pick',
 'urgent',
 'sent',
 'contact',


In [31]:
model.corpus_count

5564

In [32]:
model.epochs

5

In [36]:
model.wv.similar_by_word('king')

[('ready', 0.9941062927246094),
 ('night', 0.9940164089202881),
 ('thats', 0.9938697218894958),
 ('little', 0.9938569068908691),
 ('dear', 0.9938194155693054),
 ('well', 0.993813693523407),
 ('keep', 0.993794858455658),
 ('sure', 0.9937848448753357),
 ('take', 0.9937664270401001),
 ('use', 0.993759036064148)]

In [41]:
import numpy as np

In [42]:
def avg_word2vec(doc):
    # remove out-of-vocabulary words
    #sent = [word for word in doc if word in model.wv.index_to_key]
    #print(sent)

    return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key],axis=0)
                #or [np.zeros(len(model.wv.index_to_key))], axis=0)

In [38]:
!pip install tqdm

from tqdm import tqdm
words[73]
type(model.wv.index_to_key)



list

In [None]:
#apply for the entire sentences
X=[]
for i in tqdm(range(len(words))):
    print("Hello",i)
    X.append(avg_word2vec(words[i]))