<a href="https://colab.research.google.com/github/anurag0308/Web_Scrapping/blob/master/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
nltk.download('punkt')
  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [5]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

In [6]:
paragraph = "Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen."

In [7]:
words = word_tokenize(paragraph) #it includes special characters also.

In [8]:
print(words)

['Data', 'preprocessing', 'is', 'an', 'essential', 'step', 'in', 'building', 'a', 'Machine', 'Learning', 'model', 'and', 'depending', 'on', 'how', 'well', 'the', 'data', 'has', 'been', 'preprocessed', ';', 'the', 'results', 'are', 'seen', '.']


In [9]:
sentences = sent_tokenize(paragraph)

In [10]:
print(sentences)

['Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen.']


In [11]:
#Stop_words_removal

In [12]:
from nltk.corpus import stopwords 

In [13]:
demo = "Stop words removal: Stop words are very commonly used words (a, an, the, etc.) in the documents. These words do not really signify any importance as they do not help in distinguishing two documents."

In [14]:
stop_words = set(stopwords.words('english')) #create a list of unique stopwords of english language

In [15]:
word_tokens = word_tokenize(demo) #create_word_tokens_of_your_text
len(word_tokens)

43

In [16]:
stop_words_removed = [w for w in word_tokens if not w in stop_words]

In [17]:
len(stop_words_removed)

28

In [18]:
stop_words_removed # things like 'are',a,an,the,in,do not,any,as,they are removed. These are stopwords.

['Stop',
 'words',
 'removal',
 ':',
 'Stop',
 'words',
 'commonly',
 'used',
 'words',
 '(',
 ',',
 ',',
 ',',
 'etc',
 '.',
 ')',
 'documents',
 '.',
 'These',
 'words',
 'really',
 'signify',
 'importance',
 'help',
 'distinguishing',
 'two',
 'documents',
 '.']

In [19]:
#Stemming: It is a process of transforming a word to its root form.

In [20]:
import nltk
from nltk.stem import PorterStemmer
ps = PorterStemmer()


In [21]:
stemmed = [ps.stem(words) for words in stop_words_removed]

In [22]:
stemmed #after stemming many words have become meaningless this is disadvantage of stemming

['stop',
 'word',
 'remov',
 ':',
 'stop',
 'word',
 'commonli',
 'use',
 'word',
 '(',
 ',',
 ',',
 ',',
 'etc',
 '.',
 ')',
 'document',
 '.',
 'these',
 'word',
 'realli',
 'signifi',
 'import',
 'help',
 'distinguish',
 'two',
 'document',
 '.']

In [23]:
stemmed_sentence = ','.join(stemmed)
stemmed_sentence

'stop,word,remov,:,stop,word,commonli,use,word,(,,,,,,,etc,.,),document,.,these,word,realli,signifi,import,help,distinguish,two,document,.'

In [24]:
#Lemmatization: Unlike stemming, lemmatization reduces the words to a word existing in the language.
#Stemmer is easy to build than a lemmatizer as the latter requires deep linguistics knowledge in constructing dictionaries to look up the lemma of the word.
#For lemmatization to resolve a word to its lemma, part of speech of the word is required. This helps in transforming the word into a proper root form.
#However, for doing so, it requires extra computational linguistics power such as a part of speech tagger


[Difference between stemming and lemmatization](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/)

In [25]:
import nltk
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer() #create a object of library 

print(lemmatizer.lemmatize("Machine", pos='n'))
# pos: parts of speech tag, verb
print(lemmatizer.lemmatize("caring", pos='v'))

Machine
care


BAG OF WORDS:

In [26]:
speech = """In the year 1960, APJ Abdul Kalam’s graduation took place from Madras Institute of Technology. The association of Kalam took place with the Defence 
            Research & Development Service (DRDS). Furthermore, he joined as a scientist at the Aeronautical Development Establishment of the Defence Research 
            and Development Organisation. These were the beginning achievements of his prestigious career as a scientist. Big achievement for Kalam came when 
            he was the project director at ISRO of India‘s first-ever Satellite Launch Vehicle (SLV- III). This satellite was responsible for the deployment of 
            the Rohini satellite in 1980. Moreover, Kalam was highly influential in the development of Polar Satellite Launch Vehicle (PSLV) and SLV projects.
            Both projects were successful. Bringing enhancement in the reputation of Kalam. Furthermore, the development of ballistic missiles was possible 
            because of the efforts of this man. Most noteworthy, Kalam earned the esteemed title of “The missile Man of India”. The Government of India became 
            aware of the brilliance of this man and made him the Chief Executive of the Integrated Guided Missiles Development Program (IGMDP). Furthermore, 
            this program was responsible for the research and development of Missiles. The achievements of this distinguished man didn’t stop there. More success 
            was to come in the form of Agni and Prithvi missiles. Once again, Kalam was influential in the developments of these missiles. It was during his 
            tenure in IGMDP that Kalam played an instrumental role in the developments of missiles like Agni and Prithvi. Moreover, Kamal was a key figure in 
            the Pokhran II nuclear test."""

In [27]:
speech_sentences = sent_tokenize(speech)

In [28]:
import re #regular expressions

In [29]:
corpus=[]
for i in range(len(speech_sentences)):
  review = re.sub('[^a-zA-Z]',' ',speech_sentences[i])
  review = review.lower()
  review = review.split() #to get a list so that we can do list comprehension
  review = [ps.stem(w) for w in review if not w in stop_words]
  review = ' '.join(review) #joining stemmed sentences again in review
  corpus.append(review)


In [30]:
corpus

['year apj abdul kalam graduat took place madra institut technolog',
 'associ kalam took place defenc research develop servic drd',
 'furthermor join scientist aeronaut develop establish defenc research develop organis',
 'begin achiev prestigi career scientist',
 'big achiev kalam came project director isro india first ever satellit launch vehicl slv iii',
 'satellit respons deploy rohini satellit',
 'moreov kalam highli influenti develop polar satellit launch vehicl pslv slv project',
 'project success',
 'bring enhanc reput kalam',
 'furthermor develop ballist missil possibl effort man',
 'noteworthi kalam earn esteem titl missil man india',
 'govern india becam awar brillianc man made chief execut integr guid missil develop program igmdp',
 'furthermor program respons research develop missil',
 'achiev distinguish man stop',
 'success come form agni prithvi missil',
 'kalam influenti develop missil',
 'tenur igmdp kalam play instrument role develop missil like agni prithvi',
 'more

In [31]:
len(corpus)

18

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
cv = CountVectorizer()

In [34]:
x = cv.fit_transform(corpus).toarray()

In [35]:
x.shape

(18, 89)

**TF-IDF**


In [36]:
#Term frequency is the no. of times a term occurs in a sentence/ no. of tems/words in the sentence
#document frequency is the no. of document/sentences containig the word/ no. of documents or snetences

In [37]:
corpus1=[]
for i in range(len(speech_sentences)):
  review1 = re.sub('[^a-zA-Z]',' ',speech_sentences[i])
  review1 = review1.lower()
  review1 = review1.split() #to get a list so that we can do list comprehension
  review1 = [lemmatizer.lemmatize(w) for w in review1 if not w in stop_words]
  review1 = ' '.join(review1) #joining stemmed sentences again in review
  corpus1.append(review1)

In [38]:
corpus

['year apj abdul kalam graduat took place madra institut technolog',
 'associ kalam took place defenc research develop servic drd',
 'furthermor join scientist aeronaut develop establish defenc research develop organis',
 'begin achiev prestigi career scientist',
 'big achiev kalam came project director isro india first ever satellit launch vehicl slv iii',
 'satellit respons deploy rohini satellit',
 'moreov kalam highli influenti develop polar satellit launch vehicl pslv slv project',
 'project success',
 'bring enhanc reput kalam',
 'furthermor develop ballist missil possibl effort man',
 'noteworthi kalam earn esteem titl missil man india',
 'govern india becam awar brillianc man made chief execut integr guid missil develop program igmdp',
 'furthermor program respons research develop missil',
 'achiev distinguish man stop',
 'success come form agni prithvi missil',
 'kalam influenti develop missil',
 'tenur igmdp kalam play instrument role develop missil like agni prithvi',
 'more

In [39]:
corpus1

['year apj abdul kalam graduation took place madras institute technology',
 'association kalam took place defence research development service drds',
 'furthermore joined scientist aeronautical development establishment defence research development organisation',
 'beginning achievement prestigious career scientist',
 'big achievement kalam came project director isro india first ever satellite launch vehicle slv iii',
 'satellite responsible deployment rohini satellite',
 'moreover kalam highly influential development polar satellite launch vehicle pslv slv project',
 'project successful',
 'bringing enhancement reputation kalam',
 'furthermore development ballistic missile possible effort man',
 'noteworthy kalam earned esteemed title missile man india',
 'government india became aware brilliance man made chief executive integrated guided missile development program igmdp',
 'furthermore program responsible research development missile',
 'achievement distinguished man stop',
 'succes

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X1 = tfidf.fit_transform(corpus1).toarray()

In [41]:
X1

array([[0.33669732, 0.        , 0.        , ..., 0.29470816, 0.        ,
        0.33669732],
       [0.        , 0.        , 0.        , ..., 0.34344823, 0.        ,
        0.        ],
       [0.        , 0.        , 0.35520986, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [42]:
X1.shape

(18, 90)

[**Word Embeddings :** A Word Embedding format generally tries to map a word using a dictionary to a vector.](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)

[**Meduim** Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)

[**Jay Alammar:** The Illustrated Word2vec](http://jalammar.github.io/illustrated-word2vec/)

In [48]:
pip install beautifulsoup4




In [49]:
pip install lxml



In [62]:
import bs4 as bs 
import urllib.request
import re
scrapped_data = urllib.request.urlopen("https://en.wikipedia.org/wiki/Machine_learning")

In [63]:
article = scrapped_data.read()

In [64]:
parsed_article = bs.BeautifulSoup(article,'lxml') #'lxml' is to read HTML and XML data in python

In [65]:
paragraphs = parsed_article.find_all('p') #p tag is for paragraph in html

In [67]:
article_text = ""


In [68]:
for p in paragraphs:
  article_text += p.text

In [69]:
article_text #this is our corpus

'Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1] It is seen as a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.\nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning.[4][5] In its application 

In [118]:
article_text = article_text.lower()

In [119]:
article_text = re.sub('[^a-zA-Z]',' ',article_text)
article_text = re.sub(r'\s+',' ',article_text)

In [120]:
article_sentences = sent_tokenize(article_text)



In [122]:
article_sentences

['machine learning ml is the study of computer algorithms that improve automatically through experience it is seen as a subset of artificial intelligence machine learning algorithms build a model based on sample data known as training data in order to make predictions or decisions without being explicitly programmed to do so machine learning algorithms are used in a wide variety of applications such as email filtering and computer vision where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks a subset of machine learning is closely related to computational statistics which focuses on making predictions using computers but not all machine learning is statistical learning the study of mathematical optimization delivers methods theory and application domains to the field of machine learning data mining is a related field of study focusing on exploratory data analysis through unsupervised learning in its application across business problems machin

In [123]:
article_words = [word_tokenize(sentences) for sentences in article_sentences]

In [125]:
for i in range(len(article_words)):
  article_words[i] = [w for w in article_words[i] if not w in stop_words]


In [127]:
from gensim.models import Word2Vec

In [133]:
w2v = Word2Vec(article_words,min_count=2)

In [134]:
vocab = w2v.wv.vocab

In [137]:
len(vocab)

588

In [140]:
print(w2v.wv['machine'])

[-7.6857419e-04 -5.4542592e-04 -6.1941100e-04  3.4336261e-03
 -6.4113406e-03  3.8384793e-03  1.3174195e-04 -6.5967054e-03
  7.9688313e-04  1.8513913e-03  7.5065729e-04 -2.9785451e-03
 -5.9786304e-03 -1.3197592e-03 -5.1291485e-04 -1.6552845e-03
  2.8507796e-03 -8.4973918e-04  2.5239408e-03 -1.2264057e-03
  5.0643901e-03  2.1713667e-03  3.6508611e-03  3.2676540e-05
 -5.0912709e-03 -6.1647217e-03  6.8627005e-03  6.3586957e-03
  2.8053203e-03 -8.5201924e-04 -4.1230535e-03 -2.7150903e-03
 -4.0710806e-03 -3.4160388e-03  4.0625054e-03  1.4321262e-03
 -3.4802589e-03 -5.1956805e-03  6.8957368e-03 -4.6542455e-03
 -9.5743529e-04 -4.2422051e-03  4.5677242e-03  4.1719968e-03
 -3.0714285e-03 -2.7328099e-03 -1.5076450e-03 -4.0402273e-03
  3.7794407e-03  3.0251741e-03 -2.3053202e-03 -2.9817796e-03
  1.1959607e-03  5.4193456e-03  3.1605938e-03  1.8604588e-03
  3.5052474e-03  2.8446136e-04  4.7381599e-03  4.9587847e-03
  1.5029125e-03 -2.2321425e-03 -1.6145490e-03 -5.6282375e-03
  3.7516281e-04 -2.73446

In [142]:
similar_words = w2v.wv.most_similar('machine')

  if np.issubdtype(vec.dtype, np.int):


In [144]:
for x in similar_words:
  print(x)

('learning', 0.4765019714832306)
('example', 0.47021251916885376)
('often', 0.4256896376609802)
('also', 0.41702383756637573)
('may', 0.400728702545166)
('performance', 0.3884531855583191)
('algorithms', 0.3853161931037903)
('non', 0.3792549669742584)
('said', 0.37637993693351746)
('regression', 0.37627890706062317)
