<a href="https://colab.research.google.com/github/ajazturki10/COG_INT/blob/main/nlp_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_sm')

In [5]:
with open("text.txt", "r") as f:
  lines = f.read()

In [6]:
len(lines.split())

1616

In [7]:
lines

'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”\n\nSo she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.\n\nThere was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it,

In [8]:
doc = nlp(lines)

In [9]:
class NLP_PIPELINE:
  def __init__(self, doc):
    self.doc = doc
    self.stopwords = spacy.lang.en.STOP_WORDS
  
  def tokenizer(self, text):
    return [word for word in text]

  def remove_stopwords(self, text):
    return [word for word in text if word.text not in self.stopwords]
  
  def remove_punct(self, text):
    return [word for word in text if not word.is_punct]
  
  def lemmatization(self, text):
    return [word.lemma_ for word in text]
  
  def tagging(self, text):
    return [(word.text, word.tag_) for word in text]

  def preprocessor(self):
    tokens = self.tokenizer(self.doc)
    tokens_without_sw = self.remove_stopwords(tokens)
    tokens_without_punct = self.remove_punct(tokens_without_sw)
    lemmatized_tokens = self.lemmatization(tokens_without_punct)
  
    return lemmatized_tokens

In [10]:
nlp_pipe = NLP_PIPELINE(doc)

In [45]:
nlp_pipe.preprocessor()[:50]

['Alice',
 'begin',
 'tired',
 'sit',
 'sister',
 'bank',
 'have',
 'twice',
 'peep',
 'book',
 'sister',
 'read',
 'picture',
 'conversation',
 'use',
 'book',
 'think',
 'Alice',
 'picture',
 'conversation',
 '\n\n',
 'so',
 'consider',
 'mind',
 'hot',
 'day',
 'feel',
 'sleepy',
 'stupid',
 'pleasure',
 'make',
 'daisy',
 'chain',
 'worth',
 'trouble',
 'get',
 'pick',
 'daisy',
 'suddenly',
 'White',
 'Rabbit',
 'pink',
 'eye',
 'run',
 'close',
 '\n\n',
 'there',
 'remarkable',
 'Alice',
 'think']

In [11]:
tagged_words = nlp_pipe.tagging(doc)

In [12]:
tagged_words[:5]

[('Alice', 'NNP'),
 ('was', 'VBD'),
 ('beginning', 'VBG'),
 ('to', 'TO'),
 ('get', 'VB')]

In [16]:
from spacy import displacy

In [27]:
sentence = lines[:74]
tokens = nlp(sentence)

In [28]:
displacy.render(tokens, style='dep', jupyter=True, options={'distance':100})

In [41]:
sentence = lines[:5000]

In [42]:
for entity in nlp(sentence).ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Alice - PERSON - People, including fictional
Alice - PERSON - People, including fictional
White Rabbit - ORG - Companies, agencies, institutions, etc.
Alice - PERSON - People, including fictional
Rabbit - PERSON - People, including fictional
Alice - PERSON - People, including fictional
Alice - PERSON - People, including fictional
Alice - PERSON - People, including fictional
First - ORDINAL - "first", "second", etc.
ORANGE MARMALADE - WORK_OF_ART - Titles of books, songs, etc.
Alice - PERSON - People, including fictional
Down - PERSON - People, including fictional
aloud - PERSON - People, including fictional
four thousand miles - QUANTITY - Measurements, as of weight or distance
Alice - PERSON - People, including fictional
Latitude - PERSON - People, including fictional
Longitude - PERSON - People, including fictional
Alice - PERSON - People, including fictional
Latitude - PERSON - People, including fictional
Longitude - PERSON - People, including fictional
Antipathies - WORK_OF_ART - T