# Getting Started with NLP

Some common applications of NLP:

* Text Summarization
* Text Tagging (topic tagging)
* Named Entity Recognition
* Chatbot
* Speech Recognition

Some terms:

* Phonetics/phonology:
The study of linguistic sounds and their relations to written words
* Morphology:
The study of internal structures of words/composition of words
* Syntax:
The study of the structural relationships among words in a sentence
* Semantics:
The study of the meaning of words and how these combine to form the meaning of sentences
* Pragmatics:
Situational use of language sentences
* Discourse:
A linguistic unit that is larger than a single sentence (context)
* Tokenization:
A step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words.
* Normalization:
Before further processing, text needs to be normalized. Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.
* Stemming:
is the process of eliminating affixes (suffixed, prefixes, infixes, postfix, circumfixes) from a word in order to obtain a word stem. ex: running → run
* Lemmatization:
lemmatization is able to capture canonical forms based on a word's lemma.
For example, stemming the word "better" would fail to return its citation form (another word for lemma); however, lemmatization would result in the following:
better → good
* Corpus:
(literally Latin for body) refers to a collection of texts. Such collections may be formed of a single language of texts, or can span multiple languages.
* Stop words:
words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content.
* Parts-of-speech (POS) Tagging:
POS tagging consists of assigning a category tag to the tokenized parts of a sentence. The most popular POS tagging would be identifying words as nouns, verbs, adjectives, etc.
* Bag of Words:
representation model used to simplify the contents of a selection of text. The bag of words model omits grammar and word order, but is interested in the number of occurrences of words within the text. The ultimate representation of the text selection is that of a bag of words.

  Ex:
  "Well, well, well," said John.
  "There, there," said James. "There, there."
  
  The resulting bag of words representation as a dictionary:
     {
      'well': 3,
      'said': 2,
      'john': 1,
      'there': 4,
      'james': 1
     }
     
     
* N-grams:
representation model for simplifying text selection contents. As opposed to the orderless representation of bag of words, n-grams modeling is interested in preserving contiguous sequences of N items from the text selection.

  An example of trigram (3-gram) model of the second sentence of the above example ("There, there," said James.     "There, there.") appears as a list representation below:

   [
      "there there said",
      "there said james",
      "said james there",
      "james there there",
   ]
   
* Statistical Language Modeling:
the process of building a statistical language model which is meant to provide an estimate of a natural language. For a sequence of input words, the model would assign a probability to the entire sequence, which contributes to the estimated likelihood of various possible sequences. This can be especially useful for NLP applications which generate text.

* Syntactic Analysis:
Also referred to as parsing, syntactic analysis is the task of analyzing strings as symbols, and ensuring their conformance to a established set of grammatical rules.

* Semantic Analysis:
Also known as meaning generation, semantic analysis is interested in determining the meaning of text selections (either character or word sequences). After an input selection of text is read and parsed (analyzed syntactically), the text selection can then be interpreted for meaning. Simply put, syntactic analysis is concerned with what words a text selection was made up of, while semantic analysis wants to know what the collection of words actually means. 

* Sentiment analysis:
the process of evaluating and determining the sentiment captured in a selection of text, with sentiment defined as feeling or emotion.
   
* Entity recognition:
is the process used to classify multiple entities found in a text in predefined categories, such as a person,
objects, location, organizations, dates, events, etc.

* Word vector:
refers to the mapping of the words or phrases from vocabulary to a vector of real numbers.

#  Virtual Environment
The main purpose of Python virtual environments is to create an isolated environment for Python projects. This means that each project can have its own dependencies, regardless of what dependencies every other project has.

In [1]:
# In python 2
!pip install virtualenv

Collecting virtualenv
  Obtaining dependency information for virtualenv from https://files.pythonhosted.org/packages/4c/ed/3cfeb48175f0671ec430ede81f628f9fb2b1084c9064ca67ebe8c0ed6a05/virtualenv-20.30.0-py3-none-any.whl.metadata
  Downloading virtualenv-20.30.0-py3-none-any.whl.metadata (4.5 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Obtaining dependency information for distlib<1,>=0.3.7 from https://files.pythonhosted.org/packages/91/a1/cf2472db20f7ce4a6be1253a81cfdf85ad9c7885ffbed7047fb72c24cf87/distlib-0.3.9-py2.py3-none-any.whl.metadata
  Downloading distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting filelock<4,>=3.12.2 (from virtualenv)
  Obtaining dependency information for filelock<4,>=3.12.2 from https://files.pythonhosted.org/packages/4d/36/2a115987e2d8c300a974597416d9de88f2444426de9571f4b59b2cca3acc/filelock-3.18.0-py3-none-any.whl.metadata
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Downloading virtualenv-20.30.0-py3-none-any.whl (

In [2]:
# If you are using Python 3, then you should already have the venv module from the standard library installed.
# To create a new virtual environment inside the directory
# python3 -m venv (env_name)
!python3 -m venv handsonnlp

In [3]:
# On Debian/Ubuntu systems, you need to install the python3-venv package using the following command.
# sudo apt-get install python3-venv
# since it's not installed by default

## Activate / Deactivate your environment

In [4]:
!source handsonnlp/bin/activate

In [None]:
!deactivate

## List Installed Packages With Pip

In [5]:
!pip list

Package                       Version
----------------------------- ---------------
aiobotocore                   2.5.0
aiofiles                      22.1.0
aiohttp                       3.8.5
aioitertools                  0.7.1
aiosignal                     1.2.0
aiosqlite                     0.18.0
alabaster                     0.7.12
anaconda-anon-usage           0.4.2
anaconda-catalogs             0.2.0
anaconda-client               1.12.1
anaconda-cloud-auth           0.1.3
anaconda-navigator            2.5.0
anaconda-project              0.11.1
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.3
astroid                       2.14.2
astropy                       5.1
asttokens                     2.0.5
async-timeout                 4.0.2
atomicwrites                  1.4.0
attrs                         22.1.0
Automat                       20.2.0
autopep8

## Create requirements.txt

In [6]:
!pip freeze > requirements.txt

# Natural Language Processing Libraries
## 1. NLTK  (www.nltk.org/)
Most common package you will encounter working with corpora, categorizing text, analyzing linguistic structure, and more.

It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.

In [7]:
# install NLTK with pip
# activate your env first
#!source handsonnlp/bin/activate
!pip install nltk



### Tokenization in NLTK

In [None]:
# AK: tokenization
import nltk

# Tokenization
sent_ = "I am almost dead this time"
tokens_ = nltk.word_tokenize(sent_)

In [12]:
tokens_

['I', 'am', 'almost', 'dead', 'this', 'time']

In [9]:
# So we need to install punkt
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/ahmedkashkoush/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
# let's run word_tokenize() now
sents = "I am almost dead this time"
tokens = nltk.word_tokenize(sents)

In [11]:
tokens

['I', 'am', 'almost', 'dead', 'this', 'time']

### Getting a synonym of a word in NLTK

In [15]:
# AK: Synonom
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/ahmedkashkoush/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
from nltk.corpus import wordnet
words = wordnet.synsets("spectacular")
print(words)

[Synset('spectacular.n.01'), Synset('dramatic.s.02'), Synset('spectacular.s.02'), Synset('outstanding.s.02')]


In [16]:
words[0].definition()

'a lavishly produced performance'

In [17]:
# AK: Access Synonom definition
for word in words:
    print(word, ":", word.definition())

Synset('spectacular.n.01') : a lavishly produced performance
Synset('dramatic.s.02') : sensational in appearance or thrilling in effect
Synset('spectacular.s.02') : characteristic of spectacles or drama
Synset('outstanding.s.02') : having a quality that thrusts itself into attention


### Stemming in NLTK

In [None]:
# AK: import stemmer
from nltk.stem import PorterStemmer
stemmer = PorterStemmer() # Create the stemmer object

In [20]:
# AK: stem method
stemmer.stem("decreases")

'decreas'

In [21]:
stemmer.stem("running")

'run'

In [29]:
stemmer.stem("cats")

'cat'

In [30]:
stemmer.stem("singing")

'sing'

In [27]:
stemmer.stem("shelves")

'shelv'

In [28]:
stemmer.stem("better")

'better'

### Lemmatization in NLTK

In [None]:
# AK: import Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() # Create the Lemmatizer object

In [None]:
# AK: Lemmatize method
# Default second param is noun
lemmatizer.lemmatize("decreases")

'decrease'

In [38]:
lemmatizer.lemmatize("rocks")

'rock'

In [40]:
lemmatizer.lemmatize("corpora")

'corpus'

In [41]:
lemmatizer.lemmatize("better")

'better'

In [None]:
# AK: lemmatize to adjective 
lemmatizer.lemmatize("better", pos="a")

'good'

In [22]:
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)

In [46]:
tokens

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 '...',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good',
 '.']

## POS tagging in NLTK

In [25]:
tagged = nltk.pos_tag(tokens)

In [49]:
# so we need to download the "averaged_perceptron_tagger" from NLTK
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/muhammad/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
tagged = nltk.pos_tag(tokens)
# AK: Pos_tag: part of speech tagging
""" 
IN: Preposition or subordinating conjunction (e.g., "At", "on").
CD: Cardinal number (e.g., "eight").
JJ: Adjective (e.g., "o'clock", "good").
NNP: Proper noun, singular (e.g., "Thursday", "Arthur").
NN: Noun, singular (e.g., "morning").
:: Punctuation (e.g., "...").
VBD: Verb, past tense (e.g., "did").
RB: Adverb (e.g., "n't", "very").
VB: Verb, base form (e.g., "feel").
.: Punctuation (e.g., ".").
 """

In [35]:
# tokens
tagged

[('At', 'IN'),
 ('eight', 'CD'),
 ("o'clock", 'NN'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN'),
 ('...', ':'),
 ('Arthur', 'NNP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('feel', 'VB'),
 ('very', 'RB'),
 ('good', 'JJ'),
 ('.', '.')]

In [52]:
# list all tags
nltk.help.upenn_tagset()

LookupError: 
**********************************************************************
  Resource [93mtagsets[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('tagsets')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mhelp/tagsets/PY3/upenn_tagset.pickle[0m

  Searched in:
    - '/home/muhammad/nltk_data'
    - '/home/muhammad/anaconda3/nltk_data'
    - '/home/muhammad/anaconda3/share/nltk_data'
    - '/home/muhammad/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [36]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to
[nltk_data]     /home/ahmedkashkoush/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

In [38]:
# AK: part of speech tag help
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [39]:
type(tagged)

list

In [40]:
tagged[0:6]

[('At', 'IN'),
 ('eight', 'CD'),
 ("o'clock", 'NN'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN')]

### Identify named entities in NLTK

In [43]:
entities = nltk.chunk.ne_chunk(tagged)

LookupError: 
**********************************************************************
  Resource [93mwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('words')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/words[0m

  Searched in:
    - '/home/ahmedkashkoush/nltk_data'
    - '/home/ahmedkashkoush/anaconda3/nltk_data'
    - '/home/ahmedkashkoush/anaconda3/share/nltk_data'
    - '/home/ahmedkashkoush/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [41]:
# AK: tagged entities
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/ahmedkashkoush/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [45]:
nltk.download('words')
entities = nltk.chunk.ne_chunk(tagged)

[nltk_data] Downloading package words to
[nltk_data]     /home/ahmedkashkoush/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [61]:
# download words package
nltk.download('words')

[nltk_data] Downloading package words to /home/muhammad/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [49]:
entities = nltk.chunk.ne_chunk(tagged)

In [48]:
entities

ModuleNotFoundError: No module named 'svgling'

Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('...', ':'), Tree('PERSON', [('Arthur', 'NNP')]), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')])

In [50]:
print(entities)

(S
  At/IN
  eight/CD
  o'clock/NN
  on/IN
  Thursday/NNP
  morning/NN
  .../:
  (PERSON Arthur/NNP)
  did/VBD
  n't/RB
  feel/VB
  very/RB
  good/JJ
  ./.)


### Display a parse tree in NLTK

In [4]:
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

KeyboardInterrupt: 

In [1]:
# AK: Draw treebank tree
nltk.download('treebank')

NameError: name 'nltk' is not defined

In [2]:
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

KeyboardInterrupt: 

## 2. TextBlob (http://textblob.readthedocs.io/en/dev/index.html)
It provides a simple API for diving deep into common NLP tasks, such as part-of-speech tagging, noun
phrase extraction, sentiment analysis, classification, and much more.

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

In [4]:
# AK: TextBlob: install
# install TextBlob
!pip install -U textblob
# or Install it with python3
!python3 -m pip install -U textblob



### POS Tagging in TextBlob

In [19]:
# AK: pos tagging textblob
from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
blob.tags

[('The', 'DT'),
 ('titular', 'JJ'),
 ('threat', 'NN'),
 ('of', 'IN'),
 ('The', 'DT'),
 ('Blob', 'NNP'),
 ('has', 'VBZ'),
 ('always', 'RB'),
 ('struck', 'VBN'),
 ('me', 'PRP'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('ultimate', 'JJ'),
 ('movie', 'NN'),
 ('monster', 'NN'),
 ('an', 'DT'),
 ('insatiably', 'RB'),
 ('hungry', 'JJ'),
 ('amoeba-like', 'JJ'),
 ('mass', 'NN'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('penetrate', 'VB'),
 ('virtually', 'RB'),
 ('any', 'DT'),
 ('safeguard', 'NN'),
 ('capable', 'JJ'),
 ('of', 'IN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('doomed', 'JJ'),
 ('doctor', 'NN'),
 ('chillingly', 'RB'),
 ('describes', 'VBZ'),
 ('it', 'PRP'),
 ('assimilating', 'VBG'),
 ('flesh', 'NN'),
 ('on', 'IN'),
 ('contact', 'NN'),
 ('Snide', 'JJ'),
 ('comparisons', 'NNS'),
 ('to', 'TO'),
 ('gelatin', 'VB'),
 ('be', 'VB'),
 ('damned', 'VBN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('concept', 'NN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('most', 'RBS'),
 ('devastating', 'JJ'),
 ('of', 'IN'),
 ('potenti

### Noun phrase extraction in TextBlob

In [20]:
# AK: TextBlob: Noun phrase
blob.noun_phrases

WordList(['titular threat', 'blob', 'ultimate movie monster', 'amoeba-like mass', 'snide', 'potential consequences', 'grey goo scenario', 'technological theorists fearful', 'artificial intelligence run rampant'])

In [21]:
# need to install brown package
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     /home/ahmedkashkoush/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [22]:
blob.noun_phrases

WordList(['titular threat', 'blob', 'ultimate movie monster', 'amoeba-like mass', 'snide', 'potential consequences', 'grey goo scenario', 'technological theorists fearful', 'artificial intelligence run rampant'])

In [23]:
blob.noun_phrases[0:3]

WordList(['titular threat', 'blob', 'ultimate movie monster'])

In [24]:
blob.sentences

[Sentence("
 The titular threat of The Blob has always struck me as the ultimate movie
 monster: an insatiably hungry, amoeba-like mass able to penetrate
 virtually any safeguard, capable of--as a doomed doctor chillingly
 describes it--"assimilating flesh on contact."),
 Sentence("Snide comparisons to gelatin be damned, it's a concept with the most
 devastating of potential consequences, not unlike the grey goo scenario
 proposed by technological theorists fearful of
 artificial intelligence run rampant.")]

In [26]:

# AK: TextBlob: get first sentence
blob.sentences[0].string

'\nThe titular threat of The Blob has always struck me as the ultimate movie\nmonster: an insatiably hungry, amoeba-like mass able to penetrate\nvirtually any safeguard, capable of--as a doomed doctor chillingly\ndescribes it--"assimilating flesh on contact.'

### Sentiment analysis with TextBlob

Polarity defines negativity or positivity in the sentence, whereas
subjectivity implies whether the sentence discusses something vaguely or
with complete surety.

In [28]:
# AK: sentiment: text blob
for sentence in blob.sentences:
    print("\n",sentence.sentiment)
    print("\n polarity:",sentence.sentiment.polarity)
    print("\n subjectivity:",sentence.sentiment.subjectivity)


 Sentiment(polarity=0.06000000000000001, subjectivity=0.605)

 polarity: 0.06000000000000001

 subjectivity: 0.605

 Sentiment(polarity=-0.34166666666666673, subjectivity=0.7666666666666666)

 polarity: -0.34166666666666673

 subjectivity: 0.7666666666666666


### Correct spelling errors with TextBlob

In [None]:
# AK: TextBlob: correct spelling
sent = TextBlob("I thinkk tha model needs to be trainned more!")

In [32]:
sent.correct()

TextBlob("I think the model needs to be trained more!")

### AK: Language translation with TextBlob

In [35]:
# Install googletrans for translation
# %pip install googletrans==4.0.0-rc1

# Import the Translator from googletrans
from googletrans import Translator

# Create a Translator object
translator = Translator()

# Translate the French text to English
sent_fr = TextBlob(u"Voulez-vous apprendre le français?")
translated_text = translator.translate(sent_fr.string, src='fr', dest='en')
translated_text.text

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Obtaining dependency information for httpx==0.13.3 from https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl.metadata
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Obtaining dependency information for hstspreload from https://files.pythonhosted.org/packages/76/c3/87beb45b57c9c418d32aa773a9d0adcf3b91422b7ad1fcc46f9d2b691eed/hstspreload-2025.1.1-py3-none-any.whl.metadata
  Downloading hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Obtaining dependency information for chardet==3.* from https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4

'Do you want to learn French?'

In [36]:
sent_ar = TextBlob(u"هل تريد تعلم العربية?")
sent_ar.translate(from_lang='ar', to='en')

AttributeError: 'TextBlob' object has no attribute 'translate'

### Text Classification using TextBlob (https://textblob.readthedocs.io/en/dev/api_reference.html#module-textblob.classifiers)
Text can be classified into various classes, such as positive and negative.

In [40]:
# AK: TextBlob: classifier Naive
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier
data = [
('I love my country.', 'pos'),
('This is an amazing place!', 'pos'),
('I do not like the smell of this place.', 'neg'),
('I do not like this restaurant', 'neg'),
('I am tired of hearing your nonsense.', 'neg'),
("I always aspire to be like him", 'pos'),
("It's a horrible performance.", "neg"),
("My boss is horrible.", "neg")
]
model = NaiveBayesClassifier(data)
model.classify("It's an awesome place!")

'pos'

In [41]:
blob = TextBlob("The game is good. But the hangover is horrible.", classifier=model)
for s in blob.sentences:
    print(s)
    print(s.classify())

The game is good.
neg
But the hangover is horrible.
neg


## 3. SpaCy (https://spacy.io/)
provides very fast and accurate syntactic
analysis (the fastest of any library released) and also offers named entity
recognition and ready access to word vectors. It is written in Cython
language and contains a wide variety of trained models on language
vocabularies, syntaxes, word-to-vector transformations, and entities
recognition.

In [4]:
# AK: Spacy: Install
    # Accurate and fast syntactic, 
    # offers (named entitiy, access to word vectors) 
    # contains trained models on language vocabularies
    # Written in cython 
!pip install -U spacy
# or Install it with python3
!python3 -m pip install -U spacy

Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/f9/06/941689ebac0a6fc7eb8cf4f057c6604b71ecf3479ec12ed8a7db1fec7ffe/spacy-3.8.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached spacy-3.8.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Obtaining dependency information for spacy-legacy<3.1.0,>=3.0.11 from https://files.pythonhosted.org/packages/c3/55/12e842c70ff8828e34e543a2c7176dac4da006ca6901c9e8b43efab8bc6b/spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5-py3-none-any.whl.metadata
  Using cac

In [8]:
import spacy
nlp = spacy.load("en")

OSError: [E941] Can't find model 'en'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0. To load the model, use its full name instead:

nlp = spacy.load("en_core_web_sm")

For more details on the available models, see the models directory: https://spacy.io/models and if you want to create a blank model, use spacy.blank: nlp = spacy.blank("en")

In [6]:
# To fix that error run the command
!python -m spacy download en
nlp = spacy.load("en_core_web_sm")

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0mm
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [9]:
william_wikidef_str = """William was the son of King William II and Anna Pavlovna of Russia.
On the abdication of his grandfather William I in 1840, he became the Prince of Orange.
On the death of his father in 1849, he succeeded as king of the
Netherlands. William married his cousin Sophie of Württemberg
in 1839 and they had three sons, William, Maurice, and
Alexander, all of whom predeceased him. """

william = nlp(william_wikidef_str)

In [10]:
william.ents

(William,
 William II,
 Anna Pavlovna,
 Russia,
 William,
 1840,
 the Prince of Orange,
 1849,
 Netherlands,
 William,
 Sophie of Württemberg,
 1839,
 three,
 William,
 Maurice,
 Alexander)

In [13]:
[ (i, i.label_, i.label) for i in william.ents]

[(William, 'PERSON', 380),
 (William II, 'PERSON', 380),
 (Anna Pavlovna, 'PERSON', 380),
 (Russia, 'GPE', 384),
 (William, 'PERSON', 380),
 (1840, 'DATE', 391),
 (the Prince of Orange, 'ORG', 383),
 (1849, 'DATE', 391),
 (Netherlands, 'GPE', 384),
 (William, 'PERSON', 380),
 (Sophie of Württemberg, 'ORG', 383),
 (1839, 'DATE', 391),
 (three, 'CARDINAL', 397),
 (William, 'PERSON', 380),
 (Maurice, 'PERSON', 380),
 (Alexander, 'PERSON', 380)]

In [14]:
# let's visualize the recognized entities
import spacy
from spacy import displacy
# if you want to run a web server to serve the visualization
# displacy.serve(william, style="ent") 
# or if you are running jupyter notebook use render() instead
displacy.render(william, style="ent")

ImportError: cannot import name 'display' from 'IPython.core.display' (/home/ahmedkashkoush/Desktop/NLP/handsonnlp/lib/python3.11/site-packages/IPython/core/display.py)

###  Noun Phrase extraction with Spacy

In [18]:
# AK: Spacy: Noun phase extraction
sent = nlp('The book deals with NLP')
for noun_ in sent.noun_chunks:
    print("\n",type(noun_))
    print(noun_.text)
    print('---')
    print(noun_.root.dep_)
    print('---')
    print("head:",noun_.root.head.text)


 <class 'spacy.tokens.span.Span'>
The book
---
nsubj
---
head: deals

 <class 'spacy.tokens.span.Span'>
NLP
---
pobj
---
head: with


## 4. Gensim (https://pypi.python.org/pypi/gensim or https://radimrehurek.com/gensim/) 
It is used primarily for topic modeling and document similarity.

Gensim offers LDA (latent dirichlet allocation—a generative statistical
model that allows sets of observations to be explained by unobserved
groups that explain why some parts of the data are similar).

For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.

#### Read more! [about LDA topic model](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [29]:
# AK: Genism: Install & Explanation
    # For topic modling and document similarity
    # LDA: a generative statistical model, allows sets of observations explained by unobserverd groups
!pip install --upgrade gensim
# or
!python3 -m pip install -U gensim


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [30]:
from gensim.models import Word2Vec
min_count = 0
size = 50
window = 2
sentence= "bitcoin is an innovative payment network and a new kind of money."
tokens=sentence.split()
print(tokens)

AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

### Getting a word vector using Gensim.

In [None]:
# AK: gensim: word vector
model = Word2Vec(tokens, min_count=min_count, vector_size=size,window=window)

NameError: name 'Word2Vec' is not defined

In [25]:
model.

<gensim.models.word2vec.Word2Vec at 0x7f9eedc25520>

In [27]:
# vector of 'a'
model.wv['a']

array([ 1.56321377e-02, -1.90189127e-02, -4.17724892e-04,  6.93716388e-03,
       -1.87265233e-03,  1.67726874e-02,  1.80236828e-02,  1.30769610e-02,
       -1.42979168e-03,  1.54274460e-02, -1.70741510e-02,  6.41523115e-03,
       -9.27207991e-03, -1.01839453e-02,  7.18633179e-03,  1.07350927e-02,
        1.55413030e-02, -1.15356091e-02,  1.48623688e-02,  1.32507896e-02,
       -7.42820464e-03, -1.74971707e-02,  1.08766230e-02,  1.30173545e-02,
       -1.57562713e-03, -1.34256398e-02, -1.41616967e-02, -4.99728601e-03,
        1.02840560e-02, -7.32584484e-03, -1.87345482e-02,  7.64697697e-03,
        9.75919515e-03, -1.28610805e-02,  2.41353363e-03, -4.14449396e-03,
        4.60930496e-05, -1.97694749e-02,  5.38536347e-03, -9.49801225e-03,
        2.17481446e-03, -3.14943912e-03,  4.39422950e-03, -1.57589372e-02,
       -5.43286046e-03,  5.33067901e-03,  1.06922844e-02, -4.78387717e-03,
       -1.90264061e-02,  9.01062042e-03], dtype=float32)

## 5. Pattern (https://pypi.python.org/pypi/Pattern)
Is useful NLP tasks e.g. 
1.  part-of-speech taggers
2.  n-gram searches 
3. sentiment analysis
4.  WordNet 
5.  machine learning
    - vector space modeling, k-means clustering, Naive Bayes, K-NN, and SVM classifiers.

## 6. Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)
provides the base forms of words; their parts of speech; whether they are names of
companies, people, etc.; normalizes dates, times, and numeric quantities;
marks up the structure of sentences in terms of phrases and syntactic
dependencies; indicates which noun phrases refer to the same entities;
indicates sentiment; extracts particular or open-class relations between
entity mentions; gets the quotes people said; etc.

There is a live online demo of CoreNLP available at (https://corenlp.run/)