### Natural language processing

In [1]:
text = """Germany is struggling. Its economy has shown no growth in the best part of two years. Its infrastructure is badly in need of modernisation. There are strikes on the railways. Protesting farmers have brought Berlin to a standstill. Deutsche Bank is cutting thousands of jobs. School standards are slipping. There is growing support for parties of the hard left and hard right. For the second time in a quarter of a century it is being labelled the sick man of Europe.

Germany has a history of economic problems breeding political extremism but talk of a return to the Weimar Republic is wildly overblown. The economy is flatlining, not collapsing. There is nothing to match the hyperinflation of 1923 or the mass unemployment of the early 1930s.

That said, the ruling coalition led by the chancellor, Olaf Scholz, is in serious trouble, staggering from crisis to crisis. Late last year, the country’s constitutional court ruled against a plan that allowed money intended for pandemic emergency measures to be spent on the transition to a carbon net zero economy. That blew a €60bn (£52bn) hole in the budget that had to be filled by unpopular austerity measures. As in many other European countries, immigration is a toxic political issue."""
text

'Germany is struggling. Its economy has shown no growth in the best part of two years. Its infrastructure is badly in need of modernisation. There are strikes on the railways. Protesting farmers have brought Berlin to a standstill. Deutsche Bank is cutting thousands of jobs. School standards are slipping. There is growing support for parties of the hard left and hard right. For the second time in a quarter of a century it is being labelled the sick man of Europe.\n\nGermany has a history of economic problems breeding political extremism but talk of a return to the Weimar Republic is wildly overblown. The economy is flatlining, not collapsing. There is nothing to match the hyperinflation of 1923 or the mass unemployment of the early 1930s.\n\nThat said, the ruling coalition led by the chancellor, Olaf Scholz, is in serious trouble, staggering from crisis to crisis. Late last year, the country’s constitutional court ruled against a plan that allowed money intended for pandemic emergency 

#### Segmentation

In [2]:
# import 
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# Split text into sentences
sentences = sent_tokenize(text)
sentences

['Germany is struggling.',
 'Its economy has shown no growth in the best part of two years.',
 'Its infrastructure is badly in need of modernisation.',
 'There are strikes on the railways.',
 'Protesting farmers have brought Berlin to a standstill.',
 'Deutsche Bank is cutting thousands of jobs.',
 'School standards are slipping.',
 'There is growing support for parties of the hard left and hard right.',
 'For the second time in a quarter of a century it is being labelled the sick man of Europe.',
 'Germany has a history of economic problems breeding political extremism but talk of a return to the Weimar Republic is wildly overblown.',
 'The economy is flatlining, not collapsing.',
 'There is nothing to match the hyperinflation of 1923 or the mass unemployment of the early 1930s.',
 'That said, the ruling coalition led by the chancellor, Olaf Scholz, is in serious trouble, staggering from crisis to crisis.',
 'Late last year, the country’s constitutional court ruled against a plan that

In [4]:
sentences[2]

'Its infrastructure is badly in need of modernisation.'

In [5]:
# Punctuation removal
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", sentences[2]) 
text

'Its infrastructure is badly in need of modernisation '

#### Tokenization

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
words = word_tokenize(text)
print(words)

['Its', 'infrastructure', 'is', 'badly', 'in', 'need', 'of', 'modernisation']


#### Removal of stop words

In [8]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['Its', 'infrastructure', 'badly', 'need', 'modernisation']


In [10]:
# have a look at the stop words in nltk's corpus
print(stopwords.words("turkish"))

['acaba', 'ama', 'aslında', 'az', 'bazı', 'belki', 'biri', 'birkaç', 'birşey', 'biz', 'bu', 'çok', 'çünkü', 'da', 'daha', 'de', 'defa', 'diye', 'eğer', 'en', 'gibi', 'hem', 'hep', 'hepsi', 'her', 'hiç', 'için', 'ile', 'ise', 'kez', 'ki', 'kim', 'mı', 'mu', 'mü', 'nasıl', 'ne', 'neden', 'nerde', 'nerede', 'nereye', 'niçin', 'niye', 'o', 'sanki', 'şey', 'siz', 'şu', 'tüm', 've', 'veya', 'ya', 'yani']


#### Stemming and lemmatization

In [11]:
nltk.download('wordnet') # download for lemmatization
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...


[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [12]:
# Stemming
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['it', 'infrastructur', 'badli', 'need', 'modernis']


In [13]:
# Lemmatize
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmatized = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmatized)

['Its', 'infrastructure', 'badly', 'need', 'modernisation']


In [14]:
# Another stemming and lemmatization example
words2 = ['wait', 'waiting' , 'studies', 'studying', 'computers']

# Stemming
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words2]
print("Stemming output: {}".format(stemmed))

# Lemmatization
# Reduce words to their root form
lemmatized = [WordNetLemmatizer().lemmatize(w) for w in words2]
print("Lemmatization output: {}".format(lemmatized))

Stemming output: ['wait', 'wait', 'studi', 'studi', 'comput']
Lemmatization output: ['wait', 'waiting', 'study', 'studying', 'computer']


#### Part of speech tagging

In [15]:
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [16]:
from nltk import pos_tag

In [17]:
# tag each word with part of speech
pos_tag(words)

[('Its', 'PRP$'),
 ('infrastructure', 'NN'),
 ('badly', 'RB'),
 ('need', 'DT'),
 ('modernisation', 'NN')]

In [18]:
"""
POS

CC: It is the conjunction of coordinating
CD: It is a digit of cardinal
DT: It is the determiner
EX: Existential
FW: It is a foreign word
IN: Preposition and conjunction
JJ: Adjective
JJR and JJS: Adjective and superlative
LS: List marker
MD: Modal
NN: Singular noun
NNS, NNP, NNPS: Proper and plural noun
PDT: Predeterminer
WRB: Adverb of wh
WP$: Possessive wh
WP: Pronoun of wh
WDT: Determiner of wp
VBZ: Verb
VBP, VBN, VBG, VBD, VB: Forms of verbs
UH: Interjection
TO: To go
RP: Particle
RBS, RB, RBR: Adverb
PRP, PRP$: Pronoun personal and professional

"""

'\nPOS\n\nCC: It is the conjunction of coordinating\nCD: It is a digit of cardinal\nDT: It is the determiner\nEX: Existential\nFW: It is a foreign word\nIN: Preposition and conjunction\nJJ: Adjective\nJJR and JJS: Adjective and superlative\nLS: List marker\nMD: Modal\nNN: Singular noun\nNNS, NNP, NNPS: Proper and plural noun\nPDT: Predeterminer\nWRB: Adverb of wh\nWP$: Possessive wh\nWP: Pronoun of wh\nWDT: Determiner of wp\nVBZ: Verb\nVBP, VBN, VBG, VBD, VB: Forms of verbs\nUH: Interjection\nTO: To go\nRP: Particle\nRBS, RB, RBR: Adverb\nPRP, PRP$: Pronoun personal and professional\n\n'

#### Named entity recognition

In [19]:
from nltk import ne_chunk
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Yaramis\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [20]:
ner_tree = ne_chunk(pos_tag(word_tokenize(sentences[2])))
print(ner_tree)

(S
  Its/PRP$
  infrastructure/NN
  is/VBZ
  badly/RB
  in/IN
  need/NN
  of/IN
  modernisation/NN
  ./.)


In [21]:
ner_tree = ne_chunk(pos_tag(word_tokenize(text)))
print(ner_tree)

(S
  Its/PRP$
  infrastructure/NN
  is/VBZ
  badly/RB
  in/IN
  need/NN
  of/IN
  modernisation/NN)


In [22]:
text = "Twitter CEO Elon Musk arrived at the Staples Center in Los Angeles, California. "
ner_tree = ne_chunk(pos_tag(word_tokenize(text)))
print(ner_tree)

(S
  (PERSON Twitter/NNP)
  (ORGANIZATION CEO/NNP Elon/NNP Musk/NNP)
  arrived/VBD
  at/IN
  the/DT
  (FACILITY Staples/NNP Center/NNP)
  in/IN
  (GPE Los/NNP Angeles/NNP)
  ,/,
  (GPE California/NNP)
  ./.)
