# Text Cleansing

In [10]:
import urllib.request
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
# Raw html code
html = urllib.request.urlopen(url).read()

# Html removed tags
text = BeautifulSoup(html, 'html.parser').get_text()
print(text)




BBC NEWS | Health | Blondes 'to die out in 200 years'






































NEWS
  SPORT
  WEATHER
  WORLD SERVICE

  A-Z INDEX 

  SEARCH 












































     You are in: Health  
    
    











News Front Page





Africa


Americas


Asia-Pacific


Europe


Middle East


South Asia


UK


Business


Entertainment


Science/Nature


Technology


Health


Medical notes


-------------


Talking Point


-------------


Country Profiles


In Depth


-------------


Programmes


-------------


























SERVICES







Daily E-mail







News Ticker







Mobile/PDAs






-------------















Text Only








Feedback







Help
























EDITIONS







Change to UK















Friday, 27 September, 2002, 11:51 GMT 12:51 UK

Blondes 'to die out in 200 years'


Scientists believe the last blondes will be in Finland



	The last natural blondes will die out within 200 years, s

# Sentence tokenizing ( Splitting sentences )

In [56]:
from nltk.tokenize import sent_tokenize
    
text = "Hello, I am Amir Esmaeili. I work as a software engineer, who is trying to become an AI eng."

# Default sentence splitter for standard texts
sentences = sent_tokenize(text)
print(sentences)

['Hello, I am Amir Esmaeili.', 'I work as a software engineer, who is trying to become an AI eng.']


# Custom tokenization

In [68]:
from nltk.corpus import webtext
from nltk.tokenize import PunktSentenceTokenizer

# The bellow text is in the conversation format
text = webtext.raw('overheard.txt')
# Trains data set => This is used when text is not in standard format
sent_tokenizer = PunktSentenceTokenizer(text)

In [69]:
sentences = sent_tokenizer.tokenize(text)

## Checking the difference

In [71]:
# Default sentence tokenizer 
sentence2 = sent_tokenize(text)

In [75]:
print(sentence2[678] == sentences[678])
print(sentence2[678])
print("-----------")
print(sentences[678])

False
Girl: But you already have a Big Mac...
Hobo: Oh, this is all theatrical.
-----------
Girl: But you already have a Big Mac...


# Word tokenization

In this section, different ways of tokenizing words are demonstrated.
* standard
* custom regex

In [4]:
from nltk.tokenize import word_tokenize
text = "Hello, I'm, Amir Esmaeili. Computer science student @ IUST and full-stack developer at Satplat co."

# Standart text tokenizing
words = word_tokenize(text)
print(words)

['Hello', ',', 'I', "'m", ',', 'Amir', 'Esmaeili', '.', 'Computer', 'science', 'student', '@', 'IUST', 'and', 'full-stack', 'developer', 'at', 'Satplat', 'co', '.']


In [84]:
# Different kind of tokenizing.
from nltk.tokenize import WordPunctTokenizer

custom_tokenizer = WordPunctTokenizer()
words = custom_tokenizer.tokenize(text)
print(words)

['Hello', ',', 'I', "'", 'm', ',', 'Amir', 'Esmaeili', '.', 'Computer', 'science', 'student', '@', 'IUST', 'and', 'full', '-', 'stack', 'developer', 'at', 'Satplat', 'co', '.']


In [86]:
from nltk.tokenize import regexp_tokenize

# Tokenizing by custom regex
words = regexp_tokenize(text, pattern="[\w'w-]+")
print(words)

['Hello', "I'm", 'Amir', 'Esmaeili', 'Computer', 'science', 'student', 'IUST', 'and', 'full-stack', 'developer', 'at', 'Satplat', 'co']


# Stemming

**Different type of stemmers available in nltk package**
* The Porter Stemmer: Good enough for English
* LancasterStemmer: More aggressive than the Porter
* Snowball Stemmers: For 13 languages like Dutch, English and ...

In [1]:
from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer


pst = PorterStemmer()
lst = LancasterStemmer()
spanish_sst = SnowballStemmer('spanish')
print(pst.stem('Impressive'))
print(lst.stem('eating'))
print(spanish_sst.stem('hola'))

impress
eat
hol


In [3]:
words = ['caresses', 'flies', 'dies', 'mules', 'denied',
        'died', 'agreed', 'owned', 'humbled', 'sized',
        'meeting', 'stating', 'siezing', 'itemization',
        'sensational', 'traditional', 'reference', 'colonizer']
print(list(map(pst.stem, words)))

['caress', 'fli', 'die', 'mule', 'deni', 'die', 'agre', 'own', 'humbl', 'size', 'meet', 'state', 'siez', 'item', 'sensat', 'tradit', 'refer', 'colon']


# Lemmatization

As you seen in the previous block, some words are not even a meaningful word, that's because stemmer is only for 
stemming _ing_ , _s_ and _ed_.
But Lemmatization uses a database called **wordnet** which connects every word to its root.

In [37]:
from nltk.stem import WordNetLemmatizer

nouns = ['feet', 'teeth', 'mice']
from_verbs = ['are', 'inspiring', 'flying', 'was'] 

word_lemmatizer = WordNetLemmatizer()
print('nouns', [word_lemmatizer.lemmatize(x, pos='n') for x in nouns])
print('verbs', [word_lemmatizer.lemmatize(x, pos='v') for x in from_verbs])

nouns ['foot', 'teeth', 'mouse']
verbs ['be', 'inspire', 'fly', 'be']


As you see above, role of word is specified.

# Stemming vs. Lemmatization

### Stemming

> Stemming is a process in which some suffix such as *es*, *s*, *ing* and ... are removed.
But stemming is not going to produce meaningful words alway, and it just chops off the suffix without caring about
the meaning

### Lemmatization

> But Lemmatization uses a dictionary called **wordnet** in order to find the root of a word
But it also requires us to know the role of the word, if it is *noun* or *verb* in before changing

# Stop Word

There are some words in every language that doesn't help the algorithm, so we remove them.

In [None]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
print(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/amiresm/nltk_data...


# Part Of Speach Tagging

In [2]:
import nltk
from nltk import word_tokenize

# Data for pos tagging
nltk.download('averaged_perceptron_tagger')

text = "I was eating dinner"
words = word_tokenize(text)
tagged_words = nltk.pos_tag(words)
print(tagged_words)

[('I', 'PRP'), ('was', 'VBD'), ('eating', 'VBG'), ('dinner', 'NN')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/amiresm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Trained POS Tagging

In [16]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import brown
from nltk.tag import pos_tag, UnigramTagger, BigramTagger, TrigramTagger, DefaultTagger


nltk.download('brown')
nltk.download('tagsets')

[nltk_data] Downloading package brown to /home/amiresm/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package tagsets to /home/amiresm/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

In [4]:
# This is a sample tagged sentences
brown_tagged_sents = brown.tagged_sents(categories='news')

# Setting 90% of data for training and the 10% for testing
train_data = brown_tagged_sents[:int(len(brown_tagged_sents)*.9)]
test_data = brown_tagged_sents[int(len(brown_tagged_sents)*.9):]

In [7]:
# Setting a default tagger -> This considers every word as noun
def_tagger = DefaultTagger('NN')

# Backoff model is a model which the current model gets help incase of failure
uni_tagger = UnigramTagger(train=train_data, backoff=def_tagger)
print(uni_tagger.evaluate(test_data))

bi_tagger = BigramTagger(train=train_data, backoff=uni_tagger)
print(bi_tagger.evaluate(test_data))

tri_tagger = TrigramTagger(train=train_data, backoff=bi_tagger)
print(tri_tagger.evaluate(test_data))

0.8361407355726104
0.8452108043456593
0.843317053722715


**UnigramTagger**, **BigramTagger**, **TrigramTagger** are kinds of machine learning techniques used
to train data. They are named based on the **number** of words they process before the main word for tagging.

In [13]:
text = "UnigramTagger, BigramTagger, TrigramTagger are kinds of machine learning techniques used to train data. They are named based on the number of words they process before the main word for tagging."

In [14]:
sentences = nltk.sent_tokenize(text)
words = []
for sent in sentences:
    words.extend(nltk.word_tokenize(sent))

print(words)
print('\n')
print(bi_tagger.tag(words))

['UnigramTagger', ',', 'BigramTagger', ',', 'TrigramTagger', 'are', 'kinds', 'of', 'machine', 'learning', 'techniques', 'used', 'to', 'train', 'data', '.', 'They', 'are', 'named', 'based', 'on', 'the', 'number', 'of', 'words', 'they', 'process', 'before', 'the', 'main', 'word', 'for', 'tagging', '.']


[('UnigramTagger', 'NN'), (',', ','), ('BigramTagger', 'NN'), (',', ','), ('TrigramTagger', 'NN'), ('are', 'BER'), ('kinds', 'NNS'), ('of', 'IN'), ('machine', 'NN'), ('learning', 'NN'), ('techniques', 'NNS'), ('used', 'VBN'), ('to', 'TO'), ('train', 'NN'), ('data', 'NN'), ('.', '.'), ('They', 'PPSS'), ('are', 'BER'), ('named', 'VBN'), ('based', 'VBN'), ('on', 'IN'), ('the', 'AT'), ('number', 'NN'), ('of', 'IN'), ('words', 'NNS'), ('they', 'PPSS'), ('process', 'NN'), ('before', 'IN'), ('the', 'AT'), ('main', 'JJS'), ('word', 'NN'), ('for', 'IN'), ('tagging', 'NN'), ('.', '.')]


In [19]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

# Entity Tagging

Entity tagging is used to recognize entities in the sentence.
#### Entities
* geo: Geographical
* org: Organization
* per: Person
* gpe: Geopolitical
* tim: Time Indicator
* art: Artifact
* eve: Event
* nat: Natural Phenomenal

In [26]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import ne_chunk

nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/amiresm/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/amiresm/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [55]:
text = "Amir Esmaeili is a junior software developer at Satplat, who has been studying at IUST."

In [56]:
words = word_tokenize(text)
print(ne_chunk(nltk.pos_tag(words), binary=False))

(S
  (PERSON Amir/NNP)
  (ORGANIZATION Esmaeili/NNP)
  is/VBZ
  a/DT
  junior/JJ
  software/NN
  developer/NN
  at/IN
  (ORGANIZATION Satplat/NNP)
  ,/,
  who/WP
  has/VBZ
  been/VBN
  studying/VBG
  at/IN
  (ORGANIZATION IUST/NNP)
  ./.)


# Data Pre-Processing Overview

## Sections

* Tokenize Sentences
* Tokenize words
* Lemmatize the words ( without pos )
* Perfom POS tagging
* Lemmatize words ( with pos )
* Perform NER

## Data
"Digikala Group is a leading e-commerce organization with a firm grip in multiple online industries including consumer goods, fashion & apparel, e-books, content publishing, digital advertising, big data, fintech, FMCG and logistics. The company operates through its subsidiaries including Digikala, DIGISTYLE, Fidibo and Digistyle representing nearly 92% of Iran’s online retail market share.
With exponential growth of its customer base around its subsidiaries, the future for Digikala Group is bright and abundant. The company is dedicated to build upon its strong foothold in e-commerce and help more consumers and businesses around the MENA region to create more possibilities online, helping to boost the region’s promising economy with world-class standards in technology and service.
At Digikala Group, the focus is on building for the future and increasing our presence in the markets we currently hold stake. Creative ideas, wild imaginations, teamwork, and thoughtful execution that are consumer-centric driven is encouraged at Digikala Group, maintaining a concentrated effort on innovative technology and service to the marketplace year-after-year."

In [92]:
text = """Born and raised in the Austrian Empire, Tesla studied engineering and physics in the 1870s without receiving a degree, and gained practical experience in the early 1880s working in telephony and at Continental Edison in the new electric power industry. In 1884 he emigrated to the United States, where he became a naturalized citizen. He worked for a short time at the Edison Machine Works in New York City before he struck out on his own. With the help of partners to finance and market his ideas, Tesla set up laboratories and companies in New York to develop a range of electrical and mechanical devices. His alternating current (AC) induction motor and related polyphase AC patents, licensed by Westinghouse Electric in 1888, earned him a considerable amount of money and became the cornerstone of the polyphase system which that company eventually marketed."""

In [93]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk import ne_chunk


# Preparing the needed datasets
datasets = ['punkt', 'wordnet', 'averaged_perceptron_tagger', 'stopwords', 'maxent_ne_chunker', 'words']
for pkg in datasets:
    nltk.download(pkg)

[nltk_data] Downloading package punkt to /home/amiresm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/amiresm/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/amiresm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/amiresm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/amiresm/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/amiresm/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [94]:
# Tokenizing sentences
sentences = sent_tokenize(text)
print("sentences=",sentences)

sentences= ['Born and raised in the Austrian Empire, Tesla studied engineering and physics in the 1870s without receiving a degree, and gained practical experience in the early 1880s working in telephony and at Continental Edison in the new electric power industry.', 'In 1884 he emigrated to the United States, where he became a naturalized citizen.', 'He worked for a short time at the Edison Machine Works in New York City before he struck out on his own.', 'With the help of partners to finance and market his ideas, Tesla set up laboratories and companies in New York to develop a range of electrical and mechanical devices.', 'His alternating current (AC) induction motor and related polyphase AC patents, licensed by Westinghouse Electric in 1888, earned him a considerable amount of money and became the cornerstone of the polyphase system which that company eventually marketed.']


In [95]:
# Tokenizing Word
words = list() 
for sentence in sentences:
    words.extend(word_tokenize(sentence))
print("words=", words)

words= ['Born', 'and', 'raised', 'in', 'the', 'Austrian', 'Empire', ',', 'Tesla', 'studied', 'engineering', 'and', 'physics', 'in', 'the', '1870s', 'without', 'receiving', 'a', 'degree', ',', 'and', 'gained', 'practical', 'experience', 'in', 'the', 'early', '1880s', 'working', 'in', 'telephony', 'and', 'at', 'Continental', 'Edison', 'in', 'the', 'new', 'electric', 'power', 'industry', '.', 'In', '1884', 'he', 'emigrated', 'to', 'the', 'United', 'States', ',', 'where', 'he', 'became', 'a', 'naturalized', 'citizen', '.', 'He', 'worked', 'for', 'a', 'short', 'time', 'at', 'the', 'Edison', 'Machine', 'Works', 'in', 'New', 'York', 'City', 'before', 'he', 'struck', 'out', 'on', 'his', 'own', '.', 'With', 'the', 'help', 'of', 'partners', 'to', 'finance', 'and', 'market', 'his', 'ideas', ',', 'Tesla', 'set', 'up', 'laboratories', 'and', 'companies', 'in', 'New', 'York', 'to', 'develop', 'a', 'range', 'of', 'electrical', 'and', 'mechanical', 'devices', '.', 'His', 'alternating', 'current', '(',

In [96]:
# Lemmatization without pos tagging
wordnet_lemmatizer = WordNetLemmatizer()
lemma = list()
for word in words:
    l = wordnet_lemmatizer.lemmatize(word)
    lemma.append(l)
print("lemma(without pos)=", lemma)

lemma(without pos)= ['Born', 'and', 'raised', 'in', 'the', 'Austrian', 'Empire', ',', 'Tesla', 'studied', 'engineering', 'and', 'physic', 'in', 'the', '1870s', 'without', 'receiving', 'a', 'degree', ',', 'and', 'gained', 'practical', 'experience', 'in', 'the', 'early', '1880s', 'working', 'in', 'telephony', 'and', 'at', 'Continental', 'Edison', 'in', 'the', 'new', 'electric', 'power', 'industry', '.', 'In', '1884', 'he', 'emigrated', 'to', 'the', 'United', 'States', ',', 'where', 'he', 'became', 'a', 'naturalized', 'citizen', '.', 'He', 'worked', 'for', 'a', 'short', 'time', 'at', 'the', 'Edison', 'Machine', 'Works', 'in', 'New', 'York', 'City', 'before', 'he', 'struck', 'out', 'on', 'his', 'own', '.', 'With', 'the', 'help', 'of', 'partner', 'to', 'finance', 'and', 'market', 'his', 'idea', ',', 'Tesla', 'set', 'up', 'laboratory', 'and', 'company', 'in', 'New', 'York', 'to', 'develop', 'a', 'range', 'of', 'electrical', 'and', 'mechanical', 'device', '.', 'His', 'alternating', 'current',

In [97]:
# Performing POS tagging
tagged_words = pos_tag(words)
print("tagged_words=", tagged_words)
print("n=", len(tagged_words))

tagged_words= [('Born', 'NNP'), ('and', 'CC'), ('raised', 'VBN'), ('in', 'IN'), ('the', 'DT'), ('Austrian', 'JJ'), ('Empire', 'NNP'), (',', ','), ('Tesla', 'NNP'), ('studied', 'VBD'), ('engineering', 'NN'), ('and', 'CC'), ('physics', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1870s', 'CD'), ('without', 'IN'), ('receiving', 'VBG'), ('a', 'DT'), ('degree', 'NN'), (',', ','), ('and', 'CC'), ('gained', 'VBD'), ('practical', 'JJ'), ('experience', 'NN'), ('in', 'IN'), ('the', 'DT'), ('early', 'JJ'), ('1880s', 'CD'), ('working', 'VBG'), ('in', 'IN'), ('telephony', 'NN'), ('and', 'CC'), ('at', 'IN'), ('Continental', 'NNP'), ('Edison', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('electric', 'JJ'), ('power', 'NN'), ('industry', 'NN'), ('.', '.'), ('In', 'IN'), ('1884', 'CD'), ('he', 'PRP'), ('emigrated', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), (',', ','), ('where', 'WRB'), ('he', 'PRP'), ('became', 'VBD'), ('a', 'DT'), ('naturalized', 'JJ'), ('citizen', '

In [98]:
import string


# Removing stopwords and punktuations
stop_words = set(stopwords.words('english'))
punk = string.punctuation
for word, tag in tagged_words:
    if word in stop_words or word in punk:
        tagged_words.remove((word, tag))
print(tagged_words)
print("n=", len(tagged_words))

[('Born', 'NNP'), ('raised', 'VBN'), ('Austrian', 'JJ'), ('Empire', 'NNP'), ('Tesla', 'NNP'), ('studied', 'VBD'), ('engineering', 'NN'), ('physics', 'NNS'), ('1870s', 'CD'), ('without', 'IN'), ('receiving', 'VBG'), ('degree', 'NN'), ('gained', 'VBD'), ('practical', 'JJ'), ('experience', 'NN'), ('the', 'DT'), ('early', 'JJ'), ('1880s', 'CD'), ('working', 'VBG'), ('telephony', 'NN'), ('Continental', 'NNP'), ('Edison', 'NNP'), ('the', 'DT'), ('new', 'JJ'), ('electric', 'JJ'), ('power', 'NN'), ('industry', 'NN'), ('In', 'IN'), ('1884', 'CD'), ('emigrated', 'VBD'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('where', 'WRB'), ('became', 'VBD'), ('naturalized', 'JJ'), ('citizen', 'NN'), ('He', 'PRP'), ('worked', 'VBD'), ('short', 'JJ'), ('time', 'NN'), ('at', 'IN'), ('the', 'DT'), ('Edison', 'NNP'), ('Machine', 'NNP'), ('Works', 'NNP'), ('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP'), ('he', 'PRP'), ('struck', 'VBD'), ('on', 'IN'), ('own', 'JJ'), ('With', 'IN'), ('the', 'DT'), ('h

In [99]:
from nltk.corpus import wordnet

def get_pos(treebank_tag: str) -> 'string':
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

In [100]:
# Lemmatizing with pos tagging

lemma_pos = list()

for word, tag in tagged_words:
    pos = get_pos(tag)
    if pos != '':
        l = wordnet_lemmatizer.lemmatize(word, pos=pos)
        lemma_pos.append(l)
print("lemma_pos=", lemma_pos)

# It is not completely Correct

lemma_pos= ['Born', 'raise', 'Austrian', 'Empire', 'Tesla', 'study', 'engineering', 'physic', 'receive', 'degree', 'gain', 'practical', 'experience', 'early', 'work', 'telephony', 'Continental', 'Edison', 'new', 'electric', 'power', 'industry', 'emigrate', 'United', 'States', 'become', 'naturalized', 'citizen', 'work', 'short', 'time', 'Edison', 'Machine', 'Works', 'New', 'York', 'City', 'strike', 'own', 'help', 'partner', 'finance', 'market', 'idea', 'Tesla', 'set', 'laboratory', 'company', 'New', 'York', 'develop', 'range', 'electrical', 'mechanical', 'device', 'alternate', 'current', 'AC', 'induction', 'motor', 'related', 'polyphase', 'AC', 'patent', 'license', 'Westinghouse', 'Electric', 'earn', 'considerable', 'amount', 'money', 'become', 'cornerstone', 'polyphase', 'system', 'company', 'eventually', 'market']


In [101]:
# Named Entity Recognition (NER)

print(ne_chunk(tagged_words, binary=False))

(S
  (GPE Born/NNP)
  raised/VBN
  (GPE Austrian/JJ)
  (ORGANIZATION Empire/NNP Tesla/NNP)
  studied/VBD
  engineering/NN
  physics/NNS
  1870s/CD
  without/IN
  receiving/VBG
  degree/NN
  gained/VBD
  practical/JJ
  experience/NN
  the/DT
  early/JJ
  1880s/CD
  working/VBG
  telephony/NN
  (ORGANIZATION Continental/NNP)
  Edison/NNP
  the/DT
  new/JJ
  electric/JJ
  power/NN
  industry/NN
  In/IN
  1884/CD
  emigrated/VBD
  the/DT
  (GPE United/NNP States/NNPS)
  where/WRB
  became/VBD
  naturalized/JJ
  citizen/NN
  He/PRP
  worked/VBD
  short/JJ
  time/NN
  at/IN
  the/DT
  (ORGANIZATION Edison/NNP Machine/NNP Works/NNP New/NNP York/NNP)
  City/NNP
  he/PRP
  struck/VBD
  on/IN
  own/JJ
  With/IN
  the/DT
  help/NN
  partners/NNS
  finance/VB
  market/NN
  ideas/NNS
  (PERSON Tesla/NNP)
  set/VBD
  laboratories/NNS
  companies/NNS
  (GPE New/NNP York/NNP)
  develop/VB
  a/DT
  range/NN
  electrical/JJ
  mechanical/JJ
  devices/NNS
  His/PRP$
  alternating/VBG
  current/JJ
  AC/NNP