Q1. Copy the data about the NASA from wikipedia and perform the following: 
- Apply POS tagging on this data.
- Remove punctuation symbols.
- Delete the stopwords.
- Perform morphological analysis on all the nouns.
- Find top 3 words in the data.

In [25]:
import nltk
import re
from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize

In [26]:
nltk.download('punkt') # tokenization
nltk.download('stopwords') # stopwords removal
nltk.download('averaged_perceptron_tagger') # POS tagging
nltk.download('wordnet') # wordnet database and lemmatization
nltk.download('omw-1.4') # streaming
nltk.download('indian') # Indian language POS tagging
nltk.download('maxent_ne_chunker') # chunking

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package indian to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package indian is already up-to-date!
[nltk_data] Dow

True

In [27]:
data = "The National Aeronautics and Space Administration (NASA; /ˈnæsə/) is an independent agency of the U.S. federal government responsible for the civil space program, aeronautics research, and space research. Established in 1958, it succeeded the National Advisory Committee for Aeronautics (NACA) to give the U.S. space development effort a distinct civilian orientation, emphasizing peaceful applications in space science. It has since led most of America's space exploration programs, including Project Mercury, Project Gemini, the 1968–1972 Apollo Moon landing missions, the Skylab space station, and the Space Shuttle. Currently, NASA supports the International Space Station (ISS) along with the Commercial Crew Program, and oversees the development of the Orion spacecraft and the Space Launch System for the lunar Artemis program."
data

"The National Aeronautics and Space Administration (NASA; /ˈnæsə/) is an independent agency of the U.S. federal government responsible for the civil space program, aeronautics research, and space research. Established in 1958, it succeeded the National Advisory Committee for Aeronautics (NACA) to give the U.S. space development effort a distinct civilian orientation, emphasizing peaceful applications in space science. It has since led most of America's space exploration programs, including Project Mercury, Project Gemini, the 1968–1972 Apollo Moon landing missions, the Skylab space station, and the Space Shuttle. Currently, NASA supports the International Space Station (ISS) along with the Commercial Crew Program, and oversees the development of the Orion spacecraft and the Space Launch System for the lunar Artemis program."

POS tagging

In [28]:
tags = pos_tag(word_tokenize(data))
tags

[('The', 'DT'),
 ('National', 'NNP'),
 ('Aeronautics', 'NNP'),
 ('and', 'CC'),
 ('Space', 'NNP'),
 ('Administration', 'NNP'),
 ('(', '('),
 ('NASA', 'NNP'),
 (';', ':'),
 ('/ˈnæsə/', 'NNP'),
 (')', ')'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('independent', 'JJ'),
 ('agency', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('U.S.', 'NNP'),
 ('federal', 'JJ'),
 ('government', 'NN'),
 ('responsible', 'JJ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('civil', 'JJ'),
 ('space', 'NN'),
 ('program', 'NN'),
 (',', ','),
 ('aeronautics', 'NNS'),
 ('research', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('space', 'NN'),
 ('research', 'NN'),
 ('.', '.'),
 ('Established', 'VBN'),
 ('in', 'IN'),
 ('1958', 'CD'),
 (',', ','),
 ('it', 'PRP'),
 ('succeeded', 'VBD'),
 ('the', 'DT'),
 ('National', 'NNP'),
 ('Advisory', 'NNP'),
 ('Committee', 'NNP'),
 ('for', 'IN'),
 ('Aeronautics', 'NNP'),
 ('(', '('),
 ('NACA', 'NNP'),
 (')', ')'),
 ('to', 'TO'),
 ('give', 'VB'),
 ('the', 'DT'),
 ('U.S.', 'NNP'),
 ('space', 'NN'),
 ('development', 

Remove Punctuations

In [29]:
import string

clean = [word for word in word_tokenize(data) if word not in string.punctuation]
print(' '.join(clean))

The National Aeronautics and Space Administration NASA /ˈnæsə/ is an independent agency of the U.S. federal government responsible for the civil space program aeronautics research and space research Established in 1958 it succeeded the National Advisory Committee for Aeronautics NACA to give the U.S. space development effort a distinct civilian orientation emphasizing peaceful applications in space science It has since led most of America 's space exploration programs including Project Mercury Project Gemini the 1968–1972 Apollo Moon landing missions the Skylab space station and the Space Shuttle Currently NASA supports the International Space Station ISS along with the Commercial Crew Program and oversees the development of the Orion spacecraft and the Space Launch System for the lunar Artemis program


Delete the Stopwords

In [30]:
from nltk.corpus import stopwords

no_sw = ' '.join([word for word in word_tokenize(data) if word.lower() not in stopwords.words('english') and word.isalnum()])
no_sw

'National Aeronautics Space Administration NASA independent agency federal government responsible civil space program aeronautics research space research Established 1958 succeeded National Advisory Committee Aeronautics NACA give space development effort distinct civilian orientation emphasizing peaceful applications space science since led America space exploration programs including Project Mercury Project Gemini Apollo Moon landing missions Skylab space station Space Shuttle Currently NASA supports International Space Station ISS along Commercial Crew Program oversees development Orion spacecraft Space Launch System lunar Artemis program'

Morphological Analysis on Nouns

In [31]:
# nouns = []
# for i in tags:
#     if i[1].startswith('N'):
#         nouns.append(i[0])

# nouns

nouns = [i[0] for i in tags if i[1].startswith('N')]
nouns


['National',
 'Aeronautics',
 'Space',
 'Administration',
 'NASA',
 '/ˈnæsə/',
 'agency',
 'U.S.',
 'government',
 'space',
 'program',
 'aeronautics',
 'research',
 'space',
 'research',
 'National',
 'Advisory',
 'Committee',
 'Aeronautics',
 'NACA',
 'U.S.',
 'space',
 'development',
 'effort',
 'orientation',
 'applications',
 'space',
 'science',
 'America',
 'space',
 'exploration',
 'programs',
 'Project',
 'Mercury',
 'Project',
 'Gemini',
 'Apollo',
 'Moon',
 'missions',
 'Skylab',
 'space',
 'station',
 'Space',
 'Shuttle',
 'Currently',
 'NASA',
 'International',
 'Space',
 'Station',
 'ISS',
 'Crew',
 'Program',
 'development',
 'Orion',
 'spacecraft',
 'Space',
 'Launch',
 'System',
 'Artemis',
 'program']

In [32]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

print("|Word\t|\tPorter\t|\tLancaster\t|\tSnowball|")
for i in nouns:
    print(" ")
    print('|', i, '\t|\t', ps.stem(i), '\t|\t', ls.stem(i), '\t|\t', ss.stem(i), '|')

|Word	|	Porter	|	Lancaster	|	Snowball|
 
| National 	|	 nation 	|	 nat 	|	 nation |
 
| Aeronautics 	|	 aeronaut 	|	 aeronaut 	|	 aeronaut |
 
| Space 	|	 space 	|	 spac 	|	 space |
 
| Administration 	|	 administr 	|	 admin 	|	 administr |
 
| NASA 	|	 nasa 	|	 nas 	|	 nasa |
 
| /ˈnæsə/ 	|	 /ˈnæsə/ 	|	 /ˈnæsə/ 	|	 /ˈnæsə/ |
 
| agency 	|	 agenc 	|	 ag 	|	 agenc |
 
| U.S. 	|	 u.s. 	|	 u.s. 	|	 u.s. |
 
| government 	|	 govern 	|	 govern 	|	 govern |
 
| space 	|	 space 	|	 spac 	|	 space |
 
| program 	|	 program 	|	 program 	|	 program |
 
| aeronautics 	|	 aeronaut 	|	 aeronaut 	|	 aeronaut |
 
| research 	|	 research 	|	 research 	|	 research |
 
| space 	|	 space 	|	 spac 	|	 space |
 
| research 	|	 research 	|	 research 	|	 research |
 
| National 	|	 nation 	|	 nat 	|	 nation |
 
| Advisory 	|	 advisori 	|	 adv 	|	 advisori |
 
| Committee 	|	 committe 	|	 commit 	|	 committe |
 
| Aeronautics 	|	 aeronaut 	|	 aeronaut 	|	 aeronaut |
 
| NACA 	|	 naca 	|	 nac 	|	 naca |
 
| U.

Find top 3 words in data

In [33]:
tokens = word_tokenize(data)

word_freq = nltk.FreqDist(tokens) # Frequency distribution of words
# word_freq
top_words = word_freq.most_common(3) # Top 3 words
print(top_words)

[('the', 13), (',', 11), ('space', 6)]


### Q2. Open the wikipedia page of Pune.
### - Find the adjectives used in the text.

In [34]:
data = 'Pune (/ˈpuːnə/ POO-nə, Marathi: [ˈpuɳe] ⓘ), previously spelled in English as Poona (the official name until 1978),[15][16] is a city in Maharashtra state in the Deccan plateau in Western India. It is the administrative headquarters of the Pune district, and of Pune division. According to the 2011 Census of India, Pune has 7.2 million residents in the metropolitan region, making it the eighth-most populous metropolitan area in India.[17] The city of Pune is part of Pune Metropolitan Region.[18] Pune is one of the largest IT hubs in India.[19][20] It is also one of the most important automobile and manufacturing hubs of India. Pune is often referred to as the "Oxford of the East" because of its highly regarded educational institutions.[21][22][23] It has been ranked "the most liveable city in India" several times.[24][25]'
data

'Pune (/ˈpuːnə/ POO-nə, Marathi: [ˈpuɳe] ⓘ), previously spelled in English as Poona (the official name until 1978),[15][16] is a city in Maharashtra state in the Deccan plateau in Western India. It is the administrative headquarters of the Pune district, and of Pune division. According to the 2011 Census of India, Pune has 7.2 million residents in the metropolitan region, making it the eighth-most populous metropolitan area in India.[17] The city of Pune is part of Pune Metropolitan Region.[18] Pune is one of the largest IT hubs in India.[19][20] It is also one of the most important automobile and manufacturing hubs of India. Pune is often referred to as the "Oxford of the East" because of its highly regarded educational institutions.[21][22][23] It has been ranked "the most liveable city in India" several times.[24][25]'

In [35]:
tags = pos_tag(word_tokenize(data))
tags

[('Pune', 'NNP'),
 ('(', '('),
 ('/ˈpuːnə/', 'JJ'),
 ('POO-nə', 'NNP'),
 (',', ','),
 ('Marathi', 'NNP'),
 (':', ':'),
 ('[', 'NN'),
 ('ˈpuɳe', 'NNP'),
 (']', 'NNP'),
 ('ⓘ', 'NNP'),
 (')', ')'),
 (',', ','),
 ('previously', 'RB'),
 ('spelled', 'VBN'),
 ('in', 'IN'),
 ('English', 'NNP'),
 ('as', 'IN'),
 ('Poona', 'NNP'),
 ('(', '('),
 ('the', 'DT'),
 ('official', 'NN'),
 ('name', 'NN'),
 ('until', 'IN'),
 ('1978', 'CD'),
 (')', ')'),
 (',', ','),
 ('[', '$'),
 ('15', 'CD'),
 (']', 'NNP'),
 ('[', 'VBD'),
 ('16', 'CD'),
 (']', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('city', 'NN'),
 ('in', 'IN'),
 ('Maharashtra', 'NNP'),
 ('state', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('Deccan', 'NNP'),
 ('plateau', 'NN'),
 ('in', 'IN'),
 ('Western', 'JJ'),
 ('India', 'NNP'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('administrative', 'JJ'),
 ('headquarters', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Pune', 'NNP'),
 ('district', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('of', 'IN'),
 ('Pun

In [36]:
adjs = [i[0] for i in tags if i[1].startswith('J')]
adjs

['/ˈpuːnə/',
 'Western',
 'administrative',
 'metropolitan',
 'eighth-most',
 'populous',
 'metropolitan',
 'largest',
 ']',
 'important',
 'regarded',
 'educational',
 'liveable',
 'several']