<a href="https://colab.research.google.com/github/Vnykm/GN22CDBDS001_2113702/blob/main/2113702_NLP_PIPELINE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLP - Pipeline


# Importing Libraries

In [None]:
!pip install textacy
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl (777.4 MB)
[K     |████████████████████████████████| 777.4 MB 5.3 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [None]:
import spacy
import textacy
from spacy import displacy

# Loading the Text file.
# The text is taken from a Wikipedia Article.

In [None]:
!wget https://raw.githubusercontent.com/Vnykm/GN22CDBDS001_2113702/main/text.txt

In [None]:
nlp= spacy.load('en_core_web_lg')
f = open('text.txt','rt') #Article about ASTERIA
data = f.read()
doc = nlp(data)
len(doc)

551

# Step 1: Sentence Segmentation

In [None]:
for sen in list(doc.sents):
  print(sen)

ASTERIA (Arcsecond Space Telescope Enabling Research in Astrophysics) was a miniaturized space telescope technology demonstration and opportunistic science mission to conduct astrophysical measurements using a CubeSat.
It was designed in collaboration between the Massachusetts Institute of Technology (MIT) and NASA's Jet Propulsion Laboratory.
ASTERIA was the first JPL-built CubeSat to have been successfully operated in space.
Originally envisioned as a project for training early career scientists and engineers, ASTERIA's technical goal was to achieve arcsecond-level line-of-sight pointing error and highly stable focal plane temperature control.
These technologies are important for precision photometry, i.e., the measurement of stellar brightness over time.
Precision photometry, in turn, provides a way to study stellar activity, transiting exoplanets, and other astrophysical phenomena.

ASTERIA was launched on 14 August 2017 and deployed into low Earth orbit from the International Spac

# Step 2: Word Tokenization

In [None]:
for token in doc:
  print((token.text, token.idx))

('ASTERIA', 0)
('(', 8)
('Arcsecond', 9)
('Space', 19)
('Telescope', 25)
('Enabling', 35)
('Research', 44)
('in', 53)
('Astrophysics', 56)
(')', 68)
('was', 70)
('a', 74)
('miniaturized', 76)
('space', 89)
('telescope', 95)
('technology', 105)
('demonstration', 116)
('and', 130)
('opportunistic', 134)
('science', 148)
('mission', 156)
('to', 164)
('conduct', 167)
('astrophysical', 175)
('measurements', 189)
('using', 202)
('a', 208)
('CubeSat', 210)
('.', 217)
('It', 219)
('was', 222)
('designed', 226)
('in', 235)
('collaboration', 238)
('between', 252)
('the', 260)
('Massachusetts', 264)
('Institute', 278)
('of', 288)
('Technology', 291)
('(', 302)
('MIT', 303)
(')', 306)
('and', 308)
('NASA', 312)
("'s", 316)
('Jet', 319)
('Propulsion', 323)
('Laboratory', 334)
('.', 344)
('ASTERIA', 346)
('was', 354)
('the', 358)
('first', 362)
('JPL', 368)
('-', 371)
('built', 372)
('CubeSat', 378)
('to', 386)
('have', 389)
('been', 394)
('successfully', 399)
('operated', 412)
('in', 421)
('space',

# Step 3: Predicting Parts of Speech for Each Token

In [None]:
for token in doc: 
  print(token, token.tag_, token.pos_, spacy.explain(token.tag_))

ASTERIA NNP PROPN noun, proper singular
( -LRB- PUNCT left round bracket
Arcsecond NNP PROPN noun, proper singular
Space NNP PROPN noun, proper singular
Telescope NNP PROPN noun, proper singular
Enabling NNP PROPN noun, proper singular
Research NNP PROPN noun, proper singular
in IN ADP conjunction, subordinating or preposition
Astrophysics NNP PROPN noun, proper singular
) -RRB- PUNCT right round bracket
was VBD AUX verb, past tense
a DT DET determiner
miniaturized VBN VERB verb, past participle
space NN NOUN noun, singular or mass
telescope NN NOUN noun, singular or mass
technology NN NOUN noun, singular or mass
demonstration NN NOUN noun, singular or mass
and CC CCONJ conjunction, coordinating
opportunistic JJ ADJ adjective (English), other noun-modifier (Chinese)
science NN NOUN noun, singular or mass
mission NN NOUN noun, singular or mass
to TO PART infinitival "to"
conduct VB VERB verb, base form
astrophysical JJ ADJ adjective (English), other noun-modifier (Chinese)
measurements 

# Step 4: Text Lemmatization

In [None]:
for token in doc: print(token, token.lemma_)

ASTERIA ASTERIA
( (
Arcsecond Arcsecond
Space Space
Telescope Telescope
Enabling Enabling
Research Research
in in
Astrophysics Astrophysics
) )
was be
a a
miniaturized miniaturize
space space
telescope telescope
technology technology
demonstration demonstration
and and
opportunistic opportunistic
science science
mission mission
to to
conduct conduct
astrophysical astrophysical
measurements measurement
using use
a a
CubeSat CubeSat
. .
It it
was be
designed design
in in
collaboration collaboration
between between
the the
Massachusetts Massachusetts
Institute Institute
of of
Technology Technology
( (
MIT MIT
) )
and and
NASA NASA
's 's
Jet Jet
Propulsion Propulsion
Laboratory Laboratory
. .
ASTERIA ASTERIA
was be
the the
first first
JPL JPL
- -
built build
CubeSat CubeSat
to to
have have
been be
successfully successfully
operated operate
in in
space space
. .
Originally originally
envisioned envision
as as
a a
project project
for for
training train
early early
career career
scientists sc

# Step 5: Identifying Stop Words

In [None]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)
for stop_word in list(spacy_stopwords)[:10]: print(stop_word)

above
several
five
seem
whenever
eight
ourselves
’d
me
name


# Step 6: Dependency Parsing

In [None]:
displacy.render(doc, style='dep', jupyter=True)

# Step 6b: Finding Noun Phrases

In [None]:
chunkOfNouns = textacy.extract.noun_chunks(doc) 
allnounchunks = []
chunkOfNouns = map(str, chunkOfNouns) 
chunkOfNouns = map(str.lower, chunkOfNouns)
for chunk in chunkOfNouns:
  if len(chunk.split(' ')) > 1: 
    print(chunk)
    allnounchunks.append(chunk)
uniquenounchunks = set(allnounchunks)
uniquenounchunks

(arcsecond space telescope enabling research
miniaturized space telescope technology demonstration
opportunistic science mission
astrophysical measurements
massachusetts institute
nasa's jet propulsion laboratory
first jpl-built cubesat
early career scientists
asteria's technical goal
highly stable focal plane temperature control
precision photometry
stellar brightness
precision photometry
stellar activity
other astrophysical phenomena
14 august
low earth orbit
international space station
20 november
primary mission
745 days
three extended missions
last successful communications
principal investigator
canadian-american astronomer
planetary scientist
sara seager
massachusetts institute
arcsecond space telescope enabling research
international space station
new technologies
transit method
phaeton program
early career employees
its target mission
90 days
asteria's capabilities
precision photometry
opportunistic basis
stellar activity
other astrophysical phenomena
technological objectives


{'(arcsecond space telescope enabling research',
 '14 august',
 '20 november',
 '20-minute observations',
 '745 days',
 '90 days',
 'additional information',
 'arcsecond space telescope enabling research',
 'arcsecond-level line',
 "asteria's capabilities",
 "asteria's technical goal",
 'astrophysical measurements',
 'brightest sun-like stars',
 'canadian-american astronomer',
 'continuous study',
 'conventional space observatories',
 'early career employees',
 'early career scientists',
 'eight or more days',
 'extended duration',
 'first jpl-built cubesat',
 'five observations',
 'focal plane',
 'focal plane position',
 'future space telescopes',
 'highly stable focal plane temperature control',
 'international space station',
 'its target mission',
 'last successful communications',
 'long-term mission goals',
 'long-transiting exoplanets',
 'low earth orbit',
 'low-cost space telescopes',
 'massachusetts institute',
 'miniaturized space telescope technology demonstration',
 'multip

# Step 7: Named Entity Recognition (NER)

In [None]:
displacy.render(doc, style='ent', jupyter=True)

# Step 8: Coreference Resolution

In [None]:
statements = textacy.extract.semistructured_statements(doc, entity="ASTERIA", cue='be')

print("Here are the things I know about ASTERIA:")

for statement in statements:
    subject, verb, fact = statement
    print(f" - {fact}")

Here are the things I know about ASTERIA:
 - [a, miniaturized, space, telescope, technology, demonstration, and, opportunistic, science, mission, to, conduct, astrophysical, measurements, using, a, CubeSat]
 - [the, first, JPL, -, built, CubeSat, to, have, been, successfully, operated, in, space]
