**NLTK POC**

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum." - NLTK Documentation 

In [212]:
# NB: since NLTK is large, not all available packages are installed by default

import nltk 
from nltk.tokenize import word_tokenize
# nltk.download('punkt')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
from nltk.sentiment import SentimentIntensityAnalyzer
# nltk.download('vader_lexicon')

In [213]:
# article: https://apnews.com/article/technology-misinformation-eastern-europe-902f436e3a6507e8b2a223e09a22e969

article = open("article.txt", "r", encoding='utf-8')
text = article.read()
print(text)

Soon after the Russian invasion, the hoaxes began. Ukrainian refugees were taking jobs, committing crimes and abusing handouts. The misinformation spread rapidly online throughout Eastern Europe, sometimes pushed by Moscow in an effort to destabilize its neighbors.

It’s the kind of swift spread of falsehoods that has been blamed in many countries for increased polarization and an erosion of trust in democratic institutions, journalism and science.

But countering or stopping misinformation has proven elusive.

New findings from university researchers and Google, however, reveal that one of the most promising responses to misinformation may also be one of the simplest.

In a paper published Wednesday in the journal Science Advances, the researchers detail how short online videos that teach basic critical thinking skills can make people better able to resist misinformation.

The researchers created a series of videos similar to a public service announcement that focused on specific misi

BASICS

In [214]:
sentences = nltk.sent_tokenize(text)
print(len(sentences))

37


In [215]:
words = nltk.word_tokenize(text)
print(len(words))

964


TOKENIZATION 

In [216]:
tokenized = word_tokenize(text)

tokenized = [token.lower() for token in tokenized]

print(tokenized)

['soon', 'after', 'the', 'russian', 'invasion', ',', 'the', 'hoaxes', 'began', '.', 'ukrainian', 'refugees', 'were', 'taking', 'jobs', ',', 'committing', 'crimes', 'and', 'abusing', 'handouts', '.', 'the', 'misinformation', 'spread', 'rapidly', 'online', 'throughout', 'eastern', 'europe', ',', 'sometimes', 'pushed', 'by', 'moscow', 'in', 'an', 'effort', 'to', 'destabilize', 'its', 'neighbors', '.', 'it', '’', 's', 'the', 'kind', 'of', 'swift', 'spread', 'of', 'falsehoods', 'that', 'has', 'been', 'blamed', 'in', 'many', 'countries', 'for', 'increased', 'polarization', 'and', 'an', 'erosion', 'of', 'trust', 'in', 'democratic', 'institutions', ',', 'journalism', 'and', 'science', '.', 'but', 'countering', 'or', 'stopping', 'misinformation', 'has', 'proven', 'elusive', '.', 'new', 'findings', 'from', 'university', 'researchers', 'and', 'google', ',', 'however', ',', 'reveal', 'that', 'one', 'of', 'the', 'most', 'promising', 'responses', 'to', 'misinformation', 'may', 'also', 'be', 'one', '

STOP WORD REMOVAL

In [217]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [218]:
filtered = [word for word in tokenized if word not in stopwords.words('english')]

print(filtered)

['soon', 'russian', 'invasion', ',', 'hoaxes', 'began', '.', 'ukrainian', 'refugees', 'taking', 'jobs', ',', 'committing', 'crimes', 'abusing', 'handouts', '.', 'misinformation', 'spread', 'rapidly', 'online', 'throughout', 'eastern', 'europe', ',', 'sometimes', 'pushed', 'moscow', 'effort', 'destabilize', 'neighbors', '.', '’', 'kind', 'swift', 'spread', 'falsehoods', 'blamed', 'many', 'countries', 'increased', 'polarization', 'erosion', 'trust', 'democratic', 'institutions', ',', 'journalism', 'science', '.', 'countering', 'stopping', 'misinformation', 'proven', 'elusive', '.', 'new', 'findings', 'university', 'researchers', 'google', ',', 'however', ',', 'reveal', 'one', 'promising', 'responses', 'misinformation', 'may', 'also', 'one', 'simplest', '.', 'paper', 'published', 'wednesday', 'journal', 'science', 'advances', ',', 'researchers', 'detail', 'short', 'online', 'videos', 'teach', 'basic', 'critical', 'thinking', 'skills', 'make', 'people', 'better', 'able', 'resist', 'misinfo

PART OF SPEECH TAGGING

In [219]:
tagged = nltk.pos_tag(filtered)
print(tagged)

[('soon', 'RB'), ('russian', 'JJ'), ('invasion', 'NN'), (',', ','), ('hoaxes', 'NNS'), ('began', 'VBD'), ('.', '.'), ('ukrainian', 'JJ'), ('refugees', 'NNS'), ('taking', 'VBG'), ('jobs', 'NNS'), (',', ','), ('committing', 'VBG'), ('crimes', 'NNS'), ('abusing', 'VBG'), ('handouts', 'NNS'), ('.', '.'), ('misinformation', 'NN'), ('spread', 'NN'), ('rapidly', 'RB'), ('online', 'VBD'), ('throughout', 'IN'), ('eastern', 'JJ'), ('europe', 'NN'), (',', ','), ('sometimes', 'RB'), ('pushed', 'JJ'), ('moscow', 'NN'), ('effort', 'NN'), ('destabilize', 'NN'), ('neighbors', 'NNS'), ('.', '.'), ('’', 'VB'), ('kind', 'NN'), ('swift', 'JJ'), ('spread', 'NN'), ('falsehoods', 'NNS'), ('blamed', 'VBD'), ('many', 'JJ'), ('countries', 'NNS'), ('increased', 'VBD'), ('polarization', 'NN'), ('erosion', 'NN'), ('trust', 'NN'), ('democratic', 'JJ'), ('institutions', 'NNS'), (',', ','), ('journalism', 'NN'), ('science', 'NN'), ('.', '.'), ('countering', 'VBG'), ('stopping', 'VBG'), ('misinformation', 'NN'), ('pro

LEMMATIZATION 

In [220]:
lemmatizer = WordNetLemmatizer()

In [221]:
lemmatized = [lemmatizer.lemmatize(word) for word in filtered]
print(lemmatized)

['soon', 'russian', 'invasion', ',', 'hoax', 'began', '.', 'ukrainian', 'refugee', 'taking', 'job', ',', 'committing', 'crime', 'abusing', 'handout', '.', 'misinformation', 'spread', 'rapidly', 'online', 'throughout', 'eastern', 'europe', ',', 'sometimes', 'pushed', 'moscow', 'effort', 'destabilize', 'neighbor', '.', '’', 'kind', 'swift', 'spread', 'falsehood', 'blamed', 'many', 'country', 'increased', 'polarization', 'erosion', 'trust', 'democratic', 'institution', ',', 'journalism', 'science', '.', 'countering', 'stopping', 'misinformation', 'proven', 'elusive', '.', 'new', 'finding', 'university', 'researcher', 'google', ',', 'however', ',', 'reveal', 'one', 'promising', 'response', 'misinformation', 'may', 'also', 'one', 'simplest', '.', 'paper', 'published', 'wednesday', 'journal', 'science', 'advance', ',', 'researcher', 'detail', 'short', 'online', 'video', 'teach', 'basic', 'critical', 'thinking', 'skill', 'make', 'people', 'better', 'able', 'resist', 'misinformation', '.', 're

NAMED ENTITY RECOGNITION

In [222]:
neTree = nltk.ne_chunk(tagged)
print(neTree)

(S
  soon/RB
  russian/JJ
  invasion/NN
  ,/,
  hoaxes/NNS
  began/VBD
  ./.
  ukrainian/JJ
  refugees/NNS
  taking/VBG
  jobs/NNS
  ,/,
  committing/VBG
  crimes/NNS
  abusing/VBG
  handouts/NNS
  ./.
  misinformation/NN
  spread/NN
  rapidly/RB
  online/VBD
  throughout/IN
  eastern/JJ
  europe/NN
  ,/,
  sometimes/RB
  pushed/JJ
  moscow/NN
  effort/NN
  destabilize/NN
  neighbors/NNS
  ./.
  ’/VB
  kind/NN
  swift/JJ
  spread/NN
  falsehoods/NNS
  blamed/VBD
  many/JJ
  countries/NNS
  increased/VBD
  polarization/NN
  erosion/NN
  trust/NN
  democratic/JJ
  institutions/NNS
  ,/,
  journalism/NN
  science/NN
  ./.
  countering/VBG
  stopping/VBG
  misinformation/NN
  proven/RB
  elusive/JJ
  ./.
  new/JJ
  findings/NNS
  university/NN
  researchers/NNS
  google/VBP
  ,/,
  however/RB
  ,/,
  reveal/VBP
  one/CD
  promising/NN
  responses/NNS
  misinformation/NN
  may/MD
  also/RB
  one/CD
  simplest/NN
  ./.
  paper/NN
  published/VBN
  wednesday/JJ
  journal/JJ
  science/NN
  adv

FREQUENCY DISTRIBUTION

In [223]:
frequencyDist = nltk.FreqDist(filtered)
print(frequencyDist)

<FreqDist with 332 samples and 618 outcomes>


In [224]:
print(frequencyDist.most_common(25))

[(',', 44), ('.', 37), ('misinformation', 20), ('’', 15), ('videos', 11), ('people', 8), ('claims', 8), ('“', 8), ('”', 8), ('researchers', 7), ('false', 7), ('online', 6), ('pre-bunking', 6), ('university', 5), ('google', 5), ('one', 5), ('also', 5), ('said', 5), ('refugees', 4), ('teach', 4), ('make', 4), ('research', 4), ('effective', 4), ('ukrainian', 3), ('spread', 3)]


SENTIMENT ANALYSIS

Using VADER, NLTK's pretrained sentiment analyzer. 

In [225]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("This is an excellent product!")

{'neg': 0.0, 'neu': 0.501, 'pos': 0.499, 'compound': 0.6114}

In [226]:
# sentiment analysis of AP article
sia.polarity_scores(text)

{'neg': 0.112, 'neu': 0.773, 'pos': 0.116, 'compound': 0.8399}