# Introduction to Textmining with NLTK

A short introduction in data processing for textual data and some basic applications for sentiment analysis and

# Basic Setup


Install nltk library for text processing and download some extensions that are required. Also, we install the wordcloud library for plotting our results as wordcloud.

In [1]:
!pip install nltk
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

!pip install openai



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...




In [2]:
# we import a series of specific functions from the nltk package for processing the texts.
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk import FreqDist

# we import pandas for reading in files
import pandas as pd

from openai import OpenAI
import json
from time import sleep

## Read in the data

In [3]:
corpus = pd.read_csv("https://github.com/casbdai/notebooks/raw/main/Module2/Textmining/fake_news.csv")
corpus.head()

Unnamed: 0,title,text,label
0,"With French Socialists in Crisis, Manuel Valls...","BOISSEUIL, France — A furious Ducourtioux...",Fake
1,All Donated Blood in U.S. Should Be Tested for...,The Food and Drug Administration on Friday too...,Fake
2,Eight Immigration Victories Won for Americans ...,Last year was a success for Americans who are ...,Fake
3,How To WATCH The Highly Anticipated ‘CLINTON C...,SHARE this link with everyone you know. EVERY ...,Real
4,CNBC: China’s Secret Plan to Crush the U.S. Sp...,"In a feature on Tuesday, CNBC explained how th...",Fake


We extract the second document and save it as an object text.

In [4]:
text = corpus["text"][1]
print(text)

The Food and Drug Administration on Friday took steps to safeguard the nation’s blood supply from the Zika virus, calling for all blood banks to screen donations for the infection even in states where the virus is not circulating. The recommendations are an acknowledgment that sexual transmission may facilitate the spread of Zika even in areas where mosquitoes carrying the virus are not present. Officials also want to prepare for the possibility that clusters of local infection will continue to pop up in parts of the United States for years to come. “There could be multiple outbreaks of Zika happening outside the known current ones in South Florida, but because we are not actively looking they could be happening silently,” said Dr. Peter J. Hotez, the dean of the National School of Tropical Medicine at Baylor College of Medicine, who applauded the F. D. A. ’s move. Without federal funds, it is generally not possible for local health departments to conduct active surveillance for Zika v

## Pre-Processing Textual Data

### Convert text to lower case:

In [5]:
lower_text = text.lower()
print (lower_text)

the food and drug administration on friday took steps to safeguard the nation’s blood supply from the zika virus, calling for all blood banks to screen donations for the infection even in states where the virus is not circulating. the recommendations are an acknowledgment that sexual transmission may facilitate the spread of zika even in areas where mosquitoes carrying the virus are not present. officials also want to prepare for the possibility that clusters of local infection will continue to pop up in parts of the united states for years to come. “there could be multiple outbreaks of zika happening outside the known current ones in south florida, but because we are not actively looking they could be happening silently,” said dr. peter j. hotez, the dean of the national school of tropical medicine at baylor college of medicine, who applauded the f. d. a. ’s move. without federal funds, it is generally not possible for local health departments to conduct active surveillance for zika v

### Tokenize text

Break down text into tokens, i.e, breaking the sentences into single words for analysis.

In [6]:
word_tokens = nltk.word_tokenize(lower_text)
print (word_tokens)

['the', 'food', 'and', 'drug', 'administration', 'on', 'friday', 'took', 'steps', 'to', 'safeguard', 'the', 'nation', '’', 's', 'blood', 'supply', 'from', 'the', 'zika', 'virus', ',', 'calling', 'for', 'all', 'blood', 'banks', 'to', 'screen', 'donations', 'for', 'the', 'infection', 'even', 'in', 'states', 'where', 'the', 'virus', 'is', 'not', 'circulating', '.', 'the', 'recommendations', 'are', 'an', 'acknowledgment', 'that', 'sexual', 'transmission', 'may', 'facilitate', 'the', 'spread', 'of', 'zika', 'even', 'in', 'areas', 'where', 'mosquitoes', 'carrying', 'the', 'virus', 'are', 'not', 'present', '.', 'officials', 'also', 'want', 'to', 'prepare', 'for', 'the', 'possibility', 'that', 'clusters', 'of', 'local', 'infection', 'will', 'continue', 'to', 'pop', 'up', 'in', 'parts', 'of', 'the', 'united', 'states', 'for', 'years', 'to', 'come', '.', '“', 'there', 'could', 'be', 'multiple', 'outbreaks', 'of', 'zika', 'happening', 'outside', 'the', 'known', 'current', 'ones', 'in', 'south', '

We need a better tokenizer also "punctuation" and "numbers" are retained as tokens. Also, very short words are translated into tokens.


In [7]:
better_tokenizer = RegexpTokenizer(r'[a-zA-Z]{3,}')

# [a-zA-Z] means that only letters are retained as tokens
# {3,} means that only tokens with at least three characters are retained

In [8]:
word_tokens = better_tokenizer.tokenize(lower_text)
print(word_tokens)

['the', 'food', 'and', 'drug', 'administration', 'friday', 'took', 'steps', 'safeguard', 'the', 'nation', 'blood', 'supply', 'from', 'the', 'zika', 'virus', 'calling', 'for', 'all', 'blood', 'banks', 'screen', 'donations', 'for', 'the', 'infection', 'even', 'states', 'where', 'the', 'virus', 'not', 'circulating', 'the', 'recommendations', 'are', 'acknowledgment', 'that', 'sexual', 'transmission', 'may', 'facilitate', 'the', 'spread', 'zika', 'even', 'areas', 'where', 'mosquitoes', 'carrying', 'the', 'virus', 'are', 'not', 'present', 'officials', 'also', 'want', 'prepare', 'for', 'the', 'possibility', 'that', 'clusters', 'local', 'infection', 'will', 'continue', 'pop', 'parts', 'the', 'united', 'states', 'for', 'years', 'come', 'there', 'could', 'multiple', 'outbreaks', 'zika', 'happening', 'outside', 'the', 'known', 'current', 'ones', 'south', 'florida', 'but', 'because', 'are', 'not', 'actively', 'looking', 'they', 'could', 'happening', 'silently', 'said', 'peter', 'hotez', 'the', 'de

## Remove stop words

Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

In [9]:
stopword = stopwords.words('english')
stopword

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

For getting rid of stopwords, we must compare each token against the words in the stop words list. With can be easily done in a list comprehension. List comprehension are a common extension of "for-loops".

A for loop that prints out every token:

In [10]:
for word in word_tokens:
    print(word)

the
food
and
drug
administration
friday
took
steps
safeguard
the
nation
blood
supply
from
the
zika
virus
calling
for
all
blood
banks
screen
donations
for
the
infection
even
states
where
the
virus
not
circulating
the
recommendations
are
acknowledgment
that
sexual
transmission
may
facilitate
the
spread
zika
even
areas
where
mosquitoes
carrying
the
virus
are
not
present
officials
also
want
prepare
for
the
possibility
that
clusters
local
infection
will
continue
pop
parts
the
united
states
for
years
come
there
could
multiple
outbreaks
zika
happening
outside
the
known
current
ones
south
florida
but
because
are
not
actively
looking
they
could
happening
silently
said
peter
hotez
the
dean
the
national
school
tropical
medicine
baylor
college
medicine
who
applauded
the
move
without
federal
funds
generally
not
possible
for
local
health
departments
conduct
active
surveillance
for
zika
virus
the
blood
urine
patients
with
fever
rash
added
some
ways
the
inaction
from
congress
has
forced
the
adopt
this

Reformulating the for loop as a list comprehension. List comprehensions are considered to be very understandable and are thus used very frequently by pythonistas.

In [11]:
[word for word in word_tokens]

['the',
 'food',
 'and',
 'drug',
 'administration',
 'friday',
 'took',
 'steps',
 'safeguard',
 'the',
 'nation',
 'blood',
 'supply',
 'from',
 'the',
 'zika',
 'virus',
 'calling',
 'for',
 'all',
 'blood',
 'banks',
 'screen',
 'donations',
 'for',
 'the',
 'infection',
 'even',
 'states',
 'where',
 'the',
 'virus',
 'not',
 'circulating',
 'the',
 'recommendations',
 'are',
 'acknowledgment',
 'that',
 'sexual',
 'transmission',
 'may',
 'facilitate',
 'the',
 'spread',
 'zika',
 'even',
 'areas',
 'where',
 'mosquitoes',
 'carrying',
 'the',
 'virus',
 'are',
 'not',
 'present',
 'officials',
 'also',
 'want',
 'prepare',
 'for',
 'the',
 'possibility',
 'that',
 'clusters',
 'local',
 'infection',
 'will',
 'continue',
 'pop',
 'parts',
 'the',
 'united',
 'states',
 'for',
 'years',
 'come',
 'there',
 'could',
 'multiple',
 'outbreaks',
 'zika',
 'happening',
 'outside',
 'the',
 'known',
 'current',
 'ones',
 'south',
 'florida',
 'but',
 'because',
 'are',
 'not',
 'active

Extending our list comprehension such that only tokens are retained that are NOT on the stoplist.

In [12]:
clean_tokens = [word for word in word_tokens if word not in stopword]
print (clean_tokens)

['food', 'drug', 'administration', 'friday', 'took', 'steps', 'safeguard', 'nation', 'blood', 'supply', 'zika', 'virus', 'calling', 'blood', 'banks', 'screen', 'donations', 'infection', 'even', 'states', 'virus', 'circulating', 'recommendations', 'acknowledgment', 'sexual', 'transmission', 'may', 'facilitate', 'spread', 'zika', 'even', 'areas', 'mosquitoes', 'carrying', 'virus', 'present', 'officials', 'also', 'want', 'prepare', 'possibility', 'clusters', 'local', 'infection', 'continue', 'pop', 'parts', 'united', 'states', 'years', 'come', 'could', 'multiple', 'outbreaks', 'zika', 'happening', 'outside', 'known', 'current', 'ones', 'south', 'florida', 'actively', 'looking', 'could', 'happening', 'silently', 'said', 'peter', 'hotez', 'dean', 'national', 'school', 'tropical', 'medicine', 'baylor', 'college', 'medicine', 'applauded', 'move', 'without', 'federal', 'funds', 'generally', 'possible', 'local', 'health', 'departments', 'conduct', 'active', 'surveillance', 'zika', 'virus', 'blo

## Stemming


Often we want to map the different forms of the same word to the same root word, e.g. "walks", "walking", "walked" should all be the same as "walk".


In [13]:
snowball_stemmer = SnowballStemmer('english')

stemmed_tokens = [snowball_stemmer.stem(word) for word in clean_tokens]
print(stemmed_tokens)

['food', 'drug', 'administr', 'friday', 'took', 'step', 'safeguard', 'nation', 'blood', 'suppli', 'zika', 'virus', 'call', 'blood', 'bank', 'screen', 'donat', 'infect', 'even', 'state', 'virus', 'circul', 'recommend', 'acknowledg', 'sexual', 'transmiss', 'may', 'facilit', 'spread', 'zika', 'even', 'area', 'mosquito', 'carri', 'virus', 'present', 'offici', 'also', 'want', 'prepar', 'possibl', 'cluster', 'local', 'infect', 'continu', 'pop', 'part', 'unit', 'state', 'year', 'come', 'could', 'multipl', 'outbreak', 'zika', 'happen', 'outsid', 'known', 'current', 'one', 'south', 'florida', 'activ', 'look', 'could', 'happen', 'silent', 'said', 'peter', 'hotez', 'dean', 'nation', 'school', 'tropic', 'medicin', 'baylor', 'colleg', 'medicin', 'applaud', 'move', 'without', 'feder', 'fund', 'general', 'possibl', 'local', 'health', 'depart', 'conduct', 'activ', 'surveil', 'zika', 'virus', 'blood', 'urin', 'patient', 'fever', 'rash', 'ad', 'way', 'inact', 'congress', 'forc', 'adopt', 'posit', 'hotez

## Get word frequency

Counting the most frequently used words in a textdocument

In [14]:
freq = FreqDist(stemmed_tokens)
print (freq.most_common(5))

[('blood', 26), ('test', 16), ('said', 14), ('screen', 12), ('state', 12)]


# Very Basic Sentiment Analysis

Using a dictionairy of positive and negative words, we can now perform a very basic sentiment analysis

In [15]:
pos_sent = []

[pos_sent.append(1) for word in stemmed_tokens if word in ["correct", "good", "increas"] ]

sum(pos_sent)

2

In [16]:
neg_sent = []

[neg_sent.append(1) for word in stemmed_tokens if word in ["virus", "infect","gun"] ]

sum(neg_sent)

12

In [17]:
sum(pos_sent) - sum(neg_sent)

-10

# Text Analysis with Large Language Models

In [18]:
client = OpenAI(api_key="ENTER YOUR API KEY HERE")

In [19]:
def score_text(text):
  try:
    messages = [
        {
            "role": "system",
            "content": prompt + " Score the text with a likelihood between 0 (very low) and 1 (very high).\nReturn the score as JSON with the key 'score'. Make sure that only the JSON is returned."
        },
        {
            "role": "user",
            "content": text[:400]
        }
    ]
    sleep(1)
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        response_format={"type": "json_object"},
        temperature=0
    )
    return json.loads(response.choices[0].message.content)["score"]
  except Exception as e:
      print("An unexpected error occurred:", e)
      return None

In [20]:
prompt = "Score the likelihood of a positive sentiment."

corpus.head()["text"].apply(score_text)

Unnamed: 0,text
0,0.2
1,0.9
2,0.75
3,0.85
4,0.2


## Comparing Fake and Real News

You can pass any prompt you want. If you want another scoring method instead of a likelihood, adapt the text within the function.

In [21]:
prompt = "Score the likelihood of the text being fake news."

prediction = corpus.head()["text"].apply(score_text)
prediction

Unnamed: 0,text
0,0.1
1,0.1
2,0.8
3,0.95
4,0.2


Compare to the actual label:

In [22]:
corpus.head()["label"]

Unnamed: 0,label
0,Fake
1,Fake
2,Fake
3,Real
4,Fake
