**Natural language processing** (NLP) describes the field of getting computers to understand language how we as humans do. Natural language processing has many, many applications including:
- voice-to-text services for people who are hard of hearing.
- text-to-voice services for people who have difficulty reading.
- automated chatbots for organizations.
- translation services.

Generally when we get text data, strings aren't broken out into individual words or even sentences. We might have a full tweet, full chapter of a book, or full .pdf file all in one long string.

Today, we're diving into the practical side of NLP: taking text data and breaking it out into words that we can then leverage in machine learning.

In [57]:
# Imports
import pandas as pd       
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re

In [64]:
nltk.download('vader_lexicon')

zsh:1: unknown file attribute: v


In [58]:
# Define Merck text.
text = 'Merck (NYSE: MRK), known as MSD outside of the United States and Canada, today announced the completion of the acquisition of Harpoon Therapeutics, Inc. (Nasdaq: HARP). Harpoon is now a wholly-owned subsidiary of Merck, and Harpoon’s common stock will no longer be publicly traded or listed on the Nasdaq Stock Market. Harpoon’s lead candidate, MK-6070 (formerly known as HPN328), is a T-cell engager targeting delta-like ligand 3 (DLL3), an inhibitory canonical Notch ligand that is expressed at high levels in small cell lung cancer (SCLC) and neuroendocrine tumors. The safety, tolerability and pharmacokinetics of MK-6070 is currently being evaluated as monotherapy in a Phase 1/2 clinical trial (NCT04471727) in certain patients with advanced cancers associated with expression of DLL3. The study is also evaluating MK-6070 in combination with atezolizumab in certain patients with SCLC. In March 2022, the U.S. Food and Drug Administration (FDA) granted Orphan Drug Designation to MK-6070 for the treatment of SCLC.'

In [24]:
print(text)

Merck (NYSE: MRK), known as MSD outside of the United States and Canada, today announced the completion of the acquisition of Harpoon Therapeutics, Inc. (Nasdaq: HARP). Harpoon is now a wholly-owned subsidiary of Merck, and Harpoon’s common stock will no longer be publicly traded or listed on the Nasdaq Stock Market. Harpoon’s lead candidate, MK-6070 (formerly known as HPN328), is a T-cell engager targeting delta-like ligand 3 (DLL3), an inhibitory canonical Notch ligand that is expressed at high levels in small cell lung cancer (SCLC) and neuroendocrine tumors. The safety, tolerability and pharmacokinetics of MK-6070 is currently being evaluated as monotherapy in a Phase 1/2 clinical trial (NCT04471727) in certain patients with advanced cancers associated with expression of DLL3. The study is also evaluating MK-6070 in combination with atezolizumab in certain patients with SCLC. In March 2022, the U.S. Food and Drug Administration (FDA) granted Orphan Drug Designation to MK-6070 for t

# Pre-Processing 

When dealing with text data, there are common pre-processing steps. We won't necessarily use all of them every time we deal with text data.

- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

## Removing special characters & Tokenizing

We need to remove unnecessary characters when cleaning text data (punctuation, symbols, etc.). This can be done with RegEx.

When we "**tokenize**" data, we take it and split it up into distinct chunks based on some pattern.

If we use a RegEx tokenizer, we often can do these steps together.

In [25]:
# sentence tokenizer  ## May affect NER
sent_tokenize(text.lower())

['merck (nyse: mrk), known as msd outside of the united states and canada, today announced the completion of the acquisition of harpoon therapeutics, inc. (nasdaq: harp).',
 'harpoon is now a wholly-owned subsidiary of merck, and harpoon’s common stock will no longer be publicly traded or listed on the nasdaq stock market.',
 'harpoon’s lead candidate, mk-6070 (formerly known as hpn328), is a t-cell engager targeting delta-like ligand 3 (dll3), an inhibitory canonical notch ligand that is expressed at high levels in small cell lung cancer (sclc) and neuroendocrine tumors.',
 'the safety, tolerability and pharmacokinetics of mk-6070 is currently being evaluated as monotherapy in a phase 1/2 clinical trial (nct04471727) in certain patients with advanced cancers associated with expression of dll3.',
 'the study is also evaluating mk-6070 in combination with atezolizumab in certain patients with sclc.',
 'in march 2022, the u.s. food and drug administration (fda) granted orphan drug design

In [26]:
# word tokenizer
word_tokenize(text.lower())

['merck',
 '(',
 'nyse',
 ':',
 'mrk',
 ')',
 ',',
 'known',
 'as',
 'msd',
 'outside',
 'of',
 'the',
 'united',
 'states',
 'and',
 'canada',
 ',',
 'today',
 'announced',
 'the',
 'completion',
 'of',
 'the',
 'acquisition',
 'of',
 'harpoon',
 'therapeutics',
 ',',
 'inc.',
 '(',
 'nasdaq',
 ':',
 'harp',
 ')',
 '.',
 'harpoon',
 'is',
 'now',
 'a',
 'wholly-owned',
 'subsidiary',
 'of',
 'merck',
 ',',
 'and',
 'harpoon',
 '’',
 's',
 'common',
 'stock',
 'will',
 'no',
 'longer',
 'be',
 'publicly',
 'traded',
 'or',
 'listed',
 'on',
 'the',
 'nasdaq',
 'stock',
 'market',
 '.',
 'harpoon',
 '’',
 's',
 'lead',
 'candidate',
 ',',
 'mk-6070',
 '(',
 'formerly',
 'known',
 'as',
 'hpn328',
 ')',
 ',',
 'is',
 'a',
 't-cell',
 'engager',
 'targeting',
 'delta-like',
 'ligand',
 '3',
 '(',
 'dll3',
 ')',
 ',',
 'an',
 'inhibitory',
 'canonical',
 'notch',
 'ligand',
 'that',
 'is',
 'expressed',
 'at',
 'high',
 'levels',
 'in',
 'small',
 'cell',
 'lung',
 'cancer',
 '(',
 'sclc',

In [27]:
# Instantiate RegExp Tokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [28]:
# "Run" Tokenizer
text_tokens = tokenizer.tokenize(text.lower())

In [29]:
# Show Results
text_tokens

['merck',
 'nyse',
 'mrk',
 'known',
 'as',
 'msd',
 'outside',
 'of',
 'the',
 'united',
 'states',
 'and',
 'canada',
 'today',
 'announced',
 'the',
 'completion',
 'of',
 'the',
 'acquisition',
 'of',
 'harpoon',
 'therapeutics',
 'inc',
 'nasdaq',
 'harp',
 'harpoon',
 'is',
 'now',
 'a',
 'wholly',
 'owned',
 'subsidiary',
 'of',
 'merck',
 'and',
 'harpoon',
 's',
 'common',
 'stock',
 'will',
 'no',
 'longer',
 'be',
 'publicly',
 'traded',
 'or',
 'listed',
 'on',
 'the',
 'nasdaq',
 'stock',
 'market',
 'harpoon',
 's',
 'lead',
 'candidate',
 'mk',
 '6070',
 'formerly',
 'known',
 'as',
 'hpn328',
 'is',
 'a',
 't',
 'cell',
 'engager',
 'targeting',
 'delta',
 'like',
 'ligand',
 '3',
 'dll3',
 'an',
 'inhibitory',
 'canonical',
 'notch',
 'ligand',
 'that',
 'is',
 'expressed',
 'at',
 'high',
 'levels',
 'in',
 'small',
 'cell',
 'lung',
 'cancer',
 'sclc',
 'and',
 'neuroendocrine',
 'tumors',
 'the',
 'safety',
 'tolerability',
 'and',
 'pharmacokinetics',
 'of',
 'mk',

<details><summary>In comparing the original text to our tokenized version of the text, we converted one long string into a list of strings. What other changes occurred?</summary>

- All strings were converted to lower case.
- All punctuation was removed. (This was done using **regular expressions**.)
</details>

### Briefly: Regular Expressions

Regular Expressions, or RegEx, is a helpful tool for detecting patterns in text. 
- This is a tool of which you should be aware!

In [30]:
# [(re.findall('\d+', i), i) for i in text_tokens]

RegEx in Python 3 understands `\d+` to identify numeric digits. Therefore, the above code searched through `text_tokens` to see if any numeric digits were in there. 

A `RegexpTokenizer` splits a string into substrings using regular expressions.

The following example is pulled from [this site](http://www.nltk.org/_modules/nltk/tokenize/regexp.html).

In [31]:
# Define and print string.
s = text

print(s)

Merck (NYSE: MRK), known as MSD outside of the United States and Canada, today announced the completion of the acquisition of Harpoon Therapeutics, Inc. (Nasdaq: HARP). Harpoon is now a wholly-owned subsidiary of Merck, and Harpoon’s common stock will no longer be publicly traded or listed on the Nasdaq Stock Market. Harpoon’s lead candidate, MK-6070 (formerly known as HPN328), is a T-cell engager targeting delta-like ligand 3 (DLL3), an inhibitory canonical Notch ligand that is expressed at high levels in small cell lung cancer (SCLC) and neuroendocrine tumors. The safety, tolerability and pharmacokinetics of MK-6070 is currently being evaluated as monotherapy in a Phase 1/2 clinical trial (NCT04471727) in certain patients with advanced cancers associated with expression of DLL3. The study is also evaluating MK-6070 in combination with atezolizumab in certain patients with SCLC. In March 2022, the U.S. Food and Drug Administration (FDA) granted Orphan Drug Designation to MK-6070 for t

In [32]:
# Instantiate tokenizer.
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

In [33]:
# Run tokenizer.
tokenizer_1.tokenize(s)

['Merck',
 '(NYSE:',
 'MRK',
 '),',
 'known',
 'as',
 'MSD',
 'outside',
 'of',
 'the',
 'United',
 'States',
 'and',
 'Canada',
 ',',
 'today',
 'announced',
 'the',
 'completion',
 'of',
 'the',
 'acquisition',
 'of',
 'Harpoon',
 'Therapeutics',
 ',',
 'Inc',
 '.',
 '(Nasdaq:',
 'HARP',
 ').',
 'Harpoon',
 'is',
 'now',
 'a',
 'wholly',
 '-owned',
 'subsidiary',
 'of',
 'Merck',
 ',',
 'and',
 'Harpoon',
 '’s',
 'common',
 'stock',
 'will',
 'no',
 'longer',
 'be',
 'publicly',
 'traded',
 'or',
 'listed',
 'on',
 'the',
 'Nasdaq',
 'Stock',
 'Market',
 '.',
 'Harpoon',
 '’s',
 'lead',
 'candidate',
 ',',
 'MK',
 '-6070',
 '(formerly',
 'known',
 'as',
 'HPN328',
 '),',
 'is',
 'a',
 'T',
 '-cell',
 'engager',
 'targeting',
 'delta',
 '-like',
 'ligand',
 '3',
 '(DLL3),',
 'an',
 'inhibitory',
 'canonical',
 'Notch',
 'ligand',
 'that',
 'is',
 'expressed',
 'at',
 'high',
 'levels',
 'in',
 'small',
 'cell',
 'lung',
 'cancer',
 '(SCLC)',
 'and',
 'neuroendocrine',
 'tumors',
 '.',

`tokenizer_1` splits tokens up by spaces or by periods that are not attached to a digit.

In [34]:
# Instantiate tokenizer.
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)

# Run tokenizer.
tokenizer_2.tokenize(s)

['Merck',
 '(NYSE:',
 'MRK),',
 'known',
 'as',
 'MSD',
 'outside',
 'of',
 'the',
 'United',
 'States',
 'and',
 'Canada,',
 'today',
 'announced',
 'the',
 'completion',
 'of',
 'the',
 'acquisition',
 'of',
 'Harpoon',
 'Therapeutics,',
 'Inc.',
 '(Nasdaq:',
 'HARP).',
 'Harpoon',
 'is',
 'now',
 'a',
 'wholly-owned',
 'subsidiary',
 'of',
 'Merck,',
 'and',
 'Harpoon’s',
 'common',
 'stock',
 'will',
 'no',
 'longer',
 'be',
 'publicly',
 'traded',
 'or',
 'listed',
 'on',
 'the',
 'Nasdaq',
 'Stock',
 'Market.',
 'Harpoon’s',
 'lead',
 'candidate,',
 'MK-6070',
 '(formerly',
 'known',
 'as',
 'HPN328),',
 'is',
 'a',
 'T-cell',
 'engager',
 'targeting',
 'delta-like',
 'ligand',
 '3',
 '(DLL3),',
 'an',
 'inhibitory',
 'canonical',
 'Notch',
 'ligand',
 'that',
 'is',
 'expressed',
 'at',
 'high',
 'levels',
 'in',
 'small',
 'cell',
 'lung',
 'cancer',
 '(SCLC)',
 'and',
 'neuroendocrine',
 'tumors.',
 'The',
 'safety,',
 'tolerability',
 'and',
 'pharmacokinetics',
 'of',
 'MK-6

`tokenizer_2` will identify the spaces. By setting `gaps = True`, we're grabbing everything else: thus, we're splitting our tokens up by spaces.
- If you changed to `gaps = False`, you'll return only the whitespaces!

In [35]:
# Instantiate tokenizer.
tokenizer_3 = RegexpTokenizer('[A-Z]\w+')

# Run tokenizer.
tokenizer_3.tokenize(s)

['Merck',
 'NYSE',
 'MRK',
 'MSD',
 'United',
 'States',
 'Canada',
 'Harpoon',
 'Therapeutics',
 'Inc',
 'Nasdaq',
 'HARP',
 'Harpoon',
 'Merck',
 'Harpoon',
 'Nasdaq',
 'Stock',
 'Market',
 'Harpoon',
 'MK',
 'HPN328',
 'DLL3',
 'Notch',
 'SCLC',
 'The',
 'MK',
 'Phase',
 'NCT04471727',
 'DLL3',
 'The',
 'MK',
 'SCLC',
 'In',
 'March',
 'Food',
 'Drug',
 'Administration',
 'FDA',
 'Orphan',
 'Drug',
 'Designation',
 'MK',
 'SCLC']

`tokenizer_3` returns only words that begin with a capital letter.

As you can imagine, using RegEx _can_ be incredibly helpful if you want to find text matching a specific pattern.
- People used to use two spaces after a period to split sentences up; you could use RegEx to detect that pattern and tokenize on entire sentences.
- Chapters in a book could be titled "Chapter" followed by a number; you could use RegEx to detect that pattern and tokenize a book by its chapters.
- When Python libraries are upgraded, syntax changes! Perhaps you want to detect a certain pattern of syntax so you can update your code efficiently.

![](./images/regex.png)

[_from xkcd_](https://xkcd.com/1171/)

## Lemmatizing & Stemming

- "He is *running* really fast!"
- "He *ran* the race."
- "He *runs* a five-minute mile."

If we wanted a computer to interpret these sentences, I might count up how many times I see each word. The computer will treat words like "running," "ran," and "runs" differently... but they mean very similar things (in this context)!

**Lemmatizing** and **stemming** are two forms of shortening words so we can combine similar forms of the same word.

When we "**lemmatize**" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

In [36]:
# Instantiate lemmatizer. 
lemmatizer = WordNetLemmatizer()

In [37]:
# Lemmatize tokens.
tokens_lem = [lemmatizer.lemmatize(i) for i in text_tokens]

In [38]:
# Compare tokens to lemmatized version.
list(zip(text_tokens, tokens_lem))

[('merck', 'merck'),
 ('nyse', 'nyse'),
 ('mrk', 'mrk'),
 ('known', 'known'),
 ('as', 'a'),
 ('msd', 'msd'),
 ('outside', 'outside'),
 ('of', 'of'),
 ('the', 'the'),
 ('united', 'united'),
 ('states', 'state'),
 ('and', 'and'),
 ('canada', 'canada'),
 ('today', 'today'),
 ('announced', 'announced'),
 ('the', 'the'),
 ('completion', 'completion'),
 ('of', 'of'),
 ('the', 'the'),
 ('acquisition', 'acquisition'),
 ('of', 'of'),
 ('harpoon', 'harpoon'),
 ('therapeutics', 'therapeutic'),
 ('inc', 'inc'),
 ('nasdaq', 'nasdaq'),
 ('harp', 'harp'),
 ('harpoon', 'harpoon'),
 ('is', 'is'),
 ('now', 'now'),
 ('a', 'a'),
 ('wholly', 'wholly'),
 ('owned', 'owned'),
 ('subsidiary', 'subsidiary'),
 ('of', 'of'),
 ('merck', 'merck'),
 ('and', 'and'),
 ('harpoon', 'harpoon'),
 ('s', 's'),
 ('common', 'common'),
 ('stock', 'stock'),
 ('will', 'will'),
 ('no', 'no'),
 ('longer', 'longer'),
 ('be', 'be'),
 ('publicly', 'publicly'),
 ('traded', 'traded'),
 ('or', 'or'),
 ('listed', 'listed'),
 ('on', 'on')

In [39]:
# Print only those lemmatized tokens that are different.
[(text_tokens[i], tokens_lem[i]) for i in range(len(text_tokens)) if text_tokens[i] != tokens_lem[i]]

[('as', 'a'),
 ('states', 'state'),
 ('therapeutics', 'therapeutic'),
 ('as', 'a'),
 ('levels', 'level'),
 ('tumors', 'tumor'),
 ('as', 'a'),
 ('patients', 'patient'),
 ('cancers', 'cancer'),
 ('patients', 'patient')]

Lemmatizing is usually the more correct and precise way of handling things from a grammatical point of view, but also might not have much of an effect.

We can also do this on individual words.

When we "**stem**" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [40]:
# Instantiate PorterStemmer.
p_stemmer = PorterStemmer()

In [41]:
# Stem tokens.
stem_text = [p_stemmer.stem(i) for i in text_tokens] 

In [42]:
# Compare tokens to stemmed version.
list(zip(text_tokens, stem_text))

[('merck', 'merck'),
 ('nyse', 'nyse'),
 ('mrk', 'mrk'),
 ('known', 'known'),
 ('as', 'as'),
 ('msd', 'msd'),
 ('outside', 'outsid'),
 ('of', 'of'),
 ('the', 'the'),
 ('united', 'unit'),
 ('states', 'state'),
 ('and', 'and'),
 ('canada', 'canada'),
 ('today', 'today'),
 ('announced', 'announc'),
 ('the', 'the'),
 ('completion', 'complet'),
 ('of', 'of'),
 ('the', 'the'),
 ('acquisition', 'acquisit'),
 ('of', 'of'),
 ('harpoon', 'harpoon'),
 ('therapeutics', 'therapeut'),
 ('inc', 'inc'),
 ('nasdaq', 'nasdaq'),
 ('harp', 'harp'),
 ('harpoon', 'harpoon'),
 ('is', 'is'),
 ('now', 'now'),
 ('a', 'a'),
 ('wholly', 'wholli'),
 ('owned', 'own'),
 ('subsidiary', 'subsidiari'),
 ('of', 'of'),
 ('merck', 'merck'),
 ('and', 'and'),
 ('harpoon', 'harpoon'),
 ('s', 's'),
 ('common', 'common'),
 ('stock', 'stock'),
 ('will', 'will'),
 ('no', 'no'),
 ('longer', 'longer'),
 ('be', 'be'),
 ('publicly', 'publicli'),
 ('traded', 'trade'),
 ('or', 'or'),
 ('listed', 'list'),
 ('on', 'on'),
 ('the', 'the')

In [43]:
# Print only those stemmed tokens that are different.

[(text_tokens[i], stem_text[i]) for i in range(len(text_tokens)) if text_tokens[i] != stem_text[i]] 

[('outside', 'outsid'),
 ('united', 'unit'),
 ('states', 'state'),
 ('announced', 'announc'),
 ('completion', 'complet'),
 ('acquisition', 'acquisit'),
 ('therapeutics', 'therapeut'),
 ('wholly', 'wholli'),
 ('owned', 'own'),
 ('subsidiary', 'subsidiari'),
 ('publicly', 'publicli'),
 ('traded', 'trade'),
 ('listed', 'list'),
 ('candidate', 'candid'),
 ('formerly', 'formerli'),
 ('engager', 'engag'),
 ('targeting', 'target'),
 ('inhibitory', 'inhibitori'),
 ('canonical', 'canon'),
 ('expressed', 'express'),
 ('levels', 'level'),
 ('neuroendocrine', 'neuroendocrin'),
 ('tumors', 'tumor'),
 ('safety', 'safeti'),
 ('tolerability', 'toler'),
 ('pharmacokinetics', 'pharmacokinet'),
 ('currently', 'current'),
 ('being', 'be'),
 ('evaluated', 'evalu'),
 ('monotherapy', 'monotherapi'),
 ('clinical', 'clinic'),
 ('patients', 'patient'),
 ('advanced', 'advanc'),
 ('cancers', 'cancer'),
 ('associated', 'associ'),
 ('expression', 'express'),
 ('study', 'studi'),
 ('evaluating', 'evalu'),
 ('combina

We can also do this on individual words as well.

## Stop Word Removal

The following sentence has had stop words (and punctuation) removed:

"Answer great question life universe everything said deep thought said deep thought paused forty two said deep thought infinite majesty calm."

<details><summary>Based on this, how would you define stop words?</summary>

Stop words are words that have little to no significance or meaning. They are common words that only add to the grammatical structure and flow of the sentence, so it is still relatively easy to identify the contents of sentences without stop words.
</details>

In [44]:
# Print English stopwords.
# print(stopwords.words('english'))

In [45]:
# Remove stopwords from "text_tokens."
no_stop_words = [token for token in text_tokens if token not in stopwords.words('english')]

In [46]:
# Check it
print(no_stop_words)

['merck', 'nyse', 'mrk', 'known', 'msd', 'outside', 'united', 'states', 'canada', 'today', 'announced', 'completion', 'acquisition', 'harpoon', 'therapeutics', 'inc', 'nasdaq', 'harp', 'harpoon', 'wholly', 'owned', 'subsidiary', 'merck', 'harpoon', 'common', 'stock', 'longer', 'publicly', 'traded', 'listed', 'nasdaq', 'stock', 'market', 'harpoon', 'lead', 'candidate', 'mk', '6070', 'formerly', 'known', 'hpn328', 'cell', 'engager', 'targeting', 'delta', 'like', 'ligand', '3', 'dll3', 'inhibitory', 'canonical', 'notch', 'ligand', 'expressed', 'high', 'levels', 'small', 'cell', 'lung', 'cancer', 'sclc', 'neuroendocrine', 'tumors', 'safety', 'tolerability', 'pharmacokinetics', 'mk', '6070', 'currently', 'evaluated', 'monotherapy', 'phase', '1', '2', 'clinical', 'trial', 'nct04471727', 'certain', 'patients', 'advanced', 'cancers', 'associated', 'expression', 'dll3', 'study', 'also', 'evaluating', 'mk', '6070', 'combination', 'atezolizumab', 'certain', 'patients', 'sclc', 'march', '2022', 'u

In [47]:
print(text_tokens)

['merck', 'nyse', 'mrk', 'known', 'as', 'msd', 'outside', 'of', 'the', 'united', 'states', 'and', 'canada', 'today', 'announced', 'the', 'completion', 'of', 'the', 'acquisition', 'of', 'harpoon', 'therapeutics', 'inc', 'nasdaq', 'harp', 'harpoon', 'is', 'now', 'a', 'wholly', 'owned', 'subsidiary', 'of', 'merck', 'and', 'harpoon', 's', 'common', 'stock', 'will', 'no', 'longer', 'be', 'publicly', 'traded', 'or', 'listed', 'on', 'the', 'nasdaq', 'stock', 'market', 'harpoon', 's', 'lead', 'candidate', 'mk', '6070', 'formerly', 'known', 'as', 'hpn328', 'is', 'a', 't', 'cell', 'engager', 'targeting', 'delta', 'like', 'ligand', '3', 'dll3', 'an', 'inhibitory', 'canonical', 'notch', 'ligand', 'that', 'is', 'expressed', 'at', 'high', 'levels', 'in', 'small', 'cell', 'lung', 'cancer', 'sclc', 'and', 'neuroendocrine', 'tumors', 'the', 'safety', 'tolerability', 'and', 'pharmacokinetics', 'of', 'mk', '6070', 'is', 'currently', 'being', 'evaluated', 'as', 'monotherapy', 'in', 'a', 'phase', '1', '2',

In [48]:
# len(no_stop_words), len(text_tokens)

---

# Sentiment Analysis

![](./images/sent.jpeg)

[Sentiment analysis](https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis-practitioners-guide-nlp-5.html) is an area of natural language processing in which we seek to classify text as having positive or negative emotion.

Let's build a simple function that can classify text as either having positive or negative sentiment.

What words tell us whether certain text is positive?

In [80]:
# # Let's come up with a list of positive and negative words we might observe.

# positive_words = ['improvement', 'recommends', 'delight', 'good', 'great', 'awesome', 'tremendous', 'fabulous',
#                  'amazing', 'stellar', 'fantastic', 'super']
# negative_words = ['adverse', 'unlikely', 'ugly', 'bad', 'disgusting', 'terrible', 'gross', 'awful', 'worst']

In [81]:
# def simple_sentiment(text):
#     # Instantiate tokenizer.
#     tokenizer = RegexpTokenizer(r'\w+')
    
#     # Tokenize text.
#     tokens = tokenizer.tokenize(text.lower())
    
#     # Instantiate stemmer.
#     p_stemmer = PorterStemmer()
    
#     # Stem words.
#     stemmed_words = [p_stemmer.stem(i) for i in tokens]
    
#     # Stem our positive/negative words.
#     positive_stems = [p_stemmer.stem(i) for i in positive_words]
#     negative_stems = [p_stemmer.stem(i) for i in negative_words]

#     # Count "positive" words.
#     positive_count = sum([1 for i in stemmed_words if i in positive_stems])
    
#     # Count "negative" words
#     negative_count = sum([1 for i in stemmed_words if i in negative_stems])
    
#     # Calculate Sentiment Percentage 
#     # (Positive Count - Negative Count) / (Total Count)

#     return round((positive_count - negative_count) / len(tokens), 2)

In [82]:
# Run our sentiment analyzer on our spam email.
text

'Merck (NYSE: MRK), known as MSD outside of the United States and Canada, today announced the completion of the acquisition of Harpoon Therapeutics, Inc. (Nasdaq: HARP). Harpoon is now a wholly-owned subsidiary of Merck, and Harpoon’s common stock will no longer be publicly traded or listed on the Nasdaq Stock Market. Harpoon’s lead candidate, MK-6070 (formerly known as HPN328), is a T-cell engager targeting delta-like ligand 3 (DLL3), an inhibitory canonical Notch ligand that is expressed at high levels in small cell lung cancer (SCLC) and neuroendocrine tumors. The safety, tolerability and pharmacokinetics of MK-6070 is currently being evaluated as monotherapy in a Phase 1/2 clinical trial (NCT04471727) in certain patients with advanced cancers associated with expression of DLL3. The study is also evaluating MK-6070 in combination with atezolizumab in certain patients with SCLC. In March 2022, the U.S. Food and Drug Administration (FDA) granted Orphan Drug Designation to MK-6070 for 

In [86]:
# simple_sentiment(text)

<details><summary> What are some shortcomings of this method? </summary>

- Primarily, we're limited to the positive/negative words we came up with.
- If someone wrote "not good" or "not bad," our sentiment function would probably treat "not good" as positive or neutral... but it's probably supposed to mean negative!
- The ordering of the words doesn't matter here, which is not how language generally works.
- We haven't corrected for misspellings.
</details>

There are a couple of ways to proceed with sentiment analysis:

1. If you have already-labeled data, you can build a supervised learning model.
2. If you don't have labeled data, you can use a Lexicon that has already been built/trained for sentiment analysis.
    - There are a bunch of these and which to use depends on your purpose/data. Here are just a few that are available:
        - AFINN lexicon
        - MPQA subjectivity lexicon
        - SentiWordNet
        - VADER lexicon

We will use the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon to analyze the sentiments of our reviews!

In [56]:
# nltk.download('vader_lexicon')  TODO: not working

NameError: name 'nltk' is not defined

In [55]:
# Instantiate Sentiment Intensity Analyzer
df_news_articles['sentiment_column'] = df_news_articles['text'].apply(lambda x: sent.polarity_scores(x))
sent = SentimentIntensityAnalyzer()

LookupError: 
**********************************************************************
  Resource [93mvader_lexicon[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('vader_lexicon')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93msentiment/vader_lexicon.zip/vader_lexicon/vader_lexicon.txt[0m

  Searched in:
    - '/Users/cka/nltk_data'
    - '/Users/cka/miniconda3/envs/ga/nltk_data'
    - '/Users/cka/miniconda3/envs/ga/share/nltk_data'
    - '/Users/cka/miniconda3/envs/ga/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [52]:
# Calculate sentiment of text
sent.polarity_scores(text)

NameError: name 'sent' is not defined

A couple things to note:
1. NLP broadly describes: 
    - how we can get unstructured text data into a more structured form that can be interpreted by computers, and 
    - algorithms for interpreting text data.
2. That does not mean these tools we used today work to the exclusion of other methods. You can and should include other variables in your model!
    - For example, maybe the length of a review tells us something about how much people liked/disliked the movie, or maybe additional information about the reviewer (i.e. geography, age, how many reviews they had submitted) has predictive value.