![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-science-and-artificial-intelligence&branch=main&subPath=08-natural-language-processing.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Natural Language Processing

[![what is NLP](https://img.youtube.com/vi/oi0JXuL19TA/0.jpg)](https://www.youtube.com/watch?v=oi0JXuL19TA)


One type of [discriminative](https://en.wikipedia.org/wiki/Discriminative_model) [artificial intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence) is [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing).

Natural language processing (NLP) helps computers understand human languages. It allows computers to read and analyze text, understand speech, and even generate human-like responses. With NLP, computers can translate languages, summarize articles, and even recognize emotions in text or speech. NLP has become an important technology in many areas, including social media, healthcare, and education. It helps people communicate more effectively with computers, making it easier for them to get information and complete tasks.

In this notebook we will use NLP to analyze the text of a book. We'll choose a book that is in the [public domain](https://en.wikipedia.org/wiki/Public_domain), meaning there are no restrictions on how it can be used.

[Project Gutenberg](https://www.gutenberg.org/) is a great source for public domain books. We will want to use the `Plain Text UTF-8` version of a book. We'll start with [The Call of the Wild by Jack London](https://www.gutenberg.org/ebooks/215).

<img src="https://www.gutenberg.org/files/215/215-h/images/cover.jpg" 
     align="center" 
     width="200" />

In [2]:
book_url = 'https://www.gutenberg.org/cache/epub/215/pg215.txt'

import requests
r = requests.get(book_url)
r.encoding = 'utf-8'
data = r.text
data = data.replace("’","'").replace("“",'"').replace("”",'"') # replace smart quotes with regular quotation marks
print(data)

﻿The Project Gutenberg eBook of The Call of the Wild, by Jack London

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Call of the Wild

Author: Jack London

Release Date: February 1995 [eBook #215]
[Most recently updated: March 27, 2023]

Language: English

Produced by: Ryan, Kirstin, Linda and Rick Trapp and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK THE CALL OF THE WILD ***




cover 




The Call of the Wild

by Jack London




Contents

 Chapter I. Into the Primitive
 Chapter II. The Law of Club and Fang
 Chapter III. The Dominant Primordial Beast
 Chapter IV. Who Has Wo

Now we have the full text of the book stored in the variable `book`, but we notice that there is some extra content at the start and end that we want to get rid of.

In [None]:
data = data.split('Contents')[1] # remove the first part of the book
data = data.split('End of the Project Gutenberg EBook')[0] # remove the last part of the book
data

This book uses the word `Chapter` to show the start of each chapter, so we can split it into a list of chapters.

There are also line breaks (`\r\n`) included, we can replace them with spaces. We can also strip any leading or trailing spaces in each chapter.

In [None]:
chapters = data.split('Chapter ')

chapters = [chapter.replace('\r\n',' ') for chapter in chapters]
chapters = [chapter.strip() for chapter in chapters]

chapters

***
## Listing Chapters ##
Now that we have cleaned up the text, we can split it into a list of chapters to make it easier to work with. We will create a dataframe of the chapters.

In [None]:
import pandas as pd
data = pd.DataFrame(chapters, columns=['Text'])
data

We notice that the first few rows aren't chapters but are just the table of contents. We we want just the rows from `8` to the end.

In [None]:
data = data.iloc[8:].copy()
data

Let's also reindex the dataframe so that it starts from `0` instead of `8`.

In [None]:
data = data.reset_index(drop=True)
data

The `text` column also includes the chapter titles, which we want to remove. We can use items `1` to `7` of our `chapters` list to get the chapter titles.

In [None]:
data['Title'] = chapters[1:8]
for i in range(len(data)):
    data['Text'][i] = data['Text'][i].replace(data['Title'][i], '')
data

I like how you put the chapters in visulaations of the book in the Great Gatsby notebook. Could do that here too? 

## Natural Language Processing

Now we can finally do some NLP. We'll use the [spaCy](https://spacy.io) natural language processing library to process the text and create new columns in our dataframe containing the identified sentences and words. This will allow us to summarize the text and perform some other analysis.

The first few lines of the following code cell set up spaCy and define a `processLanguage` function. It may take a little while to run.

In [None]:
try:
    import en_core_web_sm
    nlp = en_core_web_sm.load()
except:
    !pip install spacy --user
    !python -m spacy download en_core_web_sm
    import en_core_web_sm
    nlp = en_core_web_sm.load()

def processLanguage(chapter):
    processed = nlp(chapter)
    sentences = list(processed.sents)
    words = [] # create an empty list
    for token in processed:
        if token.is_alpha: # if the token is a word
            words.append(token)
    return sentences, words

data['Sentences'], data['NLP'] = zip(*data['Text'].apply(processLanguage))
data

### Word Counts

Since the new `NLP` column is a list of the words in a chapter, we can visualize the number of words per chapter.

Remember that dataframes use zero-based indexing, so chapters 1 to 7 are labelled as 0 to 6.

In [None]:
data['WordCount'] = data['NLP'].apply(lambda x: len(x))

import plotly.express as px
px.bar(data, y='WordCount', title='Word Count by Chapter').update_xaxes(title='Chapter')

### Summarizing the Text

#### Stop Words

To summarize each chapter, we'll start by removing any [stop words](https://en.wikipedia.org/wiki/Stop_word). Stop words are words such as `a`, `it`, `the`, or `and` that don't really add any meaning.

Then if we find the frequencies of the signficant words in each each chapter, we will be able to weight each sentence based on how many of those it contains.

In [None]:
def removeStopWords(chapter):
    chapter = [word for word in chapter if word.is_stop==False]
    return chapter
data['SignificantWords'] = data['NLP'].apply(removeStopWords)

from collections import Counter
def wordFrequencyCounter(chapter):
    words = [word.text for word in chapter]
    word_frequencies = Counter(words).most_common()
    max_frequency = word_frequencies[0][1]
    for w in range(len(word_frequencies)):
        word_frequencies[w] = (word_frequencies[w][0], word_frequencies[w][1]/max_frequency) # normalize the word counts to values between 0 and 1
    return word_frequencies

data['SignificantWordFrequencies'] = data['SignificantWords'].apply(wordFrequencyCounter)
print('Our data columns are now:', data.columns)

#### Creating Chapter Summaries

First we will weight the sentences by the frequencies of the significant words, then we'll record the 5% of the sentences that have the highest scores. Again, this might take a minute.

In [None]:
from heapq import nlargest
summaries = []
for i in range(len(data)):
    sentences = data['Sentences'][i]
    number_to_output = int(len(sentences)*0.05) # 5% of the sentences
    word_frequencies = data['SignificantWordFrequencies'][i]
    sentence_scores = {}
    for sentence in sentences:
        sentence_score = 0
        for word in sentence:
            for word_frequency in word_frequencies:
                if word.pos_ != 'PROPN': # if it is not a proper noun
                    if word.text == word_frequency[0]:
                        sentence_score += word_frequency[1]
        sentence_scores[sentence] = sentence_score
    summary = nlargest(number_to_output, sentence_scores, key=sentence_scores.get)
    summaries.append(summary)
data['Summary'] = summaries
print('Summarization complete')

Now we can see a basic chapter summary using the syntax `book['Summary'][x]` where `x` is the chapter index number (from 0 to 6).

In [None]:
data['Summary'][2]

## Sentiment Analysis

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is categorizing text based on its tone (negative, neutral, or positive). For this we will use [Vader Sentiment from the Natural Language Toolkit](https://www.nltk.org/_modules/nltk/sentiment/vader.html).

Running the cell below will take a minute or two.

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

data['Sentiment'] = data['Text'].apply(lambda x: analyzer.polarity_scores(x))
data['PositiveSentiment'] = data['Sentiment'].str['pos']
data['NegativeSentiment'] = -data['Sentiment'].str['neg']

print('Our data columns are now:')
for column in data.columns:
    print(column)

We can then create a visualization of the sentiment by chapter.

In [None]:
px.bar(data, y=['PositiveSentiment', 'NegativeSentiment'], title='Sentiment by Chapter').update_xaxes(title_text='Chapter')

### Parts of Speech

Next we'll identify the [parts of speech](https://spacy.io/api/annotation#pos-tagging) in the text to see how many verbs, adjectives, nouns, and proper nouns there are. We'll also calculate word count percents so we can make some graphs.

In [None]:
parts_of_speech = {'Verbs':'VERB', 'Adjectives':'ADJ', 'Nouns':'NOUN', 'ProperNouns':'PROPN'}
for part in parts_of_speech:
    data[part] = data['NLP'].apply(lambda x: [token for token in x if token.pos_ == parts_of_speech[part]])
    data[part+'Count'] = data[part].apply(len)
    data[part+'%'] = data[part+'Count']/data['WordCount']*100

px.line(data, x=data.index, y=['Verbs%', 'Adjectives%', 'Nouns%', 'ProperNouns%'], title='Parts of Speech by Chapter').update_xaxes(title='Chapter').update_yaxes(title='Percentage')

#### Most Common Proper Nouns

The spaCy library does a fairly good job of identifying names, but you may see some false positives (words that aren't actually character names). We can look at the character names and how often they occur.

In [None]:
def nameCounter(row):
    return Counter([token.lemma_.strip() for token in row])
data['Names'] = data['ProperNouns'].apply(nameCounter)

name_counter = Counter()
for row in data['Names']:
    name_counter.update(row)
names_df = pd.DataFrame.from_dict(name_counter, orient='index', columns=['Count'])
names_df = names_df.sort_values('Count')
names_df = names_df[names_df['Count'] > 3] # only keep names that appear less than 3 times

px.bar(names_df, x='Count', y=names_df.index, orientation='h', title='Proper Noun Frequencies', height=1000).update_xaxes(title='Count')

#### Name Frequencies over Time

Since we have the text divided into chapters, let's visualize the mentions of the top `5` character names throughout the book.

In [None]:
def nameCounter(name):
    name_counts = []
    for row in data['ProperNouns']:
        n = 0
        for token in row:
            if token.text == name:
                n += 1
        name_counts.append(n)
    return(name_counts)

main_characters_list = names_df.tail(5).index.to_list()
main_characters = pd.DataFrame()
for character in main_characters_list:
    main_characters[character] = nameCounter(character)
    main_characters[character+'Total'] = main_characters[character].cumsum()
main_characters['Chapter'] = main_characters.index+1
px.line(main_characters, y=main_characters_list, title='Main Character Frequencies').update_yaxes(title='Frequency').update_xaxes(title='Chapter').show()
px.line(main_characters, y=[character+'Total' for character in main_characters_list], title='Cumulative Main Character Frequencies').update_yaxes(title='Cumulative Frequency').update_xaxes(title='Chapter').show()

---

<span style="color:#663399">Your **assignment** is to add at least two of these visualizations to a document and write about what you notice in them.</span>

---

That is a basic introduction to discriminative artificial intelligence and natural language processing. The [next notebook](09-generative-ai.ipynb) will introduce generative artificial intelligence.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)