<a href="https://colab.research.google.com/github/burrittresearch/natural-language-processing/blob/main/nlp-notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing (NLP) Notes

This notebook includes coding and notes Natural Language Processing (NLP).

Data Source: https://www.gutenberg.org

# Project Workflow
* Define the Problem
* Process Data
* NLP Notes

# Define the Problem
Create notes for NLP

# Process Data

In [1]:
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import spacy


In [2]:
# Set display options
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)
pd.set_option('display.colheader_justify', 'left')
pd.set_option('display.precision', 3)

# Line break utility
str_lb = '\n \n'

In [3]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# spaCy Notes

In [4]:
# Make language object returned by the load() function
nlp = spacy.load('en_core_web_sm')
nlp

<spacy.lang.en.English at 0x7a83787e6fb0>

In [5]:
# Data type of the variable
type(nlp)

spacy.lang.en.English

In [6]:
# Make string object
str_about = ('This is sentence 1'
  ' and this is sentence 1 continuing. But'
  ' now this is sentence 2.'
)

# Explore string object
print(type(str_about))
print(str_about)

# Make word token object of string
token_about = nlp(str_about)

# Explore word tokens
print(token_about)
print(type(token_about))
print(len(token_about))

# Get word token indices
for token in token_about:
  print(token, token.idx)

<class 'str'>
This is sentence 1 and this is sentence 1 continuing. But now this is sentence 2.
This is sentence 1 and this is sentence 1 continuing. But now this is sentence 2.
<class 'spacy.tokens.doc.Doc'>
18
This 0
is 5
sentence 8
1 17
and 19
this 23
is 28
sentence 31
1 40
continuing 42
. 52
But 54
now 58
this 62
is 67
sentence 70
2 79
. 80


In [7]:
# Make list of token objects
lst_token_about = [token for token in token_about]
lst_token_about

[This,
 is,
 sentence,
 1,
 and,
 this,
 is,
 sentence,
 1,
 continuing,
 .,
 But,
 now,
 this,
 is,
 sentence,
 2,
 .]

In [8]:
# Make span objects sentences using .sents property
token_about_sents = token_about.sents

# Explore sentence tokens
print(type(token_about_sents))
print(token_about_sents)

# Make list of sentence tokens
lst_token_about_sents = [token for token in token_about_sents]
print(lst_token_about_sents)

# Explore sentence tokens
print(type(lst_token_about_sents))
print(len(lst_token_about_sents))

<class 'generator'>
<generator object at 0x7a823d790680>
[This is sentence 1 and this is sentence 1 continuing., But now this is sentence 2.]
<class 'list'>
2


In [9]:
# Print each sentence in sentence list
for sentence in lst_token_about_sents:
  print(sentence)

This is sentence 1 and this is sentence 1 continuing.
But now this is sentence 2.


In [10]:
# Print first words of each sentence
for sentence in lst_token_about_sents:
  print(f'{sentence[:3]}')

This is sentence
But now this


# Use spaCy with Romeo and Juliet from Project Gutenberg

In [11]:
# Download book from Project Gutenberg and save locally
!wget -P '/content/drive/MyDrive/Colab Notebooks/input/' \
-O '/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt' \
https://www.gutenberg.org/cache/epub/1513/pg1513.txt


--2023-09-03 21:14:52--  https://www.gutenberg.org/cache/epub/1513/pg1513.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 169486 (166K) [text/plain]
Saving to: ‘/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt’


2023-09-03 21:14:55 (1.57 MB/s) - ‘/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt’ saved [169486/169486]



In [12]:
# Make file object and read into string object
str_path_romeo = '/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt'
str_romeo = open(str_path_romeo, 'r')
str_romeo = str_romeo.read()

# Explore string object
print(type(str_romeo))
print(f'{str_romeo}'[:100])

<class 'str'>
﻿The Project Gutenberg eBook of Romeo and Juliet
    
This ebook is for the use of anyone anywhere i


In [13]:
# Make word token object of string
token_romeo = nlp(str_romeo)

# Explore word tokens
print(f'{token_romeo}'[:100])
print(type(token_romeo))
print(len(token_romeo))

# Get word token indices
for token in token_romeo:
  if token.idx <= 100:
    print(token, token.idx)

﻿The Project Gutenberg eBook of Romeo and Juliet
    
This ebook is for the use of anyone anywhere i
<class 'spacy.tokens.doc.Doc'>
41485
﻿The 0
Project 5
Gutenberg 13
eBook 23
of 29
Romeo 32
and 38
Juliet 42

    
 48
This 54
ebook 59
is 65
for 68
the 72
use 76
of 80
anyone 83
anywhere 90
in 99


In [14]:
# Make list of token objects
lst_token_romeo = [token for token in token_romeo]
print(lst_token_romeo[:10])

[﻿The, Project, Gutenberg, eBook, of, Romeo, and, Juliet, 
    
, This]


In [15]:
# Make span objects sentences using .sents property
token_romeo_sents = token_romeo.sents

# Explore sentence tokens
print(type(token_romeo_sents))
print(token_romeo_sents)

# Make list of sentence tokens
lst_token_romeo_sents = [token for token in token_romeo_sents]

# Explore list of sentence tokens
print(lst_token_romeo_sents[:10])
print(type(lst_token_romeo_sents))
print(len(lst_token_romeo_sents))


<class 'generator'>
<generator object at 0x7a823d790a40>
[﻿The Project Gutenberg eBook of Romeo and Juliet
    
, This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever., You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org., If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

, Title: Romeo and Juliet


Author: William Shakespeare

Release date: November 1, 1998 [eBook #1513]
                Most recently updated: June 27, 2023

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***



THE TRAGEDY OF ROMEO AND JULIET

by William Shakespeare




Contents

THE PROLOGUE.

, ACT I
Scene I. A public place.
, Scene II., A Street.
, Scene III., Room in Capulet’s House.
]
<class 'list'>
2

In [16]:
# Print each sentence in sentence list
for sentence in lst_token_romeo_sents[:10]:
  print(sentence)

# Print first words of each sentence
for sentence in lst_token_romeo_sents[:10]:
  print(f'{sentence[:3]}')

﻿The Project Gutenberg eBook of Romeo and Juliet
    

This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever.
You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org.
If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.


Title: Romeo and Juliet


Author: William Shakespeare

Release date: November 1, 1998 [eBook #1513]
                Most recently updated: June 27, 2023

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***



THE TRAGEDY OF ROMEO AND JULIET

by William Shakespeare




Contents

THE PROLOGUE.


ACT I
Scene I. A public place.

Scene II.
A Street.

Scene III.
Room in Capulet’s House.

﻿The Project Gutenberg
This ebook is
You may copy
If you are
Title: Romeo
ACT I

Sce

In [17]:
# Make token object of string
token_str_romeo = nlp(str_romeo)

# Print the first 10 tokens
print(f'{token_str_romeo[:10]}')

# Explore tokens
print(type(token_str_romeo))
print(len(token_str_romeo))

﻿The Project Gutenberg eBook of Romeo and Juliet
    
This
<class 'spacy.tokens.doc.Doc'>
41485


In [18]:
# Make list of span objects sentences using .sents property
lst_span_token_str_romeo = list(token_str_romeo.sents)

# Print the first 10 sentences in the list
print(lst_span_token_str_romeo[:10])

# Explore list
print(type(lst_span_token_str_romeo))
print(len(lst_span_token_str_romeo))


[﻿The Project Gutenberg eBook of Romeo and Juliet
    
, This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever., You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org., If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

, Title: Romeo and Juliet


Author: William Shakespeare

Release date: November 1, 1998 [eBook #1513]
                Most recently updated: June 27, 2023

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***



THE TRAGEDY OF ROMEO AND JULIET

by William Shakespeare




Contents

THE PROLOGUE.

, ACT I
Scene I. A public place.
, Scene II., A Street.
, Scene III., Room in Capulet’s House.
]
<class 'list'>
2841
