<a href="https://colab.research.google.com/github/burrittresearch/natural-language-processing/blob/main/nlp-notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing (NLP) Notes

This notebook includes coding and notes Natural Language Processing (NLP).

Data Source: https://www.gutenberg.org

# Project Workflow
* Define the Problem
* Process Data
* NLP Notes

# Define the Problem
Create notes for NLP

# Process Data

In [1]:
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import spacy


In [2]:
# Set display options
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)
pd.set_option('display.colheader_justify', 'left')
pd.set_option('display.precision', 3)

# Line break utility
str_lb = '\n \n'

In [3]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# spaCY Notes

In [4]:
# Make language object returned by the load() function
nlp = spacy.load('en_core_web_sm')
nlp

<spacy.lang.en.English at 0x7fadf63c7400>

In [5]:
# Data type of the variable
type(nlp)

spacy.lang.en.English

In [6]:
# Make str object about spacy
str_about = 'this is about spacy nlp'
print(str_about)
print(type(str_about))

# Make token object of spacy string
token_str = nlp(str_about)
print(token_str)
print(type(token_str))

# Explore string
print(len(token_str))


this is about spacy nlp
<class 'str'>
this is about spacy nlp
<class 'spacy.tokens.doc.Doc'>
5


In [7]:
# Make list of token object
lst_token = []
for token in token_str:
  lst_token.append(token)

lst_token

[this, is, about, spacy, nlp]

In [8]:
# Using list comprehension to do the same thing
# More pythonic than loops
lst_token = [token for token in token_str]
lst_token

[this, is, about, spacy, nlp]

In [9]:
# Make str object of sentences
str_sent = ('This is sentence 1'
  ' and this is sentence 1 continuing. But'
  ' now this is sentence 2.'
)
print(str_sent)
print(type(str_sent))

# Make token objects of sentences string
token_str_sent = nlp(str_sent)
print(token_str_sent)
print(type(token_str_sent))

# Explore string
print(len(token_str_sent))


This is sentence 1 and this is sentence 1 continuing. But now this is sentence 2.
<class 'str'>
This is sentence 1 and this is sentence 1 continuing. But now this is sentence 2.
<class 'spacy.tokens.doc.Doc'>
18


In [10]:
# Make list of span objects sentences using .sents property
lst_span_str_sent = list(token_str_sent.sents)
print(lst_span_str_sent)
print(type(lst_span_str_sent))

# Verify length of span objects in the list
print(len(lst_span_str_sent))

[This is sentence 1 and this is sentence 1 continuing., But now this is sentence 2.]
<class 'list'>
2


In [11]:
# Print each sentence
for sentence in lst_span_str_sent:
  print(sentence)

This is sentence 1 and this is sentence 1 continuing.
But now this is sentence 2.


In [12]:
# Print first words of each sentence
for sentence in lst_span_str_sent:
  print(f'{sentence[:3]}')

This is sentence
But now this


In [13]:
# Explore the token string
for token in token_str:
  print(token, token.idx)


this 0
is 5
about 8
spacy 14
nlp 20


# Use spaCy with Romeo and Juliet from Project Gutenberg

In [14]:
# Download book from Project Gutenberg and save locally
!wget -P '/content/drive/MyDrive/Colab Notebooks/input/' \
-O '/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt' \
https://www.gutenberg.org/cache/epub/1513/pg1513.txt


--2023-09-03 18:43:30--  https://www.gutenberg.org/cache/epub/1513/pg1513.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 169486 (166K) [text/plain]
Saving to: ‘/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt’


2023-09-03 18:43:34 (405 KB/s) - ‘/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt’ saved [169486/169486]



In [15]:
# Make file object and read into string
str_path_romeo = '/content/drive/MyDrive/Colab Notebooks/input/romeo-juliet.txt'
str_romeo = open(str_path_romeo, 'r')
str_romeo = str_romeo.read()

# Explore string
print(type(str_romeo))

<class 'str'>


In [16]:
# Make token object of string
token_str_romeo = nlp(str_romeo)

# Print the first 10 tokens
print(f'{token_str_romeo[:10]}')

# Explore tokens
print(type(token_str_romeo))
print(len(token_str_romeo))

﻿The Project Gutenberg eBook of Romeo and Juliet
    
This
<class 'spacy.tokens.doc.Doc'>
41485


In [17]:
# Make list of span objects sentences using .sents property
lst_span_token_str_romeo = list(token_str_romeo.sents)

# Print the first 10 sentences in the list
print(lst_span_token_str_romeo[:10])

# Explore list
print(type(lst_span_token_str_romeo))
print(len(lst_span_token_str_romeo))


[﻿The Project Gutenberg eBook of Romeo and Juliet
    
, This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever., You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org., If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

, Title: Romeo and Juliet


Author: William Shakespeare

Release date: November 1, 1998 [eBook #1513]
                Most recently updated: June 27, 2023

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***



THE TRAGEDY OF ROMEO AND JULIET

by William Shakespeare




Contents

THE PROLOGUE.

, ACT I
Scene I. A public place.
, Scene II., A Street.
, Scene III., Room in Capulet’s House.
]
<class 'list'>
2841
