## 0. Import Dependencies

In [1]:
import os
import requests

## 1. Download text from Gutenberg

Search for the text of Pride and Prejudice on https://www.gutenberg.org/.  Use [requests.get()](https://www.w3schools.com/python/ref_requests_get.asp) to download its contents.

In [4]:
r = requests.get("https://www.gutenberg.org/cache/epub/1342/pg1342.txt" )

Store the content retrieved into a PrideAndPrejudice.txt file.

In [5]:
book_name = "PrideAndPrejudice"
with open('./' + book_name + '.txt', 'w') as f:
    f.write(r.text)

## 2. Divide Text into Paragraphs

Open the file you just created and store its contents into a variable `text`

In [7]:
import pprint as pp

In [8]:
# open and read text file
with open('PrideAndPrejudice.txt', 'r') as text_file:
    text = text_file.read()
pp.pprint(text)

('\ufeffThe Project Gutenberg eBook of Pride and Prejudice\n'
 '    \n'
 'This ebook is for the use of anyone anywhere in the United States and\n'
 'most other parts of the world at no cost and with almost no restrictions\n'
 'whatsoever. You may copy it, give it away or re-use it under the terms\n'
 'of the Project Gutenberg License included with this ebook or online\n'
 'at www.gutenberg.org. If you are not located in the United States,\n'
 'you will have to check the laws of the country where you are located\n'
 'before using this eBook.\n'
 '\n'
 'Title: Pride and Prejudice\n'
 '\n'
 '\n'
 'Author: Jane Austen\n'
 '\n'
 'Release date: June 1, 1998 [eBook #1342]\n'
 '                Most recently updated: April 14, 2023\n'
 '\n'
 'Language: English\n'
 '\n'
 'Credits: Chuck Greif and the Online Distributed Proofreading Team at '
 'http://www.pgdp.net (This file was produced from images available at The '
 'Internet Archive)\n'
 '\n'
 '\n'
 '*** START OF THE PROJECT GUTENBERG EBOOK P

Take a look at the string and figure out how to seperate the text into paragraphs.  
Print the 1st paragraph.  
Print the first 2 paragraphs.

In [9]:
s = 'I_like_this'
s.split('_')

['I', 'like', 'this']

In [10]:
paragraphs = text.split('\n\n')

In [12]:
paragraphs[100:103]

['“At the door”                                                        194',
 '“In conversation with the ladies”                                    198',
 '“Lady Catherine,” said she, “you have given me a treasure”           200']

## 3. Find and clean 1st Paragraph

Now find the first paragraph of the actual novel (vs. the text related to Project Gutenberg).  Write a function called `find_first_para` that takes the list of pragraphs you just created as input and outputs the index of the first paragraph of the novel.  
Take a look at the list of paragraphs to figure out what criteria to use...

In [34]:
s = 'I_like_this'
l = s.split('_')
'_'.join(l)

'I_like_this'

In [29]:
def find_first_para(paragraphs):
    for para_index, para in enumerate(paragraphs):
        if "Chapter I." in para:
            print(para[para_index+1])
            return para_index+1

In [30]:
first_para_id = find_first_para(paragraphs)
print(first_para_id)
first_para = paragraphs[first_para_id]
first_para

IndexError: string index out of range

Now write a function called `clean_first_para` that takes the paragraph as input and returns a cleaned version.
Eliminate any special characters such as `\n`


In [47]:
def clean_first_para(first_para):
    new_first_para = ' '.join(first_para.split())
    return new_first_para

In [48]:
new_first_para = clean_first_para(first_para)
new_first_para

'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.'

## 4. Installing Spacy

In [35]:
!pip install spacy

Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/7b/f2/a919c2a5c0f0250070cc8f7336b814e5d356d90247c2507462b3baefcdf9/spacy-3.7.0-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading spacy-3.7.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (25 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5-py3-none-any.whl.metadata
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Obtaining dependency information for murmurhash<1.1.0,>=0.28.0 from https://files.pythonhosted.org/packages/7a/05/4a3b5c3043c6d84c00bf0f574d326660702b1c10174fe6b44cef3c3dff08/murmu

In [36]:
# download a model
!python -m spacy download 'en_core_web_sm'

Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


 ## 5. Using Spacy to Analyze Text 
 https://spacy.io/

In [37]:
import spacy

In [39]:
# instatiate pipeline and wrap it around the 1st para
nlp = spacy.load ('en_core_web_sm')
doc = nlp('It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.') # you can replace this with a different longer text to expermient with spacy functionalities

### 5.1 Tokenization
Tokenization allows you to identify the basic units in your text. 
These basic units are called tokens. 

In [54]:
for token in doc :
     print(token.text, token.idx)

It 0
is 3
a 6
truth 8
universally 14
acknowledged 26
, 38
that 40
a 45
single 47
man 54
in 58
possession 61
of 72
a 75
good 77
fortune 82
, 89
must 91
be 96
in 99
want 102
of 107
a 110
wife 112
. 116


In [40]:
# Same thing using a list comprehension
tokens = [(token.text, token.idx) for token in doc]
tokens

[('It', 0),
 ('is', 3),
 ('a', 6),
 ('truth', 8),
 ('universally', 14),
 ('acknowledged', 26),
 (',', 38),
 ('that', 40),
 ('a', 45),
 ('single', 47),
 ('man', 54),
 ('in', 58),
 ('possession', 61),
 ('of', 72),
 ('a', 75),
 ('good', 77),
 ('fortune', 82),
 (',', 89),
 ('must', 91),
 ('be', 96),
 ('in', 99),
 ('want', 102),
 ('of', 107),
 ('a', 110),
 ('wife', 112),
 ('.', 116)]

### 5.2 Lemmatization
Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma.
For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. 


In [41]:
for token in doc :
    print(token.text, token.lemma_)

It it
is be
a a
truth truth
universally universally
acknowledged acknowledge
, ,
that that
a a
single single
man man
in in
possession possession
of of
a a
good good
fortune fortune
, ,
must must
be be
in in
want want
of of
a a
wife wife
. .


### 5.3 POS (Part of Speech) Tagging
Here, two attributes of the Token class are accessed:

* tag_ lists the fine-grained part of speech. See the tag definitions [here](https://web.archive.org/web/20190206204307/https://www.clips.uantwerpen.be/pages/mbsp-tags), same as spacy.explain(...)
* pos_ lists the coarse-grained part of speech.


In [42]:
for token in doc:
    print((token.text, token.tag_, token.pos_, spacy.explain(token.tag_)))

('It', 'PRP', 'PRON', 'pronoun, personal')
('is', 'VBZ', 'AUX', 'verb, 3rd person singular present')
('a', 'DT', 'DET', 'determiner')
('truth', 'NN', 'NOUN', 'noun, singular or mass')
('universally', 'RB', 'ADV', 'adverb')
('acknowledged', 'VBN', 'VERB', 'verb, past participle')
(',', ',', 'PUNCT', 'punctuation mark, comma')
('that', 'IN', 'SCONJ', 'conjunction, subordinating or preposition')
('a', 'DT', 'DET', 'determiner')
('single', 'JJ', 'ADJ', 'adjective (English), other noun-modifier (Chinese)')
('man', 'NN', 'NOUN', 'noun, singular or mass')
('in', 'IN', 'ADP', 'conjunction, subordinating or preposition')
('possession', 'NN', 'NOUN', 'noun, singular or mass')
('of', 'IN', 'ADP', 'conjunction, subordinating or preposition')
('a', 'DT', 'DET', 'determiner')
('good', 'JJ', 'ADJ', 'adjective (English), other noun-modifier (Chinese)')
('fortune', 'NN', 'NOUN', 'noun, singular or mass')
(',', ',', 'PUNCT', 'punctuation mark, comma')
('must', 'MD', 'AUX', 'verb, modal auxiliary')
('be'

### 5.4 Visualization: Using displaCy
spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

In [43]:
from spacy import displacy

In [44]:
displacy.render(doc, style='dep')