<p style="text-align: center; font-size: 28px;"><b>Text preprocessing</b></p>

<p font-size: 28px;"><b>Import and Preprocess Text Data</b></p>

<p style="font-size: 15px;"><b>Task 1</b></p>
Given to you in the code editor are text files for The Picture of Dorian Gray (dorian_gray.txt) and The Iliad (the_iliad.txt), sourced from Project Gutenberg. Import the text of your choosing, convert it to lowercase, and name it <code>text</code> using the following line of code:  
<br>
<code>text = open("_______.txt", encoding='utf-8').read().lower()</code>  
<br>
Replace the blank with the name of the text file for the novel you choose to analyze.

In [4]:
from nltk import pos_tag, RegexpParser
from utils.tokenize_words import word_sentence_tokenize
from utils.chunk_counters import np_chunk_counter, vp_chunk_counter

<p style="font-size: 16px;"><b>Task 2</b></p>
With the text imported, split it into individual sentences and then into individual words for sentence-by-sentence parsing. Use the provided function <code>word_sentence_tokenize()</code> by passing <code>text</code> as an argument, and save the result to a variable named <code>word_tokenized_text</code>.
```


In [7]:
text = open("utils/dorian_gray.txt",encoding='utf-8').read().lower()

`sent_tokenize()` vs `PunktSentenceTokenizer`

Choosing the right tokenizer depends on your text. Both methods break text into sentences, but they work differently and serve different needs.

---

‚úîÔ∏è `sent_tokenize(text)` ‚Äî Quick & Reliable

* One line of code
* Uses a **pretrained English model**
* Good for most standard texts
* Example:

  ```python
  from nltk.tokenize import sent_tokenize
  sent_tokenize(text)
  ```

---

üß† `PunktSentenceTokenizer(text)` ‚Äî Trained on Your Text

* Builds a **custom model** from your own dataset
* Useful when dealing with:

  * Many abbreviations (`Dr.`, `Prof.`, etc.)
  * Legal / medical / scientific formatting
  * Domain-specific writing styles
* Allows **finer control** over sentence detection

---

‚öñÔ∏è When to Use Which?

| Use Case                            | Best Option                |
| ----------------------------------- | -------------------------- |
| Standard English prose              | `sent_tokenize()`          |
| Unusual punctuation / abbreviations | `PunktSentenceTokenizer()` |
| You need custom behavior            | `PunktSentenceTokenizer()` |
| You want speed & simplicity         | `sent_tokenize()`          |

---

üìå Why Training Can Help

```python
text = "Dr. Watson met Mr. Holmes. They went home."

# Default sent_tokenize output:
sent_tokenize(text)
# ['Dr.', 'Watson met Mr.', 'Holmes.', 'They went home.']   ‚ùå Incorrect

# Custom trained tokenizer:
PunktSentenceTokenizer(text).tokenize(text)
# ['Dr. Watson met Mr. Holmes.', 'They went home.']         ‚úîÔ∏è Correct
```

---
üß† Final Takeaway

* Use **`sent_tokenize()`** unless your text has **special formatting**.
* Use **`PunktSentenceTokenizer()`** if your dataset needs **custom training**.

Let me know if you want a polished version of your function using `sent_tokenize()` ‚Äî I can write it for you directly.


<p style="font-size: 16px;"><b>Task 3</b></p>
Save any word-tokenized sentence in <code>word_tokenized_text</code> to a variable named <code>single_word_tokenized_sentence</code>. Then print <code>single_word_tokenized_sentence</code> to visualize what you have done so far.

In [8]:
word_tokenized_text = word_sentence_tokenize(text)

<p font-size: 28px;"><b>Part-of-speech Tag Text</b></p>

<p style="font-size: 16px;"><b>Task 4</b></p>
Begin syntax parsing by part-of-speech tagging each sentence. Create a list named <code>pos_tagged_text</code> to store each part-of-speech tagged sentence from the novel.

In [11]:
 = []

<p style="font-size: 16px;"><b>Task 5</b></p>
Loop through each word-tokenized sentence in <code>word_tokenized_text</code> and part-of-speech tag each sentence using NLTK‚Äôs <code>pos_tag()</code> function. Append each result to <code>pos_tagged_text</code>.

In [12]:
for sentence in word_tokenized_text:
    result = pos_tag(sentence)
    pos_tagged_text.append(result)

In [13]:
print(pos_tagged_text)



<p style="font-size: 16px;"><b>Task 6</b></p>
Save any part-of-speech tagged sentence from <code>pos_tagged_text</code> to a variable named <code>single_pos_sentence</code>. Then print <code>single_pos_sentence</code> to visualize what you have done so far.

In [16]:
pos_tagged_text[0][0][1]

'DT'

In [17]:
single_pos_sentence = pos_tagged_text[100]
print(single_pos_sentence)

[('it', 'PRP'), ('seems', 'VBZ'), ('to', 'TO'), ('be', 'VB'), ('the', 'DT'), ('one', 'CD'), ('thing', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('make', 'VB'), ('modern', 'JJ'), ('life', 'NN'), ('mysterious', 'JJ'), ('or', 'CC'), ('marvellous', 'JJ'), ('to', 'TO'), ('us', 'PRP'), ('.', '.')]


<p font-size: 28px;"><b>Chunk Sentences</b></p>

<p style="font-size: 16px;"><b>Task 8</b></p> Create an nltk RegexpParser object named np_chunk_parser using the noun phrase chunk grammar you defined as an argument.

In [22]:
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

np_chunk_parser = RegexpParser(chunk_grammar)

<p style="font-size: 16px;"><b>Task 9</b></p> Define a piece of chunk grammar named vp_chunk_grammar that will chunk a verb phrase consisting of a noun phrase, followed by a verb (VB), followed by an optional adverb (RB).

In [23]:
vp_chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

<p style="font-size: 16px;"><b>Task 10</b></p>
Create an nltk <code>RegexpParser</code> object named <code>vp_chunk_parser</code> using the verb phrase chunk grammar you defined as an argument.

In [24]:
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

<p style="font-size: 16px;"><b>Task 11</b></p>
Create two empty lists named <code>np_chunked_text</code> and <code>vp_chunked_text</code> to store the noun phrase and verb phrase chunked sentences from your text.

In [25]:
np_chunked_text = [] 
vp_chunked_text = []

<p style="font-size: 16px;"><b>Task 12</b></p>
Loop through each part-of-speech tagged sentence in <code>pos_tagged_text</code> and chunk noun phrases using your <code>RegexpParser</code> object's <code>.parse()</code> method. Append each result to <code>np_chunked_text</code>.

In [28]:
for sentence in pos_tagged_text:
    result = np_chunk_parser.parse(sentence)
    np_chunked_text.append(result)

<p style="font-size: 20px;"><b>Task 13</b></p>
Within the same loop from the previous task, chunk verb phrases for each part-of-speech tagged sentence using your <code>RegexpParser</code> object's <code>.parse()</code> method. Append the result to <code>vp_chunked_text</code>.

In [29]:
for sentence in pos_tagged_text:
    result = np_chunk_parser.parse(sentence)
    np_chunked_text.append(result)
    result = vp_chunk_parser.parse(sentence)
    vp_chunked_text.append(result)

<p font-size: 28px;"><b>Analyzing Chunks</b></p>

<p style="font-size: 16px;"><b>Task 14</b></p>
Analyze your chunked novel by examining chunk frequencies. Use the provided function <code>np_chunk_counter()</code>, which returns the 30 most common NP-chunks. Call it with <code>np_chunked_text</code> as an argument and save the result to a variable named <code>most_common_np_chunks</code>. Then print <code>most_common_np_chunks</code> and reflect on any interesting or surprising patterns you observe.

In [32]:
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)

[((('i', 'NN'),), 1926), ((('henry', 'NN'),), 400), ((('lord', 'NN'),), 394), ((('life', 'NN'),), 340), ((('harry', 'NN'),), 272), ((('dorian', 'JJ'), ('gray', 'NN')), 254), ((('something', 'NN'),), 252), ((('nothing', 'NN'),), 186), ((('basil', 'NN'),), 170), ((('the', 'DT'), ('world', 'NN')), 140), ((('everything', 'NN'),), 138), ((('anything', 'NN'),), 136), ((('hallward', 'NN'),), 136), ((('the', 'DT'), ('man', 'NN')), 122), ((('the', 'DT'), ('room', 'NN')), 120), ((('face', 'NN'),), 114), ((('the', 'DT'), ('door', 'NN')), 112), ((('love', 'NN'),), 110), ((('art', 'NN'),), 104), ((('course', 'NN'),), 102), ((('the', 'DT'), ('picture', 'NN')), 92), ((('the', 'DT'), ('lad', 'NN')), 90), ((('head', 'NN'),), 88), ((('round', 'NN'),), 88), ((('hand', 'NN'),), 88), ((('sibyl', 'NN'),), 82), ((('the', 'DT'), ('table', 'NN')), 80), ((('the', 'DT'), ('painter', 'NN')), 76), ((('sir', 'NN'),), 76), ((('a', 'DT'), ('moment', 'NN')), 76)]


<p style="font-size: 20px;"><b>Task 15</b></p>
Use the function <code>vp_chunk_counter()</code>, which returns the 30 most common verb phrase chunks. Call it with <code>vp_chunked_text</code> as an argument and save the result to a variable named <code>most_common_vp_chunks</code>. Then print <code>most_common_vp_chunks</code> and reflect on any interesting or surprising patterns you observe.

In [33]:
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

[((('i', 'NN'), ('am', 'VBP')), 101), ((('i', 'NN'), ('was', 'VBD')), 40), ((('i', 'NN'), ('want', 'VBP')), 37), ((('i', 'NN'), ('know', 'VBP')), 33), ((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 32), ((('i', 'NN'), ('have', 'VBP')), 32), ((('i', 'NN'), ('had', 'VBD')), 31), ((('i', 'NN'), ('suppose', 'VBP')), 17), ((('i', 'NN'), ('think', 'VBP')), 16), ((('i', 'NN'), ('am', 'VBP'), ('not', 'RB')), 14), ((('i', 'NN'), ('thought', 'VBD')), 13), ((('i', 'NN'), ('believe', 'VBP')), 12), ((('dorian', 'JJ'), ('gray', 'NN'), ('was', 'VBD')), 11), ((('i', 'NN'), ('am', 'VBP'), ('so', 'RB')), 11), ((('henry', 'NN'), ('had', 'VBD')), 11), ((('i', 'NN'), ('did', 'VBD'), ("n't", 'RB')), 9), ((('i', 'NN'), ('met', 'VBD')), 9), ((('i', 'NN'), ('said', 'VBD')), 9), ((('i', 'NN'), ('am', 'VBP'), ('quite', 'RB')), 8), ((('i', 'NN'), ('see', 'VBP')), 8), ((('i', 'NN'), ('did', 'VBD'), ('not', 'RB')), 7), ((('i', 'NN'), ('have', 'VBP'), ('ever', 'RB')), 7), ((('life', 'NN'), ('has', 'VBZ')), 7), ((('i'