# Metadata

```yaml
Course:    DS 5001
Module:    13 LLMs
Topic:     Demonstrate ChatGPT Plus with GPT-4
Author:    R.C. Alvarado
Date:      27 April 2023
```

# Prompt 1

Write a Python class that will scrape an arbitrary text from Project Gutenberg and convert the text into a Pandas dataframe of tokens with a multi-index that maps onto the text's structure, i.e. it's chapter, paragraph, sentence, and token numbers. Use NLTK to parse sentences and tokens, and get part-of-speech for each token. Also, use regular expressions to detect chapter breaks. Demonstrate with Jane Austen's Persuasion.

> Here's a Python class that scrapes an arbitrary text from Project Gutenberg, converts the text into a Pandas DataFrame of tokens with a multi-index that maps onto the text's structure (chapter, paragraph, sentence, and token numbers), and also provides part-of-speech for each token. We will use the nltk library to parse sentences and tokens and get the part-of-speech tags, and use regular expressions to detect chapter breaks.

> First, make sure you have the required libraries installed:

In [1]:
pip install pandas nltk requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


> Then, you can use the following class:

In [2]:
import requests
import re
import pandas as pd
import nltk
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")


class GutenbergScraper:
    def __init__(self, url):
        self.url = url
        self.text = self.get_text()
        self.chapters = self.split_chapters()
        self.data = self.parse_text()

    def get_text(self):
        response = requests.get(self.url)
        soup = BeautifulSoup(response.content, "html.parser")
        text = soup.get_text()
        return text

    def split_chapters(self):
        return re.split(r"\bChapter\b", self.text, flags=re.IGNORECASE)

    def parse_text(self):
        parsed_data = []
        for chap_no, chapter in enumerate(self.chapters):
            paragraphs = re.split(r"\n\n+", chapter)
            for para_no, paragraph in enumerate(paragraphs):
                sentences = sent_tokenize(paragraph)
                for sent_no, sentence in enumerate(sentences):
                    tokens = word_tokenize(sentence)
                    pos_tags = nltk.pos_tag(tokens)
                    for token_no, (token, pos) in enumerate(pos_tags):
                        parsed_data.append(
                            (
                                (chap_no, para_no, sent_no, token_no),
                                {"token": token, "pos": pos},
                            )
                        )
        return pd.DataFrame.from_dict(
            dict(parsed_data), orient="index", columns=["token", "pos"]
        )

    def get_dataframe(self):
        return self.data


if __name__ == "__main__":
    url = "https://www.gutenberg.org/files/105/105-h/105-h.htm"
    scraper = GutenbergScraper(url)
    df = scraper.get_dataframe()
    print(df.head())

[nltk_data] Downloading package punkt to /Users/rca2t1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rca2t1/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


             token  pos
0 1 0 0        The   DT
      1    Project  NNP
      2  Gutenberg  NNP
      3      eBook   NN
      4         of   IN


> Replace the url variable with the URL of the specific Project Gutenberg page you want to scrape. In this case, we use Jane Austen's Persuasion HTML file. Running this script will print the first few rows of the parsed Pandas DataFrame.

In [3]:
df

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,token,pos
0,1,0,0,The,DT
0,1,0,1,Project,NNP
0,1,0,2,Gutenberg,NNP
0,1,0,3,eBook,NN
0,1,0,4,of,IN
...,...,...,...,...,...
48,58,0,38,hear,VB
48,58,0,39,about,IN
48,58,0,40,new,JJ
48,58,0,41,eBooks,NNS


# Prompt 2

Rewrite this so that it only takes the Project Gutenburg ID in the initializer, and then figures out the regular expressions for chapter breaks in each case. This is necessary because sometimes chapters are indicated without the word "Chapter", and also we want to include things like letters as chapters.

> To make the class more flexible, we can modify the __init__ method to take a Project Gutenberg ID instead of the URL, and then create a method that automatically generates the URL based on the ID. We can also update the split_chapters method to handle different chapter break patterns, including cases where the word "Chapter" is not used, and include things like letters as chapters.

> Here's the updated class:

In [4]:
import requests
import re
import pandas as pd
import nltk
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")


class GutenbergScraper:
    def __init__(self, book_id):
        self.book_id = book_id
        self.url = self.get_url()
        self.text = self.get_text()
        self.chapters = self.split_chapters()
        self.data = self.parse_text()

    def get_url(self):
        return f"https://www.gutenberg.org/files/{self.book_id}/{self.book_id}-h/{self.book_id}-h.htm"

    def get_text(self):
        response = requests.get(self.url)
        soup = BeautifulSoup(response.content, "html.parser")
        text = soup.get_text()
        return text

    def split_chapters(self):
        patterns = [
            r"\bChapter\b",
            r"\bLetter\b",
            r"\bSection\b",
            r"\bPart\b",
            r"\bBook\b",
        ]
        combined_pattern = r"|".join(patterns)
        return re.split(combined_pattern, self.text, flags=re.IGNORECASE)

    def parse_text(self):
        parsed_data = []
        for chap_no, chapter in enumerate(self.chapters):
            paragraphs = re.split(r"\n\n+", chapter)
            for para_no, paragraph in enumerate(paragraphs):
                sentences = sent_tokenize(paragraph)
                for sent_no, sentence in enumerate(sentences):
                    tokens = word_tokenize(sentence)
                    pos_tags = nltk.pos_tag(tokens)
                    for token_no, (token, pos) in enumerate(pos_tags):
                        parsed_data.append(
                            (
                                (chap_no, para_no, sent_no, token_no),
                                {"token": token, "pos": pos},
                            )
                        )
        return pd.DataFrame.from_dict(
            dict(parsed_data), orient="index", columns=["token", "pos"]
        )

    def get_dataframe(self):
        return self.data


if __name__ == "__main__":
    book_id = 105  # Jane Austen's Persuasion
    scraper = GutenbergScraper(book_id)
    df = scraper.get_dataframe()
    print(df.head())

[nltk_data] Downloading package punkt to /Users/rca2t1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rca2t1/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


             token  pos
0 1 0 0        The   DT
      1    Project  NNP
      2  Gutenberg  NNP
      3      eBook   NN
      4         of   IN


> You can now provide the Project Gutenberg ID for the book you want to scrape, and the class will automatically generate the URL and handle different chapter break patterns. In the example above, we use the ID 105 for Jane Austen's Persuasion. Running the script will print the first few rows of the parsed Pandas DataFrame.