In [1]:
import requests

url = 'https://raw.githubusercontent.com/bellevue-university/dsc360/refs/heads/main/12%20Week/week_4/big.txt'
response = requests.get(url)

# Save the file
with open('big.txt', 'wb') as f:
    f.write(response.content)

print("File downloaded successfully.")

File downloaded successfully.


In [2]:
# Importing necessary libraries for text processing. Pandas is used for handling data, and nltk for stop words.
import pandas as pd
import re
import string
from nltk.corpus import stopwords

# Defining the TextNormalizer class. This class will have methods to clean text and remove stop words.
class TextNormalizer:
    
    def __init__(self):
        # Stop words are common words in the language that are usually removed during text preprocessing.
        self.stop_words = set(stopwords.words('english'))
    
    # The clean_text function takes in text and processes it to make it easier for analysis.
    def clean_text(self, text):
        # Converting the text to lowercase so that it is case insensitive.
        text = text.lower()
        # Removing any numbers from the text as they are not useful for most NLP tasks.
        text = re.sub(r'\d+', '', text)
        # Removing punctuation marks from the text.
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Stripping leading/trailing whitespace and replacing multiple spaces with a single space.
        text = text.strip()
        text = re.sub(r'\s+', ' ', text)
        return text

    # The remove_stopwords function removes common words like 'the', 'and', etc.
    def remove_stopwords(self, text):
        # Splitting the text into individual words (tokens).
        tokens = text.split()
        # Filtering out any word that is in the stop_words list.
        filtered_tokens = [token for token in tokens if token not in self.stop_words]
        # Joining the remaining tokens back into a single string of text.
        return ' '.join(filtered_tokens)

    # The normalize_corpus function will clean and filter a Pandas Series, returning a single stream of cleaned text.
    def normalize_corpus(self, series):
        # Using apply and lambda to clean and filter the text in each row of the Pandas Series.
        series = series.apply(self.clean_text)
        series = series.apply(self.remove_stopwords)
        # Joining the entire cleaned series into one long string of text.
        return ' '.join(series)

In [3]:
# Load the big.txt file
with open('big.txt', 'r') as file:
    text_data = file.readlines()

# Convert the text into a Pandas Series
text_series = pd.Series(text_data)

# Initialize the TextNormalizer class
normalizer = TextNormalizer()

# Normalize the text using the normalize_corpus method
normalized_text = normalizer.normalize_corpus(text_series)

# Print the first 1000 characters of the normalized text to verify
print(normalized_text[:1000])


project gutenberg ebook adventures sherlock holmes sir arthur conan doyle series sir arthur conan doyle  copyright laws changing world sure check copyright laws country downloading redistributing project gutenberg ebook  header first thing seen viewing project gutenberg file please remove change edit header without written permission  please read legal small print information ebook project gutenberg bottom file included important information specific rights restrictions file may used also find make donation project gutenberg get involved   welcome world free plain vanilla electronic texts  ebooks readable humans computers since  ebooks prepared thousands volunteers   title adventures sherlock holmes  author sir arthur conan doyle  release date march ebook recently updated november  edition  language english  character set encoding ascii  start project gutenberg ebook adventures sherlock holmes     additional editing jose menendez    adventures sherlock holmes    sir arthur conan doyle 

In [7]:
# Import spaCy
import spacy

# Load the small English model for spaCy
nlp = spacy.load('en_core_web_sm')

# Process the first 1000 characters of the normalized text using spaCy's NLP pipeline
doc = nlp(normalized_text[:1000])

# Loop through each token in the text and print out the details
for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}, POS: {token.pos_}, Dependency: {token.dep_}")


Token: project, Lemma: project, POS: PROPN, Dependency: compound
Token: gutenberg, Lemma: gutenberg, POS: PROPN, Dependency: compound
Token: ebook, Lemma: ebook, POS: NOUN, Dependency: nsubj
Token: adventures, Lemma: adventure, POS: VERB, Dependency: nsubj
Token: sherlock, Lemma: sherlock, POS: PROPN, Dependency: compound
Token: holmes, Lemma: holmes, POS: PROPN, Dependency: compound
Token: sir, Lemma: sir, POS: PROPN, Dependency: compound
Token: arthur, Lemma: arthur, POS: PROPN, Dependency: compound
Token: conan, Lemma: conan, POS: PROPN, Dependency: compound
Token: doyle, Lemma: doyle, POS: PROPN, Dependency: compound
Token: series, Lemma: series, POS: PROPN, Dependency: compound
Token: sir, Lemma: sir, POS: PROPN, Dependency: compound
Token: arthur, Lemma: arthur, POS: PROPN, Dependency: compound
Token: conan, Lemma: conan, POS: PROPN, Dependency: compound
Token: doyle, Lemma: doyle, POS: PROPN, Dependency: compound
Token:  , Lemma:  , POS: SPACE, Dependency: dep
Token: copyright, 

### Summary of the Process:
In this assignment, I used **spaCy** for tokenization, lemmatization, part-of-speech (POS) tagging, and syntactic dependency parsing. While I initially intended to compare the output of **spaCy** with **NLTK**, I encountered persistent issues with downloading and configuring NLTK resources, specifically the `averaged_perceptron_tagger`. As a result, I proceeded with **spaCy**, which handled all the required tasks efficiently.

### Results with spaCy:
Using spaCy, I processed the first 1,000 characters of the normalized text. The output included tokens, their lemmatized forms, part-of-speech tags, and syntactic dependencies. For example:

- **Token**: project, **Lemma**: project, **POS**: PROPN, **Dependency**: compound
- **Token**: gutenberg, **Lemma**: gutenberg, **POS**: PROPN, **Dependency**: compound
- **Token**: ebook, **Lemma**: ebook, **POS**: NOUN, **Dependency**: nsubj

...

Although I could not compare these results with NLTK, **spaCy** proved to be an effective and reliable tool for text analysis in this context.
