#Text classification

You can train a machine learning model to classify the text of the novel into different genres or categories, such as romance, drama, or comedy. You can use techniques such as word embeddings, sentiment analysis, or topic modeling to create features for your model.

In this example, we load the text data from a file, create a pandas DataFrame with the text and genre/category labels, and define a pipeline that converts the text into a bag of words representation, applies term frequency-inverse document frequency weighting, and trains a Multinomial Naive Bayes classifier. We then fit the text classifier on the training data and test it on a new sample by predicting the label of a new text sample.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [2]:
# Load the text data
with open('/content/EM_Forester_Room_with_a_View.txt', 'r', encoding='utf-8') as f:
    data = f.readlines()

In [3]:
# Create a list of tuples with the text and the corresponding genre/category label
labels = ['romance', 'romance', 'comedy', 'drama', 'drama', 'romance', 'comedy', 'comedy', 'drama', 'romance', 'comedy', 'drama', 'romance', 'drama', 'comedy', 'romance']
data_with_labels = list(zip(data, labels))

In [4]:
# Create a pandas DataFrame with the text and the corresponding genre/category label
df = pd.DataFrame(data_with_labels, columns=['text', 'label'])

In [5]:
# Define the pipeline for text classification
text_clf = Pipeline([
    ('vect', CountVectorizer()),  # Convert the text into a bag of words representation
    ('tfidf', TfidfTransformer()),  # Apply term frequency-inverse document frequency weighting
    ('clf', MultinomialNB()),  # Train a Multinomial Naive Bayes classifier
])

In [6]:
# Train the text classifier
text_clf.fit(df['text'], df['label'])

In [7]:
# Test the text classifier on a new sample
new_text = ["Lucy Honeychurch is a young Englishwoman who is torn between fulfilling her duties as a proper young lady and following her heart and marrying for love."]
predicted_label = text_clf.predict(new_text)
print(predicted_label)

['romance']


# Named entity recognition

You can use a pre-trained model or train your own model to identify and extract named entities from the text, such as people, locations, and organizations.

In this example, we load the pre-trained English model from spaCy, process the text of the novel, and extract the named entities using the ents attribute of the processed document. For each named entity, we print its text and its named entity label (e.g., PERSON, LOCATION, ORGANIZATION). You can also customize the pre-trained model or train your own model using spaCy's machine learning pipeline.

In [8]:
import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_sm")

# Process the text of the novel
with open('/content/EM_Forester_Room_with_a_View.txt', 'r', encoding='utf-8') as f:
    text = f.read()

doc = nlp(text)

# Extract the named entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Bertolini PERSON
The Signora WORK_OF_ART
Bartlett PERSON
Lucy PERSON
And a Cockney WORK_OF_ART
Lucy PERSON
Signora LOC
London GPE
two CARDINAL
English LANGUAGE
English NORP
Queen PERSON
English NORP
English NORP
Cuthbert Eager PERSON
M. A. Oxon PERSON
Charlotte GPE
London GPE
Bartlett PERSON
the Arno PRODUCT
Signora ORG
Signora ORG
Bartlett PERSON
Lucy PERSON
Charlotte GPE
first ORDINAL
Bartlett PERSON
Lucy PERSON
Lucy PERSON
one CARDINAL
Bartlett PERSON
day DATE
two CARDINAL
Bartlett PERSON
George PERSON
Bartlett PERSON
Lucy PERSON
Bartlett PERSON
Lucy PERSON
George PERSON
Lucy PERSON
Bartlett PERSON
half
an hour TIME
Bartlett PERSON
two CARDINAL
Lucy PERSON
Lucy PERSON
Hardly ORG
Lucy PERSON
Beebe PERSON
Charlotte GPE
Bartlett PERSON
Beebe PERSON
Bartlett PERSON
Miss Honeychurch PERSON
Tunbridge Wells ORG
Easter PERSON
one CARDINAL
Lucy PERSON
Miss Honeychurch WORK_OF_ART
Summer Street FAC
Miss PERSON
Bartlett PERSON
last week DATE
Tunbridge Wells ORG
Beebe PERSON
next June DATE
Wind

# Sentiment analysis

You can use NLP techniques to determine the sentiment of the text, such as positive, negative, or neutral. This can help you understand the emotional tone of the novel.

In this example, we load the text of the novel and analyze its sentiment using the TextBlob library. We calculate the sentiment polarity score using the sentiment.polarity attribute of the TextBlob object, which ranges from -1 (most negative) to 1 (most positive). We then determine the sentiment label based on the polarity score: positive for polarity > 0, negative for polarity < 0, and neutral for polarity = 0. Finally, we print both the sentiment polarity score and the sentiment label. You can also use other libraries or techniques for sentiment analysis, such as VaderSentiment or machine learning models trained on sentiment analysis datasets.

In [9]:
from textblob import TextBlob

# Load the text of the novel
with open('/content/EM_Forester_Room_with_a_View.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Analyze the sentiment of the text
blob = TextBlob(text)
polarity = blob.sentiment.polarity

# Print the sentiment polarity score
print("Sentiment polarity score:", polarity)

# Determine the sentiment label
if polarity > 0:
    sentiment_label = "positive"
elif polarity < 0:
    sentiment_label = "negative"
else:
    sentiment_label = "neutral"

# Print the sentiment label
print("Sentiment label:", sentiment_label)

Sentiment polarity score: 0.0844269902961895
Sentiment label: positive


# Topic modeling: 

You can use techniques such as Latent Dirichlet Allocation (LDA) to identify the underlying topics or themes in the novel. This can help you understand the main ideas and motifs in the text.

n this example, we load the text of the novel and define a preprocessing function that removes stop words and tokens with a length less than 4 characters. We then preprocess the text and create a dictionary of terms and their frequency counts, as well as a corpus of term frequency vectors. Next, we define the LDA model parameters, including the number of topics to identify, and train the LDA model on the corpus. Finally, we print the top 10 words for each topic identified by the model. You can further refine the preprocessing and model parameters to improve the quality of the topic modeling.

In [10]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora

# Load the text of the novel
with open('/content/EM_Forester_Room_with_a_View.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Define preprocessing functions
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(token)
    return result

# Preprocess the text
processed_text = preprocess(text)

# Create a dictionary of terms and their frequency counts
dictionary = corpora.Dictionary([processed_text])

# Create a corpus of term frequency vectors
corpus = [dictionary.doc2bow([token]) for token in processed_text]

# Define the LDA model parameters
num_topics = 5
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10)

# Print the topics and their top words
for i, topic in lda_model.show_topics(formatted=True, num_topics=num_topics, num_words=10):
    print(f"Topic {i}: {topic}")


Topic 0: 0.070*"said" + 0.035*"beebe" + 0.026*"honeychurch" + 0.020*"little" + 0.019*"like" + 0.013*"time" + 0.013*"greece" + 0.012*"dear" + 0.011*"vyse" + 0.009*"heard"
Topic 1: 0.027*"bartlett" + 0.023*"freddy" + 0.020*"think" + 0.017*"come" + 0.013*"going" + 0.013*"room" + 0.012*"told" + 0.011*"tell" + 0.010*"world" + 0.010*"church"
Topic 2: 0.036*"george" + 0.020*"emerson" + 0.016*"right" + 0.014*"shall" + 0.014*"things" + 0.012*"mean" + 0.011*"life" + 0.011*"want" + 0.010*"look" + 0.008*"word"
Topic 3: 0.076*"lucy" + 0.046*"cecil" + 0.024*"know" + 0.017*"people" + 0.015*"good" + 0.011*"thought" + 0.010*"course" + 0.009*"house" + 0.009*"girl" + 0.009*"asked"
Topic 4: 0.056*"miss" + 0.025*"mother" + 0.019*"love" + 0.018*"charlotte" + 0.017*"came" + 0.012*"away" + 0.012*"thing" + 0.010*"young" + 0.009*"gone" + 0.009*"long"


# Text summarization

You can use techniques such as extractive or abstractive summarization to generate a summary of the novel. This can help you quickly get an overview of the main plot points and themes.

In this example, we load the text of the novel and generate an extractive summary using the TextRank algorithm from the Gensim library. The summarize function takes the text as input and a ratio parameter that determines the length of the summary relative to the length of the original text. The summary is generated by selecting the most important sentences based on their similarity to the other sentences in the text. The output is a string containing the summary of the text. You can also use other extractive or abstractive summarization techniques and libraries, such as spaCy, NLTK, or transformers-based models.

In [11]:
!pip install gensim==3.8.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
import gensim.summarization

# Load the text of the novel
with open('/content/EM_Forester_Room_with_a_View.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Generate the summary using the TextRank algorithm
summary = gensim.summarization.summarize(text)

# Print the summary
print(summary)


“The Signora had no business to do it,” said Miss Bartlett, “no
She promised us south rooms with a view close
together, instead of which here are north rooms, looking into a
“And a Cockney, besides!” said Lucy, who had been further saddened by
“This meat has surely been used for soup,” said Miss Bartlett, laying
“Any nook does for me,” Miss Bartlett continued; “but it does seem hard
Lucy felt that she had been selfish.
vacant room in the front—” “You must have it,” said Miss Bartlett, part
Your mother would never forgive me, Lucy.”
Miss Bartlett was startled.
Generally at a pension people looked them
build, with a fair, shaven face and large eyes.
What exactly it was Miss Bartlett did not stop to consider, for her
said: “A view?
“This is my son,” said the old man; “his name’s George.
“Ah,” said Miss Bartlett, repressing Lucy, who was about to speak.
“What I mean,” he continued, “is that you can have our rooms, and we’ll
Miss Bartlett, in reply, opened her mouth as little as
possible, a