## Introduction
The objective is to build a text summarization model to identify the most important sentences in a Harry Potter chapter and use those sentences to create extractive summaries of all chapters for each book. The summary would then go through a language generator as a starting point for creating a short story that expands on ideas presented in the summary.The language generator would utilize a bidriectional encoder to adjust the writing style to reflect more accurately with that of the summary

#### Step 1: Import Data 
Create a loop to access our data file and import each of the texts into our assignment for use. Same method that was utilized in the EDA notebook

In [1]:
import pandas as pd

Bookfile = [] # Empty "Book" list - Prepare for loop

# Loops through importing 7 HP text files - Book 1 creates table, Books 2-7 append to Book 1 table
for i in range(1, 8): 
    Bookfile.append('HPBook'+str(i)+'.txt')
    FileLoc = "data/{}".format(Bookfile[i-1])
    if i == 1:
        df = pd.read_csv(FileLoc, sep="@")
    else:
        df2 = pd.read_csv(FileLoc, sep="@")
        df = pd.concat([df, df2])

#### Step 2: Clean and Process Data 
Utlizing the NLTK library a table will be created that breaks the text down by sentence as opposed to chapters and each sentence will be processed to remove unwanted punctuation and stopwords

In [2]:
import string 
import nltk 
from nltk import word_tokenize 
from nltk.corpus import stopwords 
from nltk import sent_tokenize 
from nltk.stem import PorterStemmer
#nltk.download('stopwords')

In [3]:
#remove stop words and punction 
stop_words = set(stopwords.words('english')) 

#function to implemnt removing unwated tokenz 
def remove_stopwords(text):
    text_tokens = word_tokenize(text)
    tokens_without_sw = [word for word in text_tokens if not word in stop_words]
    return " ".join(tokens_without_sw)

#creating a tokenized word column as well as a word counter
df['Text'] = df['Text'].str.lower().apply(remove_stopwords)
df['WordCountText'] = df['Text'].apply(word_tokenize)
df['WordCount'] = df['WordCountText'].str.len() #Word Count Per Chapter
df.reset_index(drop=True)

Unnamed: 0,Text,Chapter,Book,WordCountText,WordCount
0,"boy lived mr. mrs. dursley , number four , pri...",1,1,"[boy, lived, mr., mrs., dursley, ,, number, fo...",3581
1,vanishing glass nearly ten years passed since ...,2,1,"[vanishing, glass, nearly, ten, years, passed,...",2664
2,letters one escape brazilian boa constrictor e...,3,1,"[letters, one, escape, brazilian, boa, constri...",3062
3,keeper keys boom . knocked . dudley jerked awa...,4,1,"[keeper, keys, boom, ., knocked, ., dudley, je...",3288
4,diagon alley harry woke early next morning . a...,5,1,"[diagon, alley, harry, woke, early, next, morn...",5941
...,...,...,...,...,...
195,"harry remained kneeling snape 's side , simply...",33,7,"[harry, remained, kneeling, snape, 's, side, ,...",6796
196,"finally , truth . lying face pressed dusty car...",34,7,"[finally, ,, truth, ., lying, face, pressed, d...",2715
197,"lay facedown , listening silence . perfectly a...",35,7,"[lay, facedown, ,, listening, silence, ., perf...",3821
198,flying facedown ground . smell forest filled n...,36,7,"[flying, facedown, ground, ., smell, forest, f...",5483


In [4]:
#resetting index 
chapter_text= df[['Book','Chapter','Text']].reset_index().drop(['index'], axis=1) 

#dividing by sentences & tokenizing
chapter_text= chapter_text.join(chapter_text.Text.apply(sent_tokenize).rename('Sentences')) 

#putting every sentence into it's own row 
sentence_text = chapter_text.Sentences.apply(pd.Series) \
    .merge(chapter_text, left_index = True, right_index = True) \
    .drop(["Text"], axis = 1) \
    .drop(["Sentences"], axis = 1) \
    .melt(id_vars = ['Book', 'Chapter'], value_name = "Sentence") \
    .drop("variable", axis = 1) \
    .dropna()

#sort by Chapter and Book then update index for the order 
sentence_text=sentence_text.sort_values(by=['Book', 'Chapter']) \
    .reset_index() \
    .drop(['index'], axis = 1)

#remove capitalizations, punctuation and white space 
import re
sentence_text['Sentence'] = sentence_text['Sentence'].apply(lambda x: re.sub(r'[^\w\s]', '', x.lower().strip()))

In [5]:
sentence_text

Unnamed: 0,Book,Chapter,Sentence
0,1,1,boy lived mr mrs dursley number four privet ...
1,1,1,last people d expect involved anything strange...
2,1,1,mr dursley director firm called grunnings mad...
3,1,1,big beefy man hardly neck although large mus...
4,1,1,mrs dursley thin blonde nearly twice usual amo...
...,...,...,...
72578,7,37,train began harry walked alongside watching ...
72579,7,37,harry kept smiling waving even though like li...
72580,7,37,train rounded corner
72581,7,37,harry s hand still raised farewell ll alright...


#### Step 3: LexRank & Summarization
Run the tokenized sentences through lexrank so that the most important sentences can be extracted for the summary. The text will then generate a summary using the LexRank and those summaries will be processed into a dataframe for later use.

In [6]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

In [7]:
# initialize an empty list to store the summaries
summary_data = []

# iteraterate through each chapter of each book 
for book in sentence_text['Book'].unique():
    for chapter in sentence_text[sentence_text['Book'] == book]['Chapter'].unique():
        # get the sentences for each chapter of each book 
        sentences = sentence_text[(sentence_text['Book'] == book) & (sentence_text['Chapter'] == chapter)]['Sentence']

        # join the sentences into a single text 
        chapter_text = ' '.join(sentences)

        # create the lexr rank summarizer
        summarizer = LexRankSummarizer()

        # tokenize the text 
        parser = PlaintextParser.from_string(chapter_text, Tokenizer("english"))

        # generate a summary with a customizable set amount of words 
        summary = summarizer(parser.document, sentences_count=45)

        # conver sentences to string and ensure that the amount of words is less than 500 to equal roughly 45 words 
        summary = ' '.join(str(sentence) for sentence in summary)
        summary = summary[:500] if len(summary) > 500 else summary

        # remove the beginning text from the summary for cleanliness 
        summary = summary.replace("<Sentence:", "").strip()

        # create a dictionary with all the summary information
        summary_entry = {'Book': book, 'Chapter': chapter, 'Summary': summary}

        # append the dictionary to a list 
        summary_data.append(summary_entry)

# create a dataframe from the dictionary created 
summary_df = pd.DataFrame(summary_data)

In [8]:
summary_df

Unnamed: 0,Book,Chapter,Summary
0,1,1,boy lived mr mrs dursley number four privet ...
1,1,2,vanishing glass nearly ten years passed since ...
2,1,3,letters one escape brazilian boa constrictor e...
3,1,4,keeper keys boom knocked dudley jerked awake...
4,1,5,diagon alley harry woke early next morning al...
...,...,...,...
195,7,33,harry remained kneeling snape s side simply s...
196,7,34,finally truth lying face pressed dusty carpe...
197,7,35,lay facedown listening silence perfectly alo...
198,7,36,flying facedown ground smell forest filled no...


In [9]:
# sample test of the first summary for the first chapter 
first_summary = summary_df.loc[199, 'Summary']
print(first_summary)

autumn seemed arrive suddenly year  morning first september crisp apple  little family bobbed across rumbling road toward great sooty station  fumes car exhausts breath pedestrians sparkled like cobwebs cold air  two large cages tattled top laden trolleys parents pushing  owls inside hooted indignantly  redheaded girl trailed fearfully behind brothers  clutching father s armit wo nt long  ll going    harry told her  two years    sniffed lily    want go    commuters stared curiously owls family w


#### Step 4: Language Generation 
Utilizing BERT the function will allow users to specify a chapter from a book and then based on the summarizations from all the chapters leading up to the specified chapter. The model will generate a story using details from the specified chapter to create a story of a user specified length that follows the plot and writing style of the harry potter book. *Stretch goal: a sentiment model will be incoporated so that based on the score of the text: negative, positive or neutral. The story the model will generate will follow a similar tone.*

In [23]:
import torch
from transformers import BertTokenizer, BertForMaskedLM

# specify the book and chapter
specified_book = 1  # Update the book number here
specified_chapter = 3 # Update the chapter number here

# laoding the BERT model and tokenizer
model_bert = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [24]:
# create input data and tokenize 
input_data = f"Chapter {chapter} from {book}"
input_tokens = tokenizer_bert.tokenize(input_data)
input_ids = tokenizer_bert.convert_tokens_to_ids(input_tokens)

In [25]:
# generate a new story based on the summary using BERT
input_ids_tensor = torch.tensor([input_ids])
with torch.no_grad():
    generated_ids = model_bert.generate(input_ids_tensor)

generated_story = tokenizer_bert.decode(generated_ids[0], skip_special_tokens=True)

In [26]:
# print the generated story
print(generated_story)

chapter 37 from 7 - - - - - - - -........................


##### Step 4.5: The following code is the attempt to make a particular mood of story generate based on the sentiment score of the summary. 

In [27]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# loading GPT-2 tokenizers and models 
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained('gpt2')
model_gpt2 = GPT2LMHeadModel.from_pretrained('gpt2') 

In [28]:
# find the rows that match the specified book and chapter
rows = df.loc[(df['Book'] == specified_book) & (df['Chapter'] == specified_chapter)]

#if statement allows the code to throw an error if the book and chapter cannot be found, check to ensure the book and chapter 
#are being passed an integer not as a string 
if not rows.empty:
    # get the index number of the specified book and chapter
    specified_chapter_index = rows.index[0]

    # get the summary of the found index number
    specified_chapter_summary = summary_df.loc[specified_chapter_index, 'Summary']

else:
    print("Specified book and chapter not found.")

In [31]:
 #specified chapter details as input 
input_data2 = (specified_chapter_summary)

In [32]:
# tokenize the input data with BERT tokenizer
input_tokens_bert = tokenizer_bert.tokenize(input_data2)
input_ids_bert = tokenizer_bert.convert_tokens_to_ids(input_tokens_bert)
input_tensor_bert = torch.tensor([input_ids_bert])  # Convert to tensor
attention_mask_bert = torch.ones_like(input_tensor_bert)  # Attention mask for BERT model


# generate the sentiment using the BERT model
with torch.no_grad():
    sentiment_logits = model_bert(input_tensor_bert, attention_mask=attention_mask_bert)[0]
    predicted_sentiment = torch.argmax(sentiment_logits).item()
    
# defining the sentiment emotions with a dictionary mapping 
vibe_dict = {
    0: 'happy',
    1: 'boring',
    2: 'angry'
}

# generate the story with the specified emotion and the GPT-2 model 
# NOTE: would love a follow-up email if there is a word bank or library associated with sentiment scores that could have been utilized here to fill up the story with adjectives and verbs to describe the emotion instead of having to hardcode things in
vibe = vibe_dict.get(predicted_sentiment)

prompt = f"Once upon a time, in a world filled with {vibe}..."
input_ids_gpt2 = tokenizer_gpt2.encode(prompt, add_special_tokens=True, return_tensors="pt")
attention_mask_gpt2 = torch.ones_like(input_ids_gpt2)  # Attention mask for GPT-2 model

with torch.no_grad():
    output = model_gpt2.generate(
        input_ids_gpt2,
        attention_mask=attention_mask_gpt2,
        pad_token_id=tokenizer_gpt2.eos_token_id,
        max_length=200,
        num_return_sequences=1
    )

generated_story = tokenizer_gpt2.decode(output[0], skip_special_tokens=True)

# Print the generated story
print(generated_story)

Once upon a time, in a world filled with None...

The world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead, the world of the dead,


## Conclusion and Final Thoughts 

While I wouldn't go as far as to say I am disappointed in my model, it did not yield anything close to the desired outcome I hoped for. The main challenge faced was the limited amount of available data. Initially, I believed that the Harry Potter series would provide more than a sufficient amount of source material for basic generation and eventually fine tuning the BERT model. However, I vastly underestimated the amount of information the BERT model can effectively process. In practice, the entire book series represents a fraction of the data BERT needs for a proper training data size. Even when using all the summaries up to the specified book and chapter as a reference, it was not enough for the model to effectively learn the series patterns. Due to the insufficient amount of data, I had challenges generating coherent summaries and text. Nevertheless, this project has given me proof of concept to believe that summarizing academic essays for students is achievable. To adapt it for practical use with younger age groups, for series such as Harry Potter, I would ideally construct a reference database that utilizes a rule-based system or a template-style approach to generate meaningful summaries. In the future, I am excited to explore additional machine-learning techniques to fine-tune the GPT-2 model. Although I considered running my text generator on the full Harry Potter texts, that project is well-researched and would deviate from the focus of my final. Overall, I have gained valuable insight working with BERT, LexRank, and GPT-2. Despite not achieving the desired success, I am pleased that I incorporated the stretch goal of sentiment analysis and gained insight into integrating sentiment models within the broader theme of my project as I believe it will be a valuable addition for the future. 