<a id="top"></a>
# Table of contents

#### 1. [Package instalation (optional)](#1)
#### 2. [Data loading](#2)
#### 3. [Feature Engineering](#3)
- ##### 3.1. [Outline](#3_1)
- ##### 3.2. [Tokenization](#3_2)
- ##### 3.3. [Basic word- and sentence-level metrics](#3_3)
- ##### 3.4. [Subjectivity and Polarity metrics](#3_4)
- ##### 3.5. [Cosine subjectivity between prompt_text and summary](#3_5)
- ##### 3.6. [Readability score](#3_6)
- ##### 3.7. [Misspelling frequency](#3_7)
- ##### 3.8. [Topic overlap](#3_8)
- ##### 3.9. [N-gram overlap](#3_9)

<a id="1"></a>
## 1. Package installation

In [64]:
#! pip install transformers sentence-transformers
! pip install word2number

Collecting word2number
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: word2number
  Building wheel for word2number (setup.py) ... [?25ldone
[?25h  Created wheel for word2number: filename=word2number-1.1-py3-none-any.whl size=5567 sha256=bbeedc7e38d74e78777c338aac8b0f400ff4146c4f6a1b37993206af5547b5c8
  Stored in directory: /Users/duje/Library/Caches/pip/wheels/cd/ef/ae/073b491b14d25e2efafcffca9e16b2ee6d114ec5c643ba4f06
Successfully built word2number
Installing collected packages: word2number
Successfully installed word2number-1.1


<a id="2"></a>
## 2. Data loading

In [1]:
import pandas as pd
from IPython.display import display

In [2]:
summaries_train_df = pd.read_csv('../data/summaries_train.csv')
prompts_train_df = pd.read_csv('../data/prompts_train.csv')

In [3]:
#join the two data frames based on a unique key and drop unnecessary columns
joined_df = pd.merge(prompts_train_df, summaries_train_df, on = 'prompt_id')
joined_df.drop(['prompt_id', 'student_id'], axis = 1, inplace = True)

#rename 'text' column to 'summary'
joined_df.rename(columns = {'text' : 'summary'}, inplace=True)

joined_df.head(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording
0,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,1 element of an ideal tragedy is that it shoul...,-0.210614,-0.471415
1,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,The three elements of an ideal tragedy are: H...,-0.970237,-0.417058
2,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,Aristotle states that an ideal tragedy should ...,-0.387791,-0.584181


<a id='3'></a>
## 3. Feature Engineering

<a id='3_1'></a>
### 3.1. Outline

The final goal is to train the model to predict content and wording scores.

• Wording Model  

a) Voice

Voice in writing refers to the author's distinctive style and tone. In the context of grading student summaries, "using objective language" means that the summary should avoid personal opinions, emotional language, or subjective statements. It should be neutral and objective, presenting the facts from the source text without adding the author's own perspective.  

b) Paraphrase  

Paraphrasing involves restating the information from the source text in a new way, without changing its meaning. A high score in paraphrasing means that the summary effectively conveys the key points of the source text in a concise and clear manner. It should avoid direct copying of sentences from the source.

c) Language  

This component assesses the quality of the language used in the summary. It considers factors such as vocabulary choice, sentence structure, and grammar. A good summary should use appropriate and varied vocabulary, follow correct grammar rules, and have coherent sentence structure.  

• Content Scores: 

a) Main idea  

This aspect evaluates how well the summary captures the primary message or main idea of the source text. A high score means that the summary effectively identifies and conveys the central theme or argument of the source.  

b) Details  

Details refer to specific information, examples, or evidence from the source text. A good summary should accurately represent these details without omitting crucial information or including irrelevant details. The summary should focus on the most relevant supporting details.  

c) Cohesion  

Cohesion assesses how well the summary transitions from one idea to the next. It considers the flow of the summary and how well sentences and paragraphs are connected. A high score indicates that the summary has a logical and smooth progression of ideas.


Some features that could be useful:  

• Extract average sentence length, average word length, word count, unique and stopwords percentage from prompt_text and summary and divide to create new feature.  

• For each summary calculate subjectivity and emotional tone (polarity).  

• Calculate cosine similarity between prompt_text and summary.  

• Calculate readability score. 

• Calculate frequency of misspelled words in student summaries.   

• Extract topics from prompt_text, prompt_question and student summaries. Use overlap as a feature.  

• Calculate most used 2-grams and 3-grams in prompt_text and summaries and calculate overlap in the 2 categories. Use this as a new feature. 

• Perform Named Entity Recognition (NER) on prompt_text and summaries, and calculate overlap to access if relevant features are captured.  

• Calculate the frequency of transition words in summaries to evaluate cohesion.  

###### [Go to top](#top)

<a id='3_2'></a>
### 3.2. Tokenization

In [4]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt' resource
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/duje/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
joined_df['prompt_question_tokenized'] = joined_df['prompt_question'].apply(word_tokenize)
joined_df['prompt_title_tokenized'] = joined_df['prompt_title'].apply(word_tokenize)
joined_df['prompt_text_tokenized'] = joined_df['prompt_text'].apply(word_tokenize)
joined_df['summary_tokenized'] = joined_df['summary'].apply(word_tokenize)

joined_df.head(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized,summary_tokenized
0,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,1 element of an ideal tragedy is that it shoul...,-0.210614,-0.471415,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[1, element, of, an, ideal, tragedy, is, that,..."
1,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,The three elements of an ideal tragedy are: H...,-0.970237,-0.417058,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[The, three, elements, of, an, ideal, tragedy,..."
2,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,Aristotle states that an ideal tragedy should ...,-0.387791,-0.584181,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[Aristotle, states, that, an, ideal, tragedy, ..."


###### [Go to top](#top)

<a id='3_3'></a>
### 3.3. Basic word- and sentence-level metrics

In [6]:
from nltk.tokenize import sent_tokenize

from nltk.corpus import stopwords
stop=set(stopwords.words('english'))

In [7]:
def count_sentences(text):
    
    sentences = sent_tokenize(text)
    sentence_count = len(sentences)
    
    return sentence_count

def count_total_words(text):
    
    words = word_tokenize(text)
    
    special_characters = [".", ",", "!", "?", ":", ";", "'", '"', "(", ")", "[", "]", "{", "}"]
    words = [word for word in words if word not in special_characters]
    word_count = len(words)
    
    return word_count

def get_unique_words_percentage(text):
    
    words = word_tokenize(text)
    
    special_characters = [".", ",", "!", "?", ":", ";", "'", '"', "(", ")", "[", "]", "{", "}"]
    words = [word for word in words if word not in special_characters]
    unique_words = set(words)
    unique_word_count = len(unique_words)
    
    unique_word_percentage = unique_word_count / len(words)
    
    return unique_word_percentage    
    
def get_stopwords_percentage(text):
    
    words = word_tokenize(text)
    stopwords = [word for word in words if word not in stop and word.isalnum()]

    special_characters = [".", ",", "!", "?", ":", ";", "'", '"', "(", ")", "[", "]", "{", "}"]
    words = [word for word in words if word not in special_characters]
    
    stopwords_percentage = len(stopwords) / len(words)
    
    return stopwords_percentage
    
    

In [8]:
#sentence count
joined_df['prompt_text_sentence_count'] = joined_df['prompt_text'].apply(count_sentences)
joined_df['summary_sentence_count'] = joined_df['summary'].apply(count_sentences)
joined_df['sentence_count_ratio'] = joined_df['summary_sentence_count'] / joined_df['prompt_text_sentence_count']

#word count
joined_df['prompt_text_word_count'] = joined_df['prompt_text'].apply(count_total_words)
joined_df['summary_word_count'] = joined_df['summary'].apply(count_total_words)
joined_df['word_count_ratio'] = joined_df['summary_word_count'] / joined_df['prompt_text_word_count']

#average sentence length
joined_df['prompt_text_avg_sentence_length'] = joined_df['prompt_text_word_count'] / joined_df['prompt_text_sentence_count']
joined_df['summary_avg_sentence_length'] = joined_df['summary_word_count'] / joined_df['summary_sentence_count']
joined_df['avg_sentence_length_ratio'] = joined_df['summary_avg_sentence_length'] / joined_df['prompt_text_avg_sentence_length']

#percentage of unique words
joined_df['prompt_text_unique_words_percentage'] = joined_df['prompt_text'].apply(get_unique_words_percentage)
joined_df['summary_unique_words_percentage'] = joined_df['summary'].apply(get_unique_words_percentage)
joined_df['unique_words_percentage_ratio'] = joined_df['summary_unique_words_percentage'] / joined_df['prompt_text_unique_words_percentage']

#percentage of stopwords
joined_df['prompt_text_stopwords_percentage'] = joined_df['prompt_text'].apply(get_stopwords_percentage)
joined_df['summary_stopwords_percentage'] = joined_df['summary'].apply(get_stopwords_percentage)
joined_df['stopwords_percentage_ratio'] = joined_df['summary_stopwords_percentage'] / joined_df['prompt_text_stopwords_percentage']

In [9]:
joined_df.sample(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized,summary_tokenized,...,word_count_ratio,prompt_text_avg_sentence_length,summary_avg_sentence_length,avg_sentence_length_ratio,prompt_text_unique_words_percentage,summary_unique_words_percentage,unique_words_percentage_ratio,prompt_text_stopwords_percentage,summary_stopwords_percentage,stopwords_percentage_ratio
4885,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...,It developed because everyone wanted to fit in...,-0.667585,-0.163822,"[Summarize, how, the, Third, Wave, developed, ...","[The, Third, Wave]","[Background, The, Third, Wave, experiment, too...","[It, developed, because, everyone, wanted, to,...",...,0.047619,24.36,14.5,0.595238,0.466338,0.965517,2.070423,0.525452,0.448276,0.853125
3826,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...,The structure of the ancient egyptcians was li...,0.142037,-0.289107,"[In, complete, sentences, ,, summarize, the, s...","[Egyptian, Social, Structure]","[Egyptian, society, was, structured, like, a, ...","[The, structure, of, the, ancient, egyptcians,...",...,0.144144,12.613636,26.666667,2.114114,0.54955,0.775,1.410246,0.601802,0.5375,0.893151
5042,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...,The Third Wave quickly picked up speed and gre...,0.205683,0.380538,"[Summarize, how, the, Third, Wave, developed, ...","[The, Third, Wave]","[Background, The, Third, Wave, experiment, too...","[The, Third, Wave, quickly, picked, up, speed,...",...,0.110016,24.36,13.4,0.550082,0.466338,0.835821,1.792306,0.525452,0.58209,1.107789


###### [Go to top](#top)

<a id='3_4'></a>
### 3.4. Subjectivity and Polarity metrics

In [10]:
from textblob import TextBlob

def polarity(text):
    return TextBlob(text).sentiment.polarity

def subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

In [11]:
joined_df['prompt_text_polarity_score'] = joined_df['prompt_text'].apply(lambda x: polarity(x))
joined_df['summary_polarity_score'] = joined_df['summary'].apply(lambda x: polarity(x))
joined_df['polarity_score_ratio'] = joined_df['summary_polarity_score'] / joined_df['prompt_text_polarity_score']

joined_df['prompt_text_subjectivity_score'] = joined_df['prompt_text'].apply(lambda x: subjectivity(x))
joined_df['summary_subjectivity_score'] = joined_df['summary'].apply(lambda x: subjectivity(x))
joined_df['subjectivity_score_ratio'] = joined_df['summary_subjectivity_score'] / joined_df['prompt_text_subjectivity_score']

joined_df.sample(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized,summary_tokenized,...,unique_words_percentage_ratio,prompt_text_stopwords_percentage,summary_stopwords_percentage,stopwords_percentage_ratio,prompt_text_polarity_score,summary_polarity_score,polarity_score_ratio,prompt_text_subjectivity_score,summary_subjectivity_score,subjectivity_score_ratio
2544,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...,The government of ancient Egypt worked as a hi...,0.531368,0.583991,"[In, complete, sentences, ,, summarize, the, s...","[Egyptian, Social, Structure]","[Egyptian, society, was, structured, like, a, ...","[The, government, of, ancient, Egypt, worked, ...",...,1.2063,0.601802,0.617978,1.026879,0.206463,0.1170068,0.5667216,0.494048,0.671769,1.359725
6590,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...",Factories used various to cover up spoiled mea...,-0.002466,-0.045439,"[Summarize, the, various, ways, the, factory, ...","[Excerpt, from, The, Jungle]","[With, one, member, trimming, beef, in, a, can...","[Factories, used, various, to, cover, up, spoi...",...,1.850632,0.457782,0.569444,1.24392,-0.002368,-0.1,42.22277,0.404563,0.216667,0.535557
6188,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...",the meat packaging industry used many ways to ...,-0.301962,0.077857,"[Summarize, the, various, ways, the, factory, ...","[Excerpt, from, The, Jungle]","[With, one, member, trimming, beef, in, a, can...","[the, meat, packaging, industry, used, many, w...",...,1.637952,0.457782,0.469697,1.026027,-0.002368,5.5511150000000004e-17,-2.343834e-14,0.404563,0.488889,1.208437


###### [Go to top](#top)

<a id='3_5'></a>
### 3.5. Cosine similarity between prompt_text and summary

We will use pre-trained SBERT model to perform prompt_text and summary embeddings. Then, we calculate cosine similarity between the two.

In [12]:
from transformers import AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer
from sentence_transformers import util

In [13]:
model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
sbert_model = SentenceTransformer(model_name)

In [14]:
def calculate_cosine_similarity(promp_text, summary):
    
    prompt_text_embedding = sbert_model.encode(promp_text, convert_to_tensor=True)
    summary_embedding = sbert_model.encode(summary, convert_to_tensor=True)
    similarity = util.pytorch_cos_sim(summary_embedding, prompt_text_embedding)
    
    return similarity.item()

In [None]:
joined_df['cosine_similarity'] = joined_df.apply(lambda x: calculate_cosine_similarity(x['prompt_text'], x['summary']), axis=1)

In [None]:
joined_df.sample(3)

In [7]:
import matplotlib.pyplot as plt

joined_df['cosine_similarity'].hist(bins=50, grid=False)
plt.title('Cosine Similarity Distribution')
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')

KeyError: 'cosine_similarity'

###### [Go to top](#top)

<a id='3_6'></a>
### 3.6. Readability score

In [15]:
from textstat import flesch_reading_ease

In [16]:
joined_df['prompt_question_readability_score'] = joined_df['prompt_question'].apply(lambda x: flesch_reading_ease(x))
joined_df['summary_readability_score'] = joined_df['summary'].apply(lambda x: flesch_reading_ease(x))
joined_df['readability_score_ratio'] = joined_df['summary_readability_score'] / joined_df['prompt_question_readability_score']

joined_df.sample(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized,summary_tokenized,...,stopwords_percentage_ratio,prompt_text_polarity_score,summary_polarity_score,polarity_score_ratio,prompt_text_subjectivity_score,summary_subjectivity_score,subjectivity_score_ratio,prompt_question_readability_score,summary_readability_score,readability_score_ratio
4267,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...,It happened so fastly because the orig nail ki...,-1.547163,-1.461245,"[Summarize, how, the, Third, Wave, developed, ...","[The, Third, Wave]","[Background, The, Third, Wave, experiment, too...","[It, happened, so, fastly, because, the, orig,...",...,0.860937,-0.004218,0.06875,-16.300652,0.359906,0.41875,1.163498,60.65,38.32,0.631822
6128,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...",In the first paragraph it states that whenever...,0.297031,-0.168734,"[Summarize, the, various, ways, the, factory, ...","[Excerpt, from, The, Jungle]","[With, one, member, trimming, beef, in, a, can...","[In, the, first, paragraph, it, states, that, ...",...,1.034737,-0.002368,0.05,-21.111383,0.404563,0.216667,0.535557,62.34,67.25,1.078762
1107,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,"A tragedy should not satisfy your morals, so t...",-0.831253,0.550583,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[A, tragedy, should, not, satisfy, your, moral...",...,1.034668,0.011411,-0.366667,-32.131842,0.491573,0.422222,0.85892,58.28,82.95,1.423301


###### [Go to top](#top)

<a id='3_7'></a>
### 3.7. Misspelling frequency

In [17]:
from spellchecker import SpellChecker
sc = SpellChecker()

In [18]:
def calculate_misspeling_percentage(word_list):

    misspelled = sc.unknown(word_list)
    percentage_of_misspelled = len(misspelled)/len(word_list)
    
    return percentage_of_misspelled

In [19]:
joined_df['summary_misspelled_words_percentage'] = joined_df['summary_tokenized'].apply(lambda x: calculate_misspeling_percentage(x))

In [20]:
joined_df.sample(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized,summary_tokenized,...,prompt_text_polarity_score,summary_polarity_score,polarity_score_ratio,prompt_text_subjectivity_score,summary_subjectivity_score,subjectivity_score_ratio,prompt_question_readability_score,summary_readability_score,readability_score_ratio,summary_misspelled_words_percentage
71,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,Tragedy should contain and replicate actions t...,0.050689,0.260165,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[Tragedy, should, contain, and, replicate, act...",...,0.011411,-0.075,-6.572422,0.491573,0.516667,1.051047,58.28,69.07,1.185141,0.0
5625,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...",the meat that was taken out of pickle would o...,-1.547163,-1.461245,"[Summarize, the, various, ways, the, factory, ...","[Excerpt, from, The, Jungle]","[With, one, member, trimming, beef, in, a, can...","[the, meat, that, was, taken, out, of, pickle,...",...,-0.002368,-0.15,63.334149,0.404563,0.1,0.24718,62.34,68.78,1.103304,0.026316
5685,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...","The factory would take the meat ""out of pickle...",-1.547163,-1.461245,"[Summarize, the, various, ways, the, factory, ...","[Excerpt, from, The, Jungle]","[With, one, member, trimming, beef, in, a, can...","[The, factory, would, take, the, meat, ``, out...",...,-0.002368,-0.15,63.334149,0.404563,0.1,0.24718,62.34,76.9,1.233558,0.055556


###### [Go to top](#top)

<a id='3_8'></a>
### 3.8. Topic overlap

In [21]:
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
lem=WordNetLemmatizer()

from gensim import corpora
from gensim.models import LdaModel
import re

[nltk_data] Downloading package punkt to /Users/duje/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/duje/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [22]:
def tokenize(text):
    
    tokens = pd.Series(text).apply(lambda x: word_tokenize(x))
    return tokens

def filter_stopwords(tokens):
    return [word for word in tokens if word not in stop and word.isalnum()]

def remove_stopwords(tokens):
    
    filtered_text = tokens.apply(filter_stopwords)
    return filtered_text

def remove_special_characters(tokens):
    
    special_characters = [".", ",", "!", "?", ":", ";", "'", '"', "(", ")", "[", "]", "{", "}"]
    filtered_text = tokens.apply(lambda x: [word for word in x if word not in special_characters])
    return filtered_text

def make_lowercase(tokens):
    
    lowercase_text = tokens.apply(lambda x: [word.lower() for word in x])
    return lowercase_text

def create_corpus(tokens):
    
    corpus = [word for sublist in tokens for word in sublist]
    return corpus

def lemmatize_and_wrap_in_list(corpus):
    
    infinitives = [[lem.lemmatize(w) for w in corpus]]
    return infinitives

def create_bow_corpus(dictionary, infinitives):
    
    bow_corpus = [dictionary.doc2bow(doc) for doc in infinitives]
    return bow_corpus

def create_lda_model(bow_corpus, num_topics, dictionary, num_passes = 15):
    
    lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=num_passes)
    return lda_model

def extract_topics(lda_model, num_words = 5):
    
    topics = lda_model.print_topics(num_words=num_words)
    return topics

def get_topic_words(topics):
    
    topic_words = " ".join([" ".join(re.findall(r'"([^"]*)"', topic[1])) for topic in topics])
    return topic_words

In [23]:
def calculate_topic_overlap(prompt_question, prompt_text, summary):

    num_topics = 3
    
    #tokenize
    prompt_question_tokens = tokenize(prompt_question)
    prompt_text_tokens = tokenize(prompt_text)
    prompt_summary_tokens = tokenize(summary)
    
    #remove stopwords
    prompt_question_non_stopwords = remove_stopwords(prompt_question_tokens)
    prompt_text_non_stopwords = remove_stopwords(prompt_text_tokens)
    prompt_summary_non_stopwords = remove_stopwords(prompt_summary_tokens)
    
    #remove special characters
    prompt_question_non_stopwords = remove_special_characters(prompt_question_non_stopwords)
    prompt_text_non_stopwords = remove_special_characters(prompt_text_non_stopwords)
    prompt_summary_non_stopwords = remove_special_characters(prompt_summary_non_stopwords)  
    
    #lowercase
    prompt_question_non_stopwords = make_lowercase(prompt_question_non_stopwords)
    prompt_text_non_stopwords = make_lowercase(prompt_text_non_stopwords)
    prompt_summary_non_stopwords = make_lowercase(prompt_summary_non_stopwords)
    
    #flatten lists to create corpus
    prompt_combined_non_stopwords = prompt_question_non_stopwords + prompt_text_non_stopwords
    corpus_prompt = create_corpus(prompt_combined_non_stopwords)
    corpus_summary = create_corpus(prompt_summary_non_stopwords)
    
    #lemmatize and wrap in list for gensim (i.e. create list of lists as gensim expects)
    infinitives_prompt = lemmatize_and_wrap_in_list(corpus_prompt)
    infinitives_summary = lemmatize_and_wrap_in_list(corpus_summary)
    
    #create dictionary
    dictionary_prompt = corpora.Dictionary(infinitives_prompt)
    dictionary_summary = corpora.Dictionary(infinitives_summary)
    
    #create bag of words corpus
    bow_corpus_prompt = create_bow_corpus(dictionary_prompt, infinitives_prompt)
    bow_corpus_summary = create_bow_corpus(dictionary_summary, infinitives_summary)
    
    #create LDA model
    lda_model_prompt = create_lda_model(bow_corpus_prompt, num_topics, dictionary_prompt)
    lda_model_summary = create_lda_model(bow_corpus_summary, num_topics, dictionary_summary)
    
    #get topics
    topics_prompt = extract_topics(lda_model_prompt)
    topic_summary = extract_topics(lda_model_summary)
    
    #get words describing topics and calculate cosine similarity between them
    topics_prompt_words = get_topic_words(topics_prompt)
    topics_summary_words = get_topic_words(topic_summary)
    
    topic_overlap = calculate_cosine_similarity(topics_prompt_words, topics_summary_words)
    
    #return topics_prompt_words, topics_summary_words, topic_overlap
    return pd.Series([topics_prompt_words, topics_summary_words, topic_overlap])

In [None]:
joined_df[['topics_prompt_words', 'topics_summary_words', 'topic_overlap']] = joined_df.apply(lambda x: calculate_topic_overlap(x['prompt_question'], x['prompt_text'], x['summary']), axis=1)

In [None]:
joined_df['topic_overlap'].hist(bins=50, grid=False)

joined_df['cosine_similarity'].hist(bins=50, grid=False, alpha=0.3)
plt.xlabel('Cosine Similarity score')
plt.ylabel('Frequency')
plt.legend(['Topic Overlap', 'Text-Summary Cosine Similarity'])

###### [Go to top](#top)

<a id='3_9'></a>
### 3.9. N-gram overlap

In [65]:
from nltk.util import ngrams
from collections import Counter

from word2number import w2n


In [103]:
def convert_words_to_numbers(text):
    words = text.split()
    converted_words = []

    for word in words:
        try:
            converted_word = str(w2n.word_to_num(word))
            converted_words.append(converted_word)
        except ValueError:  # word_to_num raises ValueError if it can't convert
            converted_words.append(word)

    converted_text = ' '.join(converted_words)
    return converted_text

In [120]:
def calculate_ngram_overlap(prompt_question, prompt_text, summary, top_n=20):
    # Tokenize
    prompt_question_tokens = tokenize(prompt_question)
    prompt_text_tokens = tokenize(prompt_text)
    prompt_summary_tokens = tokenize(summary)

    # Remove stopwords
    prompt_question_non_stopwords = remove_stopwords(prompt_question_tokens)
    prompt_text_non_stopwords = remove_stopwords(prompt_text_tokens)
    prompt_summary_non_stopwords = remove_stopwords(prompt_summary_tokens)

    # Remove special characters
    prompt_question_non_stopwords = remove_special_characters(prompt_question_non_stopwords)
    prompt_text_non_stopwords = remove_special_characters(prompt_text_non_stopwords)
    prompt_summary_non_stopwords = remove_special_characters(prompt_summary_non_stopwords)

    # Lowercase
    prompt_question_non_stopwords = make_lowercase(prompt_question_non_stopwords)
    prompt_text_non_stopwords = make_lowercase(prompt_text_non_stopwords)
    prompt_summary_non_stopwords = make_lowercase(prompt_summary_non_stopwords)

    # Flatten lists to create corpus
    prompt_combined_non_stopwords = prompt_question_non_stopwords + prompt_text_non_stopwords
    corpus_prompt = create_corpus(prompt_combined_non_stopwords)
    corpus_summary = create_corpus(prompt_summary_non_stopwords)

    # Convert numbers to text in the corpus (e.g., 1 -> one)
    corpus_prompt = [convert_words_to_numbers(text) for text in corpus_prompt]
    corpus_summary = [convert_words_to_numbers(text) for text in corpus_summary]

    # Create bigrams and trigrams, and calculate their frequency
    bigrams_prompt = ngrams(corpus_prompt, 2)
    trigrams_prompt = ngrams(corpus_prompt, 3)
    bigrams_summary = ngrams(corpus_summary, 2)
    trigrams_summary = ngrams(corpus_summary, 3)

    bigrams_prompt_list = [' '.join(bigram) for bigram in bigrams_prompt]
    trigrams_prompt_list = [' '.join(trigram) for trigram in trigrams_prompt]
    bigrams_summary_list = [' '.join(bigram) for bigram in bigrams_summary]
    trigrams_summary_list = [' '.join(trigram) for trigram in trigrams_summary]

    # Calculate the frequency of each bigram and trigram in both lists
    prompt_bigram_freq = Counter(bigrams_prompt_list)
    prompt_trigram_freq = Counter(trigrams_prompt_list)
    summary_bigram_freq = Counter(bigrams_summary_list)
    summary_trigram_freq = Counter(trigrams_summary_list)

    # Calculate the overlap between the two sets of bigrams and trigrams
    matching_bigrams = prompt_bigram_freq.keys() & summary_bigram_freq.keys()
    matching_trigrams = prompt_trigram_freq.keys() & summary_trigram_freq.keys()

    # Calculate the overlap ratios
    bigram_overlap_ratio = len(matching_bigrams) / len(prompt_bigram_freq)
    trigram_overlap_ratio = len(matching_trigrams) / len(prompt_trigram_freq)

    return pd.Series([bigram_overlap_ratio, trigram_overlap_ratio])


In [121]:
joined_df[['bigram_ratio', 'trigram_ratio']] = joined_df.apply(lambda x: calculate_ngram_overlap(x['prompt_question'], x['prompt_text'], x['summary']), axis=1)