<a id="top"></a>
# Table of contents

#### 1. [Package instalation (optional)](#1)
#### 2. [Data loading](#2)
#### 3. [Feature Engineering](#3)
- ##### 3.1. [Outline](#3_1)
- ##### 3.2. [Tokenization](#3_2)
- ##### 3.3. [Basic word- and sentence-level metrics](#3_3)

<a id="1"></a>
## 1. Package installation

<a id="2"></a>
## 2. Data loading

In [1]:
import pandas as pd
from IPython.display import display

In [2]:
summaries_train_df = pd.read_csv('../data/summaries_train.csv')
prompts_train_df = pd.read_csv('../data/prompts_train.csv')

In [3]:
#join the two data frames based on a unique key and drop unnecessary columns
joined_df = pd.merge(prompts_train_df, summaries_train_df, on = 'prompt_id')
joined_df.drop(['prompt_id', 'student_id'], axis = 1, inplace = True)

#rename 'text' column to 'summary'
joined_df.rename(columns = {'text' : 'summary'}, inplace=True)

joined_df.head(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording
0,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,1 element of an ideal tragedy is that it shoul...,-0.210614,-0.471415
1,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,The three elements of an ideal tragedy are: H...,-0.970237,-0.417058
2,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,Aristotle states that an ideal tragedy should ...,-0.387791,-0.584181


<a id='3'></a>
## 3. Feature Engineering

<a id='3_1'></a>
### 3.1. Outline

The final goal is to train the model to predict content and wording scores.

• Wording Model  

a) Voice

Voice in writing refers to the author's distinctive style and tone. In the context of grading student summaries, "using objective language" means that the summary should avoid personal opinions, emotional language, or subjective statements. It should be neutral and objective, presenting the facts from the source text without adding the author's own perspective.  

b) Paraphrase  

Paraphrasing involves restating the information from the source text in a new way, without changing its meaning. A high score in paraphrasing means that the summary effectively conveys the key points of the source text in a concise and clear manner. It should avoid direct copying of sentences from the source.

c) Language  

This component assesses the quality of the language used in the summary. It considers factors such as vocabulary choice, sentence structure, and grammar. A good summary should use appropriate and varied vocabulary, follow correct grammar rules, and have coherent sentence structure.  

• Content Scores: 

a) Main idea  

This aspect evaluates how well the summary captures the primary message or main idea of the source text. A high score means that the summary effectively identifies and conveys the central theme or argument of the source.  

b) Details  

Details refer to specific information, examples, or evidence from the source text. A good summary should accurately represent these details without omitting crucial information or including irrelevant details. The summary should focus on the most relevant supporting details.  

c) Cohesion  

Cohesion assesses how well the summary transitions from one idea to the next. It considers the flow of the summary and how well sentences and paragraphs are connected. A high score indicates that the summary has a logical and smooth progression of ideas.


Some features that could be useful:  

• Extract average sentence length, average word length, word count, unique and stopwords percentage from prompt_text and summary and divide to create new feature.  

• For each summary calculate objectivity and emotional tone.  

• Calculate cosine similarity between prompt_text and summary.  

• Calculate readability score. 

• Calculate frequency of misspelled words in student summaries.   

• Extract topics from prompt_text and student summaries. Use overlap as a feature.  

• Calculate most used 2-grams and 3-grams in prompt_text and summaries and calculate overlap in the 2 categories. Use this as a new feature. 

• Perform Named Entity Recognition (NER) on prompt_text and summaries, and calculate overlap to access if relevant features are captured.  

• Calculate the frequency of transition words in summaries to evaluate cohesion.  

###### [Go to top](#top)

<a id='3_2'></a>
### 3.2. Tokenization

In [4]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt' resource
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/duje/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
joined_df['prompt_question_tokenized'] = joined_df['prompt_question'].apply(word_tokenize)
joined_df['prompt_title_tokenized'] = joined_df['prompt_title'].apply(word_tokenize)
joined_df['prompt_text_tokenized'] = joined_df['prompt_text'].apply(word_tokenize)
joined_df['summary_tokenized'] = joined_df['summary'].apply(word_tokenize)

joined_df.head(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized,summary_tokenized
0,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,1 element of an ideal tragedy is that it shoul...,-0.210614,-0.471415,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[1, element, of, an, ideal, tragedy, is, that,..."
1,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,The three elements of an ideal tragedy are: H...,-0.970237,-0.417058,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[The, three, elements, of, an, ideal, tragedy,..."
2,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...,Aristotle states that an ideal tragedy should ...,-0.387791,-0.584181,"[Summarize, at, least, 3, elements, of, an, id...","[On, Tragedy]","[Chapter, 13, As, the, sequel, to, what, has, ...","[Aristotle, states, that, an, ideal, tragedy, ..."


###### [Go to top](#top)

<a id='3_3'></a>
### 3.3. Basic word- and sentence-level metrics

In [30]:
from nltk.tokenize import sent_tokenize

from nltk.corpus import stopwords
stop=set(stopwords.words('english'))

In [31]:
def count_sentences(text):
    
    sentences = sent_tokenize(text)
    sentence_count = len(sentences)
    
    return sentence_count

def count_total_words(text):
    
    words = word_tokenize(text)
    
    special_characters = [".", ",", "!", "?", ":", ";", "'", '"', "(", ")", "[", "]", "{", "}"]
    words = [word for word in words if word not in special_characters]
    word_count = len(words)
    
    return word_count

def get_unique_words_percentage(text):
    
    words = word_tokenize(text)
    
    special_characters = [".", ",", "!", "?", ":", ";", "'", '"', "(", ")", "[", "]", "{", "}"]
    words = [word for word in words if word not in special_characters]
    unique_words = set(words)
    unique_word_count = len(unique_words)
    
    unique_word_percentage = unique_word_count / len(words)
    
    return unique_word_percentage    
    
def get_stopwords_percentage(text):
    
    words = word_tokenize(text)
    stopwords = [word for word in words if word not in stop and word.isalnum()]

    special_characters = [".", ",", "!", "?", ":", ";", "'", '"', "(", ")", "[", "]", "{", "}"]
    words = [word for word in words if word not in special_characters]
    
    stopwords_percentage = len(stopwords) / len(words)
    
    return stopwords_percentage
    
    

In [32]:
#sentence count
joined_df['prompt_text_sentence_count'] = joined_df['prompt_text'].apply(count_sentences)
joined_df['summary_sentence_count'] = joined_df['summary'].apply(count_sentences)
joined_df['sentence_count_ratio'] = joined_df['summary_sentence_count'] / joined_df['prompt_text_sentence_count']

#word count
joined_df['prompt_text_word_count'] = joined_df['prompt_text'].apply(count_total_words)
joined_df['summary_word_count'] = joined_df['summary'].apply(count_total_words)
joined_df['word_count_ratio'] = joined_df['summary_word_count'] / joined_df['prompt_text_word_count']

#average sentence length
joined_df['prompt_text_avg_sentence_length'] = joined_df['prompt_text_word_count'] / joined_df['prompt_text_sentence_count']
joined_df['summary_avg_sentence_length'] = joined_df['summary_word_count'] / joined_df['summary_sentence_count']
joined_df['avg_sentence_length_ratio'] = joined_df['summary_avg_sentence_length'] / joined_df['prompt_text_avg_sentence_length']

#percentage of unique words
joined_df['prompt_text_unique_words_percentage'] = joined_df['prompt_text'].apply(get_unique_words_percentage)
joined_df['summary_unique_words_percentage'] = joined_df['summary'].apply(get_unique_words_percentage)
joined_df['unique_words_percentage_ratio'] = joined_df['summary_unique_words_percentage'] / joined_df['prompt_text_unique_words_percentage']

#percentage of stopwords
joined_df['prompt_text_stopwords_percentage'] = joined_df['prompt_text'].apply(get_stopwords_percentage)
joined_df['summary_stopwords_percentage'] = joined_df['summary'].apply(get_stopwords_percentage)
joined_df['stopwords_percentage_ratio'] = joined_df['summary_stopwords_percentage'] / joined_df['prompt_text_stopwords_percentage']

In [34]:
joined_df.sample(3)

Unnamed: 0,prompt_question,prompt_title,prompt_text,summary,content,wording,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized,summary_tokenized,...,word_count_ratio,prompt_text_avg_sentence_length,summary_avg_sentence_length,avg_sentence_length_ratio,prompt_text_unique_words_percentage,summary_unique_words_percentage,unique_words_percentage_ratio,prompt_text_stopwords_percentage,summary_stopwords_percentage,stopwords_percentage_ratio
3384,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...,Egyption society was structured like a pyr...,1.813796,0.027223,"[In, complete, sentences, ,, summarize, the, s...","[Egyptian, Social, Structure]","[Egyptian, society, was, structured, like, a, ...","[Egyption, society, was, structured, like, a, ...",...,0.268468,12.613636,16.555556,1.312513,0.54955,0.577181,1.050281,0.601802,0.516779,0.858719
2742,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...,The social structure of the ancient Egyptian s...,-0.093814,0.503833,"[In, complete, sentences, ,, summarize, the, s...","[Egyptian, Social, Structure]","[Egyptian, society, was, structured, like, a, ...","[The, social, structure, of, the, ancient, Egy...",...,0.118919,12.613636,13.2,1.046486,0.54955,0.727273,1.323398,0.601802,0.606061,1.007077
4662,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...,The Third wabe developed into a big thing ove...,-1.355562,-0.955801,"[Summarize, how, the, Third, Wave, developed, ...","[The, Third, Wave]","[Background, The, Third, Wave, experiment, too...","[The, Third, wabe, developed, into, a, big, th...",...,0.062397,24.36,38.0,1.559934,0.466338,0.842105,1.805782,0.525452,0.473684,0.90148
