<a id="top"></a>
# Table of contents

#### 1. [Package instalation (optional)](#1)
#### 2. [Data loading](#2)
#### 3. [EDA](#3)
- ##### 3.1. [Understanding data types](#3_1)
- ##### 3.2. [Understanding features](#3_2)
- - ##### 3.2.1. [Content score](#3_2_1)
- - ##### 3.2.2. [Wording score](#3_2_2)
- - ##### 3.2.3. [Content vs Wording](#3_2_3)
- ##### 3.3. [Tokenization](#3_3)

<a id="1"></a>
## 1. Package installation

If you already have these packages installed you can simply leave all the lines commented out. Otherwise, uncomment everything out and run the code.

In [20]:
#!pip install pandas
#!pip install plotly
#!pip install nbformat --upgrade
#!pip install plotly cufflinks
#!pip install pyLDAvis
#!pip install wordcloud
#!pip install textblob
#!pip install textstat

<a id="2"></a>
## 2. Data loading

We will only concentrate on train data: "summaries_train.csv" and "prompts_train.csv".

In [21]:
import pandas as pd
from IPython.display import display

In [22]:
summaries_train_df = pd.read_csv('../data/summaries_train.csv')
prompts_train_df = pd.read_csv('../data/prompts_train.csv')

Showing the content of both dataframes

In [23]:
print("summaries_train dataframe:")
display(summaries_train_df.head(1))
print("prompts_train dataframe:")
display(prompts_train_df.head(1))

summaries_train dataframe:


Unnamed: 0,student_id,prompt_id,text,content,wording
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538


prompts_train dataframe:


Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...


###### [Go to top](#top)

<a id="3"></a>
## 3. EDA

<a id="3_1"></a>
#### 3.1 Understanding data types

For better understanding of the data, it is worth to understand datatypes used in a dataframe.

In [24]:
def deep_type(obj):
    if isinstance(obj, (list, tuple, set, pd.core.series.Series)):
        if obj:
            return type(obj).__name__ + " of " + deep_type(obj[0])
        else:
            return type(obj).__name__
    else:
        return type(obj).__name__

def print_column_types(df):
    # Determine the maximum column name length for alignment
    max_col_len = max(len(col) for col in df.columns)

    for column in df.columns:
        first_non_na = df[column].dropna().iloc[0] if not df[column].isna().all() else None
        if first_non_na is not None:
            print(f"column: {column:<{max_col_len}} \t type: {deep_type(first_non_na)}")
        else:
            print(f"column: {column:<{max_col_len}} \t type: {type(first_non_na).__name__}")

In [25]:
print("summaries_train dataframe:\n")
print_column_types(summaries_train_df)
display(summaries_train_df.head(1))

print("prompts_train dataframe:\n")
print_column_types(prompts_train_df)
display(prompts_train_df.head(1))

summaries_train dataframe:

column: student_id 	 type: str
column: prompt_id  	 type: str
column: text       	 type: str
column: content    	 type: float64
column: wording    	 type: float64


Unnamed: 0,student_id,prompt_id,text,content,wording
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538


prompts_train dataframe:

column: prompt_id       	 type: str
column: prompt_question 	 type: str
column: prompt_title    	 type: str
column: prompt_text     	 type: str


Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...


###### [Go to top](#top)

<a id="3_2"></a>
#### 3.2 Understanding features

In order to fully understand data, one must understand each feature of the dataset.  
This is discussed next.

In [26]:
print(f'Number of entries in the summaries_train_df: {len(summaries_train_df)}')
print(f"Number of unique 'student_id' values in summaries_train_df: {len(summaries_train_df['student_id'].unique())}")
print(f"Number of unique 'prompt_id' values in summaries_train_df: {len(summaries_train_df['prompt_id'].unique())}")
print(f"Number of summaries in summaries_train_df: {len(summaries_train_df['text'])}\n")

print(f'Number of entries in the prompts_train_df: {len(prompts_train_df)}')
print(f"Number of unique 'prompt_id' values in prompts_train_df: {len(prompts_train_df['prompt_id'].unique())}")
print(f"Number of 'prompt_question' values in prompts_train_df: {len(prompts_train_df['prompt_question'].unique())}")
print(f"Number of 'prompt_title' values in prompts_train_df: {len(prompts_train_df['prompt_title'].unique())}\n")

print(f"Unique 'prompt_id' values in summaries_train_df: {summaries_train_df['prompt_id'].unique()}")
print(f"Unique 'prompt_id' values in prompts_train_df: {prompts_train_df['prompt_id'].unique()}")

Number of entries in the summaries_train_df: 7165
Number of unique 'student_id' values in summaries_train_df: 7165
Number of unique 'prompt_id' values in summaries_train_df: 4
Number of summaries in summaries_train_df: 7165

Number of entries in the prompts_train_df: 4
Number of unique 'prompt_id' values in prompts_train_df: 4
Number of 'prompt_question' values in prompts_train_df: 4
Number of 'prompt_title' values in prompts_train_df: 4

Unique 'prompt_id' values in summaries_train_df: ['814d6b' 'ebad26' '3b9047' '39c16e']
Unique 'prompt_id' values in prompts_train_df: ['39c16e' '3b9047' '814d6b' 'ebad26']


There are in total 7165 entries in "summaries_train_df", each written by a different student.  
Every summary ("text" column in "summaries_train_df") is written according to 1 of 4 questions/tasks defined in the "prompt_question" column of "prompts_train_df". Every "prompt_question" corresponds to exactly one "prompt_title" (also in "prompts_train_df") which describes it shortly.  
Finally, "prompt_text" column (also in "prompts_train_df") stores the full texts that students have to summarize. 
  
For every summary, student is awarded content score ("content" column in summaries_train_df) and wording score ("wording" column in "summaries_train_df").

###### [Go to top](#top)

<a id="3_2_1"></a>
##### 3.2.1 Content score

Based on competition host's notes, the content score is accounting for 3 factors:

- Main idea (i.e. How well did the summary capture the main idea of the source?)  

- Details (i.e. How accurately did the summary capture the details from the source?)  

- Cohesion (i.e. How well did the summary transition from one idea to the next?)

In [27]:
import cufflinks as cf
cf.go_offline()

In [28]:
summaries_train_df['content'].iplot(
    kind='hist',
    bins=60,
    layout=dict(
        title='Content score distribution',
        title_x=0.5,
        xaxis=dict(title='content score'),
        yaxis=dict(title='count')
    ),
    color='blue')

The content score ranges from roughly -2 to +4 which points to possible data transformation. This can be checked by looking at the column summary.

In [29]:
summaries_train_df['content'].describe()

count    7165.000000
mean       -0.014853
std         1.043569
min        -1.729859
25%        -0.799545
50%        -0.093814
75%         0.499660
max         3.900326
Name: content, dtype: float64

By closer inspection, one can see that the range of the content score is between -1.73 and 3.9. The mean and standard deviation are very close to 0 and 1, respectively, which points to the data standardization.  

At this point, it is not clear how these scores map to actual grades.

###### [Go to top](#top)

<a id="3_2_2"></a>
##### 3.2.2 Wording score

Based on competition host's notes, the wording score is accounting for 3 factors:

- Voice (i.e. Was the summary written using objective language?)  

- Paraphrase (i.e. Is the summary properly paraphrased?)  

- Language (i.e. How well did the summary use lexis and the syntax) 

In [30]:
summaries_train_df['wording'].iplot(
    kind='hist',
    bins=60,
    layout=dict(
        title='Wording score distribution',
        title_x=0.5,
        xaxis=dict(title='wording score'),
        yaxis=dict(title='count')
    ),
    color='blue')

Wording score ranges roughly from -2 to 4.3. It is reasonable to expect that the same data transformation technique was applied on the wording score as on the content score.

In [31]:
summaries_train_df['wording'].describe()

count    7165.000000
mean       -0.063072
std         1.036048
min        -1.962614
25%        -0.872720
50%        -0.081769
75%         0.503833
max         4.310693
Name: wording, dtype: float64

The minimum value for wording is -1.96, while the maximum value is 4.31. As expected, mean and standard deviation are very close to 0 and 1, respectively, which suggests that data standardization was applied on wording score as well.

As is the case for content score, it is not clear, at this moment, the meaning behind content score values.

###### [Go to top](#top)

<a id="3_2_3"></a>
##### 3.2.3 Content vs Wording

By looking at content score distribution, we can see that the wording score distribution is multimodal, although this is harder to see due to corse binning. The same can be said for wording score distribution.  

One can see clusters better if the number of binning is increased

In [32]:
summaries_train_df['content'].iplot(
    kind='hist',
    bins=360,
    layout=dict(
        title='Content score distribution (finer binning)',
        title_x=0.5,
        xaxis=dict(title='content score'),
        yaxis=dict(title='count')
    ),
    color='blue')

summaries_train_df['wording'].iplot(
    kind='hist',
    bins=360,
    layout=dict(
        title='Wording score distribution (finer binning)',
        title_x=0.5,
        xaxis=dict(title='wording score'),
        yaxis=dict(title='count')
    ),
    color='blue')

As one can see, there appears to be several overlapping Gauss-like distributions which points to possible existence of clusters. One thing to try is to prepare a (content, wording) scatter plot to find possible clusters. 

In [33]:
import plotly.graph_objs as go

scatter_plot = go.Scatter(
    x=summaries_train_df['content'],
    y=summaries_train_df['wording'],
    mode='markers',
    marker=dict(size=5, color='blue'),  # Adjust the size as needed
)

layout = go.Layout(
    xaxis=dict(title='content score'),
    yaxis=dict(title='Wording Score'),
    title='Content score vs. Wording score',
    title_x=0.5,
)

fig = go.Figure(data=[scatter_plot], layout=layout)
fig.show()

From the scatter plot, it can be seen that there exists 37 clusters. This is more clear if on applies 30° rotation on both content and wording scores. 

In [34]:
import numpy as np

angle_deg = 30
angle_rad = np.radians(angle_deg)
rotation_matrix = np.array([[np.cos(angle_rad), -np.sin(angle_rad)],
                            [np.sin(angle_rad), np.cos(angle_rad)]])

rotated_values = np.dot(summaries_train_df[['content', 'wording']].values, rotation_matrix)
summaries_train_df['content_rotated'] = rotated_values[:, 0]
summaries_train_df['wording_rotated'] = rotated_values[:, 1]

In [35]:
# rotated content score
scatter_plot = go.Scatter(
    x=summaries_train_df['content_rotated'],
    mode='markers',
    marker=dict(size=5, color='blue'),  # Adjust the size as needed
)

layout = go.Layout(
    xaxis=dict(title='content score'),
    #yaxis=dict(title='Wording Score'),
    title='Rotated content score',
    title_x=0.5,
)

fig = go.Figure(data=[scatter_plot], layout=layout)
fig.show()

# rotated wording score
scatter_plot = go.Scatter(
    x=summaries_train_df['wording_rotated'],
    mode='markers',
    marker=dict(size=5, color='blue'),  # Adjust the size as needed
)

layout = go.Layout(
    xaxis=dict(title='wording score'),
    #yaxis=dict(title='Wording Score'),
    title='Rotated wording score',
    title_x=0.5,
)

fig = go.Figure(data=[scatter_plot], layout=layout)
fig.show()

# rotated content score vs. rotated wording score
scatter_plot = go.Scatter(
    x=summaries_train_df['content_rotated'],
    y=summaries_train_df['wording_rotated'],
    mode='markers',
    marker=dict(size=5, color='blue'),  # Adjust the size as needed
)

layout = go.Layout(
    xaxis=dict(title='content score'),
    yaxis=dict(title='Wording Score'),
    title='Content score vs. Wording score',
    title_x=0.5,
)

fig = go.Figure(data=[scatter_plot], layout=layout)
fig.show()

In [36]:
summaries_train_df['content_rotated'].iplot(
    kind='hist',
    bins=360,
    layout=dict(
        title='Rotated content score distribution',
        title_x=0.5,
        xaxis=dict(title='content score'),
        yaxis=dict(title='count')
    ),
    color='blue')

summaries_train_df['wording'].iplot(
    kind='hist',
    bins=360,
    layout=dict(
        title='Wording score distribution',
        title_x=0.5,
        xaxis=dict(title='wording score'),
        yaxis=dict(title='count')
    ),
    color='blue')

summaries_train_df['wording_rotated'].iplot(
    kind='hist',
    bins=360,
    layout=dict(
        title='Rotated wording score distribution',
        title_x=0.5,
        xaxis=dict(title='wording score'),
        yaxis=dict(title='count')
    ),
    color='blue')

summaries_train_df['content'].iplot(
    kind='hist',
    bins=360,
    layout=dict(
        title='Content score distribution',
        title_x=0.5,
        xaxis=dict(title='content score'),
        yaxis=dict(title='count')
    ),
    color='blue')

When data is rotated by 30° it becomes clear that the content score is divided into 37 distinct clusters. The question remains as to what is the meaning of these clusters. One possible explanation could be the following.  

The basic grading system in the US is: A, B, C, D, F. However, if we go to a finer-grained system we have A+, A, A−, B+, B, B−, C+, C, C−, D+, D, D− and F. Looking at the rotated content score distribution, one can see that there exists one particularly bad score with 426 entries. Let's ignore that group. This leaves us with 36 bins. Ignoring grade F (the only one without finer grading), there are 12 possible grades. This could suggest that each group of 3 clusters (starting from the second on the left) corresponds to one finer-grained score (A+, A, A−, etc.). The remaining bin (the first bin to the right) could correspond to grade F. If this is correct, one could simply map content score to percentage (written as decimal number in range 0-1) corresponding to each of the grades.  

Additional thing that one might notice is that, by doing rotation by 90° (instead of 30°), rotated content score distribution becomes (non-rotated) wording score distribution. Similarly, by doing rotation of -90°, rotated wording score distribution becomes (non-rotated) content score distribution. While this is clear from geometric point of view (i.e. we are just swapping x and y axis), it is not clear if there is a deeper meaning to this. Why is content score the exact same thing as wording score rotated by 90°. 

###### [Go to top](#top)

<a id="3_3"></a>
#### 3.3 Tokenization

Tokenization is a process of converting text into a list of words. This is the crucial step for exploring the number of words in a given text, number of characters per word, performing word frequency analysis, counting and exploring stopping words, looking into N-grams, performing topic modelling, doing sentiment analysis, calculating text's readability score, etc.

In [37]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt' resource
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/duje/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [38]:
prompts_train_df['prompt_question_tokenized'] = prompts_train_df['prompt_question'].apply(word_tokenize)
prompts_train_df['prompt_title_tokenized'] = prompts_train_df['prompt_title'].apply(word_tokenize)
prompts_train_df['prompt_text_tokenized'] = prompts_train_df['prompt_text'].apply(word_tokenize)

summaries_train_df['text_tokenized'] = summaries_train_df['text'].apply(word_tokenize)

print("summaries_train dataframe:\n")
display(summaries_train_df.sample(1))

print("prompts_train dataframe:\n")
display(prompts_train_df.sample(1))

summaries_train dataframe:



Unnamed: 0,student_id,prompt_id,text,content,wording,content_rotated,wording_rotated,text_tokenized
6392,e43ebd5f1ca5,3b9047,The different classes played into the governme...,-0.39331,0.627128,-0.027052,0.739764,"[The, different, classes, played, into, the, g..."


prompts_train dataframe:



Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text,prompt_question_tokenized,prompt_title_tokenized,prompt_text_tokenized
3,ebad26,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an...","[Summarize, the, various, ways, the, factory, ...","[Excerpt, from, The, Jungle]","[With, one, member, trimming, beef, in, a, can..."


###### [Go to top](#top)