## Importing packages

Importing all needed packages for the full notebook. Only needs to be ran once. 

In [79]:
import sys
!{sys.executable} -m pip install ipynb
!{sys.executable} -m pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [80]:
import gzip as gz
import json
import sys as sklearn
import spacy as sp
import pandas as pd
import numpy as np
import math
import re
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
#from ipynb.fs.full.my_functions import rouge_blue_scoring
from rouge import Rouge
from rouge_score import rouge_scorer


## Import and inspect data

Loading data from disc. Time consuming process that is only needed to be done once per session. After completion all needed data is saved in the *Data* variable.

In [8]:
# Setting local file path
path = "release/train.jsonl.gz"

# Creating list entity to hold full set of loaded data
data = []

# Using gz to set path to zip file and iteritavly load each json line
with gz.open(path) as f:
    for ln in f:
        obj = json.loads(ln)
        data.append(obj)

Inspecting one data entity for better understanding of its properties. 

In [261]:
# Exracting and printing the first data entity 
ent = data[0]
ent

{'url': 'http://www.nytimes.com/2006/06/04/sports/soccer/04racism.html',
 'archive': 'http://web.archive.org/web/20060618204254id_/http://www.nytimes.com:80/2006/06/04/sports/soccer/04racism.html?ei=5090&en=2cc7885a61bb8cf5&ex=1307073600&partner=rssuserland&emc=rss&pagewanted=all',
 'title': 'Surge in Racist Mood Raises Concerns on Eve of World Cup',
 'date': '20060618204254',
 'text': 'HAMBURG, Germany, June 3 \x97 As he left the soccer field after a club match in the eastern German city of Halle on March 25, the Nigerian forward Adebowale Ogungbure was spit upon, jeered with racial remarks and mocked with monkey noises. In rebuke, he placed two fingers under his nose to simulate a Hitler mustache and thrust his arm in a Nazi salute.\n\nMarc Zoro, right, an Ivory Coast native, was a target of racial slurs from the home fans in Messina, Italy. Adriano, a star with Inter Milan, tried to persuade him to stay on the field.\n\nFrom now until its conclusion on July 9, Jeff Z. Klein and othe

## Data Thinning

Ïnitial data thinning to extract data entities suitable for an extractive summarization task. These are identified by looking at the entity property "density_bin" stating if summary is mainly "extractive" or "abstractive"

In [14]:
# Extracting appropriate data and transforming to pandas dataframe
df = pd.DataFrame(data)
df_extractive = df[df.density_bin == 'extractive']

# Filtering on rough estimate of lenght text
article_lengths = [len(text.split()) for text in df_extractive.text]
_ = plt.hist(article_lengths, bins = 100, range = (0, 2000))
length_check = [len > 250 for len in article_lengths]
df_extractive = df_extractive[length_check]

# Inspecting head of dataframe for inspection
df_extractive.head()

KeyboardInterrupt: 

In [16]:
# Printing the length of the dataset, before and after thinning
print("Full dataset: ", len(df), "\nExtractive dataset: ",len(df_extractive), "\nPercentage: ", round(len(df_extractive)/len(df), 2))

Full dataset:  995041 
Extractive dataset:  257586 
Percentage:  0.26


## Intermidiate Input Representation

In [15]:
# Loading the "en_core_web_sm" model from the SpaCy library
# Disabling unused features for a lighter and more efficient model
# The model is used for text transformation in preprocessing 
nlp = sp.load("en_core_web_sm", disable=["tagger", "parser", "ner", "textcat"])

### Sparse TF-IDF matrix representation

In [17]:
# PREPROCESSING FUNCTION
def preprocess(text):
    # Cleans input text
    text = nlp(text)
    tokens = []
    for token in text:
        # Remove stopwords and numerics, append the lemma form to tokens list
        if token.is_stop == False and token.is_alpha == True:
            tokens.append(token.lemma_)
    
    # Return all lemmatized tokens in the input text
    return tokens

In [18]:
# TF-IDF MATRIX CREATION FUNCTION
def create_tfidf_matrix(corpus, preprocessor = preprocess):
    #Initializing ScikitLearn TF-IDF vectorizer and creating TF-IDF sparse matrix
    vectorizer = TfidfVectorizer(tokenizer = preprocessor)
    tfidf_matrix = vectorizer.fit_transform(corpus['text'])
    
    # Saving list of all corpus tokens
    feature_names = vectorizer.get_feature_names()

    # Inspecting dimension of sparse matrix
    # Rows should equal no. of df_extractive data entities
    # Number of columns equals number of unique corpus tokens
    print("TF-IDF matrix dimension: ", tfidf_matrix.get_shape(), "\nAligning with no. df_extractive enteties? ", tfidf_matrix.get_shape()[0] == len(df_extractive))
    
    # Returning TF-IDF matrix
    return tfidf_matrix, feature_names

### Token and TF-IDF score pairing

In [19]:
# TOKEN TFIDF PAIRING FUNCTION
def token_tfidf_ranking(feature_names, tfidf_matrix, row_index = 0):
    # Exctracting indices of document tokens from the TF-IDF matrix
    token_indices = tfidf_matrix[row_index,:].nonzero()[1]
   
    # Extract token names and pair with corresponding TF-IDF value from the TF-IDF matrix
    # Sort by TF-IDF score
    token_tfidf = pd.DataFrame(np.column_stack(([feature_names[index] for index in token_indices], [tfidf_matrix[row_index, x] for x in token_indices])), columns=['token', 'tfidf_score'])
    token_tfidf = token_tfidf.sort_values(by='tfidf_score', ascending=False)
    
    #token_tfidf = token_tfidf.astype({"word": str, "tfidf_score": float})
    
    # Return the sorted list of (token, TF-IDF value) data frame
    return token_tfidf

### Sentence splitting

In [20]:
# Loading new model from SpaCy and adding sentencizer pipeline
nlp_sentencizer = sp.load("en_core_web_sm", disable=["tagger", "parser", "ner", "textcat"])
nlp_sentencizer.add_pipe(nlp.create_pipe('sentencizer'))

In [21]:
# SENTENCE SPLITTING FUNCTION
def sentence_splitting(doc):
    doc = nlp_sentencizer(doc['text'])
    sentences = [sent.string.strip() for sent in doc.sents]
    
    return sentences

### Sentence level tokenization and scoring

In [129]:
# SENTENCE SCORING FUNTION
def sentence_scoring(sentences, token_tfidf_pairs):
    sentence_scores = []
    
    for i, sentence in enumerate(sentences):
        score = 0
        sentence_length = len(sentence)

        # Using preprocessing function to extract sentence tokens
        sentence_tokens = preprocess(sentence)

        # Summation of sentence tokens' TF-IDF values 
        for token in sentence_tokens:
            token = token.lower()
            if token in token_tfidf_pairs['token'].values:
                score = score + float(token_tfidf_pairs.loc[token_tfidf_pairs['token'] == token]['tfidf_score'].values)

        # Normalizing sentence score dependent on sentence length
        score = score / sentence_length
        
        # Append to list of sentence scores
        sentence_scores.append((i,score))
    
    # Save scores in pd dataframe
    sentence_scores = pd.DataFrame(sentence_scores,columns=["sentence_index", "sentence_score"]).sort_values(by='sentence_score', ascending=False)
    
    # Return final sentence scores
    return sentence_scores 

### Sentence Extraction

In [23]:
# SENTENCE EXTRACTION FUNCTION
def sentence_extraction(sentences, sentence_scores, n):
    summary = []
    
    # Extracting indices of the n number of top scoring sentences
    # sort them in ascending order
    top_sentence_indices = np.sort(sentence_scores[0:n]['sentence_index'].values)
    
    # Extracting the original sentences and appending to summary list
    for index in top_sentence_indices:
        summary.append(sentences[index])
        
    # Joining summary sentences
    summary = ' '.join(summary)
    
    return summary

## Score summaries with ROUGE

In [198]:
rouge = Rouge()
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL','rougeLsum'], use_stemmer=True)
# ROUGE AND BLEU SCORING FUNCTION
def rouge_blue_scoring(summary, reference):
    rouge_score = rouge.get_scores(summary, reference)
    rouge_score_2 = scorer.score(summary, reference)
    return rouge_score_2, rouge_score
    

# Running the Experiment
* Preprocess and Vectorize
* Extract document's tokens and tfidf key-value pairs
* Sentencize document and score each seperate sentence
* Extract top *n* sentences

In [24]:
# Intermidiate Input Representation
corpus = df_extractive
tfidf_matrix, feature_names = create_tfidf_matrix(corpus = corpus)

TF-IDF matrix dimension:  (257586, 483035) 
Aligning with no. df_extractive enteties?  True


In [178]:
test_docs = corpus[0:10]
test_docs.head()

Unnamed: 0,url,archive,title,date,text,summary,compression,coverage,density,compression_bin,coverage_bin,density_bin
2,http://www.nytimes.com/2006/04/23/business/you...,http://web.archive.org/web/20060909062911id_/h...,Investors vs. Pfizer: Guess Who Has the Guns?,20060909062911,IF outsized executive pay has indeed become a ...,The battle between Pfizer Inc.'s investors and...,33.88,1.0,11.72,medium,high,extractive
3,http://www.nydailynews.com/archives/gossip/199...,http://web.archive.org/web/20080313232743id_/h...,REX FLEXED PECS FOR SKIN PICS,20080313232743,BY A.J. BENZA & MICHAEL LEWITTES\n\nIf Simon R...,"If Simon Rex looks a little familiar, it may n...",11.894118,0.988235,38.988235,low,high,extractive
4,http://www.nydailynews.com/archives/entertainm...,http://web.archive.org/web/20080314003027id_/h...,POPEYE-WORTHY PIE. PHYLLO DOUGH WRAPS SPINACH ...,20080314003027,Spinach has terrorized generations of veggie-p...,POPEYE-WORTHY PIE. PHYLLO DOUGH WRAPS SPINACH ...,3.932584,0.921348,36.629213,low,medium,extractive
5,http://www.nydailynews.com/archives/news/2001/...,http://web.archive.org/web/20080520122148id_/h...,JOY FOR ADDICTS ON MEND AS CHILDREN ARE RETURNED,20080520122148,"All day, every day, Cheryl Bernstein thanks he...","All day, every day, Cheryl Bernstein thanks he...",4.597561,0.987805,23.47561,low,high,extractive
6,http://www.nydailynews.com/archives/news/2001/...,http://web.archive.org/web/20080711053245id_/h...,QUICK WORK BY THE COPS NYPD response time plunges,20080711053245,With Police Commissioner Bernard Kerik crackin...,By JOHN MARZULLI DAILY NEWS POLICE BUREAU CHIE...,20.085366,0.939024,16.890244,medium,medium,extractive


In [201]:
# Sentence Scoring and Sentence Extraction
summaries = []
rouge_scores1 = pd.DataFrame(columns = ['r1-p', 'r1-r', 'r1-f', 'r2-p', 'r2-r', 'r2-f', 'rl-p', 'rl-r', 'rl-f'])
rouge_scores2 = pd.DataFrame(columns = ['r1-p', 'r1-r', 'r1-f', 'r2-p', 'r2-r', 'r2-f', 'rl-p', 'rl-r', 'rl-f'])

for index, doc in test_docs.iterrows():
    token_tfidf_pairs = token_tfidf_ranking(feature_names, tfidf_matrix, index - 2)
    sentences = sentence_splitting(doc)
    sentence_scores = sentence_scoring(sentences, token_tfidf_pairs)
    summary = sentence_extraction(sentences, sentence_scores, 3)
    summaries.append(summary)
    rouge_score2, rouge_score = rouge_blue_scoring(summary, doc.summary)
    rouge_scores1.loc[len(rouge_scores1)] = list(rouge_score[0]['rouge-1'].values()) + list(rouge_score[0]['rouge-2'].values()) + list(rouge_score[0]['rouge-1'].values())
    rouge_scores2.loc[len(rouge_scores2)] = list(rouge_score2['rouge1'][0:3]) + list(rouge_score2['rouge2'][0:3]) + list(rouge_score2['rougeL'][0:3])
    
    