# Extractive Summarization

<p>Extractive summarization builds texts summaries by selecting passages directly from the document, rather than writing new text. This is the simplest way to summarize. A good summary is shorter than the original document but captures as much of the original meaning and information as possible.</p>
<p>Single-document summarization consists of three steps:</p>
<ol>
<li>Vectorize - turn each sentence/passage into a vector
<li>Score - rate each sentence/passage with a score
<li>Select - select the best sentences/passages
<ol>

# Import requirements

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
%matplotlib inline

# Import data

In [2]:
chapter_df = pd.read_csv('./KJB_chapters.csv')
chapter_df.head()

Unnamed: 0,book,chapter,char_count,text,word_count
0,Genesis,1,4117,In the beginning God created the heaven and th...,797
1,Genesis,2,3119,"Thus the heavens and the earth were finished, ...",632
2,Genesis,3,3435,Now the serpent was more subtil than any beast...,695
3,Genesis,4,3235,"And Adam knew Eve his wife; and she conceived,...",632
4,Genesis,5,2781,This is the book of the generations of Adam. I...,505


# Preprocessing Text

In [3]:
def chapter_text_to_words(chapter_text):
    """
    Returns list of words in lowercase format provided the proper introduction
    """
    text_letters_only = re.sub("[^a-zA-Z]", " ", chapter_text) 
    return text_letters_only.lower().split()


# other options include removing stop words and stemming/lemmatizing
# for now, will not reduce or transform the original vocabulary beyond capitalization

# Vectorization (Bag of Words)

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(lowercase=True)
bow_matrix = vec.fit_transform(chapter_df['text']) # returns matrix in sparse format

print type(bow_matrix)
print bow_matrix.shape
bow_matrix_dense = bow_matrix.todense()
print bow_matrix_dense

<class 'scipy.sparse.csr.csr_matrix'>
(1189, 12591)
[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


In [27]:
# view the feature names
vocab = vec.get_feature_names()
print len(vocab)
vocab

12591


[u'aaron',
 u'aaronites',
 u'abaddon',
 u'abagtha',
 u'abana',
 u'abarim',
 u'abase',
 u'abased',
 u'abasing',
 u'abated',
 u'abba',
 u'abda',
 u'abdeel',
 u'abdi',
 u'abdiel',
 u'abdon',
 u'abednego',
 u'abel',
 u'abelbethmaachah',
 u'abelmaim',
 u'abelmeholah',
 u'abelmizraim',
 u'abelshittim',
 u'abez',
 u'abhor',
 u'abhorred',
 u'abhorrest',
 u'abhorreth',
 u'abhorring',
 u'abi',
 u'abia',
 u'abiah',
 u'abialbon',
 u'abiasaph',
 u'abiathar',
 u'abib',
 u'abida',
 u'abidah',
 u'abidan',
 u'abide',
 u'abideth',
 u'abiding',
 u'abiel',
 u'abiezer',
 u'abiezrite',
 u'abiezrites',
 u'abigail',
 u'abihail',
 u'abihu',
 u'abihud',
 u'abijah',
 u'abijam',
 u'abilene',
 u'ability',
 u'abimael',
 u'abimelech',
 u'abinadab',
 u'abinoam',
 u'abiram',
 u'abishag',
 u'abishai',
 u'abishalom',
 u'abishua',
 u'abishur',
 u'abital',
 u'abitub',
 u'abiud',
 u'abjects',
 u'able',
 u'abner',
 u'aboard',
 u'abode',
 u'abodest',
 u'abolish',
 u'abolished',
 u'abominable',
 u'abominably',
 u'abomination'

# Vectorization (tf-idf)