## Text Summarization: Wikipedia - Congress

11.04.19

Extractive summarization of Wikipedia's article "United States Congress" overview using **TextRank** (most representative sentence, keywords).

In [1]:
import pandas as pd

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import networkx as nx

import warnings
warnings.filterwarnings('ignore')

In [2]:
text = 'The United States Congress is the bicameral legislature of the federal government of the United States, and consists of two chambers: the House of Representatives and the Senate. The Congress meets in the United States Capitol in Washington, D.C. Both senators and representatives are chosen through direct election, though vacancies in the Senate may be filled by a gubernatorial appointment. Congress has 535 voting members: 435 representatives and 100 senators. The House of Representatives has six non-voting members representing Puerto Rico, American Samoa, Guam, the Northern Mariana Islands, the U.S. Virgin Islands, and the District of Columbia in addition to its 435 voting members. Although they cannot vote in the full house, these members can address the house, sit and vote in congressional committees, and introduce legislation. The members of the House of Representatives serve two-year terms representing the people of a single constituency, known as a "district". Congressional districts are apportioned to states by population using the United States Census results, provided that each state has at least one congressional representative. Each state, regardless of population or size, has two senators. Currently, there are 100 senators representing the 50 states. Each senator is elected at-large in their state for a six-year term, with terms staggered, so every two years approximately one-third of the Senate is up for election. To be eligible for election, a candidate must be aged at least 25 (House) or 30 (Senate), have been a citizen of the United States for seven (House) or nine (Senate) years, and be an inhabitant of the state which they represent. The Congress was created by the Constitution of the United States and first met in 1789, replacing in its legislative function the Congress of the Confederation. Although not legally mandated, in practice since the 19th century, Congress members are typically affiliated with the Republican Party or with the Democratic Party and only rarely with a third party or independents.'

In [3]:
text = text.replace(u'\xa0', u' ')

In [4]:
# Parse text.
parser = spacy.load('en')

text = parser(text)

### Sentence Extraction:
1. parse and tokenize text (spaCy),
2. calculate the tf-idf matrix,
3. calculate similarity scores,
4. calculate TextRank ("networkx" package).

In [5]:
# Divide the text into sentences and store the sentences as a list of strings.
sentences = []
for span in text.sents:
    sent = ''.join(text[i].string for i in range(span.start, span.end)).strip()
    sentences.append(sent)
sentences    
    

['The United States Congress is the bicameral legislature of the federal government of the United States, and consists of two chambers: the House of Representatives and the Senate.',
 'The Congress meets in the United States Capitol in Washington,',
 'D.C. Both senators and representatives are chosen through direct election, though vacancies in the Senate may be filled by a gubernatorial appointment.',
 'Congress has 535 voting members: 435 representatives and 100 senators.',
 'The House of Representatives has six non-voting members representing Puerto Rico, American Samoa, Guam, the Northern Mariana Islands, the U.S. Virgin Islands, and the District of Columbia in addition to its 435 voting members.',
 'Although they cannot vote in the full house, these members can address the house, sit and vote in congressional committees, and introduce legislation.',
 'The members of the House of Representatives serve two-year terms representing the people of a single constituency, known as a "dist

In [6]:
# Create the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(1, 1), 
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

# Apply the vectorizer.
data_counts=counter.fit_transform(sentences)

In [7]:
# Calculate similarity.
similarity = data_counts * data_counts.T
similarity

<14x14 sparse matrix of type '<class 'numpy.float64'>'
	with 170 stored elements in Compressed Sparse Row format>

### TextRank

"pagerank" function

TextRank is based on PageRank, an algorithm used to calculate the weight for web pages.

**Hyperparameters:**
- alpha: damping parameter
- tol: convergence parameter

TextRank finds how similar each sentence is to all other sentences in the text. The most important sentence is the one that is most similar to all the others.

In [8]:
# Identify the sentence with the highest rank.
nx_graph = nx.from_scipy_sparse_matrix(similarity)
ranks = nx.pagerank(nx_graph,
                   alpha = .85,
                   tol = .00000001)
ranks

{0: 0.0878251700622082,
 1: 0.07267215748663565,
 2: 0.06517936005110639,
 3: 0.07342983473788513,
 4: 0.0760510364018347,
 5: 0.05750304306988665,
 6: 0.07540310978760259,
 7: 0.06730017549083059,
 8: 0.07522704634350005,
 9: 0.06095004959733662,
 10: 0.07225105910134201,
 11: 0.07483801243408525,
 12: 0.07958141228047579,
 13: 0.06178853315527037}

In [10]:
# Get the most representative sentence.
ranked = sorted(((ranks[i], s) for i, s in enumerate(sentences)),
                reverse=True)
print(ranked[0])

(0.0878251700622082, 'The United States Congress is the bicameral legislature of the federal government of the United States, and consists of two chambers: the House of Representatives and the Senate.')


### Keyword Extraction:
1. parse and tokenize text (done above),
2. filter out stopwords, choose only nouns and adjectives,
3. calculate the neighbors of words,
4. run TextRank on the neighbor matrix.

In [14]:
# Remove stop word and punctuation, get all unique words in the text.
text_filt = [word for word in text 
               if word.is_stop==False and
              (word.pos_=='NOUN' or word.pos_=='ADJ')]
words = set(text_filt)
#words

In [15]:
# Create a grid indicating whether word are within 4 places of the target word.
adjacency=pd.DataFrame(columns=words, index=words, data=0)

# Iterate through each word in the text,
# Indicate which of the unique words are its neigbors.
for i, word in enumerate(text):
    # Check if any of the word's next four neighbors are in the word list.
    if any([word == item for item in text_filt]):
        end = max(0,len(text)-(len(text)-(i+5))) # stop at the end of string
        nextwords = text[i+1:end] # potential neighbors
        # Filter the neigbors to select only those in the word list.
        inset=[x in text_filt for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Add 1 to the adjacency matrix for neigbors of th etarget word.
        if neighbors:
            adjacency.loc[word,neighbors]=adjacency.loc[word,neighbors]+1
            

In [16]:
adjacency

Unnamed: 0,senator,-,century,vacancies,population,senators,district,years,congressional,term,...,senators.1,committees,addition,year,election,representative,federal,citizen,practice,gubernatorial
senator,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
-,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
century,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
vacancies,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
population,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
senators,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
district,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
years,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
congressional,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
term,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Run TextRank.
nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identify the five most highly ranked keywords.
ranked = sorted(((ranks[i],s) for i,s in enumerate(words)),
                reverse=True)
print(ranked[:5])

[(0.0235644461707775, people), (0.021652302118677517, representatives), (0.021541787618127764, states), (0.021541787618127764, members), (0.021541787618127764, voting)]


In Wikipedia's article overview on the U.S. Congress, the most representative sentence is:

- "The United States Congress is the bicameral legislature of the federal government of the United States, and consists of two chambers: the House of Representatives and the Senate."

and the five top keywords are:
- people, representatives, states, members, and voting.