In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

import nltk
import string
import pandas as pd
import numpy as np

#download assets from nltk
#nltk.download('stopwords')
#nltk.download('punkt')

def tfidf(corpus):
    '''
    Computes the TF-IDF (term frequency - inverse document frequency) matrix

    Args
    - corpus: a list of documents

    Returns
    - tfidfVec: an m x n matrix of the corpus. m = number of different terms used in the documents, n = number of documents 
    - vocab: all the unique words used in the corpus, excluding stop words
    '''

    vectorizer = TfidfVectorizer(stop_words = stopwords.words('english'))
    #vectorizer = CountVectorizer(stop_words='english')
    tfidfVec = vectorizer.fit_transform(corpus)
    vocab = vectorizer.get_feature_names()
        
    return tfidfVec, vocab

def svd(tfidfVec):
    '''
    Gives the singular value decomposition of an m x n matrix.
    A = U * sigma * V^t
    
    Args
    - tfidfVec: an m x n matrix. m = number of documents or sentences, n = number of terms

    Returns
    - U: an m x r matrix of left singular values (document-topic table). r = number of topics
    - sigma: an r x r diagonal matrix of singular values in decreasing order across the diagonal
    - V^t: an n x r matrix of right singular values (term-topic table)
    '''

    lsa = TruncatedSVD(n_components = 10, n_iter=20)
    u = lsa.fit_transform(tfidfVec)
    sigma = lsa.singular_values_
    vt = lsa.components_.T

    return u, sigma, vt

def getImportantSentences(u, sigma):
    '''
    Uses the LSA enhancement described by Josef Steinberg, et al.
    Take all topics that have singular values > half of the largest singular value

    Compute sk = sqrt(sum(v_ki^2 * sigma_i^2) from i = 1 to n)
    sk is the length of the vector of the kth sentence
    n is the number of topics 
    '''
    #look for the sigma value range that we need to consider using binary search
    #sigma array is sorted in descending order and will never be empty
    l, r, target = 0, len(sigma), sigma[0]/2
    while l < r:
        mid = l + (r-l)//2

        if sigma[mid] < target:
            r = mid
        else:
            l = mid + 1
    sigmaBound = l

    uSlice = u[:, :sigmaBound]
    sigmaSlice = sigma[:sigmaBound]
    uSq = np.square(uSlice)
    sigSq = np.square(np.diag(sigmaSlice))
    prod = np.matmul(uSq, sigSq)
    result = np.sqrt(np.sum(prod, axis = 1)).T

    return (-result).argsort()

def createWordToSentenceMap(corpus):
    wordToSentence = {}
    stopWords = set(stopwords.words('english'))

    for i, doc in enumerate(corpus):
        #remove punctuation while preserving contractions in text
        sanitizeText = doc.translate(str.maketrans('', '', string.punctuation))
        tokenized = word_tokenize(sanitizeText)
        #remove duplicate words
        tokenized = list(set([word.lower() for word in tokenized]))

        for word in tokenized:
            if word not in stopWords:
                if word not in wordToSentence:
                    wordToSentence[word] = [i]
                else:
                    wordToSentence[word].append(i)
    
    return wordToSentence

def extractSummary(u, sigma, k, corpus):
    '''
    Summary will be taken from the top k sentences from getImportantSentences()
    for each topic.
    '''
    return [corpus[i] for i in getImportantSentences(u, sigma)[:k]]


In [49]:
def preProcess(blockText):
    return sent_tokenize(blockText)

In [54]:
text = '''
For the second time during his papacy, Pope Francis has announced a new group of bishops and archbishops set to become cardinals -- and they come from all over the world.

Pope Francis said Sunday that he would hold a meeting of cardinals on February 14 "during which I will name 15 new Cardinals who, coming from 13 countries from every continent, manifest the indissoluble links between the Church of Rome and the particular Churches present in the world," according to Vatican Radio.

New cardinals are always important because they set the tone in the church and also elect the next pope, CNN Senior Vatican Analyst John L. Allen said. They are sometimes referred to as the princes of the Catholic Church.

The new cardinals come from countries such as Ethiopia, New Zealand and Myanmar.

"This is a pope who very much wants to reach out to people on the margins, and you clearly see that in this set," Allen said. "You're talking about cardinals from typically overlooked places, like Cape Verde, the Pacific island of Tonga, Panama, Thailand, Uruguay."

But for the second time since Francis' election, no Americans made the list.

"Francis' pattern is very clear: He wants to go to the geographical peripheries rather than places that are already top-heavy with cardinals," Allen said.

Christopher Bellitto, a professor of church history at Kean University in New Jersey, noted that Francis announced his new slate of cardinals on the Catholic Feast of the Epiphany, which commemorates the visit of the Magi to Jesus' birthplace in Bethlehem.

"On feast of three wise men from far away, the Pope's choices for cardinal say that every local church deserves a place at the big table."

In other words, Francis wants a more decentralized church and wants to hear reform ideas from small communities that sit far from Catholicism's power centers, Bellitto said.

That doesn't mean Francis is the first pontiff to appoint cardinals from the developing world, though. Beginning in the 1920s, an increasing number of Latin American churchmen were named cardinals, and in the 1960s, St. John XXIII, whom Francis canonized last year, appointed the first cardinals from Japan, the Philippines and Africa.

In addition to the 15 new cardinals Francis named on Sunday, five retired archbishops and bishops will also be honored as cardinals.

Last year, Pope Francis appointed 19 new cardinals, including bishops from Haiti and Burkina Faso.

CNN's Daniel Burke and Christabelle Fombu contributed to this report hi@gmail.com.
'''

corpus = preProcess(text)
print(corpus)

tfidfVec, vocab = tfidf(corpus)
wordToSentence = createWordToSentenceMap(corpus)
print(vocab)
print(wordToSentence)
print(tfidfVec)
print('----------------------------------------------------------')

u, sigma, vt = svd(tfidfVec)
numTopics = u.shape[1] + 1

dfSVD = pd.DataFrame(u, columns=[f'topic{str(i)}' for i in range(1, numTopics)])
docCol = pd.DataFrame({'Documents': corpus})
dfSVD = pd.concat([docCol, dfSVD], axis = 1)

display(dfSVD)
print('----------------------------------------------------------')
print(sigma)

print('----------------------------------------------------------')

dfVt = pd.DataFrame(vt, columns=[f'topic{str(i)}' for i in range(1, numTopics)])
vocabCol = pd.DataFrame({'Terms': vocab})
dfVt = pd.concat([vocabCol, dfVt], axis = 1)

display(dfVt)

for i in range(1, numTopics):
    dfVtSort = dfVt.sort_values(by=f'topic{i}', ascending=False)
    display(dfVtSort[['Terms', f'topic{i}']])
print('----------------------------------------------------------')


#df = pd.DataFrame()
#print(df)

['\nFor the second time during his papacy, Pope Francis has announced a new group of bishops and archbishops set to become cardinals -- and they come from all over the world.', 'Pope Francis said Sunday that he would hold a meeting of cardinals on February 14 "during which I will name 15 new Cardinals who, coming from 13 countries from every continent, manifest the indissoluble links between the Church of Rome and the particular Churches present in the world," according to Vatican Radio.', 'New cardinals are always important because they set the tone in the church and also elect the next pope, CNN Senior Vatican Analyst John L. Allen said.', 'They are sometimes referred to as the princes of the Catholic Church.', 'The new cardinals come from countries such as Ethiopia, New Zealand and Myanmar.', '"This is a pope who very much wants to reach out to people on the margins, and you clearly see that in this set," Allen said.', '"You\'re talking about cardinals from typically overlooked plac

Unnamed: 0,Documents,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
0,"\nFor the second time during his papacy, Pope ...",0.59315,-0.200822,0.015477,0.461551,0.004958,-0.03375,0.003493,-0.034704,0.042379,0.026577
1,Pope Francis said Sunday that he would hold a ...,0.459667,0.023715,0.144921,-0.102591,0.201304,0.133781,-0.138787,0.466273,-0.080927,0.133955
2,New cardinals are always important because the...,0.449935,0.310885,-0.002873,-0.078796,0.369663,-0.18447,0.074091,-0.028117,-0.013942,0.343917
3,They are sometimes referred to as the princes ...,0.097619,0.161203,0.534549,-0.180251,-0.292602,-0.021218,0.273854,-0.198314,-0.233401,0.47286
4,The new cardinals come from countries such as ...,0.428025,-0.219335,0.228356,0.057098,0.33236,0.253816,-0.119868,-0.127033,-0.062954,-0.306114
5,"""This is a pope who very much wants to reach o...",0.268698,0.588904,-0.236137,0.15118,0.133828,-0.130595,-0.161936,-0.124921,-0.029821,0.165342
6,"""You're talking about cardinals from typically...",0.100064,0.036001,-0.176279,-0.132891,-0.05779,0.672187,0.430196,0.010369,0.444586,0.128649
7,But for the second time since Francis' electio...,0.192608,-0.134996,-0.071122,0.701259,-0.34793,-0.147129,0.209871,-0.010323,0.157328,0.137923
8,"""Francis' pattern is very clear: He wants to g...",0.269457,0.465943,-0.338273,0.003167,-0.100263,0.279208,0.080132,-0.085668,-0.097935,-0.118879
9,"Christopher Bellitto, a professor of church hi...",0.34889,0.003073,0.522757,-0.056605,-0.182312,0.113772,0.115299,-0.26138,-0.105647,-0.27084


----------------------------------------------------------
[1.41828601 1.14807645 1.07917382 1.05317454 1.03406883 1.01548284
 1.00338675 0.99160554 0.9817554  0.96273915]
----------------------------------------------------------


Unnamed: 0,Terms,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
0,13,0.045788,0.003605,0.024933,-0.018533,0.037721,0.025994,-0.027621,0.095015,-0.016824,0.028958
1,14,0.045788,0.003605,0.024933,-0.018533,0.037721,0.025994,-0.027621,0.095015,-0.016824,0.028958
2,15,0.113751,-0.048853,0.010032,-0.035332,0.070938,0.044701,-0.040271,0.092082,-0.020804,0.044159
3,19,0.077024,-0.059580,-0.050474,-0.074408,-0.035718,-0.074188,-0.037241,-0.104776,0.086755,-0.025192
4,1920s,0.036180,-0.038510,-0.067171,-0.095765,-0.062962,-0.057083,0.007941,-0.032121,0.009661,0.010855
...,...,...,...,...,...,...,...,...,...,...,...
170,world,0.151058,-0.072410,-0.030666,0.071018,-0.064964,0.020491,0.022059,0.231090,-0.144643,0.018435
171,would,0.045788,0.003605,0.024933,-0.018533,0.037721,0.025994,-0.027621,0.095015,-0.016824,0.028958
172,xxiii,0.036180,-0.038510,-0.067171,-0.095765,-0.062962,-0.057083,0.007941,-0.032121,0.009661,0.010855
173,year,0.098848,-0.085651,-0.102726,-0.148592,-0.086166,-0.114624,-0.025584,-0.119535,0.084188,-0.012519


Unnamed: 0,Terms,topic1
33,cardinals,0.350546
110,new,0.348287
70,francis,0.247575
126,pope,0.208585
27,bishops,0.200440
...,...,...
28,burke,0.004797
136,report,0.004797
46,com,0.004797
38,christabelle,0.004797


Unnamed: 0,Terms,topic2
167,wants,0.356141
139,said,0.278218
9,allen,0.250844
40,church,0.165966
142,see,0.151992
...,...,...
19,archbishops,-0.094598
47,come,-0.101909
33,cardinals,-0.103782
27,bishops,-0.131514


Unnamed: 0,Terms,topic3
34,catholic,0.284549
40,church,0.264397
149,sometimes,0.225792
129,princes,0.225792
134,referred,0.225792
...,...,...
18,appointed,-0.102726
124,places,-0.110454
167,wants,-0.112282
67,first,-0.118329


Unnamed: 0,Terms,topic4
157,time,0.328939
141,second,0.328939
99,made,0.243477
14,americans,0.243477
97,list,0.243477
...,...,...
67,first,-0.104600
33,cardinals,-0.123176
93,last,-0.148592
173,year,-0.148592


Unnamed: 0,Terms,topic5
110,new,0.181678
45,cnn,0.178344
53,countries,0.143724
61,ethiopia,0.126877
107,myanmar,0.126877
...,...,...
129,princes,-0.134611
149,sometimes,-0.134611
34,catholic,-0.150735
67,first,-0.161639


Unnamed: 0,Terms,topic6
124,places,0.226866
153,talking,0.181098
95,like,0.181098
114,overlooked,0.181098
115,pacific,0.181098
...,...,...
69,fombu,-0.124360
54,daniel,-0.124360
52,contributed,-0.124360
46,com,-0.124360


Unnamed: 0,Terms,topic7
78,hi,0.235292
136,report,0.235292
54,daniel,0.235292
46,com,0.235292
38,christabelle,0.235292
...,...,...
167,wants,-0.055999
62,every,-0.059822
110,new,-0.061138
53,countries,-0.066555


Unnamed: 0,Terms,topic8
170,world,0.231090
155,though,0.211343
125,pontiff,0.211343
17,appoint,0.211343
57,developing,0.211343
...,...,...
18,appointed,-0.119535
93,last,-0.119535
173,year,-0.119535
110,new,-0.137398


Unnamed: 0,Terms,topic9
98,local,0.163380
105,men,0.163380
123,place,0.163380
25,big,0.163380
32,cardinal,0.163380
...,...,...
57,developing,-0.181936
125,pontiff,-0.181936
17,appoint,-0.181936
103,mean,-0.181936


Unnamed: 0,Terms,topic10
149,sometimes,0.250968
129,princes,0.250968
134,referred,0.250968
34,catholic,0.162249
144,set,0.135441
...,...,...
107,myanmar,-0.134815
174,zealand,-0.134815
110,new,-0.148576
23,bellitto,-0.163686


----------------------------------------------------------


In [51]:
summary = '\n'.join(extractSummary(u, sigma, 5, corpus))
print(summary)



For the second time during his papacy, Pope Francis has announced a new group of bishops and archbishops set to become cardinals -- and they come from all over the world.
CNN's Daniel Burke and Christabelle Fombu contributed to this report.
"You're talking about cardinals from typically overlooked places, like Cape Verde, the Pacific island of Tonga, Panama, Thailand, Uruguay."
"On feast of three wise men from far away, the Pope's choices for cardinal say that every local church deserves a place at the big table."
They are sometimes referred to as the princes of the Catholic Church.
