In [25]:
# data stuff and utilities:
from dateutil import parser
import pandas as pd
import numpy as np
import requests
import os.path
import random
import html
import re

# NLP stuff:
from sklearn.feature_extraction.text import TfidfVectorizer
from rouge_score import rouge_scorer
from bs4 import BeautifulSoup
from nltk import tokenize
import rouge_score
import nltk
# nltk.download('punkt')

from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

# plotting stuff:
import matplotlib.pyplot as plt

In [2]:
# pip install sumy

---
<div class="alert alert-block alert-info">


### About this notebook:
In this notebook, we take a look at "Extractive Summarization" methods to analyze text. See Page 246 of "Blueprints for Text Analytics Using Python"

</div>

---
### Import Data:
For this notebook, we'll use "market update" emails from my brother-in-law and apply summarization techniques.

In [3]:
# import all emails and read the first:
df = pd.read_csv('data/market_update_emails.csv')
print(df['text'][2])

Good morning.  Last night I noticed Karissa was browsing the Target app on her iPad, ostensibly doing some online shopping at one of her favorite retailers.  I told her that was nice of her considering that their stock dropped 25% yesterday, maybe she’ll get a Thank You note from their CEO.
 
The latest driver of “down” volatility (we’ve had “up” volatility this week too), is retail store earnings which show slowing sales.  That caused yet another fit among traders as the S&P 500 dropped about 4% yesterday.  Futures for the S&P 500 fell another 1.6% overnight, suggesting the index will open close to bear market territory — market shorthand for a 20% fall from a recent high. The broad market gauge briefly fell into a bear market during the pandemic panic in March 2020 before launching on a two-year rally that peaked on Jan. 3 this year.  Of the handful of recurring crosscurrents hanging over the global economy (and dragging down financial markets) is fear of a recession, but others incl

---
---
<div class="alert alert-block alert-info">


### TF/IDF Summarization:
Identifying important words (and as an extensio, sentences) with TF-IDF values.

</div>

#### Define function to preprocess and rank sentences according to their importance score:

In [4]:
# function definition:
def get_sentence_analysis(text):
    # define df to hold results:
    sentence_df = pd.DataFrame()
    
    # tokenize the text and split into sentences:
    sentences = tokenize.sent_tokenize(text)
    tfidfVectorizer = TfidfVectorizer()
    words_tfidf = tfidfVectorizer.fit_transform(sentences)
    
    # get sentence importance scores:
    sent_sum = words_tfidf.sum(axis=1)
    
    # append data to df:
    sentence_df['sentence'] = sentences
    sentence_df['importance score'] = sent_sum.reshape(1, len(sent_sum)).tolist()[0]
    
    return sentence_df

In [5]:
get_sentence_analysis(df['text'][2]).sort_values(by='importance score', ascending=False)

Unnamed: 0,sentence,importance score
10,We know the following already: The pandemic c...,5.851135
35,He previously told CNBC that buying stocks and...,5.565634
11,As inflation started to rise – too many dollar...,5.469399
7,Of the handful of recurring crosscurrents hang...,5.399933
29,It’s important to remember that this sort of d...,5.161159
15,Whether this causes a full-blown recession whe...,5.072455
12,Throw in higher energy prices due to the war a...,4.932101
31,"As investors, not traders, this is fundamental...",4.836705
9,Let me just help cut through the noise as the ...,4.789492
19,If we think of the “course” as fair value or w...,4.752204


#### Define function to summarize text:

In [6]:
def get_text_summary(text, num_sentences):
    # get the analysis for all sentences in the text:
    sentence_df = get_sentence_analysis(text)
    
    # get the top num_sentences by score:
    candidate_sentences = sentence_df.sort_values(by='importance score', 
                                            ascending=False)[:num_sentences]
    
    # sort the top sentences according to their order in the text (index)
    top_sentences = candidate_sentences['sentence'].sort_index()
    
    # concatenate sentence to create summarized text:
    summary = '\n'.join(top_sentences)
    
    return summary

In [7]:
print(get_text_summary(text=df['text'][2], num_sentences=5))

Of the handful of recurring crosscurrents hanging over the global economy (and dragging down financial markets) is fear of a recession, but others include inflation, rising rates, war in eastern Europe, and rising COVID cases/lockdowns in China.
We know the following already:  The pandemic caused the government to lower interest rates to unprecedented levels (basically zero) to stimulate economic activity, while at the same time infusing trillions of dollars into the system via direct payments to consumers (aka “helicopter money”).
As inflation started to rise – too many dollars chasing too few goods/services which were still constrained by supply chain bottlenecks – the Fed was late to the party and wrote off the early signs of inflation as temporary.
It’s important to remember that this sort of data, which are generally called “valuation metrics,” CANNOT be used market timing tools to know when to trade in and out of markets.
He previously told CNBC that buying stocks and holding the

---
---
<div class="alert alert-block alert-info">


### Latent Semantic Analysis (LSA) Summarization:
See Page 250 of "Blueprints for Text Analytics Using Python"

</div>

#### Define function to summarize text with LSA:

In [10]:
def get_lsa_summary(text, num_sentences):
    # define LSA params:
    LANGUAGE = "english"
    stemmer = Stemmer(LANGUAGE)
    
    # parse and factorize:
    parser = PlaintextParser.from_string(df['text'][2], Tokenizer(LANGUAGE))
    summarizer = LsaSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    
    # join important sentences into a summary:
    sentences  = [str(x) for x in summarizer(parser.document, num_sentences)]
    summary  = '\n'.join(sentences)
    
    
    return summary

In [14]:
print(get_lsa_summary(text=df['text'][2], num_sentences=4))

We know the following already:  The pandemic caused the government to lower interest rates to unprecedented levels (basically zero) to stimulate economic activity, while at the same time infusing trillions of dollars into the system via direct payments to consumers (aka “helicopter money”).
Throw in higher energy prices due to the war and further COVID-driven supply constraints, the Fed is now having to be more aggressive in raising interest rates.
Whether this causes a full-blown recession where economic activity slows dramatically and unemployment rises OR if the Fed managed to orchestrate a so-called “soft landing,” remains unknown.
I like to invoke the wisdom of the late Jack Bogle, the founder of Vanguard and the father of the index fund, who always recommended a buy-and-hold strategy for investors.


---
---
<div class="alert alert-block alert-info">


### TextRank Summarization:
See Page 254 of "Blueprints for Text Analytics Using Python".
    
Similar to the paper by Brin and Page, TextRank treats each sentence as a node in the graph (analogous to web pages for PageRank), and determines the weight of the edges connecting them by employing dimilarity functions (such as number of shared lexical tokens, cosine, etc..)
</div>

<div class="alert alert-block alert-warning">

According to the authors of "Blueprints...", TextRank is the preferred method for analyzing large pieces of text content.

</div>

### Define TextRank summarization function:

In [32]:
def get_textrank_summary(text, num_sentences):
    # define textrank params and objects:
    parser = PlaintextParser.from_string(text, Tokenizer('english'))
    stemmer = Stemmer('english')
    summarizer = TextRankSummarizer(stemmer)
    summarizer.stop_words = get_stop_words('english')
    
    # get top sentences and create summary:
    top_sentences = [str(x) for x in summarizer(parser.document, num_sentences)]
    summary = '\n'.join(top_sentences)
    
    return summary

In [33]:
print(get_textrank_summary(text=df['text'][2], num_sentences=4))

Of the handful of recurring crosscurrents hanging over the global economy (and dragging down financial markets) is fear of a recession, but others include inflation, rising rates, war in eastern Europe, and rising COVID cases/lockdowns in China.
Remember, the economy and financial markets (seen daily via stock/bond/commodity markets) are not the same thing.
This means the biggest question right now for investors (not traders), is whether the financial markets and associated asset prices are overshooting their course.
It’s important to remember that this sort of data, which are generally called “valuation metrics,” CANNOT be used market timing tools to know when to trade in and out of markets.
