# CS230 Project Milestone
Aaron Reed (aaron73@stanford.edu)

Ivan Villa-Renteria (ivillar@stanford.edu)


## Objective

The ultimate goal of our project is to produce summaries of psychotherapy sessions to aid therapists and their clients.  As a preliminary milestone, we intended to apply an *unsupervised* summarization pipeline proposed by [Padmakumar and Saran](https://www.cs.utexas.edu/~asaran/reports/summarization.pdf) and implemented by [Chauhan](https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1) to psychotherapy data. Our plan was to use this unsupervised method as a baseline and compare it with deep learning methods using ROUGE-2 scores. However, we have been unable to access the [psychotherapy transcripts dataset](https://alexanderstreet.com/products/counseling-and-psychotherapy-transcripts-series) we originally intended to use. Here we use a subset of the [Enron Email Dataset](https://www.cs.cmu.edu/~enron/) for the purpose of pipeline development.

The following is a representative sample email from the Enron dataset.


In [1]:
!cat sample_email.txt

Message-ID: <11578703.1075855681711.JavaMail.evans@thyme>
Date: Tue, 12 Sep 2000 06:42:00 -0700 (PDT)
From: phillip.allen@enron.com
To: bs_stone@yahoo.com
Subject: Re: Sept 1 Payment
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Brenda Stone <bs_stone@yahoo.com> @ ENRON
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\Sent
X-Origin: Allen-P
X-FileName: pallen.nsf

Brenda,

 I checked my records and I mailed check #1178 for the normal amount on 
August 28th.  I mailed it to 4303 Pate Rd. #29, College Station, TX 77845.  I 
will go ahead and mail you another check.  If the first one shows up you can 
treat the 2nd as payment for October.

 I know your concerns about the site plan.  I will not proceed without 
getting the details and getting your approval.

 I will find that amortization schedule and send it soon.

Phillip

## Pipeline walkthrough
### Overview

The summarization pileline is summarized as follows:


1.   Data cleaning*
2.   Sentence tokenization
3.   Skip-thought encoding
4.   Clustering
5.   Summarization

\* Chauhan includes langauge detection between cleaning and tokenization. However, since our psychotherapy data will be in English, we omit this step.

### Install and import
The `setup.sh` script downloads about 5GB of model parameters the first time it is run. Please be patient!

In [2]:
!./setup.sh # This downloads model parameters. Wait about 5 minutes.

import numpy as np
# from talon.signature.bruteforce import extract_signature
import nltk
nltk.download('punkt') # for tokenization
from nltk.tokenize import sent_tokenize
import skipthoughts.skipthoughts as skipthoughts
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

[nltk_data] Downloading package punkt to /Users/aaronreed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 1. Data cleaning
In this step, an open-source [utility](https://github.com/mailgun/talon) is used to strip headers and signatures from emails, since they do not contribute semantic data relevant to summarization. The resulting text is the body portion of the email.



In [3]:
kNumMetadataLines = 16 # throw out this many lines from header

def preprocess(emails):
    """
    Performs preprocessing operations such as:
        1. Removing metadata lines.
        2. Removing new line characters.
    """
    n_emails = len(emails)
    for i in range(n_emails):
        email = emails[i]
        # email, _ = extract_signature(email)
        lines = email.split('\n')
        lines = lines[kNumMetadataLines:] # remove metadata lines
        for j in reversed(range(len(lines))):
            lines[j] = lines[j].strip()
            if lines[j] == '':
                lines.pop(j)
        emails[i] = ' '.join(lines)

### 2. Sentence tokenization
The body of the email is split into individual sentences which will be encoded into skip-thought vectors in the next step.

In [4]:
def split_sentences(emails):
    """
    Splits the emails into individual sentences
    """
    n_emails = len(emails)
    for i in range(n_emails):
        email = emails[i]
        sentences = sent_tokenize(email)
        for j in reversed(range(len(sentences))):
            sent = sentences[j]
            sentences[j] = sent.strip()
            if sent == '':
                sentences.pop(j)
        emails[i] = sentences

### 3. Skip-thought encoding
Skip-thought encoding, due to [Kiros et al.](https://arxiv.org/abs/1506.06726), is a pre-trained encoder-decoder model that maps sentences to vectors and then predicts similar sentences. Because the skip-thought model is trained on a large [corpus](https://arxiv.org/abs/1506.06724), it can predict semantically similar sentences using words not found in the original encoded text, making it capable of *abstractive* summarization. 

This implementation relies on a version of `skip-thoughts` ported to Python 3 by Chiao An Yang: https://github.com/tartarskunk/skip-thoughts 

In [5]:
def skipthought_encode(emails):
    """
    Obtains sentence embeddings for each sentence in the emails
    """
    enc_emails = [None]*len(emails)
    cum_sum_sentences = [0]
    sent_count = 0
    for email in emails:
      sent_count += len(email)
      cum_sum_sentences.append(sent_count)

    all_sentences = [sent for email in emails for sent in email]
    print('Loading pre-trained models...')
    model = skipthoughts.load_model()
    encoder = skipthoughts.Encoder(model)
    print('Encoding sentences...')
    enc_sentences = encoder.encode(all_sentences, verbose=False)

    for i in range(len(emails)):
      begin = cum_sum_sentences[i]
      end = cum_sum_sentences[i+1]
      enc_emails[i] = enc_sentences[begin:end]
    return enc_emails

### 4. Clustering
The *k*-means method is used to cluster the sentences, encoded as skip-thought vectors, for each email. The distance metric is Euclidean and there are $\sqrt{\text{#(sentences in email)}}$ clusters.

### 5. Summarizaton

For each cluster, the sentence closest to the mean of the cluster is selected as a representative of the cluster. The summary of an email contains one representative sentance per cluster.

In [6]:
def summarize(emails):
    """
    Performs summarization of emails
    """
    n_emails = len(emails)
    summary = [None]*n_emails
    print('Preprocesing...')
    preprocess(emails)
    print('Splitting into sentences...')
    split_sentences(emails)
    print('Starting to encode...')
    enc_emails = skipthought_encode(emails)
    print('Encoding Finished')
    for i in range(n_emails):
        enc_email = enc_emails[i]
        n_clusters = int(np.ceil(len(enc_email)**0.5))
        kmeans = KMeans(n_clusters=n_clusters, random_state=0)
        kmeans = kmeans.fit(enc_email)
        avg = []
        closest = []
        for j in range(n_clusters):
            idx = np.where(kmeans.labels_ == j)[0]
            avg.append(np.mean(idx))
        closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_,\
                                                   enc_email)
        ordering = sorted(range(n_clusters), key=lambda k: avg[k])
        summary[i] = ' '.join([emails[i][closest[idx]] for idx in ordering])
    print('Clustering Finished')
    return summary

## Verification

In this part, we run the summarization pipeline on email data and subjectively evaluate the results. If we had access to a dataset with human-generated summaries, we could treat them as ground truth and use ROUGE-2 scoring to obtain a qualitative measure of machine-generated summary quality. We could then use the ROUGE-2 scores of the unsupervised method as a benchmark to compare with results from deep supervised learning.

Another (less-than-ideal) option is to use these unsupervised summaries as labels to train a deep learning model. However, the performace of the DL model would then be limited by the label quality.

In [8]:
sample = 'sample_email.txt'
with open(sample, 'r') as file:
    email = file.read()
summarize([email])

Preprocesing...
Splitting into sentences...
Starting to encode...
Loading pre-trained models...
Loading model parameters...
Compiling encoders...
Loading tables...
Packing up...
Encoding sentences...
Encoding Finished
Clustering Finished


['Brenda, I checked my records and I mailed check #1178 for the normal amount on August 28th. I will find that amortization schedule and send it soon. Phillip']

In [None]:
samples = []