# Text Summarization 

This notebook is designed to let you easily generate and play with summaries of the open-ended survey comments. The comments can be grouped into themes, sub-themes and agreement levels. This notebook can help you summarize hundreds of comments in mere minutes and reduce the time it takes to gain meaningful insights for the WES. 

### Instructions for use

This notebook can be used to create summaries for text. You can select which subtheme and agreement level you want to look at. There are 2 different algorithms provided for generating a summary. 


**Option 1: PageRank - cosine similarity**
This is a implementation adapted from [Prateek Joshi's blog](https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/) and the code is present in the text summary script. This method is slower since it uses pre-trained word embeddigns which are large files that take a while to load and the code has not been optimized for speed. To reduce the run time when generating multiple summaries, you can save the loaded pre-trained embeddings on the first run by changing the parameters of the function. This will greatly reduce the runtime the second time the function is run. 

Any embedding should work in the function, the recommended embedding is the fasttext crawl because it has the greatest text coverage with our corpus. 

**Option 2: Variation TextRank - BM25 similarity** 
This method comes from the [Gensim package](https://radimrehurek.com/gensim/summarization/summariser.html) and is an variation on the TextRank algorithm and is much faster than our implementation. 

For detailed examples of use each option read the documentation for generate_text_summary.

#### Usage

The `generate_text_summary` has been imported as a module and you can generate summaries and save them as csv files. 


### Info about working directories

This notebook had been set up to run from the root directory. To switch the working directory, follow the instructions in the cell below.


In [None]:
# This code chunck will change the working directory to be project root (only run once)

import os
# check what folder is the current working directory
print("Intial Working Directory \n", os.getcwd())
# change the working directory to one level up
os.chdir('../')
# confirm working directory is now project root
print("Current Working Directory \n", os.getcwd())

In [None]:
import time
import src
from src.analysis.text_summary import generate_text_summary

## Summaries

### Option 1: Pagerank 

Credit to: Prateek Joshi

https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

In [None]:
start = time.time()
summary1, loaded_embedding = generate_text_summary(".\data\interim\linking_joined_qual_quant.csv",
                                        5,
                                        "subtheme",
                                        13,
                                        "weak",
                                        "pagerank",
                                        "./references/pretrained_embeddings.nosync/fasttext/crawl-300d-2M.vec",
                                        embedding_return=True)
end = time.time()
print((end - start) / 60, "mins")

In [None]:
start = time.time()
generate_text_summary(".\data\interim\linking_joined_qual_quant.csv",  
                                        5,
                                        "subtheme",
                                        13,
                                        "weak",
                                        "pagerank",
                                        embedding=loaded_embedding)
end = time.time()
print((end - start) / 60, "mins")

### Option 2: Gensim Package: TextRank

In [None]:
start = time.time()
generate_text_summary(".\data\interim\linking_joined_qual_quant.csv",  
                                        200,
                                        "subtheme",
                                        13,
                                        "weak",
                                        "textrank")

end = time.time()
print((end - start) / 60, "mins")