In [None]:
#hide
%load_ext autoreload
%autoreload 2
from nbdev.showdoc import *

# MedTop

> Extracting topics from reflective medical writings.

## Requirements
`pip install medtop`  

`python -m nltk.downloader all`

## How to use

A template pipeline is provided below using a test dataset. You can read more about the test_data dataset [here](https://github.com/cctrbic/medtop/blob/master/test_data/README.md)

Each step of the pipeline has configuration options for experimenting with various methods. These are detailed in the documentation for each method. Notably, the `import_docs`, `get_cluster_topics`, `visualize_clustering`, and `evaluate` methods all include the option to save results to a file.

## Example Pipeline
### Import data
Import and pre-process documents from a text file containing a list of all documents.

In [None]:
from medtop.core import *
data, doc_df = import_docs('test_data/corpus_file_list.txt', save_results = False)

### Transform data
Create word vectors from the most expressive phrase in each sentence of the imported documents.

In [None]:
tfidf, dictionary = create_tfidf('test_data/seed_topics_file_list.txt', doc_df)
data = get_phrases(data, dictionary.token2id, tfidf, include_input_in_tfidf = True)
data = get_vectors("tfidf", data, dictionary = dictionary, tfidf = tfidf)

Removed 43 sentences without phrases.


**Questions about unrepresentative names:**   
  1) Need a better understanding of `include_input_in_tfidf`  
  2) Why is `token_averages` is the max of each row.

### Cluster data
Cluster the sentences into groups expressing similar ideas or topics.

In [None]:
data = assign_clusters(data, method = "kmeans", k=4)
cluster_df = get_cluster_topics(data, doc_df, save_results = False)
visualize_clustering(data, method = "umap", show_chart = False)
cluster_df

Unnamed: 0,cluster,topics,sent_count
0,0,"[felt, guilt, lost, joy, lied, going, work, sa...",14
1,1,"[felt, guilt, joy, shame, christmas, friend, n...",75
2,2,"[felt, joy, got, found, shame, guilt, son, for...",25
3,3,"[felt, sadness, died, friend, dog, heard, joy,...",40


### Evaluate results

In [None]:
gold_file = "test_data/gold.txt"
evaluate(data, gold_file="test_data/gold.txt", save_results = False)

Unnamed: 0,label,gold_examples,closest_cluster,closest_cluster_members,tp,fp,fn,precision,recall,f1
1,guilt,"{doc.2.sent.8, doc.2.sent.6, doc.6.sent.18, do...",1,"{doc.2.sent.8, doc.2.sent.6, doc.4.sent.3, doc...",32,43,36,0.427,0.471,0.448
2,joy,"{doc.4.sent.3, doc.5.sent.3, doc.6.sent.12, do...",1,"{doc.2.sent.8, doc.2.sent.6, doc.4.sent.3, doc...",25,50,38,0.333,0.397,0.362
0,sadness,"{doc.3.sent.7, doc.0.sent.17, doc.8.sent.4, do...",3,"{doc.3.sent.7, doc.0.sent.17, doc.8.sent.4, do...",27,13,11,0.675,0.711,0.693
3,shame,"{doc.3.sent.4, doc.6.sent.10, doc.3.sent.1, do...",1,"{doc.2.sent.8, doc.2.sent.6, doc.4.sent.3, doc...",18,57,13,0.24,0.581,0.34


In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted core.ipynb.
Converted index.ipynb.
Converted internal.ipynb.
Converted preprocessing.ipynb.
