In [None]:
#hide
#all_flag
from nbdev.showdoc import *

# MedTop

> Extracting topics from reflective medical writings.

## Requirements
`pip install medtop` *Not actually available via pip yet

`python -m nltk.downloader all`

## How to use

A template pipeline is provided below using a test dataset. You can read more about the test_data dataset [here](https://github.com/cctrbic/medtop/blob/master/test_data/README.md)

Each step of the pipeline has configuration options for experimenting with various methods. These are detailed in the documentation for each method. Notably, the `import_docs`, `get_cluster_topics`, `visualize_clustering`, and `evaluate` methods all include the option to save results to a file.

## Example Pipeline
### Import data
Import and pre-process documents from a text file containing a list of all documents.

In [None]:
from medtop.core import *
data, doc_df = import_docs('test_data/corpus_file_list.txt', save_results = True)

Results saved to output/DocumentSentenceList.txt


### Transform data
Create word vectors from the most expressive phrase in each sentence of the imported documents.

NOTE: If `doc_df` is NOT passed to `create_tfidf`, you must set `include_input_in_tfidf=False` in `get_phrases`.

In [None]:
tfidf, dictionary = create_tfidf(doc_df, 'test_data/seed_topics_file_list.txt')
data = get_phrases(data, dictionary.token2id, tfidf, include_input_in_tfidf = True)
data = get_vectors("tfidf", data, dictionary = dictionary, tfidf = tfidf)

Removed 0 sentences without phrases.


### Cluster data
Cluster the sentences into groups expressing similar ideas or topics. If you aren't sure how many true clusters exist in the data, try running `assign_clusters` with the optional parameter `show_chart = True` to visual cluster quality with varying numbers of clusters. When using `method='hac'`, you can also use `show_dendrogram = True` see the cluster dendrogram.

In [None]:
data = assign_clusters(data, method = "kmeans", k=4)
cluster_df = get_cluster_topics(data, doc_df, save_results = True)
visualize_clustering(data, method = "umap", show_chart = False)
cluster_df

Results saved to output/TopicClusterResults.txt



Embedding a total of 2 separate connected components using meta-embedding (experimental)



Unnamed: 0,cluster,topics,sent_count
0,0,"[felt, joy, got, guilt, found, son, shame, for...",37
1,1,"[felt, sadness, heard, died, dog, home, friend...",14
2,2,"[felt, guilt, joy, shame, sadness, friend, old...",87
3,3,"[felt, guilt, lost, joy, sadness, managed, lie...",16


### Evaluate results

In [None]:
gold_file = "test_data/gold.txt"
evaluate(data, gold_file="test_data/gold.txt", save_results = False)

Unnamed: 0,label,gold_examples,closest_cluster,closest_cluster_members,tp,fp,fn,precision,recall,f1
2,guilt,"{doc.4.sent.17, doc.9.sent.6, doc.0.sent.0, do...",2,"{doc.4.sent.15, doc.4.sent.8, doc.3.sent.12, d...",31,56,37,0.356,0.456,0.4
1,joy,"{doc.0.sent.19, doc.7.sent.2, doc.1.sent.19, d...",0,"{doc.5.sent.4, doc.5.sent.5, doc.1.sent.10, do...",23,14,40,0.622,0.365,0.46
0,sadness,"{doc.1.sent.13, doc.4.sent.15, doc.1.sent.18, ...",2,"{doc.4.sent.15, doc.4.sent.8, doc.3.sent.12, d...",15,72,23,0.172,0.395,0.24
3,shame,"{doc.8.sent.14, doc.4.sent.8, doc.4.sent.19, d...",2,"{doc.4.sent.15, doc.4.sent.8, doc.3.sent.12, d...",20,67,11,0.23,0.645,0.339


In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted core.ipynb.
Converted index.ipynb.
Converted internal.ipynb.
Converted preprocessing.ipynb.
Converted sandbox.ipynb.
