# Generating Topic Models with the CORD-19 Dataset

**Purpose:** This notebook allows for the interactive development of topic-modeling on the the COVID-19 research-text which has been made available by the White House as of 2020-03-16.  After generating the processed outputs of the raw text using the **CORD-19 Data Processing notebook**, the user is simply required to input their directory (using the text boxes embedded in the Notebook) to read-in the pre-processed data before making their topic-modeling selections.  This workflow is designed for anyone looking to leverage Python to explore and analyze the COVID-19 text.

**About the Dataset:**
In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.

This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. The corpus will be updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others.
    
</br></br>
- **Resources**
    - **[Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)**
    
</br></br>
- **Datasets**
    - **\[Required]** Any dataset which contains row-by-row records of text contained within one column (or variable).  Use the **CORD-19 Data Processing** Notebook to generate a dataset from the CORD-19 corpus or bring-your-own-data if it meets the condition described.
    - **[COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)** - Kaggle is hosting the COVID-19 Open Research Dataset Challenge, a series of important questions designed to inspire the community to use CORD-19 to find new insights about the COVID-19 pandemic including the natural history, transmission, and diagnostics for the virus, management measures at the human-animal interface, lessons from previous epidemiological studies, and more.  You can download the full dataset on their website.
    - **[COVID-19 Open Research Dataset (CORD-19)](https://pages.semanticscholar.org/coronavirus-research)** - The CORD-19 resource is available on the Allen Institute’s SemanticScholar.org website and will continue to be updated as new research is published in archival services and peer-reviewed publications.

    
## Table of Contents

**1.0** **- Ingest Data**
    * 1.1 - Set Your Working Directory
    * 1.2 - Load Helper Functions
    * 1.3 - Import Data
    
**2.0** **- Build Topics**
    * 2.1 - Select the Data for Modeling
    * 2.2 - Build the Topic Model
    * 2.3 - Review the Topics 
    * 2.4 - Visualize the Clusters of Topics     
   
**3.0** **- Search Text**
    * 3.1 - Search by Keyword(s)
    * 3.2 - Search by Phrase
    * 3.3 - Merge Topics and Search Results
    * 3.4 - Merge Visualization Coordinates

**4.0** **- Export Data**
    * 4.1 - Load Export Functions
    * 4.2 - Select Files and Export Locally

## Dependencies

This script was executed using the following version of Python:
* **Python 3.6.2 :: Anaconda, Inc.**

Use this link to install Python on your machine:
* https://www.anaconda.com/distribution/#download-section

**About Python Versions:**
If you are running a higher-version of Python and this notebook fails to execute properly, you can downgrade your version in the terminal by running the following commands:
* conda search python [to see which versions are available on your machine]
* conda install python=3.6.2 [which will switch the active version to 3.6.2; if available in the list above]

**About Python Packages:**
All packages used in this notebook can be installed on your machine using the "pip install [package_name]" command on your terminal.  Be sure you've installed each of the packages below before attempting to execute the notebook.

Current package requirements include:
* os - https://docs.python.org/3/library/os.html
* Pandas - https://pandas.pydata.org/
* Numpy - http://www.numpy.org/
* Datetime - https://docs.python.org/3/library/datetime.html
* ipywidgets - https://ipywidgets.readthedocs.io/en/stable/user_install.html
* ipython - https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/extension_api.html
* requests - https://2.python-requests.org/en/master/user/install/
* io - https://docs.python.org/3/library/io.html
* warnings - https://docs.python.org/3/library/warnings.html
* pyarrow - https://arrow.apache.org/docs/python/parquet.html

The current template uses the following versions:
* os== module 'os' from '/anaconda3/lib/python3.6/os.py'
* pandas==0.24.1
* numpy==1.16.1
* datetime== module 'datetime' from '/anaconda3/lib/python3.6/datetime.py'
* ipywidgets==7.4.2
* ipython==6.2.1
* requests==2.18.4
* io== module 'io' from '/anaconda3/lib/python3.6/io.py'
* warnings== module 'warnings' from '/anaconda3/lib/python3.6/warnings.py'
* pyarrow==0.16.0

## Before you begin, ensure you've installed the required Python packages

* See the list above and make note of the specific versions that were used in this notebook

In [1]:
############################################
###### Import required Python packages #####
############################################

import os
from copy import deepcopy
import pyarrow
import numpy as np
import pandas as pd
import datetime as dt
from ipywidgets import interact, interactive, IntSlider, Layout
import ipywidgets as widgets
from IPython.display import display
import requests
import warnings
import tensorflow as tf
import ktrain
import re
import sklearn

warnings.filterwarnings('ignore')

  from ._conv import register_converters as _register_converters


The cell below confirms which version of Tensorflow you have installed. Your version should be >=2.0.0

In [2]:
print(tf.__version__)

2.1.0


## AN IMPORTANT NOTE ABOUT INTERACTIVE WIDGETS

This notebook uses interactive widgets to help you make selections and inputs more conveniently.  As you work through this notebook, be sure to follow the steps below to ensure your selections are incorporated in the cells that follow:

#### 1. Run the cell containing the interactive widget(s) to bring them into view
#### 2. Apply your selections and/or inputs to the widgets that appear
#### 3. DO NOT rerun the cell as it will erase your selections and inputs
#### 4. To proceed, simply click on the next cell in the notebook, and Run it

<br/>

## 1.0 - Data Ingestion

The series of code blocks below will walk you through the process of mapping to your working directory and uploading your dataset.

## 1.1 - Set Your Working Directory

Your "working directory" is a folder location on your computer that will store files either read-in or written-out by this script.  This code by default will return your current, active directory.  You can change this directory by typing in a specific path into the text box provided.

In [3]:
set_working_directory = widgets.Text(
    value=os.getcwd(),
    placeholder='/Users/bblanchard006/Desktop/covid19/CORD-19-research-challenge',
    description='Directory:',
    disabled=False,
    layout=Layout(width='100%')
)

display(set_working_directory)

Text(value='/Users/bblanchard006/Desktop/covid19/workbench', description='Directory:', layout=Layout(width='10…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

After executing the cell above, you can leave the default directory or overwrite the text string that appears with your desired folder directory. **DO NOT execute the cell again after making your update.** The input above will be fed into the following code cell, where it will either successfully map to the new directory or notify you of an error.

In [4]:
try:
    os.chdir(set_working_directory.value)
    print('Changed directory to {}'.format(set_working_directory.value))
except Exception as e:
    print('Failed to change directory')
    print(e)

Changed directory to /Users/bblanchard006/Desktop/covid19/workbench


## 1.2 - Load Helper Functions

The list of functions below will help us extract the important attributes embedded in the .json files.  Much of the source material detailed below was originally published to the Kaggle community by various supporters and has been lightly modified for the purposes of this Notebook.  You can access the original walkthrough provided on Kaggle below:

[CORD-19: EDA, parse JSON and generate clean CSV](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv)

In [5]:
########################################
##### Data Ingestion Functions
########################################

def compile_raw_data(filename, tab_names, subfolder, skip_rows = 0, file_ext = 'xlsx'):
    
    # Inputs: 
    ## filename = 'sample.csv' | 'sample.xlsx' - the filename in the directory (including the extension) 
    ## tab_names = None | ['Sheet1,'Sheet2'] - None for csv; [comma separated list of tab names] for xlsx
    ## subfolder = 'source_data' - string containing the name of a folder in the working directory
    ## delimiter_char = ',' | ';' - None for xlsx
    ## rows to skip = default 0 - Not used for csv; trims the user-defined number of rows from an xlsx
    ## file extension = csv | xlsx
    
    # Description: reads in the workbook; standardizes header names; 
    # Outputs: returns a dictionary of dataframes

    master_data = {}
    if subfolder:
        file_path = subfolder+'/{}'.format(filename)
    else:
        file_path = filename

    if file_ext == 'parquet':
        tab_names = [re.sub('.parquet.gzip','', filename)]

    for tab in tab_names:
        try:
            if file_ext == 'xlsx':
                dframe = pd.read_excel(file_path, tab, skip_rows)
            else:
                dframe = pd.read_parquet(file_path)
            
            master_data.update({tab:dframe})
        except Exception as e:
            master_data.update({tab:'Failed'})
    
    return master_data

## 1.3 - Upload Your Data (Excel and Parquet files)

The function in the code cell below will find, ingest, and format both xlsx and parquet files.

The code blocks below enable conditional filtering to support multiple file types. Further instructions are provided below:

**Uploading parquet files**

To upload a parquet file, complete these steps:
1. Type in your filename along with the extension (ex. sample.parquet.gzip)
2. Check the 'parquet' radio-button
3. Is your file in the main directory or a sub-folder in the directory:
    * Select the "no" radio-button if your file is in your main directory
    * Select the "yes" radio-button to expose a text-box where you can type-in the name of your sub-folder
    
**Uploading xlsx files**

To upload an xlsx file, complete these steps:
1. Type in your filename along with the extension (ex. sample.xlsx)
2. Check the 'xlsx' radio-button
3. Type in the tab-names you'd like to ingest (comma-separated; Sheet1,Sheet2,Sheet3)
4. If the data in your file has leading rows, select how many rows to skip before ingesting the data (ex. if your data starts on Row 2 in the Excel-file, set the Skip Rows value to 1)
5. Is your file in the main directory or a sub-folder in the directory:
    * Select the "no" radio-button if your file is in your main directory
    * Select the "yes" radio-button to expose a text-box where you can type-in the name of your sub-folder

In [10]:
upload_type = widgets.RadioButtons(
    options=['local'],
    description='File Location:',
    disabled=False
)

upload_filename = widgets.Text(
    value='consolidated_full_text.xlsx',
    placeholder='Sample File.xlsx',
    description='File Name:',
    disabled=False,
    layout=Layout(width='50%')
)

file_type = widgets.RadioButtons(
    options=['xlsx','parquet'],
    description='File Type:',
    disabled=False
)

tab_names = widgets.Text(
    value='Sheet1, Sheet2, Sheet3, etc',
    placeholder='Sheet1',
    description='Tab(s):',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder_name = widgets.Text(
    value='source_data',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

skip_rows = widgets.IntSlider(
    value=0,
    min=0,
    max=10,
    step=1,
    description='Skip Rows:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)


def text_field(x):
    if(x=='xlsx'):
        display(tab_names)
        tab_names.on_submit(tab_names)
        display(skip_rows)
    else:
        print('Tab Names: Not needed for parquet files')

def sub_folder(y):
    if(y=='yes'):
        display(subfolder_name)
        subfolder_name.on_submit(subfolder_name)
    else:
        print('Using {} folder'.format(os.getcwd()))

def file_location(z):
    if(z=='local'):
        display(upload_filename)
        i = widgets.interactive(text_field, x=file_type)
        display(i)
        p = widgets.interactive(sub_folder, y=subfolder)
        display(p)
    else:
        pass
    
q = widgets.interactive(file_location, z=upload_type)

display(q)

interactive(children=(RadioButtons(description='File Location:', options=('local',), value='local'), Output())…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The following code cell will attempt to ingest the data you've selected in the widgets above:

**Note About xlsx Files** - Depending on the number of tabs and the size of the data on each tab, ingesting an xlsx file can take several minutes to execute.

In [11]:
if file_type.value == 'parquet':
   tabs = None
   skiprows = 0
else:
   tabs = [x.strip() for x in tab_names.value.split(',')]
   skiprows = skip_rows.value

if subfolder.value == 'yes':
   subfolder = subfolder_name.value
else:
   subfolder = None


In [12]:
master_data = {}

master_data = compile_raw_data(upload_filename.value, tabs, subfolder, skip_rows = skiprows, file_ext = file_type.value)


**Note:** If you see an AttributeError: 'NoneType' object has no attribute 'value' message above, simply rerun the last two code cells to reset the input parameters.

The following code cell will print out the attributes associated with the files you've uploaded and alert you of any errors:

In [13]:
for key, value in master_data.items():
    try:
        print('{} table was ingested with {} rows and {} columns'.format(key,value.shape[0],value.shape[1]))
    except:
        print('{} table failed to load'.format(key))

salesforce_events table was ingested with 214 rows and 9 columns


## 2.0 - Begin Topic Modeling

The list of functions below will help us generate topics using the full-text found in the CORD-19 dataset.  

## 2.1 - Select a Data Frame

The following menus will allow you to select the dataset you would like to use in your modeling and the variables you would like included in the subsequent processes.  You can preview a sample of the data.

Select an available frame from the list below:

In [14]:
dict_keys = widgets.Select(
    options=master_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

Select(description='Tables:', layout=Layout(width='50%'), options=('salesforce_events',), value='salesforce_ev…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

After selecting a frame above, select the variables you would like included in your workflow from the list below:

**NOTE:** To select multiple values from the picklist, either hold down the command key on your keyboard or click and hold the shift key to select ranges of variables.  You can scroll down if your mouse is within the widget window.

In [15]:
master_data[dict_keys.value].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Account: Account Name  214 non-null    object        
 1   Opportunity ID         214 non-null    object        
 2   Opportunity Name       214 non-null    object        
 3   Activity ID            214 non-null    object        
 4   Task Subtype           196 non-null    object        
 5   Created Date           214 non-null    datetime64[ns]
 6   Created By: Full Name  214 non-null    object        
 7   Subject                214 non-null    object        
 8   Comments               205 non-null    object        
dtypes: datetime64[ns](1), object(8)
memory usage: 15.2+ KB


In [16]:
review_variables = widgets.SelectMultiple(
    options=master_data[dict_keys.value].columns.tolist(),
    description='Variables:',
    disabled=False
)

display(review_variables)

SelectMultiple(description='Variables:', options=('Account: Account Name', 'Opportunity ID', 'Opportunity Name…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed


In [17]:
review_var_list = []
for i in review_variables.value:
    review_var_list.append(i)
    
master_data['topic_model_data'] = master_data[dict_keys.value][review_var_list]


In [18]:
master_data['topic_model_data'].head(1)

Unnamed: 0,Comments
0,"From: Gash, Deborah <dgash@saint-lukes.org> Da..."


## 2.2 - Build the Topic Modeling Object

The cells below will walk you through the topic-modeling process.  You'll first select the text you'll be modeling and make a series of inputs prep your model for training.  Much of the source material detailed below was originally published to the Kaggle community by various supporters and has been lightly modified for the purposes of this Notebook.  You can access the original walkthrough provided on Kaggle below:

[CORD-19-LDA-Topic-modeling-recommendation-system](https://www.kaggle.com/d4v1d3/cord-19-lda-topic-modeling-reccomendation-system)

The cell below allows you to combine multiple columns (containing strings only) to form a unique string value in a new column which will be added to your dataset.

In [19]:
combined_text_fields = widgets.SelectMultiple(
    options=master_data[dict_keys.value].select_dtypes(include='object').columns.tolist(),
    description='Combine:',
    disabled=False
)

new_field_name = widgets.Text(
    value='combined_text_field',
    placeholder='combined_text_field',
    description='New Column:',
    disabled=False,
    layout=Layout(width='50%')
)

notice = 'After combining the fields above, select which fields to remove (if necessary)'

removed_text_fields = widgets.SelectMultiple(
    options=master_data[dict_keys.value].select_dtypes(include='object').columns.tolist(),
    description='Remove:',
    disabled=False
)

display(combined_text_fields, new_field_name, notice, removed_text_fields)

SelectMultiple(description='Combine:', options=('Account: Account Name', 'Opportunity ID', 'Opportunity Name',…

Text(value='combined_text_field', description='New Column:', layout=Layout(width='50%'), placeholder='combined…

'After combining the fields above, select which fields to remove (if necessary)'

SelectMultiple(description='Remove:', options=('Account: Account Name', 'Opportunity ID', 'Opportunity Name', …

In [20]:
cols_selected = [x for x in combined_text_fields.value]

if new_field_name.value == '':
    new_col_name = 'default_combined_text_field'
else:
    new_col_name = new_field_name.value

master_data['topic_model_data'][new_col_name] = master_data['topic_model_data'][cols_selected].apply(lambda x: x.str.cat(sep=' | '), axis=1)

cols_removed = [x for x in removed_text_fields.value]

for c in cols_removed:
    del master_data['topic_model_data'][c]

master_data['topic_model_data'].head(1)

Unnamed: 0,combined_text_field
0,"From: Gash, Deborah <dgash@saint-lukes.org> Da..."


In [21]:
text_var = widgets.Select(
    options=master_data['topic_model_data'].columns.tolist(),
    description='Text Column:',
    disabled=False
)

number_topics = widgets.IntSlider(
    value=25,
    min=1,
    max=81,
    step=1,
    description='# of Topics:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

number_features = widgets.IntSlider(
    value=10000,
    min=0,
    max=15000,
    step=25,
    description='# of Features:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)


display(text_var)
display(number_topics)
display(number_features)

Select(description='Text Column:', options=('combined_text_field',), value='combined_text_field')

IntSlider(value=25, description='# of Topics:', max=81, min=1)

IntSlider(value=10000, description='# of Features:', max=15000, step=25)

**Important Note:** When selecting the number of topics and features using the sliders above, please note that if you "type in" a new value by clicking on the value shown, that input value will not be transferred to the model.  The only way to ensure your inputs are transferred to the model are to use the slide functionality provided.

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The cell below will remove any records with **missing text.** SKIP the next cell if you would like to assign a topic to these records.  Note that missing text values may cause duplicate merge statements to occur in later steps.

In [22]:
master_data['topic_model_data'] = master_data['topic_model_data'][~master_data['topic_model_data'][text_var.value].isnull()]
master_data['topic_model_data'] = master_data['topic_model_data'][master_data['topic_model_data'][text_var.value] != '']
master_data['topic_model_data'] = master_data['topic_model_data'][master_data['topic_model_data'][text_var.value] != 'None']


In [23]:
master_data['topic_model_data'].shape

(205, 1)

The cell below will build the topic modeling object using the inputs you've selected in the cells above

In [24]:
ktrain.text.preprocessor.detect_lang = ktrain.text.textutils.detect_lang

if number_features.value == 0:
    number_features_v = 1
    print('Number of features was set to 0 by the user, using a value of 1 for modeling purposes')
else:
    number_features_v = number_features.value

texts = master_data['topic_model_data'][text_var.value]

In [25]:
texts

0      From: Gash, Deborah <dgash@saint-lukes.org> Da...
1      ---------- Forwarded message --------- From: E...
2      Additional To: linda.williams@hpe.com CC: fara...
3      Additional To: jeff.ricci@hpe.com; christopher...
4      Additional To: tarek.robbiati@hpe.com CC: meli...
                             ...                        
209    Additional To: ksilva@sbmcorp.com; cobrien@sbm...
210    Additional To: ksilva@sbmcorp.com; cobrien@sbm...
211    Additional To: ksilva@sbmcorp.com; cobrien@sbm...
212    Additional To: zrocha@pcoastp.com CC: mark.d.c...
213    Additional To: mdw3@ntrs.com CC: elizabeth.a.g...
Name: combined_text_field, Length: 205, dtype: object

In [26]:
tm = ktrain.text.get_topic_model(texts, n_topics=number_topics.value, n_features=number_features_v)

lang: en
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.


#### Guidance for the inputs below: 
The **Topic Words** input will allow you to determine the number of words from each topic that should be used to create each topic's label.  The **Threshold** input allows you to set a cut-off which restricts text from being included in a specific topic if its score-value is less than the threshold.

In [27]:
topic_word_labels = widgets.IntSlider(
    value=5,
    min=1,
    max=10,
    step=1,
    description='Topic Words:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

threshold_val = widgets.IntSlider(
    value=25,
    min=1,
    max=100,
    step=1,
    description='Threshold:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

display(topic_word_labels)
display(threshold_val)

IntSlider(value=5, description='Topic Words:', max=10, min=1)

IntSlider(value=25, description='Threshold:', min=1)

**Important Note:** When selecting the number of topics and features using the sliders above, please note that if you "type in" a new value by clicking on the value shown, that input value will not be transferred to the model.  The only way to ensure your inputs are transferred to the model are to use the slide functionality provided.

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The cell below builds the document-topic distribution showing the topic probability distirbution for each document in texts with respect to the learned topic space.

In [28]:
tm.build(texts, threshold=threshold_val.value/100)

done.


## 2.3 - Review the Topics Generated

The cells below will allow you to review the topics that were generated by your selections in the steps above.  A list of all topics is provided below, along with the number of texts included in each topic (based on the threshold you set) and the list of words most associated with each topic.  If you'd like to rebuild your model, rerun the cells in Section 2.2

In [29]:
topics = tm.get_topics(n_words=topic_word_labels.value, as_string=True)

topic_labels = []
topic_texts = []
topic_idx = []

topic_counts = sorted([ (k, topics[k], len(v)) for k,v in tm.topic_dict.items()],key=lambda kv:kv[0], reverse=False)
for (idx, topic, count) in topic_counts:
    print("topic:%s | count:%s | %s" %(idx, count, topic))
    topic_labels.append(topic)
    topic_texts.append(count)
    topic_idx.append(idx)
    

topic:2 | count:16 | canada par government joyce businesses
topic:3 | count:5 | http com/uk support registered addressee
topic:5 | count:47 | mayo advisory stacey sow empson
topic:6 | count:1 | navigator people organization need tool
topic:11 | count:52 | navigator organization people survey tool
topic:12 | count:1 | meeting boeing final responding keith
topic:18 | count:50 | html clients reach companies outbreak
topic:19 | count:4 | bryan mclaughlin need hpe global
topic:20 | count:1 | navigator minutes healthy employees organization
topic:21 | count:15 | organisations international responding phone lowell
topic:23 | count:1 | urldefense com/v2/url proofpoint dwmfaq&c google
topic:24 | count:8 | government financial companies support end


The following cell converts the list of topics and their attributes to a dataframe for later merging

In [30]:
topic_temp_frame =  list(zip(topic_idx, topic_labels,  topic_texts))
topics_df = pd.DataFrame(topic_temp_frame, columns=['topic_id','topic_label','text_count'])
topics_df.head(1)

Unnamed: 0,topic_id,topic_label,text_count
0,2,canada par government joyce businesses,16


## 2.4 - Visiualize the Topics Generated

The cells below will allow you to visualize the clustering of texts by topic using your model developed in prior steps.

In [1041]:
texts = tm.filter(texts)
master_data['topic_model_data_filtered'] = tm.filter(deepcopy(master_data['topic_model_data']))

In [1042]:
tm.visualize_documents(doc_topics=tm.get_doctopics())

reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 25712 samples in 0.044s...
[t-SNE] Computed neighbors for 25712 samples in 63.125s...
[t-SNE] Computed conditional probabilities for sample 1000 / 25712
[t-SNE] Computed conditional probabilities for sample 2000 / 25712
[t-SNE] Computed conditional probabilities for sample 3000 / 25712
[t-SNE] Computed conditional probabilities for sample 4000 / 25712
[t-SNE] Computed conditional probabilities for sample 5000 / 25712
[t-SNE] Computed conditional probabilities for sample 6000 / 25712
[t-SNE] Computed conditional probabilities for sample 7000 / 25712
[t-SNE] Computed conditional probabilities for sample 8000 / 25712
[t-SNE] Computed conditional probabilities for sample 9000 / 25712
[t-SNE] Computed conditional probabilities for sample 10000 / 25712
[t-SNE] Computed conditional probabilities for sample 11000 / 25712
[t-SNE] Computed conditional probabilities for sample 12000 / 25712
[t-SNE] Computed condi

## 3.0 - Search for Text by Keywords or Phrase

The cell blocks below allow you to search for texts in the CORD-19 dataset that have a high-degree of match with your search keyword(s) and phrases.

## 3.1 - Search for Text by Keyword(s)

Using the text box below, enter a list of **comma-separated** keywords along with the threshold score for filtering on the text.  Text found to have a score higher than your threshold will be returned.

In [1043]:
search_term = widgets.Text(
    value='remdesivir, chloroquine, ritonavir',
    description='Search Terms',
    disabled=False,
    layout=Layout(width='50%')
)

threshold_topic_val = widgets.IntSlider(
    value=80,
    min=1,
    max=100,
    step=1,
    description='Threshold:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

display(search_term, threshold_topic_val)

Text(value='remdesivir, chloroquine, ritonavir', description='Search Terms', layout=Layout(width='50%'))

IntSlider(value=80, description='Threshold:', min=1)

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

In [1044]:
search_terms = [x.strip() for x in search_term.value.split(',')]
threshold = threshold_topic_val.value/100

t_topics = set()
for s in search_terms:
    temp_results = tm.search(s, case_sensitive=False)
    temp_topic_ids = {doc[3] for doc in temp_results if doc[2]>threshold}
    t_topics.update(temp_topic_ids)
    

In [1045]:
docs = tm.get_docs(topic_ids=t_topics, rank=True)
print("TOTAL_NUM_OF_DOCS: %s" % len(docs))

print("##################################")

for t in t_topics:
    docs = tm.get_docs(topic_ids=[t], rank=True)
    print("NUM_OF_DOCS: %s" % len(docs))
    if(len(docs)==0): continue
    doc = docs[1]
    print('TOPIC_ID: %s' % (doc[3]))
    print('TOPIC: %s' % (tm.topics[t]))
    print('DOC_ID: %s'  % (doc[1]))
    print('TOPIC SCORE: %s '% (doc[2]))
    print('TEXT: %s' % (doc[0][0:400]))
    
    
    print("##################################")

TOTAL_NUM_OF_DOCS: 1876
##################################
NUM_OF_DOCS: 299
TOPIC_ID: 16
TOPIC: los que las del por una para como pacientes son
DOC_ID: 23906
TOPIC SCORE: 0.995119997163663 
TEXT: RECOMENDACIONES PARA EL MANEJO DE LA FARINGOAMIGDALITIS AGUDA DEL ADULTO ଝ BAJO LA LICENCIA CC BY-NC-ND (HTTP:// CREATIVECOMMONS.ORG/LICENSES/BY-NC-ND/4.0/) | Abstract

o Sociedad Española de Otorrinolaringología y Patología Cérvico-Facial (SEORL-PCF), España Recibido el 30 de diciembre de 2014; aceptado el 5 de febrero de 2015 Disponible en Internet el 27 de mayo de 2015 PALABRAS CLAVE Faringoami
##################################
NUM_OF_DOCS: 601
TOPIC_ID: 9
TOPIC: membrane proteins particles surface membranes entry lipid cellular golgi formation
DOC_ID: 1325
TOPIC SCORE: 0.8328878853109434 
TEXT: INTERMEDIATE COMPARTMENT: A SORTING STATION BETWEEN THE ENDOPLASMIC RETICULUM AND THE GOLGI APPARATUS | None | Introduction

Newly synthesized proteins and lipids leave the endoplasmic reticulum (E

In [1046]:
keyword_text = []
keyword_score = []
keyword_rank = []

for t in t_topics:
    docs = tm.get_docs(topic_ids=[t], rank=True)
    for idx,(text,doc_id,score,topic_id) in enumerate(docs):
        if idx+1 < 6:
            keyword_text.append(text)
            keyword_score.append(score)
            keyword_rank.append(idx+1)
        else:
            pass

col_name = 'Search Keyword Text'
col_val = search_term.value+': Top 5 Ranked Texts per Topic'

keyword_frame =  list(zip(keyword_rank,keyword_score,keyword_text))
keyword_results = pd.DataFrame(keyword_frame, columns = ['keyword_rank_in_topic','keyword_score',text_var.value])
keyword_results[col_name] = col_val

keyword_results

Unnamed: 0,keyword_rank_in_topic,keyword_score,combined_text_field,Search Keyword Text
0,1,0.995992,RECOMENDACIONES PARA EL MANEJO DE LA FARINGOAM...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
1,2,0.99512,RECOMENDACIONES PARA EL MANEJO DE LA FARINGOAM...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
2,3,0.994095,None | None | \n\nActitud diagnóstica y terapé...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
3,4,0.993985,INFECCIONES NOSOCOMIALES EN PEDIATRÍA | Abstra...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
4,5,0.993335,REANIMACIÓN Y PREVENCIÓN DE LAS INFECCIONES NO...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
5,1,0.87841,PATHWAYS OF PROTEIN SORTING AND MEMBRANE TRAFF...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
6,2,0.832888,INTERMEDIATE COMPARTMENT: A SORTING STATION BE...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
7,3,0.830317,ENDOMEMBRANE SYSTEM OF PLANTS AND FUNGI I. INT...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
8,4,0.814098,MORE THAN ONE DOOR -BUDDING OF ENVELOPED VIRUS...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."
9,5,0.811854,INFLUENZA VIRUS MORPHOGENESIS AND BUDDING | Ab...,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."


## 3.2 - Search for Text by Phrase

Using the text box below, enter a **search phrase** (or Question) to return a ranked list of relevant texts.  Use the **Results** slider to determine the number of texts to return.

In [1047]:
tm.train_recommender()

In [1048]:
search_phrase = widgets.Text(
    value='Which treatments have the highest likelihood of working to combat the effects of COVID-19, have the most safety data from previous use, and are likely to be available in supplies sufficient to treat substantial numbers of patients?',
    description='Question:',
    disabled=False,
    layout=Layout(width='100%')
)

number_results = widgets.IntSlider(
    value=5,
    min=1,
    max=50,
    step=1,
    description='Results:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

display(search_phrase, number_results)

Text(value='Which treatments have the highest likelihood of working to combat the effects of COVID-19, have th…

IntSlider(value=5, description='Results:', max=50, min=1)

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

In [1049]:
search_phrase_results = tm.recommend(text=search_phrase.value, n=number_results.value)
search_phrase_results = sorted([(v[0],v[1],v[2],v[3]) for idx,v in enumerate(search_phrase_results)],key=lambda v:v[2], reverse=True)


The cell below will print out a ranked summary of the texts most relevant to your search phrase

In [1050]:
for i, doc in enumerate(search_phrase_results):
    print('RESULT #%s'% (i+1))
    print('TEXT:\n\t%s' % (" ".join(doc[0].split()[:200])))
    print(doc[1])
    print(doc[2])
    print(doc[3])
    print()

RESULT #1
TEXT:
	None | None | Repositioning of drugs for use as antiviral treatments is a critical need [1] . It is commonly very badly perceived by virologists, as we experienced when reporting the effectiveness of azithromycin for Zika virus [2] . A response has come from China to the respiratory disease caused by the new coronavirus (SARS-CoV-2) that emerged in December 2019 in this country. Indeed, following the very recent publication of results showing the in vitro activity of chloroquine against SARS-CoV-2 [3] , data have been reported on the efficacy of this drug in patients with SARS-CoV-2-related pneumonia at different levels of severity [4, 5] . Indeed, following the in vitro results, 20 clinical studies were launched in several Chinese hospitals.The first results obtained from more than 100 patients showed the superiority of chloroquine compared with treatment of the control group in terms of reduction of exacerbation of pneumonia, duration of symptoms and delay of viral c

In [1051]:
rank_list = []
text_list = []
score_list = []

for rank,(txt, doc_id, score, topic_id) in enumerate(search_phrase_results):
    rank_list.append(rank+1)
    text_list.append(txt)
    score_list.append(score)

col_name = 'Search Phrase Text'
col_value = search_phrase.value+' Top '+str(number_results.value)+' Ranked Text'

phrase_frame =  list(zip(rank_list,text_list,score_list))
phrase_results = pd.DataFrame(phrase_frame, columns = ['phrase_rank',text_var.value, 'phrase_score'])
phrase_results[col_name] = col_value

phrase_results


Unnamed: 0,phrase_rank,combined_text_field,phrase_score,Search Phrase Text
0,1,None | None | \n\nRepositioning of drugs for u...,0.527483,Which treatments have the highest likelihood o...
1,2,DIAGNOSIS AND TREATMENT OF VIRAL INFECTIONS AN...,0.489013,Which treatments have the highest likelihood o...
2,3,None | None | Introduction\n\nThis article pro...,0.488937,Which treatments have the highest likelihood o...
3,4,ANTIVIRAL THERAPY CHAPTER OUTLINE | None | \n\...,0.488895,Which treatments have the highest likelihood o...
4,5,MEETING REPORT: 27TH INTERNATIONAL CONFERENCE ...,0.48114,Which treatments have the highest likelihood o...
5,6,PERSONAL VIEW ANTIVIRAL EFFECTS OF CHLOROQUINE...,0.444867,Which treatments have the highest likelihood o...
6,7,IN VITRO SUSCEPTIBILITY OF 10 CLINICAL ISOLATE...,0.441079,Which treatments have the highest likelihood o...
7,8,None | None | Introduction\n\nInfection is an ...,0.435595,Which treatments have the highest likelihood o...
8,9,None | Abstract\n\npublicly funded repositorie...,0.432136,Which treatments have the highest likelihood o...
9,10,"DRUG REPURPOSING FOR NEW, EFFICIENT, BROAD SPE...",0.421043,Which treatments have the highest likelihood o...


## 3.3 - Merge Topics, Keywords and Search Phrase Results with Texts

The code block below will merge the topic results back to the original texts along with the results of the keyword and phrase search results executed in the cells above.

In [1052]:
docs = tm.get_docs()

In [1053]:
doc_df = pd.DataFrame(docs, columns =[text_var.value, 'doc_id', 'topic_score', 'topic_id'])

In [1054]:
doc_df.head(1)

Unnamed: 0,combined_text_field,doc_id,topic_score,topic_id
0,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,0,0.655379,34


The cell below merges the **original text data** with the **topic id** assigned to each text

In [1055]:
master_data['topic_model_data'] = master_data['topic_model_data'].merge(doc_df, how='left', on=text_var.value)


In [1056]:
master_data['topic_model_data'].head(1)

Unnamed: 0,paper_id,title,authors,affiliations,bibliography,bibliography_titles,number_of_references,bibliography_authors,cord_19_source,metadata_match,combined_text_field,doc_id,topic_score,topic_id
0,f905f78b32f63c6d14a79984dfb33f1b358b8ab4,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,"Meytal Galilee, Akram Alian",Meytal Galilee (Technion -Israel Institute of ...,HIV drug resistance against strand transfer in...,HIV DRUG RESISTANCE AGAINST STRAND TRANSFER IN...,38,"K ANSTETT, B BRENNER, T MESPLEDE, M A WAINBERG...",biorxiv_medrxiv_full_text,yes,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,0.0,0.655379,34.0


The cell below merges the **original text data** with the **topic label** assigned to each text

In [1057]:
master_data['topic_model_data'] = master_data['topic_model_data'].merge(topics_df[['topic_id','topic_label']], how='left', on='topic_id')
                                                                                  

In [1058]:
master_data['topic_model_data'].head(1)

Unnamed: 0,paper_id,title,authors,affiliations,bibliography,bibliography_titles,number_of_references,bibliography_authors,cord_19_source,metadata_match,combined_text_field,doc_id,topic_score,topic_id,topic_label
0,f905f78b32f63c6d14a79984dfb33f1b358b8ab4,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,"Meytal Galilee, Akram Alian",Meytal Galilee (Technion -Israel Institute of ...,HIV drug resistance against strand transfer in...,HIV DRUG RESISTANCE AGAINST STRAND TRANSFER IN...,38,"K ANSTETT, B BRENNER, T MESPLEDE, M A WAINBERG...",biorxiv_medrxiv_full_text,yes,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,0.0,0.655379,34.0,binding structure residues domain


The cell below merges the **original text data** with the **keyword search results** assigned to each text

In [1059]:
master_data['topic_model_data'] = master_data['topic_model_data'].merge(keyword_results, how='left', on=text_var.value)
   

In [1060]:
master_data['topic_model_data'][master_data['topic_model_data']['keyword_rank_in_topic'] == 1].head(1)

Unnamed: 0,paper_id,title,authors,affiliations,bibliography,bibliography_titles,number_of_references,bibliography_authors,cord_19_source,metadata_match,combined_text_field,doc_id,topic_score,topic_id,topic_label,keyword_rank_in_topic,keyword_score,Search Keyword Text
5165,bcefb795b369f47edd3b2fd1838adef100b131ac,RECOMENDACIONES PARA EL MANEJO DE LA FARINGOAM...,"Josep M Cots, Juan-Ignacio Alós, Mario Bárcena...",Josep M Cots (Centro de Atención Primaria La M...,Estudio Nacional de la Infección Respiratoria ...,ESTUDIO NACIONAL DE LA INFECCIÓN RESPIRATORIA ...,54,"; A L BISNO; M H EBELL, M A SMITH, H C BARRY, ...",custom_license_full_text,yes,RECOMENDACIONES PARA EL MANEJO DE LA FARINGOAM...,4607.0,0.995992,16.0,los que las del,1.0,0.995992,"remdesivir, chloroquine, ritonavir: Top 5 Rank..."


The cell below merges the **original text data** with the **search phrase results** assigned to each text

In [1061]:
master_data['topic_model_data'] = master_data['topic_model_data'].merge(phrase_results, how='left', on=text_var.value)

In [1062]:
master_data['topic_model_data'][master_data['topic_model_data']['phrase_rank'] == 1].head(1)

Unnamed: 0,paper_id,title,authors,affiliations,bibliography,bibliography_titles,number_of_references,bibliography_authors,cord_19_source,metadata_match,...,doc_id,topic_score,topic_id,topic_label,keyword_rank_in_topic,keyword_score,Search Keyword Text,phrase_rank,phrase_score,Search Phrase Text
11022,868df013f3ff396406edc6f386e6ade9e7c899f4,,"Philippe Colson, Jean-Marc Rolain, Jean-Christ...","Philippe Colson (MEPHI, 27 boulevard Jean Moul...",Recycling of chloroquine and its hydroxyl anal...,RECYCLING OF CHLOROQUINE AND ITS HYDROXYL ANAL...,21,"J M ROLAIN, P COLSON, D RAOULT; E BOSSEBOEUF, ...",custom_license_full_text,yes,...,9793.0,0.527483,2.0,antiviral drug drugs hcv,,,,1.0,0.527483,Which treatments have the highest likelihood o...


## 3.4 - Get Visualization Coordinates for Texts

The code block below will generate x,y-coordinates for each text in the dataset for visualization purposes.

In [1063]:
doc_topics=tm.get_doctopics()

t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

**Read more at:** [sklearn.manifold.TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)

In [1064]:
tsne_model = sklearn.manifold.TSNE(n_components=2, verbose=0, random_state=0, angle=.99, init='pca')

In [1065]:
tsne_lda = tsne_model.fit_transform(doc_topics)

Since texts can belong to multiple topics, the following cell returns the topic with the maximum score for each text (resulting in each text being assigned to one topic ONLY)

In [1066]:
max_topics = []
for i in range(doc_topics.shape[0]):
  max_topics.append(doc_topics[i].argmax())
X_topics = np.array(max_topics)

In [1067]:
texts_w_topics = tm.get_texts()

In [1068]:
texts_w_topic_temp =  list(zip(texts_w_topics, tsne_lda[:,0], tsne_lda[:,1]))

In [1069]:
texts_w_topic_df = pd.DataFrame(texts_w_topic_temp, columns = [text_var.value, 'x', 'y']) 

In [1070]:
texts_w_topic_df.head(1)

Unnamed: 0,combined_text_field,x,y
0,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,7.373386,-46.736141


In [1071]:
texts_w_topic_df = texts_w_topic_df.drop_duplicates(subset=text_var.value, keep='first')

The cell below merges the **original text data** with the **x,y-coordinates** assigned to each text

In [1072]:
master_data['topic_model_data'] = master_data['topic_model_data'].merge(texts_w_topic_df, how='left', on=text_var.value)
master_data['topic_model_data'] = master_data['topic_model_data'].drop_duplicates(subset=text_var.value,keep='first')


In [1073]:
master_data['topic_model_data'].head(1)

Unnamed: 0,paper_id,title,authors,affiliations,bibliography,bibliography_titles,number_of_references,bibliography_authors,cord_19_source,metadata_match,...,topic_id,topic_label,keyword_rank_in_topic,keyword_score,Search Keyword Text,phrase_rank,phrase_score,Search Phrase Text,x,y
0,f905f78b32f63c6d14a79984dfb33f1b358b8ab4,MULTIMERIZATION OF HIV-1 INTEGRASE HINGES ON C...,"Meytal Galilee, Akram Alian",Meytal Galilee (Technion -Israel Institute of ...,HIV drug resistance against strand transfer in...,HIV DRUG RESISTANCE AGAINST STRAND TRANSFER IN...,38,"K ANSTETT, B BRENNER, T MESPLEDE, M A WAINBERG...",biorxiv_medrxiv_full_text,yes,...,34.0,binding structure residues domain,,,,,,,7.373386,-46.736141


## 4.0 - Export Dataframes for Offline Analysis or Secondary Processes

The following code block will allow you to select and export dataframes to a local directory.  Use the inputs below to write the files to your current directory and to apply a timestamp to the filenames to prevent the risk of overwriting prior files saved to that folder.

## 4.1 - Load Export Functions

The functions below will make two types of export types available to the user.  One will allow you to export the files as .xlsx files, while the other supports exporting as parquet.

In [1074]:
def dict_to_excel(dict_name, dframe, subfolder, timestamp = False):
    
    # Inputs: a dictionary of dataframes; timestamp = True adds an ISO-formatted suffix to the filename
    # Description: Writes dataframes contained within a dictionary to xlsx (on your directory)

    if subfolder:
        file_path = subfolder+'/'
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.xlsx' if timestamp else '.xlsx'  
        file_path = os.path.join(file_path, dframe + suffix)
    else:
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.xlsx' if timestamp else '.xlsx'  
        file_path = os.path.join(dframe + suffix)
        
    try:
        dict_name[dframe].to_excel(file_path, index = False)
        print('Successfully wrote {} with {} rows and {} columns to the directory'.format(dframe+suffix, dict_name[dframe].shape[0], dict_name[dframe].shape[1]))
    except Exception as e:
        print('Writing the data to the directory failed')
        
def dict_to_parquet(dict_name, dframe, subfolder, timestamp = False):
    
    # Inputs: a dictionary of dataframes; timestamp = True adds an ISO-formatted suffix to the filename
    # Description: Writes dataframes contained within a dictionary to parquet (on your directory)

    if subfolder:
        file_path = subfolder+'/'
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.parquet.gzip' if timestamp else '.parquet.gzip'  
        file_path = os.path.join(file_path, dframe + suffix)
    else:
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.parquet.gzip' if timestamp else '.parquet.gzip'  
        file_path = os.path.join(dframe + suffix)
        
    try:
        dict_name[dframe].to_parquet(file_path, compression='gzip')
        print('Successfully wrote {} with {} rows and {} columns to the directory'.format(dframe+suffix, dict_name[dframe].shape[0], dict_name[dframe].shape[1]))
    except Exception as e:
        print('Writing the data to the directory failed')


## 4.2 - Select and Export Files

The code block below will allow you to select which files you would like to extract and in what format.  If choosing to write the files to a "subfolder" - please ensure that the subfolder exists in your directory.

Select one or more available dataframes, then select whether or not you'd like the files saved to the current working directory or a subfolder in the directory.  Lastly, if you would like a timestamp to be added to your exported filenames, select Timestamp = 'yes' to prevent overwriting prior files saved to the folder.

### Important Note:
The **topic_model_data** label in the Tables-menu option below contains the "stacked" full-text contained in all of the original datasources (combined) with the additional topic attributes.

In [1075]:
dict_keys = widgets.SelectMultiple(
    options=master_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

subfolder_option = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

output_type = widgets.RadioButtons(
    options=['xlsx','parquet'],
    value='xlsx',
    description='Output Type:',
    disabled=False
)

timestamp_option = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Timestamp:',
    disabled=False
)

subfolder_text = widgets.Text(
    value='output',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

def sub_folder_edit(y):
    if(y=='yes'):
        display(subfolder_text)
        subfolder_text.on_submit(subfolder_text)
        print('Your file(s) will be written to the subfolder in {}[Your Entry Above]'.format(os.getcwd()+os.sep))
    else:
        print('Using {} folder'.format(os.getcwd()))
        
y = widgets.interactive(sub_folder_edit, y=subfolder_option)

display(y, timestamp_option, output_type)

SelectMultiple(description='Tables:', layout=Layout(width='50%'), options=('Sheet1', 'topic_model_data', 'topi…

interactive(children=(RadioButtons(description='Subfolder:', options=('no', 'yes'), value='no'), Output()), _d…

RadioButtons(description='Timestamp:', options=('no', 'yes'), value='no')

RadioButtons(description='Output Type:', options=('xlsx', 'parquet'), value='xlsx')

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
Execute the code cell below to export the csv files to your chosen directory.

**NOTE:** If you have chosen to write your files to a "subfolder" - ensure that the folder can be found in your working directory.  The function below will "not create a subfolder" in your directory.

In [1096]:
if subfolder_option.value == 'yes':
    subfolder = subfolder_text.value
else:
    subfolder = None
    
dframe_list = []
for df in dict_keys.value:
    dframe_list.append(df)

if timestamp_option.value == 'yes':
    timestamp_boolean = True
else:
    timestamp_boolean = False
 
for df in dframe_list:
    if output_type.value == 'parquet':
        dict_to_parquet(master_data, df, subfolder, timestamp = timestamp_boolean)
    else:
        dict_to_excel(master_data, df, subfolder, timestamp = timestamp_boolean)

Successfully wrote topic_model_data.xlsx with 29133 rows and 23 columns to the directory
