<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-and-Setup" data-toc-modified-id="Imports-and-Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports and Setup</a></span></li><li><span><a href="#Keyword-Analysis-Functions" data-toc-modified-id="Keyword-Analysis-Functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Keyword Analysis Functions</a></span><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Prepare-Text-Data" data-toc-modified-id="Prepare-Text-Data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Prepare Text Data</a></span></li><li><span><a href="#Show-Model-Topics" data-toc-modified-id="Show-Model-Topics-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Show Model Topics</a></span></li><li><span><a href="#Generate-Keywords" data-toc-modified-id="Generate-Keywords-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Generate Keywords</a></span></li><li><span><a href="#Translate-Output" data-toc-modified-id="Translate-Output-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Translate Output</a></span></li><li><span><a href="#Order-by-Part-of-Speech" data-toc-modified-id="Order-by-Part-of-Speech-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Order by Part of Speech</a></span></li><li><span><a href="#Get-TFIDF-Keywords" data-toc-modified-id="Get-TFIDF-Keywords-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Get TFIDF Keywords</a></span></li></ul></li><li><span><a href="#Visualization-Functions" data-toc-modified-id="Visualization-Functions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Visualization Functions</a></span><ul class="toc-item"><li><span><a href="#Graph-of-Topic-Number-Evaluations" data-toc-modified-id="Graph-of-Topic-Number-Evaluations-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Graph of Topic Number Evaluations</a></span></li><li><span><a href="#pyLDAvis-Topic-Visualization" data-toc-modified-id="pyLDAvis-Topic-Visualization-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>pyLDAvis Topic Visualization</a></span></li><li><span><a href="#Word-Cloud" data-toc-modified-id="Word-Cloud-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Word Cloud</a></span></li></ul></li><li><span><a href="#gen_analysis_files-Test" data-toc-modified-id="gen_analysis_files-Test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>gen_analysis_files Test</a></span></li></ul></div>

This notebook describes the use of kwgen by deriving the top keywords for papers from the [COVID-19 Open Research Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/). 

Follow the provided link and download the 

# Imports and Setup

In [1]:
import os
import sys

import pandas as pd

import kwgen

import matplotlib.pyplot as plt
import seaborn as sns
# Plot settings
sns.set(style="darkgrid")
sns.set(rc={'figure.figsize':(15,5)})

pd.set_option("display.max_rows", 16) # maximum df rows
pd.set_option('display.max_columns', None) # maximum df columns
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:99% !important; }</style>")) # widens interface
# %matplotlib notebook

# Keyword Analysis Functions

## Load Data

In [None]:
df_corpus = kwgen.utils.load_data(data=path)
df_corpus.loc[:5]

## Prepare Text Data

In [None]:
text_corpus = kwgen.utils.prepare_text_data(
    data=df_corpus, 
    input_language=input_language, 
    incl_mc_questions=False,
    min_freq=2,
    min_word_len=4
)[0]

In [None]:
text_corpus[index]

## Show Model Topics

In [11]:
topics = kwgen.model.gen_keywords(
    method='LDA',
    text_corpus=df_corpus,
    clean_texts=None,
    input_language=input_language,
    output_language=output_language,
    num_keywords=15,
    num_topics=10,
    corpuses_to_compare=None,
    return_topics=True, # <--
    incl_mc_questions=False,
    ignore_words=None,
    min_freq=2,
    min_word_len=4,
    sample_size=1
)

In [None]:
topics[0]

## Generate Keywords

In [23]:
ignore_words = None

In [24]:
freq_kws = kwgen.model.gen_keywords(
    method='frequency',
    text_corpus=df_corpus,
    clean_texts=None,
    input_language=input_language,
    output_language=output_language,
    num_keywords=15,
    num_topics=5,
    corpuses_to_compare=None,
    return_topics=False,
    incl_mc_questions=False,
    ignore_words=ignore_words,
    min_freq=2,
    min_word_len=4,
    sample_size=1
)

In [None]:
freq_kws

In [16]:
LDA_kws = kwgen.model.gen_keywords(
    method='lda',
    text_corpus=df_corpus,
    clean_texts=None,
    input_language=input_language,
    output_language=output_language,
    num_keywords=15,
    num_topics=12,
    corpuses_to_compare=None,
    return_topics=False,
    incl_mc_questions=False,
    ignore_words=ignore_words,
    min_freq=2,
    min_word_len=4,
    sample_size=1
)

In [None]:
LDA_kws

## Translate Output

In [None]:
# This sometimes just doesn't work - try rerunning
kwgen.utils.translate_output(
    outputs=LDA_keywords, 
    input_language=input_language, 
    output_language='en'
) # switch to another language

## Order by Part of Speech

In [None]:
kwgen.utils._order_by_pos(outputs=LDA_keywords, output_language=output_language)

## Get TFIDF Keywords

In [23]:
# This generally produces the most frequent words in the text corpus or general LDA words
tfidf_kws = kwgen.model.gen_keywords(
   method='tfidf',
   text_corpus=mit_vergnuegen_path_lang[0],
   clean_texts=None,
   input_language=mit_vergnuegen_path_lang[1],
   output_language=mit_vergnuegen_path_lang[1],
   num_keywords=10,
   num_topics=5,
   corpuses_to_compare=[was_mit_medien_path_lang[0], # <-
                        payment_banking_path_lang[0],
                        blackrock_path_lang[0]],
   return_topics=False,
   incl_mc_questions=False,
   ignore_words=ignore_words,
   min_freq=2,
   min_word_len=4,
   sample_size=1
)

In [None]:
tfidf_kws

# Visualization Functions

## Graph of Topic Number Evaluations

In [None]:
figure = kwgen.visuals.graph_topic_num_evals(
    method=['lda', 'lda_bert'], # lda, bert, lda_bert
    text_corpus=df_corpus, 
    clean_texts=None,
    input_language=input_language,
    num_keywords=15,
    topic_nums_to_compare=None,
    incl_mc_questions=False,
    min_freq=2,
    min_word_len=4,
    sample_size=1,
    metrics=True, # stability or coherence
    fig_size=(20,10),
    save_file=False,
    return_ideal_metrics=False
) # <- used for gen_analysis_files

plt.show()

## pyLDAvis Topic Visualization

In [28]:
# Commented out, as it changes the output dimensions due to its width
# kwgen.visuals.pyLDAvis_topics(
#     method='lda',
#     text_corpus=df_corpus, 
#     input_language=input_language,
#     num_topics=10,
#     incl_mc_questions=False,
#     min_freq=2,
#     min_word_len=4,
#     save_file=False,
#     display_ipython=True # <- show in Jupyter notebook
# )

## Word Cloud

In [None]:
kwgen.visuals.gen_word_cloud(
    text_corpus=df_corpus,
    input_language=input_language,
    incl_mc_questions=False,
    ignore_words=ignore_words,
    min_freq=2,
    min_word_len=4,
    sample_size=1,
    height=500,
    save_file=False
)

# gen_analysis_files Test

In [None]:
kwgen.visuals.gen_files(
    method=['lda', 'bert', 'lda_bert'],
    text_corpus=df_corpus, 
    clean_texts=None,
    input_language=input_language,
    output_language=output_language,
    num_keywords=15,
    topic_nums_to_compare=[10,11,12,13,14,15],
    corpuses_to_compare=None,
    incl_mc_questions=False,
    ignore_words=None,
    min_freq=2,
    min_word_len=4,
    sample_size=1,
    fig_size=(20,10),
    zip_results=True
)