# Focused Concept Miner Command Line Interface
# User Guide
By Eric Zhou, Carnegie Mellon University Tepper School of Business

Please contact me at ebzhou@tepper.cmu.edu for any feedback or inquiries.

Version: September 2020

# 1 Introduction
We present the Focused Concept Miner (FCM) Command Line Interface (CLI), an exploratory tool which operationalizes an interpretable, deep learning-based text mining algorithm such that managers and researchers can accessibly analyze high-resolution unstructured data. This is specifically designed to be accessible for managers and researchers with or without a technical background. For more information about FCM CLI, please visit [FCMiner](http://www.fcminer.com/).

In this guide, we provide a step-by-step guide on setup, features, implementation, and troubleshooting of FCM CLI. We also discuss a general approach to explore data and filter results to obtain meaningful insights using FCM's __Grid Search__ and __Visualize__ features.

Becoming comfortable with the features and processes we discuss in the following sections will enable you to rapidly extract insights from unstructured data in a fraction of the time that involved, multi-step approaches require. We encourage you to experiment with creative applications to glean insights and develop novel hypotheses from rich unstructured data.

This document covers the following:

1. Introduction & Caveats

2. Installation

3. Commands & Features

4. Demonstration

5. Closing Comments

## 1.1 Caveats
Before we begin, we'd like to present several caveats.

First, FCM is a purely data-driven exploratory tool that is correlational in nature. Thus, we do not make any causal claims about the results.

Additionally, all machine learning models are noisy representations of reality and could be faulty. Patterns identified by FCM should be, at most, interpreted as insights into the model, not reality. Users must investigate further to validate potential patterns as informative of real-world phenomena and to make causal claims.

Consequently, users should practice trial and error to achieve the best results with FCM. This involves leveraging two of its core features, **Grid Search** and **Visualize**, to validate patterns found by FCM using human judgement and domain expertise.

That being said, trial and error when tuning hyperparameters and filtering results can be time-consuming and even frustrating as with most deep learning frameworks. Users must perform due diligence to arrive at good results. Refer to [this paper](https://arxiv.org/abs/1206.5533 ) for hyperparameter tuning tips and tricks.

# 2 Installation


## 2.1 Getting Started: Installing Python
FCM CLI requires a working installation of Python 3.6 or greater. Available Python distributions can be found [here](https://www.python.org/downloads/windows/). Alternative methods of installing Python include via the [Anaconda](https://www.anaconda.com/products/individual)/[Miniconda](https://docs.conda.io/en/latest/miniconda.html) distribution which will include Python 3.6 along with other scientific computing libraries.

Download the appropriate Python distribution and complete the installation process.

If you would prefer a more user-friendly interface for running FCM than the system command line, consider installing an integrated development environment (IDE). A couple good options include:

• [Visual Studio](https://code.visualstudio.com/)

• [PyCharm](https://www.jetbrains.com/pycharm/)

We offer two installation options, either via Command Line/terminal or in Google Colab. If you have a machine with adequate hardware (GPU, high RAM) and prefer to use an IDE to run FCM or will simply use the Command Line, follow the steps outlined in Section 2.2.

Otherwise, proceed to Section 2.3 to set up FCM in Google Colab.

## 2.2 FCM CLI Setup: Command Line
To setup the FCM CLI, we must first clone the repository from GitHub to your machine. The requisite FCM CLI files can be obtained [here](https://github.com/cygit/fcm). Click the green button "Code" and "Download ZIP". Save and extract the ZIP to an accessible location on your computer.

![github](img/2.2_github.png)

If you have access to GitHub desktop, you can clone the respository to your computer directly from GitHub via “Open with GitHub Desktop” or with the command: 
    
    git clone git@github.com:ecfm/fcm_cli.git

Now that we have obtained the source files, we must install and activate the environment. Follow these steps:

1. Open up your command line by opening the Start Menu and typing in "cmd". In the command line, you will see at the start of each line your current working directory (i.e. `/mnt/c/Users/name`). We must navigate to the fcm_cli directory that we extracted earlier.

    ![cwd](img/2.2_cwd.png)


2. Find the fcm_cli folder path in File Explorer, and copy the full path. In the example below, we will copy `C:\Users\name\Documents\GitHub\fcm_cli`
    
    ![directory](img/navigate.png)
3. In the command line, execute the following command with the relative path. Given the example path, the input should appear as:

    ```bash
    cd C:\Users\name\Documents\GitHub\fcm_cli
    ```
    
    Your new working directory should appear as `C:\Users\name\Documents\GitHub\fcm_cli` in the command line.

    ![cd](img/2.2_cd.png)
4. Once you have navigated to the fcm_cli directory, execute the following commands in the command line to create a virtual environment and install FCM along with its required packages:

In [None]:
%cd FCM_CLI_PATH # change the path here with your fcm_cli directory
!python -m venv env
!source env/bin/activate
%cd src
!pip install --editable .

   **NOTE**: You must include the period at the end of `pip install --editable .`, otherwise the installation will fail!

The first time you install FCM CLI, you must create a Python virtual environment and install the required packages. This may take several minutes.

From hereafter, each time you start a new command line instance, you must navigate to the fcm_cli directory as above and execute the commands:
```bash
        source env/bin/activate
        cd src
```

At this point, you should have run the following commands and are in the process of installing FCM CLI as seen below:

![install](img/2.2_install.png)

Alternatively, if you'd like to use a conda environment, execute the following to create a virtual environment:

In [None]:
%cd FCM_CLI_PATH # change the path here with your fcm_cli directory
!conda create --name fcm python=3.6
!activate fcm
%cd src
!conda install --name fcm --editable .

__NOTE__: There have been instances where the installation fails due to certain packages that can not be retrieved. Try rerunning `pip install editable .`. If this doesn't work, try installing the packages below individually using `pip install PACKAGE_NAME`

The required packages are:
- click
- django
- nltk
- numpy
- pandas
- pyLDAvis
- scipy
- sklearn
- torch

5. Once the installation is complete, ensure that FCM CLI is working properly by using the __Help__ command. Execute the following:
            
        fcm --help
            
    You should see a prompt as follows:

    ![Help Prompt](img/2.2_help.png)

    Congratulations - your FCM CLI is now ready! In Section 3, we will discuss the available commands and features and how to use them.

## 2.3 RECOMMENDED: Setup in Google Colab

If you do not have access to device with a GPU or adequate processing power, consider using Google Colab to run FCM CLI. Colab provides access to powerful GPUs and high-RAM environments to run Python code at no cost. The setup is even simpler as well!

To setup FCM CLI in Google Colab, follow thes steps in [Colab Guide](./fcm_cli_colab.ipynb)

# 3 Commands & Features

Currently, FCM CLI has three core commands:

1. __Train__: given a dataset and output directory, train FCM using a set of default hyperparameters.

2. __Grid search__: given a configuration file, train FCM with user-defined hyperparameters.

3. __Visualize__: display grid search results with filtering sliders to evaluate outputs.

## 3.1 Train

The __Train__ command will train FCM on the specified dataset with default hyperparameters and save the outputs in the user-specified folder.

To execute __Train__, enter the following command in the command line:

`fcm train DATASET_NAME OUTPUT_DIRECTORY`

The user specifies the `DATASET_NAME` exactly as found in the directory: `fcm_cli/src/dataset`. If `DATASET_NAME` is 'csv', user must also specify three options `--csv-path`(path to the csv file), `--csv-text` (column name of the text field in the csv file), and `--csv-label` (column name of the label field in the csv file). Results will be saved in the user-specified `OUTPUT_DIRECTORY`.

The command line should appear as:

![train](img/3.1_train.png)

While this command is running, logs will be saved in `OUTPUT_DIRECTORY/fcm.log`.

Metrics from each epoch will be written in `OUTPUT_DIRECTORY/train_metrics.txt`.

Concept words will be saved in `OUTPUT_DIRECTORY/concept/epoch*.txt` for each epoch.

The state of the model will be saved every 10 epochs and at the last epoch in `OUTPUT_DIRECTORY/model/epoch*.pytorch`.

Concept distributions will be saved every 10 epochs as well in `OUTPUT_DIRECTORY/model/epoch*_train_doc_concept_probs.npy`.

## 3.2 Grid Search

__Grid Search__ trains FCM on all combinations of hyperparameters in the user-specific search space in the configuration file associated with the dataset. To run __Grid Search__, execute the following in command line:

`fcm grid-search ../configs/CONFIG_FILE`

Where `CONFIG_FILE` is the exact name of the configuration file that you'd like to run. Once you run the command, you should see the below prompt in your command line. FCM CLI will load in the data and inform you of the number of combinations in the search space. 

![gridsearch](img/3.2_gridsearch.png)

Training will then begin for each of the combinations until completion or the process is interrupted. The process can be resumed with the same command above and FCM will continue processing at the last configuration.

Grid search results will be saved to the directory according to “out_dir” in the configuration. The configuration file also contains all of the hyperparameters that can be changed by the user.

The configuration files are stored in ..fcm_cli/configs and can be edited directly in NotePad or most IDEs. See below for an example configuration file open in NotePad. Generally, you should only adjust 'dataset_params' and 'fcm_params'. A brief description of the hyperparameters can be found at the end of this section.

![config](img/3.2_config.png)

Within each grid search run, there will be directories named with a `RUN_ID`, the hash value of each set of hyperparameters being searched, i.e. a grid search result of dataset `prosper_loan` with `OUT_DIR="run0"` and `RUN_ID="0f735f978246aa65aa1806299869978c"`, the results are located in . `../grid_search/prosper_loan/run0/0f735f978246aa65aa1806299869978c`.

Within each directory `../grid_search/DATASET_NAME/OUT_DIR/RUN_ID/` , you will find similar files as described above in __Train__.

These include a log file `fcm.log`, a metrics file `train_metrics.txt`, concept words `concept/epoch*.txt`., saved models `model/epoch*.pytorch`, concept distributions `model/epoch*_train_doc_concept_probs.npy`, and the best metrics of each hyperparameter configuration `results.csv`.

## TIPS

While grid search is running, do not open the results.csv - this may force the process to terminate. However, you may still examine the files found in specific configuration directories once the configuration is done training.

We also advise that users run smaller search sets (less hyperparameter combinations), especially on larger datasets as it may take a long time to train all configurations. Wait for the message “Grid search complete!” before proceeding to view the results.csv.

### Hyperparameter Descriptions

1. __dataset__: name of the dataset to be used; the dataset must be defined in `fcm_cli/src/dataset/` and be a subclass of base_dataset.py
2. __csv-path__: path to the csv file. Only needed if __dataset__ is 'csv'
3. __csv-text__: column name of the text field in the csv file. Only needed if __dataset__ is 'csv'
4. __csv-label__: column name of the label field in the csv file. Only needed if __dataset__ is 'csv'
5. __gpus__: list of GPU devices if CUDA is available; ignored if no CUDA
6. __max_threads__: max number of parallel threads to run grid search
7. __out_dir__: the directory to save grid search output files

#### Dataset parameters: hyperparameters for loading the dataset
1. __window_size__: context window size for training word embeddings
2. __min_df__: minimum document frequency of vocabulary
3. __max_df__: maximum document frequency of vocabulary

#### FCM parameters: hyperparameters for creating new FCM instances
1. __embed_dim__: size of each word/concept embedding vector
2. __nnegs__: number of negative context words to be sampled during the training of word embeddings
3. __nconcepts__: number of concepts to be extracted
4. __lam__: Dirichlet loss (L_dir) weight; the higher, the more sparse the concept distribution of each document.
5. __rho__: prediction loss (L_clf) weight; the higher, the more the model focuses on prediction accuracy.
6. __eta__: diversity loss (L_div) weight; the higher, the more different are concepts vectors from each other
7. __inductive__: whether to use neural network to inductively predict concept weights of each document or use a concept weights embedding
8. __inductive_dropout__: dropout rate of the inductive neural network
9. __hidden_size__: size of the hidden layers in the inductive neural network
10. __num_layers__: number of layersi n the inductive neural network

#### Fit parameters: hyperparameters for training FCM
1. __lr__: learning rate
2. __nepochs__: number of training epochs
3. __pred_only_epochs__: number of epochs optimized with prediction loss only
4. __batch_size__: number of training examples per iteration
5. __grad_clip__: maximum gradient magnitude; gradients will be clipped within the range \[-grad_clip, grad_clip\]

## 3.3 Visualize
The __Visualize__ command generates an interactive interface where users can view all configuration results in a single grid search run and filter results on term frequency, document frequency, and FREX exclusivity.

Execute __Visualize__ by entering the command while specifying the grid search result path of a specific grid search configuration:

`fcm visualize ../grid_search/DATASET_NAME/OUT_DIR`

In the image below, `DATASET_NAME="prosper_loan"` and `OUT_DIR="test"`, executing the command 
`fcm visualize ../grid_search/prosper_loan/test`
will result in the following prompt where you must copy the development server address and paste in in your web browser WHILE appending '/viewer/' to the address to navigate to the visualization interface. The address should thus appear as: http://127.0.0.1:8000/viewer/

![address](img/3.3_visualize_address.png)

## tf-idf Filters & FREX

The three filtering features are based on the _term frequency-inverse document frequency (tf-idf)_ mechanism which measures the importance of words in a collection of texts. The mechanics of tf-idf and FREX are explained in detail below.

### Basic Intuition & Rules of Thumb
For the sake of brevity, the general intuition is as follows:

1. We want to filter out words that are either too frequent or too rare such that they are uninformative. Thus, we want to restrict the maximum and minimum of tf and df range.

2. If you want to extract words that are more exclusive to a topic, increase the FREX exclusivity. Alternatively, if you want to extract more frequent words, decrease the FREX exclusivity.

### How tf-idf and FREX Work
_Term frequency (tf)_ is the count of occurences a term appears in a document.

_Document frequency (df)_ is the count of occurrences a term appears in a set of documents. Given that structural terms (i.e. articles, prepositions) tend to appear frequently in documents, such high frequency but often less informative terms should have less weight in computing term importance. __NOTE__: The df filter in __Visualize__ only applies to obtained concept words, not the entire vocabulary from the texts. This differs from __Grid Search__ where the min-max df parameters apply to the entire vocabulary.

The _inverse document frequency (idf)_ addresses this by scaling down the importance of frequent terms and scales up the importance of infrequent terms. Intuitively, idf will be low for the most frequent words in a document and high for less frequent, context-specific words.

In accord with tf-idf, _FREX exclusivity_ is used to find words that are frequently found in and are exclusive to a topic. FREX strikes a balance between words that occur frequently and are exclusive such that extracted words are neither too frequent to simply discuss a topic nor are they so rare that they are uninformative. FREX is higher if words are more exclusive to each topic and is lower when more frequent words are desired.

Results will change real-time as the user adjust the filter sliders, or updating the input boxes directly. We encourage users to experiment with the filtering features to better understand the coherence and recall of results. 

The filters sliders will appear as...

![filter](img/visualize_filters.png)

... And visualized results will appear as the image below. λ, η, and ρ correspond to __lam__, __rho__, and __eta__ in the grid search configuration and the same notations in the FCM paper. Coherence scores are calculated with [Gensim Library](https://radimrehurek.com/gensim/models/coherencemodel.html). Specifically, [UMass Measure](http://qpleple.com/topic-coherence-to-evaluate-topic-models/) is used. Dir __L__, div __L__, clf __L__,  correspond to Dirichlet loss (L_dir), diversity loss (L_div), and prediction loss (L_clf) in the FCM paper. Total __L__ is the sum of these losses. `Train AUC` and `Test AUC` are the AUC of the model on the training set and the test set.

![viz](img/3.3_visualization.png)


## 3.4 Implementing Your Own Dataset

You are also able to implement your own dataset by creating a Python script under fcm_cli/src/dataset which pre-processes the data for FCM. Popular data formats like comma/tab separated value (csv and tsv) are easily implemented. 

The dataset script must be a subclass of base_dataset.py and implement the load_data method. Refer to the other datasets such as prosper_loan.py for an example. Below, we provide a basic framework  based on prosper_loan.py that the dataset script should follow. For ease of implementation, feel free to copy prosper_loan.py and edit as needed.

__NOTE__: FCM currently only supports binary classification. Ensure that your labels are binary before passing the data to FCM!

```python
# Read in the csv as a Pandas dataframe
x_df = pd.read_csv(os.path.join(DATA_DIR, "prosper_loan.csv"))

# Fill any empty fields in the text column and create a list of the texts
x_df['Description'] = x_df['Description'].fillna('')
documents = x_df['Description'].tolist()

# Create numpy array of labels as binary variables
labels = x_df['LoanStatus'].isin(['Paid', 'Defaulted (PaidInFull)', 'Defaulted (SettledInFull)']).astype(int).to_numpy()

# Drop the X and Y variables along with any extra columns and create a numpy array for remaining explanatory variables
x_df = x_df.drop(columns=['Key', 'LoanStatus', 'Description'])
expvars = x_df.to_numpy()

# Create the training and test sets. Test ratio is 0.15 by default
doc_train, doc_test, y_train, y_test, expvars_train, expvars_test =\ train_test_split(documents, labels, expvars, test_size=TEST_RATIO)
```

```python
# Use this exact code to implement base_dataset.py and load_data; only change the dataset name as indicated by the notes
class ProsperDataset(BaseDataset): # NOTE: change dataset name
    def __init__(self):
        super().__init__(doc_train, doc_test, y_train, y_test, expvars_train, expvars_test)

    def get_data_filename(self, params):
        window_size = params["window_size"]  # context window size
        vocab_size = params["vocab_size"]  # max vocabulary size
        min_df = params.get("min_df", MIN_DF)  # min document frequency of vocabulary, defaults to MIN_DF
        max_df = params.get("max_df", MAX_DF)  # max document frequency of vocabulary, defaults to MAX_DF
        return os.path.join(DATA_DIR, "prosper_w%d_v%d_min%.0E_max%.0E.pkl" % (window_size, vocab_size, min_df, max_df)) # NOTE: change dataset name

    def load_data(self, params):
        print("Loading data...")
        window_size = params["window_size"]  # context window size
        vocab_size = params["vocab_size"]  # max vocabulary size
        min_df = params.get("min_df", MIN_DF)  # min document frequency of vocabulary, defaults to MIN_DF
        max_df = params.get("max_df", MAX_DF)  # max document frequency of vocabulary, defaults to MAX_DF

        vectorizer = CountVectorizer(tokenizer=tokenize, stop_words='english', min_df=min_df, max_df=max_df,
                                     max_features=vocab_size)
        return self.get_data_dict(self.get_data_filename(params), vectorizer, window_size)
```

# 4 Demonstration

In this section, we walk through a demonstration of FCM CLI on a peer-to-peer lending dataset titled prosper_loan within the repository.

Before proceeding, ensure that this iPython Notebook is located in the fcm_cli base directory. This is required for the following commands to run.

## 4.1 Dataset

The data contains all loan requests that were funded on the crowdfunding platform Prosper between April 2007 and October 2008, consisting of 122,479 total listings. The data includes (1) a loan application description (X-variable), (2) relevant personal information about the applicant and loan (credit grade, lender rate, etc.) (explanatory variables), and (3) a binary label indicating paid loans (Y-variable).

See prosper_loan.csv in fcm_cli/data/prosper_loan for the raw data.

## 4.2 Initial Setup

If you haven't set up FCM CLI yet, run the setup commands to create a virtual environment, activate it, navigate to the appropriate directory, and install the requisite packages. Execute the commands below.

In [None]:
# Create virtual environment in base directory
!python3 -m venv fcm

In [None]:
# Activate virtual environment.
!source fcm/bin/activate

In [None]:
# Navigate to src
%cd src 

In [None]:
# Install FCM CLI in the virtual environment. This will take several minutes the first time you run it
!pip install --editable .

Alternatively, if you run into any issues with the setup in this notebook, you may run the commands in Command Line/terminal.

Also, you can set up your Google Colab notebook as explained in Section 2.3 and follow along. We also provide fcm_cli_colab.ipynb which you can upload to Colab and begin working immediately.

## 4.3 Data Exploration: Setting Up Grid Search

As our first exploratory exercise, we would like to run __Grid Search__ over a broad search space to incrementally and exhaustively test many combinations of parameters and evaluate the resulting concepts.

We are looking for two specific characteristics within results as judged by the user: coherence and recall (i.e. human-understandable concepts that are distinct from one another)

Open up prosper_config.json in either NotePad or your IDE and edit the parameters as desired or simply run the default configurations (seen below). We specify the output directory as 'test'.

In [None]:
{
    "dataset": "prosper_loan",
    "gpus": [0],
    "max_threads": 1,
    "out_dir": "test",
    "dataset_params" : {
        "window_size": [2, 4],
        "vocab_size": [5000],
        "min_df": [0.01, 0.001],
        "max_df": [0.4, 0.6, 0.8]
    },
    "fcm_params": {
        "inductive": [true],
        "inductive_dropout": [0.01, 0.1, 0.2],
        "embed_dim": [50, 200],
        "hidden_size": [100],
        "num_layers": [1],
        "nnegs": [10, 20],
        "nconcepts": [7],
        "lam": [1, 10, 20, 100],
        "rho": [10, 100, 1000],
        "eta": [1, 10, 100, 500]
    },
    "fit_params": {
        "lr": [0.01],
        "nepochs": [30],
        "pred_only_epochs": [0, 5, 10],
        "batch_size": [5000],
        "grad_clip": [1000]
    }
}

Then, run the following command in command line. Grid search will begin loading in the data and training. Please wait until the process has finished before proceeding.

In [None]:
! fcm grid-search ../configs/prosper_config.json

## 4.4 Evaluating Results

Once the grid search is completed, we can now proceed to view the results. If interested, review the results.csv in the grid search directory to review performance metrics of each hyperparameter configuration in the search space.

To review results, we will use the __Visualize__ feature. Run the command in command line:

In [None]:
! fcm visualize ../grid_search/prosper_loan/test

Copy the address to the development server and paste it into your web browser search bar. Remember to append '/viewer/' to the end of the address. The input should appear as (or similar to) 'http://127.0.0.1:8000/viewer/'.

The visualization interface will appear, showing results from each of the configurations in the search space. In the example below, we expand the concept word set to gain a broader understanding of concepts found in the data. By default, FCM extracts 10 per concept. This feature is currently not available but will be implemented in a future update. [UMass measure](http://qpleple.com/topic-coherence-to-evaluate-topic-models/) 

![viz](img/3.3_visualization.png)

Upon closer observation, many of the topic results seen above are generally uninformative as we cannot intuitively describe the concepts based on the words extracted.

Note that in these configurations, topics appear to overlap and concept-describing words are often too scattered to describe a single coherent idea. These words cannot clearly or easily inform us about potential constructs to investigate.

## 4.5 Hyperparameter Tuning
To address the above issues, we must perform hyperparameter tuning. Recall that FCM utilizes the __tf-idf__ mechanism. In order to extract coherent, distinct concepts, we must adjust __tf__ and __df__.

Filtering can be performed at two stages: __Grid Search__ and __Visualize__. While __Grid Search__ can help obtain better baseline results, we recommend using the __Visualize__ interface to filter and obtain good results from the baseline search space produced by __Grid Search__.

__Tf__ and __df__ of concept words can be filtered in the visualization interface via the filtering slider. __Df__ of the entire text vocabulary can be adjusted in __Grid Search__.



To identify suitable df parameters across the entire text vocabulary, we run __Grid Search__ to test different minimum and maximum df parameters in the configuration file holding all else equal. Then, we use the __Visualize__ feature to manually evaluate the concept results from the broad exploratory set, adjusting the filters and observing the effect on concept coherence and recall. Once we identify several informative configurations, we will use the df parameters associated with those configurations in consequent __Grid Search__ runs.

For example, we set the minimum and maximum df parameters in the configuration file to the following:

In [None]:
"min_df": [0.01, 0.1, 0.2, 0.3],
"max_df": [0.4, 0.6, 0.8, 0.9]

__Remember__: the df parameter in __Grid Search__ applies to the entire text vocabulary while the df filter in __Visualize__ only applies to the extracted concept words.

With this specific dataset, we find that constraining the df range slightly (0.1 to 0.8, for example) in the configuration file greatly improves the recall and coherence within and across concepts as evaluated by human judgement, i.e. concepts tend to be more informative and distinct from one another.

After evaluating concept coherence from the initial exploratory configurations, we can systematically test other parameter configurations via __Grid Search__ holding the user-selected df parameters constant to find suitable hyperparameters that produce informative concepts.

This process involves re-running __Grid Search__ numerous times while testing a range of values one parameter at a time holding all else equal. Review the results and identify the specific configurations that produce better results as judged by the user. Use those parameters and iterate over the next parameter.

Following these steps will help build intuition as to what parameters work well. Once you have a grasp on effective hyperparameters, you can run focused search spaces with narrow parameter sets to extract more informative results.

As an example, here is a sample result from a poorly tuned run with arbitrary hyperparameters.

![gs_bad](img/4.5_gs_bad.png)

Notice how the concept words tend to overlap and generally express similar ideas.

Next, with well-tuned df hyperparameters, holding all else equal, we obtain more informative baseline results as follows:

![gs_good](img/4.5_gs_good.png)

## 4.6 Filtering Results

Once you have made the appropriate adjustments to the configuration file, rerun __Grid Search__, now specifying a different output directory. In this example, we use 'run0' instead of 'test' as our directory. Execute the __Grid Search__ command again:

In [None]:
! fcm grid-search ../configs/prosper_config.json

Result directories will be generated in accordance with the new output directory that you specify.

Now, we can visualize our new results with the command:

In [None]:
! fcm visualize ../grid_search/prosper_loan/run0

Follow the same procedure to navigate to the __Visualize__ interface.

Again, experiment with the filtering options in the visualization interface until you identify interesting patterns. The process is iterative and experimental in nature - keep adjusting parameters and filters to find coherent, distinct concepts!

After several iterations of this process, we arrive at a __Grid Search__ space that highlights potential constructs that we can reasonably hypothesize are correlated with loan default. Notice how the concept-describing words are generally understandable and distinct and, upon closer observation, can highlight intuitive constructs like mentions of family and personal expenses.

In consequent __Grid Search__ runs, we now have a baseline understanding of what constructs may be relevant and can substantiate our hypothesized concepts. Keep these in mind as you iterate through search spaces.

![final](img/4.6_prosper_vis.png)

In another search space, we identify one configuration using the filtering features in __Visualize__ that highlights particularly coherent and distinct concepts like family, school and employment, personal expenses, etc.

Without filtering, we observe the following concepts still with some degree of overlap:

![bad](img/4.6_vis_bad.png)

Once we apply filters, we obtain the following results which are much cleaner and more informative. We manually reorder the words to highlight the top words that suggest interesting constructs to investigate further.

Note that FCM will not automatically reorder words in this manner.

![good](img/4.6_prosper_final.png)

These results pose an interesting hypothesis and can serve as the basis for further causal investigation. With these results, a manager or researcher can exercise domain expertise and apply involved causal inference methods to validate the hypothesis and ultimately extract business-relevant insights.

# 5 Closing Comments

With a methodical approach to data exploration and results filtering, we successfully identify several interesting constructs from our test dataset for further causal investigation.

Given that managers and researchers are often faced with sifting through volumes of unstructured data to extract patterns, FCM CLI can be an essential tool to obtain data-driven insights that users can inspect further to identify meaningful vs. spurious outcomes. A properly tuned FCM CLI can help you extract coherent concepts and build intuition towards insights about real-world phenomena.

Thank you for using FCM CLI! Please reach out to the team with any inquiries, feedback, or suggestions.