In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Notebook Setting

In order to display the interactive elements of PyCaret in the Google Colab environment, you need to run the following code.

In [2]:
from pycaret.utils import enable_colab

enable_colab()

Colab mode enabled.


## Datasets Loading

In this case, we will use data from **Kiva Microfunds**.  
**Kiva Microfunds** is a non-profit organization that enables individuals to finance low-income entrepreneurs and students around the world through crowdfunding.  
The data for each Kiva loan includes information such as gender and location, as well as the borrower's story.  
In this project, we will use natural language processing to analyze the text of this borrower's story.  
The dataset contains 6,818 samples.  
The description of each feature in the dataset is as follows.

* **country**: Borrower's country
* **en**: The borrower's personal story
* **gender**: Sex (M=male, F=female)
* **loan_amount:** the amount of the approved loan
* **nonpayment**: type of lender (Lender = individual user registered with Kiva, Partner = microfinance institution working with Kiva)
* **sector**: sector of the borrower
* **status**: status of the loan (1 - defaulted, 0 - repaid)
 
In this case, we will only use the **en** column.

https://www.kiva.org/ 

In [3]:
from pycaret.datasets import get_data

data = get_data('kiva')
print(f"Datasets Shape : {data.shape}")

Unnamed: 0,country,en,gender,loan_amount,nonpayment,sector,status
0,Dominican Republic,"""Banco Esperanza"" is a group of 10 women looki...",F,1225,partner,Retail,0
1,Dominican Republic,"""Caminemos Hacia Adelante"" or ""Walking Forward...",F,1975,lender,Clothing,0
2,Dominican Republic,"""Creciendo Por La Union"" is a group of 10 peop...",F,2175,partner,Clothing,0
3,Dominican Republic,"""Cristo Vive"" (""Christ lives"" is a group of 10...",F,1425,partner,Clothing,0
4,Dominican Republic,"""Cristo Vive"" is a large group of 35 people, 2...",F,4025,partner,Food,0


Datasets Shape : (6818, 7)


## Environment Configuration

Set up the **PyCaret** environment.  
In this case, we will use `custom_stopwords` to set the stopwords. 
**Stopwords** are words that are too common and do not contribute to the topic characterization.  
In the example below, the stopwords are the high frequency words obtained from "03-NLP_Corpus_KivaMicrofunds.ipynb".   
These words are used at a very high frequency in the text and add more noise than information.  

In [4]:
from pycaret.nlp import setup

stopwords = ['loan', 'income', 'usd', 'many', 'also', 'make', 'business', 'buy', 'sell',
             'purchase', 'year', 'people', 'able', 'enable', 'old', 'woman', 'child', 'school']

clf = setup(data=data, target='en', session_id=123, custom_stopwords=stopwords)

Description,Value
session_id,123
Documents,6818
Vocab Size,10830
Custom Stopwords,True


## Model Building

Create and train a model of **Latent Dirichlet Allocation (LDA)**, which is one of the **topic models**.  

In [5]:
from pycaret.nlp import create_model

lda = create_model('lda', num_topics=5, multi_core=True)

Specify `topic_distribution` for plot to check the number of samples classified into each topic.

In [6]:
from pycaret.nlp import plot_model

plot_model(lda, plot='topic_distribution')

## Optimizing Hyperparameters

The `tune_model()` function is used to optimize the number of topics.  
In the following code, the number of topics is optimized so that **Coherence**, a commonly used evaluation metric for topic models, is the highest.  
**Coherence** indicates how consistent the words in a topic are and how easy it is for humans to understand the topic.

For more information on how **Coherence** is calculated, please refer to the following and other sources.

- https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

Also, `custom_grid` can be set to a candidate number of topics.  

In [7]:
from pycaret.nlp import tune_model

tuned_lda = tune_model(model='lda', multi_core=True, custom_grid=[2, 3, 4, 5, 6, 7, 8])

IntProgress(value=0, description='Processing: ', max=17)

Output()

Best Model: Latent Dirichlet Allocation | # Topics: 7 | Coherence: 0.5229


If you specify `topic_distribution` for plot, you can see a graph of the number of samples classified into each topic.  
Hover over the bars for each topic to make sure the keywords are appropriate.  

In [8]:
plot_model(tuned_lda, plot='topic_distribution')

Use **t-SNE (T-Distributed Stochastic Neighbor Embedding)** to visualize your data in three dimensions.    

In [9]:
plot_model(tuned_lda, plot='tsne')

## Saving and loading models

The `save_model` function allows you to save the model.

In [10]:
from pycaret.nlp import save_model

save_model(tuned_lda, 'lda_model')

Model Succesfully Saved


(<gensim.models.ldamulticore.LdaMulticore at 0x7ffb6ab6ab20>, 'lda_model.pkl')

The `load_model` function allows you to load a saved model.

In [11]:
from pycaret.nlp import load_model

loaded_lda = load_model('lda_model')

Model Sucessfully Loaded


## (Supplementary) Classification using labels

The dataset we are using is labeled by the **status** column.  
A status of 1 means the debt is in default, while 0 means it has been repaid.  
The `tune_model()` function can perform machine learning using these labels as the correct answers and determine the optimal number of topics.  
Since this is a classification problem, the metric used for optimization is Accuracy (percentage of correct answers).  
  
In the following code, we specify the column of correct labels by specifying **status** for supervised_target.  
Also, **lr** is specified as the estimator, which means that machine learning by logistic regression is selected.  

For more details on how to use the tune_model function, please refer to the official documentation below.

- https://pycaret.org/tune-model/

In [12]:
tuned_classification = tune_model(model='lda',
                                  multi_core=True,
                                  supervised_target='status',
                                  estimator='lr',
                                  custom_grid=[2, 3, 4, 5, 6, 7, 8])

IntProgress(value=0, description='Processing: ', max=17)

Output()

Best Model: Latent Dirichlet Allocation | # Topics: 3 | Accuracy : 0.8612


Specify **topic_distribution** for plot and check the number of samples classified into each topic in the graph.

In [13]:
plot_model(tuned_classification, plot='topic_distribution')

Visualize the data in three dimensions with **t-SNE**.

In [14]:
plot_model(tuned_classification, plot='tsne')