In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Notebook Setting

In order to display the interactive elements of PyCaret in the Google Colab environment, you need to run the following code.

In [2]:
from pycaret.utils import enable_colab

enable_colab()

Colab mode enabled.


## Datasets Loading

In this case, we will use data from **Kiva Microfunds**.  
**Kiva Microfunds** is a non-profit organization that enables individuals to finance low-income entrepreneurs and students around the world through crowdfunding.  
The data for each Kiva loan includes information such as gender and location, as well as the borrower's story.  
In this project, we will use natural language processing to analyze the text of this borrower's story.  
The dataset contains 6,818 samples.  
The description of each feature in the dataset is as follows.

* **country**: Borrower's country
* **en**: The borrower's personal story
* **gender**: Sex (M=male, F=female)
* **loan_amount:** the amount of the approved loan
* **nonpayment**: type of lender (Lender = individual user registered with Kiva, Partner = microfinance institution working with Kiva)
* **sector**: sector of the borrower
* **status**: status of the loan (1 - defaulted, 0 - repaid)
 
In this case, we will only use the **en** column.

https://www.kiva.org/ 

In [3]:
from pycaret.datasets import get_data

data = get_data("kiva")
print(f"Datasets Shape : {data.shape}")

Unnamed: 0,country,en,gender,loan_amount,nonpayment,sector,status
0,Dominican Republic,"""Banco Esperanza"" is a group of 10 women looki...",F,1225,partner,Retail,0
1,Dominican Republic,"""Caminemos Hacia Adelante"" or ""Walking Forward...",F,1975,lender,Clothing,0
2,Dominican Republic,"""Creciendo Por La Union"" is a group of 10 peop...",F,2175,partner,Clothing,0
3,Dominican Republic,"""Cristo Vive"" (""Christ lives"" is a group of 10...",F,1425,partner,Clothing,0
4,Dominican Republic,"""Cristo Vive"" is a large group of 35 people, 2...",F,4025,partner,Food,0


Datasets Shape : (6818, 7)


## Environment Configuration

Set up the **PyCaret** environment.  
The `setup` function in pycaret.nlp initializes the **PyCaret** environment, but it must be called before executing other functions in **PyCaret**.  

At that time, you need to specify the target column to be analyzed.

In [4]:
from pycaret.nlp import setup

clf = setup(data=data, target="en", session_id=123)

Description,Value
session_id,123
Documents,6818
Vocab Size,10754
Custom Stopwords,False


## Model Building

The `models` function allows you to see all available machine learning models.

In [5]:
from pycaret.nlp import models

models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
lda,Latent Dirichlet Allocation,gensim/models/ldamodel
lsi,Latent Semantic Indexing,gensim/models/lsimodel
hdp,Hierarchical Dirichlet Process,gensim/models/hdpmodel
rp,Random Projections,gensim/models/rpmodel
nmf,Non-Negative Matrix Factorization,sklearn.decomposition.NMF


A **topic model** is a kind of statistical model used in machine learning and natural language processing to discover abstract **topics** that appear in a collection of documents.

The `create_model` function trains and evaluates the model.  
In the following code, we will create and train a model for **Latent Dirichlet Allocation (LDA)**, one of the topic models.  
The number of topics is specified by `num_topics`, and the default number of topics is **4**.

In [6]:
from pycaret.nlp import create_model

lda = create_model('lda', num_topics=5, multi_core=False)

Displays an overview of the trained model.  

In [7]:
print(lda)

LdaModel(num_terms=10754, num_topics=5, decay=0.5, chunksize=100)


> The meaning of each hyperparameter is explained in the official `pycaret.nlp` documentation.
> https://pycaret.readthedocs.io/en/latest/api/nlp.html#pycaret.nlp.create_model

## Assigning Labels

The `assign_model` function allows you to calculate the percentage of each sample that belongs to each topic. 

In [8]:
from pycaret.nlp import assign_model

assigned_lda = assign_model(lda)
assigned_lda.head()

Unnamed: 0,country,en,gender,loan_amount,nonpayment,sector,status,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Dominant_Topic,Perc_Dominant_Topic
0,Dominican Republic,group woman look receive small loan take small...,F,1225,partner,Retail,0,0.012978,0.041189,0.000998,0.890905,0.05393,Topic 3,0.89
1,Dominican Republic,walk forward group entrepreneur seek second lo...,F,1975,lender,Clothing,0,0.059633,0.033273,0.067947,0.739733,0.099415,Topic 3,0.74
2,Dominican Republic,group people hope start business group look re...,F,2175,partner,Clothing,0,0.01184,0.031284,0.001226,0.908427,0.047223,Topic 3,0.91
3,Dominican Republic,vive live group woman look receive loan young ...,F,1425,partner,Clothing,0,0.010091,0.041024,0.00103,0.898563,0.049292,Topic 3,0.9
4,Dominican Republic,vive large group people hope take loan many se...,F,4025,partner,Food,0,0.022655,0.041435,0.000964,0.90174,0.033206,Topic 3,0.9


## Model Evaluation

The `plot_model` function allows you to analyze the entire corpus or a specific topic from various angles.

- https://pycaret.readthedocs.io/en/latest/api/nlp.html#pycaret.nlp.plot_model

If no specific parameter is specified in the `plot_model` function, the number of occurrences of each word will be displayed in a graph.

In [9]:
from pycaret.nlp import plot_model

plot_model()

If you specify **bigram** for plot, the number of occurrences of two consecutive words will be displayed in a graph.

In [10]:
plot_model(plot='bigram')

By specifying a topic name in `topic_num`, you can analyze a specific topic.  
The following code shows the number of word occurrences in a graph for a topic named **Topic 1**.

In [11]:
plot_model(lda, topic_num='Topic 1')

If you specify **topic_distribution** for plot, you can see the number of samples classified into each topic in a graph.  
Hovering the cursor over the bar for each topic will display the keywords for each topic.

In [12]:
plot_model(lda, plot="topic_distribution")

**t-SNE (T-Distributed Stochastic Neighbor Embedding)** is a dimensionality reduction algorithm to convert high-dimensional data into low-dimensional data for visualization.  
As shown in the code below, specifying **tsne** as plot will visualize the data using t-SNE.  
By default, the data is compressed to 3 dimensions.  

In [13]:
plot_model(lda, plot="tsne")