Skip to content

frucci/WordsClusterBySynonyms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

WordsClusterBySynonyms

Words clustering using synonyms

This class is able to create clusters by using the definition of synonyms inside NLTK. Let's see an example.

import pandas as pd
import WordsClusterBySynonyms as wcbs

In this case we decided to use a list of italian verbs.

verbs = [
    'cogliere', 'intagliare', 'ragguagliare', 'dilazionare', 'tuffare',
    'dissipare', 'indisporre', 'complottare', 'contraddire', 'sconoscere',
    'sgocciolare', 'ridimensionare', 'ammansire', 'stuzzicare', 'rintuzzare',
    ...
    'autenticare', 'programmare', 'assassinare', 'immalinconire', 'esalare',
    'istigare', 'abiurare', 'curare', 'tranciare', 'tracciare', 'vagolare',
    'raddolcire', 'sfinire', 'confrontare', 'indispettire','fare','avere','vivere'
]
verbs = pd.DataFrame(verbs)
verbs.columns = ['verbs']

WordsClusterBySynonyms requires a dataframe in which you have to specify the name of the target column and the language.

The first function inside WordClusterBySynonyms is get_synonyms_pandas. It applies on the dataframe the generation of synonyms by creating a new columns.

wc = wcbs.WordsClusterBySynonyms(verbs, 'verbs', lang='ita')
df = wc.get_synonyms_pandas()
wc.plot_hist(df)

hist_all.jpg

Using set_threshold you can repeat get_synonyms_pandas with a threshold on the number of synonyms for each word.

df = wc.set_threshold(20, df)

Using plot_hist you can check if in your list of words there are words with associate a huge number of synonyms. These words are a problem, because they tend to create few huge clusters with our definition of distance.

wc.plot_hist(df)

hist_no_higher.jpg

DISTANCE

Given two different words (A and B) with associated two lists of synonyms ( sa and sb) A is equal to B if sa is equal to sb. A is totally different from B if there is an empty intersection between sa and sb.

The formula we used is:

formula

You can choose between min or max, or if you would like to use your definition of distance:

    def mydistance_name():
        ...
        return ...

    wc.create_distance_matrix(mydistance= mydistance_name, criteria=None, verbose=True)
matrix = wc.create_distance_matrix(criteria=min, verbose=True)
wc.plot_eps_ncluster(matrix, ntot=10, min_samples=6)

plot_eps_clusters.jpg plot_eps_not_clustered.jpg

The function run_cluster uses the DBSCAN implemented in sklearn. You can find the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

result = wc.run_cluster(0.3,6, matrix)

Below a plot to show the cluster using a wordcloud-like format, where for a smaller size correnspond a lower distance.

wc.plot_cluster_k(matrix, 'contraddire')

contraddire.jpg

This class seems to work better for verbs and adjectives, but in general the goodness of this method is crucial correlated to the "goodness" of synonyms' structure.

I've done this class together with https://github.com/aborgher

About

Words clustering using synonyms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages