Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Python implementations of concept formation algorithms

branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

README
This is a python library of algorithms that perform concept formation written by
Christopher MacLellan (http://www.christopia.net).  

In this library, the COBWEB
(http://axon.cs.byu.edu/~martinez/classes/678/Papers/Fisher_Cobweb.pdf) and
COBWEB/3
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.4676&rep=rep1&type=pdf)
algorithms are implemented. The system accepts a stream of instances, which are
represented as hash tables of attributes and values (where values can be both
numeric and nominal), and learns a concept hierarchy.

The system includes the LABYRINTH
(http://www.isle.org/~langley/papers/labyrinth.cfb.pdf) algorithm for learning
structured concepts with relations and components as well. However, the
algorithm seemed underdeveloped in a couple of regards and so has been
expanded in the TRESTLE system.

First, we have included a single matching (structure-mapping) step at the root,
instead of at every node in the search, as LABYRINTH did. This matching step
uses the Hungarian Algorithm to match objects based on their conceptual
relations (optimal for objects alone). It also includes a greedy approach to
matching relations. 

Second, we have been very very careful about how we handle component features
(features that whose values refer to other concepts in the tree). At this time
we ensure that all values refer to leaves of the concept tree. There are no
intermediate nodes at this time. This is nice because we maintain maximal
information. We are currently in the process of exploring different approaches
to abstracting component features during categorization, but have not committed
to any best approach yet. 

Third, in terms of the COBWEB3 algorithm, we have made a few improvements to
correct for situations where the number of datapoints is low (we now use an
unbiased estimation for STD:
https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation). We
are currently considering a modification that normalizes the continuous values
during each category utility calculation. This will help the system to deal
with small differences between data points, they will adjust based on the
nature of the domain. We have also added a category utility cutoff, which
triggers counts of the current node to be updated without creating an
additional leaf/node. 

Lastly, we are considering a mechanism that filters out concepts that do not
share features with a given instance that is being categorized. This would
prevent weird "mixed" clusters. This would be particularly relevant when 


New features we are exploring:

1) Normalization of continuous values at each node during category utility
calculation.

2) A mechanism for abstracting component features.

3) Consider different approaches for selective filtering out concepts. 

Thoughts:

1) Maybe the matching process should be a parse and match process. Then this
could be repeated at each node so that the parsing is driven by the current
top-down process as well as the bottom up parsing process. 

2) How is structure provided to the system? Is it at all? Or should it be given
a flat representation and structure learned?

Something went wrong with that request. Please try again.