Skip to content

Andy Whitton, a partner in Deloitte’s data practice, says: “Full data classification can be a very expensive activity that very few organisations do well. Certified database technologies can tag every data item but, in our experience, only governments do this because of the cost implications.”

License

Notifications You must be signed in to change notification settings

WEBSHIVOM/NLPSamples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLPSamples

Andy Whitton, a partner in Deloitte’s data practice, says:

“Full data classification can be a very expensive activity that very few organisations do well. Certified database technologies can tag every data item but, in our experience, only governments do this because of the cost implications.”

Supervised Document Classification: In supervised classification, an external mechanism (such as human feedback) provides correct information on the classification of documents.

Unsupervised Document Classification: In unsupervised document classification, also called document clustering, where classification must be done entirely without reference to external information. Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.

In general, there are two common algorithms.

(i) The first one is the hierarchical based algorithm, which includes a single link, complete linkage, group average and Ward’s method. By aggregating or dividing, documents can be clustered into a hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems.

(ii) The other algorithm is developed using the K-means algorithm and its variants. Generally, hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the K-Means algorithm are more efficient and provide sufficient information for most purposes. These algorithms can further be classified as hard or soft clustering algorithms.

About

Andy Whitton, a partner in Deloitte’s data practice, says: “Full data classification can be a very expensive activity that very few organisations do well. Certified database technologies can tag every data item but, in our experience, only governments do this because of the cost implications.”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published