Skip to content

geekgirljoy/Part-Of-Speech-Tagger

Repository files navigation

Part-Of-Speech-Tagger

A PHP & MySQL based Parts of Speech Tagger that uses the Brown Corpus.

Project Files And Overview

Currently this project is still under active development however it is at a point where it is useful.

  • The original Brown Corpus is available for use and review in the Brown subfolder.

Please note my Disclaimer where I explicitly state that I do not own the Brown corpus and I am not selling it to you.

Starting From Scratch

it is highly recommended that you do not train from scratch - SEE THE DATA FILES SECTION BELOW.

  • There is an SQL setup file (Create.PartsOfSpeech.DB.sql) that you can use to create the MySQL database necessary to use this Parts of Speech Tagger.

  • Once you have your database setup you could use Train.php to process the brown corpus into your PartsOfSpeech MySQL database. The extracted Trigrams can be used to do many different things however the goal in this case is to associate the word tri-grams with the part of speech that it represents. This efficiently models English to allow us to use the tri-grams as a pattern lookup table. The Train process can take several hours to run on a fast machine. My tests on a Raspberry Pi took over 10 hours to complete the training from scratch.

  • The pretrained database is available as both .SQL dump and .CSV for use and review in the Data subfolder. Loading the .SQL files into the database takes only minutes to set up and opening the .CSV in Excel or in your own programs is super easy.

  • After you have the data in your MySQL database you will need to run AddHashes.php which will compute AB & BC bi-grams as well as AC skip-gram hashes for each already known tri-gram. This enables a "backoff" of "failover" approach where if you fail to find the exact pattern you are looking for (a trigram in this case being considered the ideal) then instead of failing to tag the text the tagger can try a slightly different (less specific) pattern to see if it can match the text. This approach should significantly improve performance in terms of tagging accuracy and speed up the overall pattern lookup process.

At this point you can Test your Part of Speech Tagger:

Testing & Tagging

Data files

There is no need to train your parts of speech tagger from scratch. The data is avalable as CSV & SQL for your convenience.

CSV

SQL

More Reading

If you would like a more thorough guide to this project I have a series of blog posts covering the development.

Can a Bot Understand a Sentence?

Tokenizing & Lexing Natural Language

The Brown Corpus

The Brown Corpus Database

Building A Faster Bot

Adding Bigrams & Skipgrams

Parts of Speech Tagging

Unigrams

Finished Prototype

About

A PHP Parts of Speech Tagger that uses the Brown Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages