Part-Of-Speech-Tagger

A PHP & MySQL based Parts of Speech Tagger that uses the Brown Corpus.

Project Files And Overview

Currently this project is still under active development however it is at a point where it is useful.

The original Brown Corpus is available for use and review in the Brown subfolder.

Please note my Disclaimer where I explicitly state that I do not own the Brown corpus and I am not selling it to you.

Starting From Scratch

it is highly recommended that you do not train from scratch - SEE THE DATA FILES SECTION BELOW.

There is an SQL setup file (Create.PartsOfSpeech.DB.sql) that you can use to create the MySQL database necessary to use this Parts of Speech Tagger.
Once you have your database setup you could use Train.php to process the brown corpus into your PartsOfSpeech MySQL database. The extracted Trigrams can be used to do many different things however the goal in this case is to associate the word tri-grams with the part of speech that it represents. This efficiently models English to allow us to use the tri-grams as a pattern lookup table. The Train process can take several hours to run on a fast machine. My tests on a Raspberry Pi took over 10 hours to complete the training from scratch.
The pretrained database is available as both .SQL dump and .CSV for use and review in the Data subfolder. Loading the .SQL files into the database takes only minutes to set up and opening the .CSV in Excel or in your own programs is super easy.
After you have the data in your MySQL database you will need to run AddHashes.php which will compute AB & BC bi-grams as well as AC skip-gram hashes for each already known tri-gram. This enables a "backoff" of "failover" approach where if you fail to find the exact pattern you are looking for (a trigram in this case being considered the ideal) then instead of failing to tag the text the tagger can try a slightly different (less specific) pattern to see if it can match the text. This approach should significantly improve performance in terms of tagging accuracy and speed up the overall pattern lookup process.

At this point you can Test your Part of Speech Tagger:

Testing & Tagging

Data files

There is no need to train your parts of speech tagger from scratch. The data is avalable as CSV & SQL for your convenience.

CSV

Words.csv Contains all 56,057 words.
Tags.csv Contains all 472 tags.
Trigrams_1.csv Contains 0 - 212315 trigrams.
Trigrams_2.csv Contains 212316 - 424631 trigrams.
Trigrams_3.csv Contains 424632 - 636947 trigrams.
Trigrams_4.csv Contains 636948 - 849262 trigrams.

SQL

Words_Data.sql Contains all 56,057 words.
Words_Structure.sql Contains the structure the Words table.
Tags_Data.sql Contains all 472 tags.
Tags_Structure.sql Contains the structure the Tags table.
Trigrams_Data_1.sql Contains 0 - 212315 trigrams.
Trigrams_Data_2.sql Contains 212316 - 424631 trigrams.
Trigrams_Data_3.sql Contains 424632 - 636947 trigrams.
Trigrams_Data_4.sql Contains 636948 - 849262 trigrams.
Trigrams_Structure.sql Contains the structure the Trigrams table.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
brown		brown
data		data
images		images
AddHashes.php		AddHashes.php
CollectUnigramTags.php		CollectUnigramTags.php
Create.PartsOfSpeech.DB.sql		Create.PartsOfSpeech.DB.sql
DISCLAIMER		DISCLAIMER
FastTest.php		FastTest.php
LICENCE		LICENCE
README.md		README.md
Test.php		Test.php
Train.php		Train.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part-Of-Speech-Tagger

Project Files And Overview

Starting From Scratch

Testing & Tagging

Data files

CSV

SQL

More Reading

About

Releases

Packages

Contributors 2

Languages

License

geekgirljoy/Part-Of-Speech-Tagger

Folders and files

Latest commit

History

Repository files navigation

Part-Of-Speech-Tagger

Project Files And Overview

Starting From Scratch

Testing & Tagging

Data files

CSV

SQL

More Reading

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages