- This is a work-in-progress project. (WIP)
- The data sets and trained models are not included in this repository. (Mostly because their size were about 800MB.)
- The procedure described below may change.
(The following is not implemented yet!)
LM: 3-gram with backoff and turing-good normalization.
CM: Noisy channel with naive bayes.
- Gather data of normal farsi writtings. (Normal data set)
- Gather data with possible mistakes. (Noisy data set)
- Create a dictionary from normal data set.
- Create a language model from normal data set. (3-gram)
- Mark words of noisy data set which are not in created dictionary. (finding out of dictionary spelling errors)
- Mark words of noisy data which are not probble base on LM. (finding mistype words which are in dictionary)
- Find the correct spelling of marked words by hand. (This might be tidious)
- Split mistake datas to train, dev and test sets.
- Create a confusion matrix (edit distant) for modeling noisy channel.
Training is done after these steps and model can be tested.
I have gathered data from HamshahriOnline and Virgool. I assume HamshahriOnline as the normal data set and Virgool posts as noisy data.
I will manage spelling errors in a csv file with two columns. One of the columns will be for correct word and the other represents the misspelled word.
After creating the csv file its data should be splitted into three test set of training, development and testing.
- Some data from HamshahriOnline has been gathered. (it is confiremed that this data has noise.)
- Some data from Virgool has been gathered.
- 3 language models (1, 2, 3 -grams with laplace smoothing) have been trained.
- Find spelling errors by checking 1-gram and suggest words with edit distant of 1.
- Use 3-gram model with backoff for suggesting spelling errors.
- Start creating a restApi server for connecting with spell checker through HTTP.
- Loading language mode is time consuming now, convert it to binary file for reducing the space and time.
- جورن|1:('جوان', -3.8090884102661104);('جورج', -4.763630536469236)
- ظورف|1:('ظرف', -3.962282228494659);('دورف', -6.0673664255091415)
- مشهد●|1:('مشهد', -4.073379409291665);('مشهدی', -5.40225468843409)
- میشینیم|4:('می بینیم', -4.4239137490229545);('می نشینیم', -5.89127516645346)
- Loading language models into ram takes time.
GPL V3 Read LICENSE.txt