Text Language Identifier

text-language-identifier accurately identifies English, French and Italian written text with up to 99% accuracy.

Since this project was used to better understand n-gram analysis, no natural language processing modules were imported -- everything was implemented from first principles.

Usage

Download this repo as a .zip
cd src
$ python3 lang-bigram-id.py

Accuracy information for each model will be displayed in the terminal after the analysis is complete.

Diff the output files to see which lines were predicted differently by certain pairs of models. Here are some commands to try:

$ diff ../output/letter-bigram-laplace-smoothing-predictions.txt ../output/letter-bigram-no-smoothing-predictions.txt
$ diff ../output/letter-bigram-laplace-smoothing-predictions.txt ../output/word-bigram-no-smoothing-predictions.txt
$ diff ../output/letter-bigram-laplace-smoothing-predictions.txt ../output/word-bigram-laplace-smoothing-predictions.txt

How does it work?

This program creates a probabilistic model of each language based on bigram analyses of French, English and Italian sample corpora. To predict the language of a test sentence, it creates another probabilistic model to represent the sentence and chooses the language whose model is most similar to the sentence model using the RMSE.

Why?

This is was a computational linguistics experiment to see which language model, word or letter bigrams, performs the best. I also tested the impact of LaPlace smoothing on the predictive accuracy of the model.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
img		img
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Language Identifier

Usage

How does it work?

Why?

About

Releases

Packages

Languages

License

alichtman/text-language-identifier

Folders and files

Latest commit

History

Repository files navigation

Text Language Identifier

Usage

How does it work?

Why?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages