Language-Identifier

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.

Each language in this dataset contains 1000 rows/paragraphs.

After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages

⦁ English ⦁ Arabic ⦁ French ⦁ Hindi ⦁ Urdu ⦁ Portuguese ⦁ Persian ⦁ Pushto ⦁ Spanish ⦁ Korean ⦁ Tamil ⦁ Turkish ⦁ Estonian ⦁ Russian ⦁ Romanian ⦁ Chinese ⦁ Swedish ⦁ Latin ⦁ German ⦁ Dutch ⦁ Japanese ⦁ Thai
Accuracy of the model is 95%.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Language Identifier.ipynb		Language Identifier.ipynb
README.md		README.md
dataset.csv		dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language-Identifier

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.

About

Releases

Packages

Languages

bhavuksagar/Language-Identifier

Folders and files

Latest commit

History

Repository files navigation

Language-Identifier

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages