Skip to content

bhavuksagar/Language-Identifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Language-Identifier

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages.

Each language in this dataset contains 1000 rows/paragraphs.

After data selection and preprocessing I used the 22 selective languages from the original dataset Which Includes following Languages

⦁ English ⦁ Arabic ⦁ French ⦁ Hindi ⦁ Urdu ⦁ Portuguese ⦁ Persian ⦁ Pushto ⦁ Spanish ⦁ Korean ⦁ Tamil ⦁ Turkish ⦁ Estonian ⦁ Russian ⦁ Romanian ⦁ Chinese ⦁ Swedish ⦁ Latin ⦁ German ⦁ Dutch ⦁ Japanese ⦁ Thai
Accuracy of the model is 95%.

Releases

No releases published

Packages

No packages published