A Persian (Farsi) Preprocessing Python Module
About the project
This Python module was developed as part of my Persian text mining project in 2010. It was because I felt lack of a Persian text preprocessing tool/library back then.
The main operations of this module are:
- Normalise the Letters: maps all glyphs of each letter to one representative glyph
- Remove noises: removes any non-Arabic character, digits, and stop words
- Stem the words: It is an affix stemmer and developed using finite state machine model
Python 3.6 and its built-in modules was used to develop preper.
Python 3.* fully supports unicode characters. So, there is no need to change the characters to
their unicode code point; i.e.
There is no third-party library or dependency that you need to install separately.
There is no need to install anything. You just need to copy
preper.py module file into your project folder.
use_module.py is a sample file to help you to understand how to use
preper. But, basically there is only one thing that should be noticed. The stop words list is already provided in
stopwords.txt file in the module folder. Should you wish, please feel free to modify/update it.
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature)
- Commit your Changes (
git commit -m 'Add some AmazingFeature)
- Push to the Branch (
git push origin feature/AmazingFeature)
- Open a Pull Request
Distributed under the MIT License. See
LICENSE for more information.
Project Link: https://github.com/aliie62/preper