<a href="https://colab.research.google.com/github/baharkarami/Text-Mining-Class/blob/main/Hazm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install hazm





---



1. Importing Hazm: The Hazm library provides various tools for processing Persian text.

2. Creating a Normalizer object: An instance of the Normalizer class is created. This class is used to standardize and correct characters in Persian text (e.g., removing extra non-breaking spaces, converting characters to their correct forms, etc.).

3. Normalizing the text.








In [None]:
from hazm import *

normalizer = Normalizer()
normalizer.normalize('قیمت های گزارش شده در سایت های مختلف امروز 11 مرداد به قرار زیر می باشد!')


'قیمت\u200cهای گزارش\u200cشده در سایت\u200cهای مختلف امروز ۱۱ مرداد به قرار زیر می\u200cباشد!'



---



1. Importing Hazm.

2. Creating a WordTokenizer object.

3. Defining the sample text.

4. Tokenizing the text:

  The splits the input text into a list of words and symbols. The output will include each word or symbol as a separate element in the list.


In [None]:
import hazm

wrd_tokenizer = WordTokenizer()
txt = "این یک تست است! این ایمیل است b.example@email.com! این هم یک عدد است44!"
wrd_tokenizer.tokenize(txt)

['این',
 'یک',
 'تست',
 'است',
 '!',
 'این',
 'ایمیل',
 'است',
 'b',
 '.',
 'example@email',
 '.',
 'com',
 '!',
 'این',
 'هم',
 'یک',
 'عدد',
 'است',
 '44',
 '!']



---



1. Importing Hazm.

2. Creating a WordTokenizer with replacement settings:

  `replace_emails=True`: Replaces email addresses in the text with `[EMAIL]`.

  replace_numbers=True: Replaces numbers in the text with `[NUM]`.

3. Defining the sample text.

4. Tokenizing the text:

  This function splits the text into tokens (words and symbols) and replaces emails and numbers as specified.

In [None]:
import hazm
wrd_tokenizer = hazm.WordTokenizer(replace_emails=True, replace_numbers=True)
txt = "این یک تست است! این ایمیل است b.example@email.com! 44 این هم یک عدد است !"
wrd_tokenizer.tokenize(txt)

['این',
 'یک',
 'تست',
 'است',
 '!',
 'این',
 'ایمیل',
 'است',
 'EMAIL',
 '!',
 'NUM',
 '2',
 'این',
 'هم',
 'یک',
 'عدد',
 'است',
 '!']



---



1. Importing the Hazm library:

2. Creating an InformalNormalizer object:

  An instance of the `InformalNormalizer` class is created. This class is used to normalize colloquial text and convert informal words into more formal ones.

3. Defining the input text:

  The text contains informal words:

    * "یه" instead of "یک"

    * "محاوره‌ اس" instead of "محاوره‌ای است"

    * "واسه"  instead of "برای"

4. Normalizing the text: The function identifies informal words and converts them into formal equivalents.



  

In [None]:
from hazm import *

informal_normalizer = hazm.InformalNormalizer()
txt = "این یه متن محاوره‌ اس واسه آموزشه"
res = informal_normalizer.normalize(txt)
print(res)

[[['این'], ['یک', 'یه'], ['متن'], ['محاوره'], ['اس'], ['برای', 'واسه'], ['آموزشه است', 'آموزش است', 'آموزشه']]]




---



1. Importing libraries:

  `stopwords_list`: A Hazm function providing a list of Persian stopwords (e.g., "و", "به", "در", etc.).

  `string.punctuation`: A set of punctuation marks (e.g., `!`, `?`, `.`, etc.).

2. Initializing objects for preprocessing:

  `normalizer`: Normalizes words (e.g., unifying forms like "می‌خواهیم").

  `stopwords`: A list of stopwords to filter out from the text.

  `punc`: A collection of punctuation marks to be removed.

3. Defining the sample text

4. Tokenizing the text

5. Normalizing and filtering stopwords and punctuation:

 For each word in `tok`:

    If the word is not in the stopwords list (`x not in stopwords`) and not a punctuation mark (`x not in punc`):

      It is normalized using `normalizer.normalize(x)`.

      Remaining words are joined with spaces (`" ".join(...)`).





In [None]:
from hazm import stopwords_list
import string

normalizer = Normalizer()
stopwords = stopwords_list()
punc = string.punctuation

sample = "این یک متن فارسی است که می‌خواهیم آن را پیش پردازش کرده و کلمات توقف آن را حذف کنیم! و همینطور علائم نقطه گذاری!!"
tok = word_tokenize(sample)
s = " ".join(normalizer.normalize(x) for x in tok if x not in stopwords and x not in punc)
print(s)

متن فارسی می‌خواهیم پردازش کلمات توقف حذف همینطور علائم نقطه !!




---



1. Importing the Hazm library

2. Creating a Lemmatizer object:

 Its main task is to separate the root and suffix/prefix of words and return the base form.


3. Lemmatizing samples

In [None]:
import hazm

lemmatizer = Lemmatizer()

sample1 = lemmatizer.lemmatize("خواندم")
print(sample1)

sample2 = lemmatizer.lemmatize("کتابهایی")
print(sample2)

خواند#خوان
کتاب




---



1. Importing the Hazm library

2.  Creating a Normalizer object:

  `Normalizer` is used to normalize text (e.g., replacing half-spaces with regular spaces).

3. Creating a Stemmer object:

  The `Stemmer` reduces words to their simpler or root form.

4. Stemming sample words:

  Each word is stemmed, and the result is printed.
  
  In the last line, "می خواند" is normalized before stemming to ensure better processing.

In [None]:
import hazm

normalizer = Normalizer()

stemmer = Stemmer()
print(stemmer.stem("کتابها"))
print(stemmer.stem("خدایان"))
print(stemmer.stem("پاسبان"))
print(stemmer.stem("می خواند"))
print(stemmer.stem(normalizer.normalize("می خواند")))

کتاب
خدا
پاسب
می خواند
می‌خواند
