# Set up environment (donwload and import libs)

In [None]:
# SPARK-NLP FOR TOKENIZATION, LEMATIZATION, PoS
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-05-19 11:34:15--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-05-19 11:34:15--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-05-19 11:34:16--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:44

In [None]:
!pip install alphabet-detector

Collecting alphabet-detector
  Downloading alphabet-detector-0.0.7.tar.gz (1.6 kB)
Building wheels for collected packages: alphabet-detector
  Building wheel for alphabet-detector (setup.py) ... [?25l[?25hdone
  Created wheel for alphabet-detector: filename=alphabet_detector-0.0.7-py3-none-any.whl size=2446 sha256=cd2052d9652fe88426004cca4917c7f27ed36a3890eb0a8ccb38af14f6aa00fe
  Stored in directory: /root/.cache/pip/wheels/22/8c/ab/4afb1765f2b8450f894a1f06c9aa2b3f8e73f2fb8b55849e17
Successfully built alphabet-detector
Installing collected packages: alphabet-detector
Successfully installed alphabet-detector-0.0.7


In [None]:
import pandas as pd
import numpy as np
import json
import re

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from alphabet_detector import AlphabetDetector

In [None]:
spark = sparknlp.start()
print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 3.4.4
Apache Spark version: 3.0.3


In [None]:
spark = sparknlp.start()
print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 3.4.4
Apache Spark version: 3.0.3


In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"]) \
    .setOutputCol("token")

lemmatizer = LemmatizerModel.pretrained("lemma", "bn") \
        .setInputCols(["token"]) \
        .setOutputCol("lemma")

stop_words = StopWordsCleaner.pretrained('stopwords_bn', 'bn')\
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

pos = PerceptronModel.pretrained("pos_msri", "bn") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("pos")

nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer, stop_words, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))

lemma download started this may take some time.
Approximate size to download 90.6 KB
[OK!]
stopwords_bn download started this may take some time.
Approximate size to download 1.9 KB
[OK!]
pos_msri download started this may take some time.
Approximate size to download 806.5 KB
[OK!]


# Text processing (explanation)

The main aim of preprocessing is clearing text from:
* stop-words
* auxiliary part of speech
* proper nouns (I am changing them to word "জিনিস" - "object")
* numbers (I am changing them to word "সংখ্যা" - "number")
* punctuation and other symbols

I am also changing all pronouns to the word "জিনিস" - "object". Below listed all part of speeches that model can distinguish and what I'm doing with them in text preprocessing:

- NN - noun \\
- SYM - symbol (delete) \\
- NNP - propper noun (change to "object") \\
- VM - modal verb \\
- INTF - intesifier (delete) \\
- JJ - Adjective \\
- QF - Quantifiers (delete) \\
- CC - coordinating conjunction (delete) \\
- NST - noun \\
- PSP - adposition (delete) \\
- DEM - pronoun (change to "object") \\
- PRP - posessive pronoun (change to "object") \\
- NEG - negative (delete) \\
- WQ - wh-qual (delete) \\
- RB - adverb \\
- VAUX - Verb Auxiliary (delete) \\
- UT (delete) \\
- XC (delete) \\
- RP - particle (delete) \\
- Q0 - ordinal number (change to "number") \\
- QC - cardinal number (change to "number") \\
- BM - (delete) \\
- NNC - compound noun \\
- PPR - postposition (delete) \\
- INJ - delete \\
- CL - delete \\
- UNK - delete \\

Stop words i'm clearing form text are listed here \\
(https://github.com/stopwords-iso/stopwords-bn/blob/master/stopwords-bn.txt)

In the words of other parts of speech, I highlight the initial form. 
As i understood from wikipedia, some part of speech in bengali has different forms of word (like nominative, objective, genetive, locative noun inflections). As a result, I get a text containing the initial forms of words and cleaned during preprocessing. Here is an example of how it should work for english language:

**In:** She jumped into the river and breathed heavily. \\
**Out:** Object jump river breath heavily 

# Actual code for text processer and several examples:

In [None]:
!wget https://github.com/stopwords-iso/stopwords-bn/blob/master/stopwords-bn.txt 

--2022-05-19 11:36:42--  https://github.com/stopwords-iso/stopwords-bn/blob/master/stopwords-bn.txt
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘stopwords-bn.txt’

stopwords-bn.txt        [ <=>                ] 218.26K  --.-KB/s    in 0.1s    

2022-05-19 11:36:42 (2.10 MB/s) - ‘stopwords-bn.txt’ saved [223502]



In [None]:
with open('stopwords-bn.txt', 'r', encoding='utf-8') as stopfile:
  text = stopfile.read()
  stopwords = text.split('\n')
print('Web stop_words:', len(stopwords))
stop_set = set(stopwords)

Web stop_words: 3014


In [None]:
banned_types = {'SYM', 'INTF', 'QF', 'CC', 'PSP', 'RDP', 'NEG', 'WQ', 'VAUX', 'UT', 'XC', 'RP', 'BM', 'PPR', 'INJ', 'CL', 'UNK'}
pron_types = {'NNP', 'DEM', 'PRP'}
number_types = {'QO', 'QC'}
forbidden_chars = {'"','0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0','!', '@', '#', '$', '^', '&', '*', '—','’','‘',']','[','_','·','.%',')',
'(','”','“','\u3000',"、","。","〈","〉","《"}

In [None]:
def clean_bn_text(text_in, debug=False):
  ad = AlphabetDetector()
  results = light_pipeline.fullAnnotate([text_in])
  text_out = ""

  for i in range(len(results[0]['lemma'])):
    token = results[0]['token'][i].result
    lemm = results[0]['lemma'][i].result
    pt = results[0]['pos'][i].result

    if not "LATIN" in ad.detect_alphabet(token) and token not in stop_set and lemm not in stop_set and pt not in banned_types and len(set(token).intersection(forbidden_chars)) == 0:
      if pt in pron_types:
        text_out += 'জিনিস '
        if debug:
          print(token, "-->", "জিনিস", pt)
      elif pt in number_types:
        text_out += 'সংখ্যা '
        if debug:
          print(token, "-->", "সংখ্যা", pt)
      else:
        lemm = re.sub(r'[^\w\s]', '', lemm)
        text_out += lemm
        text_out += ' '
        if debug:
          print(token, "-->", lemm, pt)
    else:
      if debug:
        print(token, "-->", "[deleted]", pt)
  return text_out

Run cells below for testing

In [None]:
# List of sentences to test preprocesser. You can put sentence to test in this list
texts = [
  'ধারা ১: সমস্ত মানুষ স্বাধীনভাবে সমান মর্যাদা এবং অধিকার নিয়ে জন্মগ্রহণ করে। তাঁদের বিবেক এবং বুদ্ধি আছে; সুতরাং সকলেরই একে অপরের প্রতি ভ্রাতৃত্বসুলভ মনোভাব নিয়ে আচরণ করা উচিৎ।',
  'একদিন প্রাতে বৈদ্যনাথের মার্বলমণ্ডিত দালানে একটি স্থূলোদর সন্ন্যাসী দুইসের মোহনভোগ এবং দেড়সের দুগ্ধ সেবায় নিযুক্ত আছে বৈদ্যনাথ গায়ে একখানি চাদর দিয়া জোড়করে একান্ত বিনীতভাবে ভূতলে বসিয়া ভক্তিভরে পবিত্র ভোজনব্যাপার নিরীক্ষণ করিতেছিলেন এমন সময় কোনোমতে দ্বারীদের দৃষ্টি এড়াইয়া জীর্ণদেহ বালক সহিত একটি অতি শীর্ণকায়া রমণী গৃহে প্রবেশ করিয়া ক্ষীণস্বরে কহিল বাবু দুটি খেতে দাও',
  'তিনি জানালা খোলা এবং দেখেছি একটি গাছ পাখি.'
]

In [None]:
i = 1
for text in texts:
  print("Test", i)
  i += 1

  print('\033[1m' + "In:" +  '\033[0m', text)
  print('\033[1m' + "Out:" +  '\033[0m', clean_bn_text(text, True)) # Can set False instead of True to get rid off "A --> B" debug output 
  print("==========================================================")

Test 1
[1mIn:[0m ধারা ১: সমস্ত মানুষ স্বাধীনভাবে সমান মর্যাদা এবং অধিকার নিয়ে জন্মগ্রহণ করে। তাঁদের বিবেক এবং বুদ্ধি আছে; সুতরাং সকলেরই একে অপরের প্রতি ভ্রাতৃত্বসুলভ মনোভাব নিয়ে আচরণ করা উচিৎ।
ধারা --> ধর NN
১ --> সংখ্যা QC
: --> [deleted] SYM
সমস্ত --> সমসত JJ
মানুষ --> মনষ NN
স্বাধীনভাবে --> সবধনভব RB
সমান --> সমন JJ
মর্যাদা --> মরযদ NN
এবং --> [deleted] CC
অধিকার --> অধকর NN
নিয়ে --> নয JJ
জন্মগ্রহণ --> জনমগরহণ NN
করে। --> কর NN
তাঁদের --> জিনিস PRP
বিবেক --> ববক JJ
এবং --> [deleted] CC
বুদ্ধি --> বদধ NN
আছে --> আছ VM
; --> [deleted] SYM
সুতরাং --> সতর JJ
সকলেরই --> সকলরই NN
একে --> সংখ্যা QC
অপরের --> অপরর NN
প্রতি --> [deleted] PSP
ভ্রাতৃত্বসুলভ --> ভরততবসলভ NN
মনোভাব --> মনভব NN
নিয়ে --> নয JJ
আচরণ --> আচরণ NN
করা --> কর VM
উচিৎ। --> [deleted] VAUX
[1mOut:[0m ধর সংখ্যা সমসত মনষ সবধনভব সমন মরযদ অধকর নয জনমগরহণ কর জিনিস ববক বদধ আছ সতর সকলরই সংখ্যা অপরর ভরততবসলভ মনভব নয আচরণ কর 
Test 2
[1mIn:[0m একদিন প্রাতে বৈদ্যনাথের মার্বলমণ্ডিত দালানে একটি স্থূলোদর সন্ন্যাসী দুইসের মোহ