<a href="https://colab.research.google.com/github/bucuram/foundations-of-NLP-labs/blob/main/Lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing for non-English languages

##Challenges
* Diacritics restoration 
* Text segmentation for Chinese, Japanese, Arabic
* You may need to have some knowledge about the language





##Case studies

###Romanian


In [None]:
from pprint import pprint

romanian_text = '''Examenul Chunin începe 25. A zecea întrebare: Totul sau nimic!
    Excepţie a făcut-o Hotinul, unde s-a aşezat o garnizoană turcească, care a început să jefuiască cu cruzime ţara.
    Excitarea catatonică este o stare de agitaţie constantă şi de excitare motrică şi nervoasă.
    Excreţiile parazitului irită pielea prpducând mâncărime şi răni produse de scărpinat, iau naştere papule, vezicule, pustule, infiltraţii, acestea prin infecţii secundare, se pot transforma în furunculi.
    Executarea tabloului este precedată de o întreagă serie de schiţe.
    Execuţia acestei lucrări a început în anul 1957 şi s-a finalizat în anul urmator.
    Exemple de mesaje au fost eliminarea unui al doilea membru al tribului după ce a fost eliminat unul sau evitarea participării la Consiliul Tribal, dar cu preţul mutării intr-o locaţie mai puţin confortabilă.
    Exemple de specii din grupul calmarilor pot servi Loligo vulgaris, Ommastrephes sagittatus, Ommastrephes slosnei pacificus, Chiroteuthis veranyi etc. Unele specii sunt exploatate pentru carnea lor comestibilă
    '''
pprint(romanian_text)

Lowercasing text

In [None]:
romanian_text = romanian_text.lower()
pprint(romanian_text)

Removing digits

In [None]:
import re

romanian_text = re.sub(' \d+', '', romanian_text)
pprint(romanian_text)

Removing diacritics using [Unidecode](https://pypi.org/project/Unidecode/). Transforming Unicode text in ASCII.

In [None]:
!pip install unidecode

In [None]:
import unidecode
romanian_text  = unidecode.unidecode(romanian_text)
pprint(romanian_text)

Sentence tokenization

In [None]:
from nltk import sent_tokenize
import nltk
nltk.download('punkt')

sent_romanian = sent_tokenize(romanian_text)
sent_romanian 

Word tokenization

In [None]:
from nltk import word_tokenize
import string


sent_romanian = [sent.translate(str.maketrans('', '', string.punctuation)) for sent in sent_romanian]

words_romanian = [[word_tokenize(sent) for sent in sent_romanian]]
pprint(words_romanian) 

spaCy on Romanian text

###Japanese 

[Japanese punctuation](https://en.wikipedia.org/wiki/Japanese_punctuation)

In [15]:
japanese_text = '''$100ベットしていた場合には$100の配当を得られる。
    なお、100形のうち101 - 110の10両は当時丸ノ内線方南町支線用であり 、これの代替には1500N形投入により2000形を10両 (2031 - 2040) 捻出して対応している。
    コンセプトは「100時間遊べるおまけ」。
    なお100系1･2次車の床面高さは従前通りの1150mmである。
    ゴチ10・12・15ではMCを担当。
    ! 101番目の魔物 （ 大海恵 ） 2005年 * 劇場版 金色のガッシュベル!
    1088年に誕生した人物及び著名な動物 。'''

Removing digits

In [None]:
import re

japanese_text = ''.join([c for c in japanese_text if c.isdigit() == False ])
japanese_text

Installing some dependencies for japanese

In [None]:
!pip install mecab-python3

In [None]:
!pip install fugashi[unidic-lite]

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download ja_core_news_sm

Removing stopwords

In [None]:
import spacy

nlp = spacy.load('ja_core_news_sm')
stop_words_spacy = nlp.Defaults.stop_words
print(len(stop_words_spacy))
print(stop_words_spacy)

In [None]:
tokenized_text_spacy = nlp(japanese_text)
tokenized_text_without_stopwords = [i for i in tokenized_text_spacy if not i in stop_words_spacy]
print(tokenized_text_without_stopwords)

Removing punctuation

In [None]:
import spacy
for word in tokenized_text_without_stopwords:
    if word.is_punct:
        print(word, word.lemma_, word.pos_)

In [None]:
for word in tokenized_text_without_stopwords:
    if word.is_punct == False:
        print(word, word.lemma_, word.pos_)

# Assignment

To be uploaded here: https://forms.gle/wLsjCWnxK7w8GPvt9

Choose samples from 2 languages and preprocess the texts (normalization, tokenization, lematization, etc.).

Try to use spacy or find other resources and tools for the chosen languages.

Also mention if the language you have choosen needs specific preprocessing.

**Please add the resources you used in the doc**: https://docs.google.com/document/d/1c5sqPfgSioGzLZkRv4yWw7DWiiQC96QHw34ZRqcDiSY/edit?usp=sharing for further reference.


Questions: If you have chosen a language you are fluent in, how well did the tools work on this language? What problems did you observed (e.g. problems with tokenization, etc.)

## Data

Data can be downloaded from the resources below.


In [None]:
###write the code for your assignment here

Resources:

* [An Crúbadán - Corpus Building for Minority Languages - 18721 languages](http://crubadan.org/)
* [Leipzig Corpora Collection / Deutscher Wortschatz 291 languages](https://wortschatz.uni-leipzig.de/en/download)
* [Europarl](https://www.statmt.org/europarl/)

Tools:
* [Chinese text segmentation](https://github.com/fxsjy/jieba)

Further reading:

* [A Survey of Approaches to Diacritic Restoration](https://www.researchgate.net/profile/Franklin-Asahiah/publication/328419851_A_Survey_of_Approaches_to_Diacritic_Restoration/links/5bcd8b67458515f7d9d02f3d/A-Survey-of-Approaches-to-Diacritic-Restoration.pdf)
* [Preprocessing Arabic text on social media](https://www.sciencedirect.com/science/article/pii/S2405844021002966)
* [Semantic-Based Segmentation of Arabic Texts](https://scialert.net/fulltext/?doi=itj.2008.1009.1015)
* [The case of Croatian](https://medium.com/krakensystems-blog/text-processing-problems-with-non-english-languages-82822d0945dd)






