# Predicția complexității cuvintelor - utilizare spaCy

Complexitatea unui cuvânt este un criteriu subiectiv și depinde de mulți factori de la cât de des este întâlnit în vorbire acel cuvânt, cât de lung sau greu de citit este, dacă este un termen specializat, forma morfologică, până la funcția cuvântului în sintaxa propoziției. Pe baza acestor idei putem să ne definim niște funcții care să extragă caracterisitici.


Aveți voie cu următoarele resurse externe, cu condiția să primiți aprobare de la Sergiu (aprobările se dau individual):
- liste de cuvinte adiționale, [MRC Psycholinguistic Database](https://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm) conține informații privitoare la cuvinte sau lista [Dale-Chall](https://readabilityformulas.com/word-lists/the-dale-chall-word-list-for-readability-formulas/)
- seturi de date de text adiționale din care să extrageți frecvențe, de exemplu [AOCHILDES](https://github.com/UIUCLearningLanguageLab/AOCHILDES) conține ”child-directed speech transcripts, ordered by the age of the target child” pentru engleză iar Task-ul [BabyLM](https://babylm.github.io/) are ca scop antrenarea de LLM cu texte care pot fi plauzibile în dezvoltarea copiilor
- biblioteci sau API care să comunice cu [WordNet](http://wordnetweb.princeton.edu/perl/webwn?s=dog) sau [ConceptNet](https://conceptnet.io/)
- algoritmi traducere automată (ideal ar fi să ruleze local)
- rețele pre-antrenate de tip BERT, RoBERTa, XLM-RoBERTa (obligatoriu trebuie să ruleze local) din care să scoateți vectori de activări sau valori interne, word embeddings
- LLMs care obligatoriu trebuie să ruleze local din care puteți scoate vectori de activări sau valori interne din rețea


**Nu aveți voie cu:**
- API-uri externe
- extragerea scorurilor de complexitate prin parsarea unor prompturi rezultate din LLM, din LLM aveți voie doar cu valori interne ale vectorilor


## Sugestii de caracteristici

- frecvența cuvântului într-un corpus foarte mare - aici putem folosi biblioteca wordfreq, dar ideal să implementați o funcție care extrage cuvintele frecvente dintr-un corpus arbitrar
- lungimea cuvantului
- nr de silabe
- nr de vocale (aici sunt si semivocale, fara diftongi)
- daca e titlu sau entitate
- nr de synsets din WordNet
- nr de hypernime (colour is a hypernym of red)
- nr de hyponime (spoon is a hyponym of cutlery)
- word embedding din spacy
- nr de relații în arborele de sintaxă al propoziției
- cuvinte și părți de vorbire din context

## Get data

- comment data download if running on kaggle

In [1]:
! mkdir -p ../input/predictia-complexitatii-cuvintelor
! cd ../input/predictia-complexitatii-cuvintelor && \
 wget https://github.com/curs-ia-2024/proiect/releases/download/data/train.csv && \
 wget https://github.com/curs-ia-2024/proiect/releases/download/data/test.csv

--2024-05-13 12:48:50--  https://github.com/curs-ia-2024/proiect/releases/download/data/train.csv
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/799955344/6e1fe7f2-680e-407a-900f-a8c61003d468?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240513T124850Z&X-Amz-Expires=300&X-Amz-Signature=58c1448874d6e2df0262388256f0223eca347b7f83e37e7d4a1e46a74ca6b9a2&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=799955344&response-content-disposition=attachment%3B%20filename%3Dtrain.csv&response-content-type=application%2Foctet-stream [following]
--2024-05-13 12:48:50--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/799955344/6e1fe7f2-680e-407a-900f-a8c61003d468?X-Amz-Algorithm=AW

## Load Data, Make a Random Submission

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('../input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

../input/predictia-complexitatii-cuvintelor/test.csv
../input/predictia-complexitatii-cuvintelor/train.csv


In [3]:
! pwd

/content


In [4]:
import os
BASE_DIR = '../input/predictia-complexitatii-cuvintelor'
TRAIN_PATH = os.path.join(BASE_DIR, 'train.csv')
TEST_PATH = os.path.join(BASE_DIR, 'test.csv')

In [5]:
import pandas as pd

train = pd.read_csv(TRAIN_PATH)
train

Unnamed: 0,cur_id,language,sentence,word,complexity
0,0,english,"Behold, there came up out of the river seven c...",river,0.000000
1,1,english,I am a fellow bondservant with you and with yo...,brothers,0.000000
2,2,english,"The man, the lord of the land, said to us, 'By...",brothers,0.050000
3,3,english,Shimei had sixteen sons and six daughters; but...,brothers,0.150000
4,4,english,Moreover Yahweh will deliver Israel also with ...,sons,0.160714
...,...,...,...,...,...
8628,8628,spanish,Pueden estar colmados de desbarajustes y el bi...,colmados,0.675000
8629,8629,spanish,Pueden estar colmados de desbarajustes y el bi...,desbarajustes,0.800000
8630,8630,spanish,Y le va a presentar algunos retos personales p...,esperaría,0.200000
8631,8631,spanish,Y le va a presentar algunos retos personales p...,retos,0.450000


In [6]:
test = pd.read_csv(TEST_PATH)
test

Unnamed: 0,cur_id,language,sentence,word
0,8633,catalan,En el que han coincidit tots els presents és q...,coincidit
1,8634,catalan,"Serà molt més fàcil poder-nos comunicar, ha ce...",auditiva
2,8635,catalan,"El Síndic de Greuges, Rafael Ribó, ha reclamat...",segregació
3,8636,catalan,També demana elaborar materials didàctics per ...,controvertits
4,8637,catalan,Una quinzena de joves han clavat enganxines on...,enganxines
...,...,...,...,...
5618,14251,spanish,"La función de concentración de recursos, tiene...",concentración
5619,14252,spanish,Después surgió la moneda y posteriormente surg...,intercambios
5620,14253,spanish,A éstos se les coloca una fecha posterior al m...,suficientes
5621,14254,spanish,Colisión: Choque de dos cuerpos. Oposición y p...,ahorro


In [7]:
import numpy as np
import pandas as pd

random_values = np.random.uniform(0, 1, len(test))
random_submission = pd.DataFrame({'cur_id': test.cur_id.values, 'complexity': random_values})
random_submission.to_csv('submission.csv', index=False)

## spaCy library

- check out the models and functionality here: https://spacy.io/models

In [None]:
# better have the latest version
#! pip install -U spacy

In [8]:
# first we must download the models we wish to use
! python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
import spacy
spacy_lm_names = {
    'english': 'en_core_web_lg'
}

In [10]:
model = spacy.load(spacy_lm_names['english'])


In [11]:
position = 2567
sentence = train.sentence.values[position]
word = train.word.values[position]
train.iloc[position]


cur_id                                                     2567
language                                                english
sentence      Diminished response to dDAVP, diminished abund...
word                                                   response
complexity                                             0.194444
Name: 2567, dtype: object

In [12]:
spacy_sentence = model(sentence)
print(spacy_sentence)

Diminished response to dDAVP, diminished abundance of mature glycosylated protein in mutant animals, and the transport of a fraction of mutant protein beyond the ER in MDCK cells are all consistent with the notion that AQP2-F204V misfolding is limited and that it may retain some residual water transporting activity.


In [13]:
interesting_properties = []
for token in spacy_sentence:
    interesting_properties.append({'token': token,
                                   'lwr_text': token.text.lower(),
                                   'pos': token.pos_,
                                   'lemma': token.lemma_})

pd.DataFrame(interesting_properties)

Unnamed: 0,token,lwr_text,pos,lemma
0,Diminished,diminished,VERB,diminish
1,response,response,NOUN,response
2,to,to,ADP,to
3,dDAVP,ddavp,NOUN,ddavp
4,",",",",PUNCT,","
5,diminished,diminished,VERB,diminish
6,abundance,abundance,NOUN,abundance
7,of,of,ADP,of
8,mature,mature,ADJ,mature
9,glycosylated,glycosylated,ADJ,glycosylated


In [15]:
token = spacy_sentence[1]
# this object has so many properties, curious to try them all
dir(token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

In [16]:
# this is a word embedding
token.vector

array([ 1.5235e+00, -2.6049e+00, -2.1225e+00,  2.5392e+00,  3.9386e+00,
       -6.2870e-01, -7.7328e-01,  2.1056e+00, -2.5325e+00, -1.0489e+00,
        4.6560e+00,  1.1335e+00, -3.8840e+00,  3.3985e-01, -2.3117e+00,
        3.8706e+00,  1.6959e+00,  1.3614e+00, -4.2780e+00,  3.7743e-01,
       -3.7904e+00, -1.3665e+00, -3.8358e+00, -3.5943e-01, -4.2350e-01,
       -3.7950e+00,  1.2701e+00, -1.9095e-01, -2.6544e+00, -1.8395e+00,
        3.9027e+00,  3.6811e+00,  1.5446e+00,  5.3012e+00, -1.5729e+00,
       -4.0013e+00, -7.6532e-01, -3.2671e-01,  2.6857e+00,  2.0945e+00,
       -1.9085e+00, -1.5437e+00, -1.2115e+00, -1.5782e-01, -4.3022e+00,
        1.1444e+00,  3.4901e+00, -3.1766e+00,  1.5489e-01,  1.0986e-02,
       -3.6548e+00,  3.2322e+00,  3.5850e+00, -2.3372e+00, -2.9324e+00,
       -5.5599e-01, -2.0590e+00,  1.5786e+00,  1.0854e+00, -3.0508e-01,
        7.6219e-01,  1.2029e+00, -3.8683e+00, -9.6589e-01,  2.4159e+00,
        5.1211e+00, -2.6994e+00, -2.0673e+00,  1.8490e+00,  4.85

In [17]:
token.vector.shape

(300,)

### TODO:
- add more languages, check which ones are supported by spaCy
- check other interesting properties of the token: https://spacy.io/api/token
- look for .vector property and experiment with word similarity
- introduce features extracted from spacy
- test different models for different languages