# Instalarea pachetelor necesare

## Dale Chall Word List

https://www.readabilityformulas.com/articles/dale-chall-readability-word-list.php
1. The Dale-Chall Word List contains approximately three thousand familiar words  that are known in reading by at least 80 percent of the children in Grade 5. 
2. It gives a significant correlation with reading difficulty. 
3. It is not intended as a list of the most important words for children or adults. 
4. It includes words that are relatively unimportant and excludes some important ones.

In [6]:
# download in shell 
! rm -f *.py
! wget https://raw.githubusercontent.com/artificial-intelligence-ml-cti/ml_cti/main/proiect/dale_chall.py

--2021-11-02 15:26:25--  https://raw.githubusercontent.com/artificial-intelligence-ml-cti/ml_cti/main/proiect/dale_chall.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27456 (27K) [text/plain]
Saving to: ‘dale_chall.py’


2021-11-02 15:26:25 (128 MB/s) - ‘dale_chall.py’ saved [27456/27456]

/content


## Instalarea pachetelor cu pip

In [1]:
!pip install pyphen nltk pandas sklearn 

Collecting pyphen
  Downloading pyphen-0.11.0-py3-none-any.whl (2.0 MB)
[?25l[K     |▏                               | 10 kB 23.6 MB/s eta 0:00:01[K     |▍                               | 20 kB 32.2 MB/s eta 0:00:01[K     |▌                               | 30 kB 38.2 MB/s eta 0:00:01[K     |▊                               | 40 kB 39.8 MB/s eta 0:00:01[K     |▉                               | 51 kB 17.7 MB/s eta 0:00:01[K     |█                               | 61 kB 19.4 MB/s eta 0:00:01[K     |█▏                              | 71 kB 14.6 MB/s eta 0:00:01[K     |█▍                              | 81 kB 16.2 MB/s eta 0:00:01[K     |█▌                              | 92 kB 14.3 MB/s eta 0:00:01[K     |█▊                              | 102 kB 14.9 MB/s eta 0:00:01[K     |█▉                              | 112 kB 14.9 MB/s eta 0:00:01[K     |██                              | 122 kB 14.9 MB/s eta 0:00:01[K     |██▏                             | 133 kB 14.9 MB/s eta 0:

## Instalarea resurselor pentru nltk

In [None]:
# pentru nltk trebuie sa downloadam resursele necesare lucrului cu wordnet si tokenizer
import nltk
nltk.download('punkt')
nltk.download('wordnet')

# Downloadarea datelor

In [10]:
! rm -rf data*
! wget https://github.com/artificial-intelligence-ml-cti/ml_cti/raw/main/proiect/data.zip
! unzip "data.zip"

! echo "***\n Fisierele sunt: "
! ls data/
! echo "****\n Calea catre directorul cu date este: "
! readlink -f data/

--2021-11-02 15:36:47--  https://github.com/artificial-intelligence-ml-cti/ml_cti/raw/main/proiect/data.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/artificial-intelligence-ml-cti/ml_cti/main/proiect/data.zip [following]
--2021-11-02 15:36:47--  https://raw.githubusercontent.com/artificial-intelligence-ml-cti/ml_cti/main/proiect/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 741506 (724K) [application/zip]
Saving to: ‘data.zip’


2021-11-02 15:36:48 (39.3 MB/s) - ‘data.zip’ saved [741506/741506]

Archive:  data.zip
   creating: data/
  inflating: data/test.xlsx          
  infla

# Cod de baza pentru proiect

In [8]:
import pyphen
import numpy as np
import pandas as pd

from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import balanced_accuracy_score

from dale_chall import DALE_CHALL


## Citirea datelor in dataframe

In [13]:
dtypes = {"sentence": "string", "token": "string", "complexity": "float64"}
train = pd.read_excel('/content/data/train.xlsx', dtype=dtypes, keep_default_na=False)
test = pd.read_excel('/content/data/test.xlsx', dtype=dtypes, keep_default_na=False)

print('train data: ', train.shape)
print('test data: ', test.shape)

train data:  (7662, 4)
test data:  (1338, 3)


## Generarea de caracteristici legate de structura cuvantului pentru cuvantul tinta 

In [None]:
def get_word_structure_features(word):
    features = []
    features.append(nr_syllables(word))
    features.append(is_dale_chall(word))
    features.append(length(word))
    features.append(nr_vowels(word))
    features.append(is_title(word))
    return np.array(features)

## Generarea de caracteristici folosind Wordnet pentru cuvantul tinta

In [None]:
def get_wordnet_features(word):
  features = []
  features.append(synsets(word))
  return np.array(features)

## Apelul functiilor de generare de caracteristici

In [None]:
def featurize(row):
    word = row['token']
    all_features = []
    all_features.extend(corpus_feature(row['corpus']))
    all_features.extend(get_word_structure_features(word))
    all_features.extend(get_wordnet_features(word))
    return np.array(all_features)

In [None]:
def featurize_df(df):
    nr_of_features = len(featurize(df.iloc[0]))
    nr_of_examples = len(df)
    features = np.zeros((nr_of_examples, nr_of_features))
    for index, row in df.iterrows():
        row_ftrs = featurize(row)
        features[index, :] = row_ftrs
    return features

## Generarea de caracteristici pentru setul de train

In [None]:
X_train = featurize_df(train)
y_train = train['complex'].values

## Generarea de caracteristici pentru setul de test

In [None]:
X_test = featurize_df(test)

In [None]:
for nb in range(1, 8, 2):
    model = KNeighborsClassifier(n_neighbors=nb)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)