## Testing the ngram model

**We support:**
- n_grams of different length
- word vectors are weighted with tf-idf

**We need:**
- To be sure of the columns classes we want (Atm: `FIRSTNAME, NAME, ADDRESS, CITY, DATE, CODE, ID, STRING`)
- A way to reduce the number of feature (maybe total number of occurence per feature)
- Scale to a column-like vision to enhance the distinction between STRING/FIRSTNAME/NAME/CODE to use redundancy: use number of different items
- Add more features: number of words, ending pattern (with voyelle), ratio voyelle/consonne
- In this vision, use the column name to confirm or enhance classification.


In [1]:
import sys
sys.path.append('..') #Adding the parent folder to import files from there

In [2]:
import src.loader as loader
import src.models.ngram as ngram
from ipywidgets import FloatProgress

  """)
  from numpy.core.umath_tests import inner1d


In [3]:
source = [
    ('firstname', 'firstnames.firstname', 100),
    
    ('name', 'names.name', 100),
    
    ('code', 'patients.gender', 10),
    ('code', 'admissions.marital_status', 10),
    ('code', 'admissions.religion', 10),
    ('code', 'admissions.insurance', 10),
    ('code', 'admissions.admission_location', 10),
    ('code', 'prescriptions.drug_type', 30),
    ('code', 'prescriptions.dose_unit_rx', 20),
    
    ('date', 'prescriptions.startdate', 90),
    ('date', 'admissions.admittime', 10),
    
    ('id', 'admissions.hadm_id', 10),
    ('id', 'admissions.subject_id', 10),
    ('id', 'prescriptions.subject_id', 80),
    
    ('address', 'addresses.road', 100),
    
    ('city', 'addresses.city', 100)
]

In [4]:
dataset = []
labels = []
for column in source:
    if len(column) >= 3:
        label, column_name, nb_datasets = column
    else:
        label, column_name, nb_datasets = column, 1
    dataset.append((column_name, nb_datasets))
    labels += [label.upper()] * nb_datasets

In [5]:
%%time

max_value = len(labels)
bar = FloatProgress(min=0, max=max_value)
display(bar)

columns = loader.fetch_columns(dataset, dataset_size=100, load_bar=bar)

FloatProgress(value=0.0, max=700.0)

CPU times: user 856 ms, sys: 63.3 ms, total: 920 ms
Wall time: 2.03 s


In [6]:
clf = ngram.NGramClassifier()
X_train, y_train, X_test, y_test = clf.preprocess(columns, labels)

In [7]:
%%time

clf.fit(X_train, y_train, ngram_range=(2, 3))
y_pred = clf.predict(X_test)
clf.score(y_pred, y_test)

(560, 13561)
FIRSTNAME           	27/27	   100.0% 	(FP:0)
NAME                	23/23	   100.0% 	(FP:0)
CODE                	23/23	   100.0% 	(FP:0)
DATE                	15/15	   100.0% 	(FP:0)
ID                  	16/16	   100.0% 	(FP:0)
ADDRESS             	15/15	   100.0% 	(FP:0)
CITY                	21/21	   100.0% 	(FP:0)
SCORE 140/140 :   100.0%
CPU times: user 7.24 s, sys: 1.41 s, total: 8.65 s
Wall time: 8.68 s
