## Testing the ngram model

**We support:**
- n_grams of different length
- word vectors are weighted with tf-idf

**We need:**
- To be sure of the columns classes we want (Atm: `FIRTNAME, NAME, ADDRESS, CITY, DATE, CODE, ID, STRING`)
- A way to reduce the number of feature (maybe total number of occurence per feature)
- Scale to a column-like vision to enhance the distinction between STRING/FIRSTNAME/NAME/CODE to use redundancy: use number of different items
- Add more features: number of words, ending pattern (with voyelle), ratio voyelle/consonne
- In this vision, use the column name to confirm or enhance classification.


In [1]:
import loader
import ngram
import numpy as np
from ipywidgets import FloatProgress

In [2]:
source = [
    ('firstname', 'firstnames.firstname', 100),
    
    ('name', 'names.name', 100),
    
    ('code', 'patients.gender', 10),
    ('code', 'admissions.marital_status', 10),
    ('code', 'admissions.religion', 10),
    ('code', 'admissions.insurance', 10),
    ('code', 'admissions.admission_location', 10),
    ('code', 'prescriptions.drug_type', 30),
    ('code', 'prescriptions.dose_unit_rx', 20),
    
    ('date', 'prescriptions.startdate', 90),
    ('date', 'admissions.admittime', 10),
    
    ('id', 'admissions.hadm_id', 10),
    ('id', 'admissions.subject_id', 10),
    ('id', 'prescriptions.subject_id', 80),
    
    ('address', 'addresses.road', 100),
    
    ('city', 'addresses.city', 100)
]
source = [
    ('firstname', 'firstnames.firstname', 20),
    ('name', 'names.name', 20),
    ('address', 'addresses.road', 20),
    ('code', 'patients.gender', 10),
    ('code', 'admissions.marital_status', 10),
    ('city', 'addresses.city', 20),
    ('id', 'admissions.hadm_id', 10),
    ('id', 'admissions.subject_id', 10),
    ('date', 'prescriptions.startdate', 10),
    ('date', 'admissions.admittime', 10)
]

In [3]:
dataset = []
labels = []
for column in source:
    if len(column) >= 3:
        label, column_name, nb_datasets = column
    else:
        label, column_name, nb_datasets = column, 1
    dataset.append((column_name, nb_datasets))
    labels += [label.upper()] * nb_datasets

In [4]:
%%time

max_value = len(labels)
bar = FloatProgress(min=0, max=max_value)
display(bar)

columns = loader.fetch_columns(dataset, limit=100, load_bar=bar)

CPU times: user 437 ms, sys: 243 ms, total: 680 ms
Wall time: 19.7 s


In [10]:
clf = ngram.NGramClassifier()
X_train, y_train, X_test, y_test = clf.preprocess(columns, labels)

In [11]:
clf.fit(X_train, y_train, ngram_range=(2, 3))
y_pred = clf.predict(X_test)
clf.score(y_pred, y_test)

FIRSTNAME           	3/3	   100.0% 	(FP:0)
NAME                	3/3	   100.0% 	(FP:0)
ADDRESS             	5/5	   100.0% 	(FP:0)
CODE                	1/2	   50.0% 	(FP:0)
CITY                	9/9	   100.0% 	(FP:0)
ID                  	1/1	   100.0% 	(FP:1)
DATE                	5/5	   100.0% 	(FP:0)
SCORE 27/28 :   96.43%
