## Testing the ngram model

**We support:**
- n_grams of different length
- word vectors are weighted with tf-idf

**We need:**
- To be sure of the columns classes we want (Atm: `FIRTNAME, NAME, ADDRESS, CITY, DATE, CODE, ID, STRING`)
- A way to reduce the number of feature (maybe total number of occurence per feature)
- Scale to a column-like vision to enhance the distinction between STRING/FIRSTNAME/NAME/CODE to use redundancy: use number of different items
- Add more features: number of words, ending pattern (with voyelle), ratio voyelle/consonne
- In this vision, use the column name to confirm or enhance classification.


In [1]:
import loader
import ngram

In [2]:
source = [
    ('firstname', 'firstnames.firstname', 3),
    ('code', 'patients.gender', 3),
    ('address', 'addresses.road', 3)
]

In [3]:
dataset = []
labels = {}
for column in source:
    dataset.append(column[1:])
    labels[column[1]] = column[0].upper()

In [4]:
columns = loader.fetch_columns(dataset, limit=50)

In [5]:
clf = ngram.NGramClassifier()
X_train, y_train, X_test, y_test = clf.preprocess(columns, labels)

In [6]:
X_train, y_train, X_test, y_test

(['F',
  'F',
  'MARIE-THÉRÈSE',
  'Chemin du Ligno',
  'Rue aux Bois',
  'F',
  'TIFFANY',
  'F',
  'Rue André Maurois',
  'F',
  'F',
  'YVETTE',
  'PASCAL',
  'F',
  'M',
  'Impasse des Mouettes',
  'Lotissement le Bréat',
  'F',
  'Chemin des Planasteaux Vc 29',
  'Avenue de Stalingrad',
  'Rue du Mont Gerbault',
  'ESTHER',
  'INÈS',
  'Route des Confins',
  'F',
  'LINDA',
  'BERNADETTE',
  'F',
  'HÉLÈNE',
  'Route des Châtelaines',
  'F',
  'CHRISTIANNE',
  'YANIS',
  'EMMANUEL',
  'Rue des Sablons',
  'M',
  'Route Principale',
  'M',
  'Avenue Mercuria',
  'JADE',
  'M',
  'M',
  'F',
  'ERIKA',
  'Place de la Liberté',
  'M',
  'GILBERTE',
  'M',
  'Avenue des Acacias',
  'CLAUDINE',
  'M',
  'AMANDINE',
  'Avenue des Acacias',
  'ANGEL',
  'Route de Vichy',
  'Chemin des Terres Blanches',
  'YVONNE',
  'Route du Clergeon',
  'F',
  'Rue Louis Aragon',
  'F',
  'Rue Louis Aragon',
  'YASMINA',
  'Chemin des Catalins',
  'Rue des Saules',
  'M',
  'Avenue des États-Unis',
  '

In [7]:
clf.fit(X_train, y_train, ngram_range=(2, 3))
y_pred = clf.predict(X_test)
clf.score(y_pred, y_test)

(800, 1445)
(200, 1445)
FIRSTNAME           	57/78	   73.08% 	(FP:0)
ADDRESS             	69/73	   94.52% 	(FP:0)
CODE                	49/49	   100.0% 	(FP:25)
SCORE 175/200 :   87.5%
