# Definitions
**Metamodels - models that take other models as input - then construct new models based on comparing the performance of the individual models**

* They are often used by data scientist and machine learnng experts for optimising models
* They are closely related to the concept of adaptive models

Exercise
Modify the name / gender classifier to make use of a meta model
For every n construct a model that would classify on the last three letter as input, where:
  * n ranges from 3
  * total length of the name
Your meta model will then compare the performance of each and pick the best model
To make it meta, the model has to build the smaller versions itself
    * Has to set the training and  data set
    * Pick n
    * Run it and test accuracy

In [21]:
import nltk
import random
from nltk.corpus import names 

In [22]:
def build_dataset():
    '''builds dataset from NLTK corpus comprising male and female names.
    returns list of tuples in form (name, sex) in random order
    '''
    male_names =  [name for name in names.words('male.txt')]
    female_names =  [name for name in names.words('female.txt')]
    model_names = [(name, 'male') for name in names.words('male.txt')]
    model_names = model_names + [(name, 'female') for name in names.words('female.txt')]
    random.seed(0) # seed so can recreate results
    random.shuffle(model_names)
    return model_names

In [23]:
def feature_dict(word, last_letters, word_length=0):
    '''builds feature dict. 
    takes in word, number of last letters we are interested in and word length.
    returns dictionary of features.
    if word length is < the last letters parameter, as many last letters as possible will be included in the dict
    if word length is 0 (i.e.false) or omitted, that feature will not be included'''
    dict = {}
    if len(word) < last_letters:
        last_letters = len(word)
    dict['last_letter(s)'] = word[-last_letters:]
    if word_length:
        dict['word_length'] = len(word)
    return dict

In [24]:
def build_feature_set(model_names, last_letters, word_length):
    '''builds feature set.
    takes in list of names, number of last letters we are interested in and word length.
    returns a list of tuples in format (feature dict, sex), based on those names'''
    return [(feature_dict(n, last_letters, word_length), g)for (n, g) in model_names]

In [25]:
def model_1(max_last_letters):
    '''model takes parameter of max last letters to consider.
    1/ builds dataset
    2/ runs naive bayes. 
    Considering last number of letters up to max, on 2 bases:
    -with name length
    -without name length.
    returns best model and its accuracy.
    if dataset includes names shorter than the number of letters being considered, that cycle will be omitted 
    '''
    model_names = build_dataset()
    include_length = [0, 1]
    score = 0
    best = {}
    for letter in range(1,max_last_letters+1): #here we exclude a no. of last letters if the data set includes longer names than that
        if len([name for name in model_names if len(name) < letter]):
            print('exclude {} letters - data set includes shorter names'.format(letter))
            continue
        print('last {} letters'.format(letter))
        for setting in include_length:
            if setting:
                print('don\'t include word length')
            else:
                print('include word length')
            feature_set = build_feature_set(model_names, letter, setting)
            train_set, test_set = feature_set[500:],feature_set[:500]
            classifier = nltk.NaiveBayesClassifier.train(train_set)
            result = nltk.classify.accuracy(classifier, test_set)
            print('result: {}'.format(result))
            if result > score:
                score = result
                best['letters'] = letter
                best['length on/off'] = bool(setting)
    return(best, score)

In [26]:
model_1(3)

last 1 letters
include word length
result: 0.768
don't include word length
result: 0.762
last 2 letters
include word length
result: 0.818
don't include word length
result: 0.822
exclude 3 letters - data set includes shorter names


({'length on/off': True, 'letters': 2}, 0.822)

In [27]:
def model_2(max_last_letters):
    '''model takes parameter of max last letters to consider.
    1/ builds dataset
    2/ runs naive bayes. 
    Considering last number of letters up to max, on 2 bases:
    -with name length
    -without name length.
    returns best model and its accuracy.
    this version relies on default behaviour of feature dict function - 
    if max last letters > number of letters in name, it will take the max possible letters for the feature dict
    '''
    model_names = build_dataset()
    include_length = [0, 1]
    score = 0
    best = {}
    for setting in include_length:
        if setting:
            print('don\'t include word length')
        else:
            print('include word length')
        for letter in range(1,max_last_letters+1):
            print('last {} letters'.format(letter))
            feature_set = build_feature_set(model_names, letter, setting)
            train_set, test_set = feature_set[500:],feature_set[:500]
            classifier = nltk.NaiveBayesClassifier.train(train_set)
            result = nltk.classify.accuracy(classifier, test_set)
            print('result: {}'.format(result))
            if result > score:
                score = result
                best['letters'] = letter
                best['length on/off'] = bool(setting)
    return(best, score)

In [28]:
model_2(3)

include word length
last 1 letters
result: 0.768
last 2 letters
result: 0.818
last 3 letters
result: 0.768
don't include word length
last 1 letters
result: 0.762
last 2 letters
result: 0.822
last 3 letters
result: 0.76


({'length on/off': True, 'letters': 2}, 0.822)