# Improving your model

Most scikit-learn objects are either transformers or models.

Transformers are for pre-processing before modeling. The Imputer class (like SimpleImputer for filling in missing values) and FeatureSelection classes in sklearn are an example of some transformers.

Models are used to make predictions like Linear Regression model, Decision Tree model, Random Forest model etc. You will usually pre-process your data (with transformers) before putting it in a model.

Now the usage of methods fit(), transform(), fit_transform() and predict() depend on the type of object.

For Transformers:

* fit() - It is used for calculating the initial filling of parameters on the training data (like mean of the column values) and saves them as an internal objects state
* transform() - Use the above calculated values and return modified training data
* fit_transform() - It joins above two steps. Internally, it just calls first fit() and then transform() on the same data.

For Models:

* fit() - It calculates the parameters/weights on training data (e.g. parameters returned by coef() in case of Linear Regression) and saves them as an internal objects state.
* predict() - Use the above calculated weights on test data to make the predictions
* transform() - Cannot be used
* fit_transform() - Cannot be used

## Pipelines, feature & text preprocessing

Imputer will fill the missing values with the mean of the column in question.

### Instantiate pipeline

In [16]:
import numpy as np
import pandas as pd

rng = np.random.RandomState(123)

SIZE = 1000

sample_data = {
 'numeric': rng.normal(0, 10, size=SIZE),
 'text': rng.choice(['', 'foo', 'bar', 'foo bar', 'bar foo'], size=SIZE),
 'with_missing': rng.normal(loc=3, size=SIZE)
}

sample_df = pd.DataFrame(sample_data)

sample_df.loc[rng.choice(sample_df.index, size=np.floor_divide(sample_df.shape[0], 5)), 'with_missing'] = np.nan

foo_values = sample_df.text.str.contains('foo') * 10
bar_values = sample_df.text.str.contains('bar') * -25
no_text = ((foo_values + bar_values) == 0) * 1

val = 2 * sample_df.numeric + -2 * (foo_values + bar_values + no_text) + 4 * sample_df.with_missing.fillna(3)
val += rng.normal(0, 8, size=SIZE)

sample_df['label'] = np.where(val > np.median(val), 'a', 'b')

print(sample_df.head())

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=22)

pl = Pipeline([('clf', OneVsRestClassifier(LogisticRegression()))])
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)

print("\nAccuracy on sample data - numeric, no nans: ", accuracy)

     numeric     text  with_missing label
0 -10.856306               4.433240     b
1   9.973454      foo      4.310229     b
2   2.829785  foo bar      2.469828     a
3 -15.062947               2.852981     b
4  -5.786003  foo bar      1.826475     a

Accuracy on sample data - numeric, no nans:  0.62


Now it's time to incorporate numeric data with missing values by adding a preprocessing step!



### Preprocessing numeric features

Without imputing missing values, the pipeline would not be happy

In [17]:
from sklearn.impute import SimpleImputer
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', "with_missing"]],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=456)
pl = Pipeline([("imp", SimpleImputer()),('clf', OneVsRestClassifier(LogisticRegression()))])
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)

print("\nAccuracy on sample data - numeric, inc nans: ", accuracy)


Accuracy on sample data - numeric, inc nans:  0.636


## Text features and feature unions

In order to use the text column from the dataset we need to use countvectorizer but pipline steps for numeric and text preprocessing can't follow each other. Separately operate on the text and numeric columns is needed. FunctionTransformer and FeatureUnion can help to work with both text and numeric data. 

FunctionTransformer: turns a python function into an object that a scikit-learn pipeline can understand. They can be split into 2 different pipelines.

FeatureUnion binds them. that'll be the input for the classifier.

### Preprocessing text features

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

X_train, X_test, y_train, y_test = train_test_split(sample_df["text"],
                                                    pd.get_dummies(sample_df['label']), 
                                                    random_state=456)
pl = Pipeline([("vec", CountVectorizer()),('clf', OneVsRestClassifier(LogisticRegression()))])
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - just text data: ", accuracy)


Accuracy on sample data - just text data:  0.808


### Multiple types of processing: FunctionTransformer

In [19]:
from sklearn.preprocessing import FunctionTransformer

get_text_data = FunctionTransformer(lambda x: x["text"], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[["numeric", "with_missing"]], validate=False)

just_text_data = get_text_data.fit_transform(sample_df)
just_numeric_data = get_numeric_data.fit_transform(sample_df)

print("Text Data")
print(just_text_data.head())

print("\nNumeric Data")
print(just_numeric_data.head())

Text Data
0           
1        foo
2    foo bar
3           
4    foo bar
Name: text, dtype: object

Numeric Data
     numeric  with_missing
0 -10.856306      4.433240
1   9.973454      4.310229
2   2.829785      2.469828
3 -15.062947      2.852981
4  -5.786003      1.826475


### Multiple types of processing: FeatureUnion

Join the results together using FeatureUnion()

In [20]:
from sklearn.pipeline import FeatureUnion
X_train, X_test, y_train, y_test = train_test_split(sample_df[["numeric", "with_missing", "text"]],
                                                   pd.get_dummies(sample_df["label"]), random_state=22)

process_and_join_features = FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ("selector", get_text_data),
                    ("vectorizer", CountVectorizer())
                ]))
             ]
        )

# Instantiate nested pipeline: pl
pl = Pipeline([
        ('union', process_and_join_features),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)



Accuracy on sample data - all data:  0.928


## Choosing a classification model

To precess the text in the budget dataset, all of the text columns needed to be in the single column.

### Using FunctionTransformer on the main dataset

In [66]:
def multilabel_sample(y, size=1000, min_count=2, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).any():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')

    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

    if size <= 1:
        size = np.floor(y.shape[0] * size)

    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count

    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])

    sample_idxs = np.array([], dtype=choices.dtype)

    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

    sample_idxs = np.unique(sample_idxs)

    # now that we have at least min_count of each, we can just random sample
    sample_count = int(size - sample_idxs.shape[0])

    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices,
                                   size=sample_count,
                                   replace=False)

    return np.concatenate([sample_idxs, remaining_sampled])


def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):
    """ Takes a dataframe `df` and returns a sample of size `size` where all
        classes in the binary matrix `labels` are represented at
        least `min_count` times.
    """
    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)
    return df.loc[idxs]


def multilabel_train_test_split(X, Y, size, min_count=2, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
    train_set_idxs = np.setdiff1d(index, test_set_idxs)

    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask

    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])

import pickle
f = open("index.list", "rb")
index = pickle.load(f)
f.close()

df = pd.read_csv("TrainingData.csv", index_col=0)
df = df.loc[index]


NUMERIC_COLUMNS = ['FTE', 'Total']
LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type', 'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status']
NON_LABELS = [c for c in df.columns if c not in LABELS]
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    text_data.fillna("",inplace=True)
    return text_data.apply(lambda x: " ".join(x), axis=1)


dummy_labels = pd.get_dummies(df[LABELS])


X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2, 
                                                               seed=123)

get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)


### Add a model to the pipeline

In [91]:
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ("selector", get_numeric_data),
                    ("imputer", SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ("selector", get_text_data),
                    ("vectorizer", CountVectorizer())
                ]))
             ]
        )),
        ('clf', OneVsRestClassifier(LogisticRegression(max_iter=1000), n_jobs=-1))
    ])

pl.fit(X_train, y_train)

accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)


Accuracy on budget dataset:  0.0


### Try a different class of model

In [81]:
from sklearn.ensemble import RandomForestClassifier

pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ("clf", RandomForestClassifier())
    ])

pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)


Accuracy on budget dataset:  0.32051282051282054


### Can you adjust the model or parameters to improve accuracy?


In [90]:
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer())
                ]))
             ]
        )),
        ('clf', RandomForestClassifier(n_estimators=15))
    ])

pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on budget dataset: ", accuracy)


Accuracy on budget dataset:  0.3301282051282051
