In [1]:
%load_ext autoreload
%autoreload 2

# fastText introduction

fastText model is a simple and fast baseline model for text classification. It learns about features (n-grams)  embedding, which are averaged to form the hideen vector representation of a document. Its accuray is on par with deep learning classifiers, but is orders of magnititute faster for training and evaluation. The fastYext model provides another baseline model for text classification besides bags of words model in autogluon.   

To start, import autogluon and TabularPrediction module as your task:

In [2]:
import logging
logging.basicConfig(format='%(asctime)s: [%(funcName)s] %(message)s',
                   level=logging.INFO)
logger = logging.getLogger(__name__)


In [3]:
import pandas as pd
import autogluon as ag

from sklearn.metrics import accuracy_score

from autogluon.tabular import TabularPrediction as task

from autogluon.tabular.features.generators import AutoMLPipelineFeatureGenerator

from autogluon.tabular.models.fasttext.fasttext_model import FastTextModel
from autogluon.tabular.utils import infer_problem_type
from autogluon.tabular.task.tabular_prediction.hyperparameter_configs import get_hyperparameter_config

# load data

Load training data from a CSV file into an AutoGluon Dataset object. This object is essentially equivalent to a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and the same methods can be applied to both.

In [4]:
train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
train_data['class'] = train_data['class'].str.strip()
test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
print(train_data.head())

  and should_run_async(code)
2020-12-09 22:46:46,415: [load] Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073
2020-12-09 22:46:46,527: [load] Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


   age   workclass  fnlwgt   education  education-num       marital-status  \
0   25     Private  178478   Bachelors             13        Never-married   
1   23   State-gov   61743     5th-6th              3        Never-married   
2   46     Private  376789     HS-grad              9        Never-married   
3   55           ?  200235     HS-grad              9   Married-civ-spouse   
4   36     Private  224541     7th-8th              4   Married-civ-spouse   

           occupation    relationship    race      sex  capital-gain  \
0        Tech-support       Own-child   White   Female             0   
1    Transport-moving   Not-in-family   White     Male             0   
2       Other-service   Not-in-family   White     Male             0   
3                   ?         Husband   White     Male             0   
4   Handlers-cleaners         Husband   White     Male             0   

   capital-loss  hours-per-week  native-country  class  
0             0              40   United-

Note that we loaded data from a CSV file stored in the cloud (AWS s3 bucket), but you can you specify a local file-path instead if you have already downloaded the CSV file to your own machine (e.g., using `wget`).
Each row in the table `train_data` corresponds to a single training example. In this particular dataset, each row corresponds to an individual person, and the columns contain various characteristics reported during a census.

Let's first use these features to predict whether the person's income exceeds $50,000 or not, which is recorded in the `class` column of this table.

In [5]:
label_column = 'class'
print("Summary of class variable: \n", train_data[label_column].describe())

  and should_run_async(code)
2020-12-09 22:46:46,565: [_init_num_threads] Note: detected 96 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2020-12-09 22:46:46,566: [_init_num_threads] Note: NumExpr detected 96 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2020-12-09 22:46:46,566: [_init_num_threads] NumExpr defaulting to 8 threads.


Summary of class variable: 
 count     39073
unique        2
top       <=50K
freq      29704
Name: class, dtype: object


# Use fastText model in TabularPrediction task 

Let's add some mock text fields to the original data

In [6]:
train_data['text'] = (
    train_data[['education', 'marital-status', 'occupation', 'relationship', 
                'workclass', 'native-country',  'sex', 'race']]
    .apply(lambda r: ', '.join(r.values) + '.', axis=1)
)


test_data['text'] = (
    test_data[['education', 'marital-status', 'occupation', 'relationship',
               'workclass', 'native-country',  'sex', 'race']]
    .apply(lambda r: ', '.join(r.values) + '.', axis=1)
)
print('sample text column values')
print(train_data['text'].sample(5).to_list())

  and should_run_async(code)


sample text column values
[' Some-college,  Married-civ-spouse,  Exec-managerial,  Wife,  Private,  United-States,  Female,  White.', ' 10th,  Married-civ-spouse,  Other-service,  Wife,  Private,  United-States,  Female,  White.', ' Some-college,  Married-civ-spouse,  Craft-repair,  Husband,  Private,  United-States,  Male,  White.', ' HS-grad,  Never-married,  Sales,  Not-in-family,  Private,  El-Salvador,  Male,  White.', ' Bachelors,  Married-civ-spouse,  Exec-managerial,  Husband,  Private,  United-States,  Male,  White.']


Now, we can specific FastTextModel as one custom model so that you can leverage the emsemble/stacking feature in AutoGluon:

In [7]:
custom_hyperparameters = {'RF': {},
                          FastTextModel:  {'epoch': 50},
                         }

feature_generator = AutoMLPipelineFeatureGenerator(enable_raw_text_features=True)
predictor = task.fit(train_data=train_data, 
                     label=label_column, 
                     hyperparameters=custom_hyperparameters,
                     feature_generator = feature_generator,
                     output_directory='AutogluonModels/ag-test/'
                    )

predictor = task.load('AutogluonModels/ag-test/')

y_pred = predictor.predict(test_data)
df_res = pd.DataFrame({
    'pred': y_pred,
    'label': test_data[label_column]
})
print('accuracy:', (df_res.pred.str.strip() == df_res.label.str.strip()).mean())
print(df_res.sample(5))

2020-12-09 22:46:46,944: [_fit] Beginning AutoGluon training ...
2020-12-09 22:46:46,944: [_fit] AutoGluon will save models to AutogluonModels/ag-test/
2020-12-09 22:46:46,944: [_fit] AutoGluon Version:  0.0.15b20201112
2020-12-09 22:46:46,945: [_fit] Train Data Rows:    39073
2020-12-09 22:46:46,945: [_fit] Train Data Columns: 15
2020-12-09 22:46:46,946: [_fit] Preprocessing data ...
2020-12-09 22:46:46,962: [infer_problem_type] AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2020-12-09 22:46:46,962: [infer_problem_type] 	2 unique label values:  ['<=50K', '>50K']
2020-12-09 22:46:46,962: [infer_problem_type] 	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
2020-12-09 22:46:46,980: [__init__] Selected class <--> label mapping:  class 1 = >50K, class 0 = <=50K
2020-12-09 22:46:46,983: [general_data

accuracy: 0.839594636093766
       pred   label
428    >50K   <=50K
9664  <=50K   <=50K
366    >50K    >50K
4642   >50K   <=50K
2100  <=50K   <=50K


# Sentiment Analysis: SST

The Standford Sentiment Treebank ([SST](https://nlp.stanford.edu/sentiment/)) dataset aims to predict positive or negative sentiment of movie views. Here we show how the model performs on this dataset. First, let's load the data:

In [8]:
df_train = pd.read_parquet('https://autogluon-text.s3-us-west-2.amazonaws.com/glue/sst/train.parquet')
df_test = pd.read_parquet('https://autogluon-text.s3-us-west-2.amazonaws.com/glue/sst/dev.parquet')
df_test.sample(5).to_dict(orient='records')

  and should_run_async(code)


[{'sentence': 'the format gets used best ... to capture the dizzying heights achieved by motocross and bmx riders , whose balletic hotdogging occasionally ends in bone-crushing screwups . ',
  'label': 1},
 {'sentence': 'scooby dooby doo / and shaggy too / you both look and sound great . ',
  'label': 1},
 {'sentence': "though it 's become almost redundant to say so , major kudos go to leigh for actually casting people who look working-class . ",
  'label': 1},
 {'sentence': "inside the film 's conflict-powered plot there is a decent moral trying to get out , but it 's not that , it 's the tension that keeps you in your seat . ",
  'label': 1},
 {'sentence': "because of an unnecessary and clumsy last scene , ` swimfan ' left me with a very bad feeling . ",
  'label': 0}]

Next, we train an AutoGluon model with the FastTextModel as one of the custom model types:

In [9]:
custom_hyperparameters = {'RF': {},
                         FastTextModel:  {'epoch': 100},
                         }

feature_generator = AutoMLPipelineFeatureGenerator(enable_raw_text_features=True)

label_column = 'label'
predictor = task.fit(train_data=df_train,
                     label=label_column, 
                     feature_generator=feature_generator,
                     hyperparameters=custom_hyperparameters
                    )


2020-12-09 22:49:25,596: [setup_outputdir] No output_directory specified. Models will be saved in: AutogluonModels/ag-20201210_064925/
2020-12-09 22:49:25,597: [_fit] Beginning AutoGluon training ...
2020-12-09 22:49:25,597: [_fit] AutoGluon will save models to AutogluonModels/ag-20201210_064925/
2020-12-09 22:49:25,598: [_fit] AutoGluon Version:  0.0.15b20201112
2020-12-09 22:49:25,598: [_fit] Train Data Rows:    67349
2020-12-09 22:49:25,598: [_fit] Train Data Columns: 1
2020-12-09 22:49:25,599: [_fit] Preprocessing data ...
2020-12-09 22:49:25,614: [infer_problem_type] AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2020-12-09 22:49:25,614: [infer_problem_type] 	2 unique label values:  [0, 1]
2020-12-09 22:49:25,615: [infer_problem_type] 	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
2020-12-

Now we use the traind model to make predictions on the test data and check its accuracy

In [11]:
y_pred = predictor.predict(df_test)
print('accuracy:', accuracy_score(y_pred, df_test[label_column]))

accuracy: 0.8004587155963303
