#Tabular training
To illustrate the tabular application, we will use the example of the [Adult dataset](https://archive.ics.uci.edu/dataset/2/adult) where we have to predict if a person is earning more or less than $50k per year using some general data.

In [None]:
!pip install -Uqq fastai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from fastai.tabular.all import *

We can download a sample of this dataset with the usual untar_data command:

In [None]:
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

(#3) [Path('/root/.fastai/data/adult_sample/export.pkl'),Path('/root/.fastai/data/adult_sample/models'),Path('/root/.fastai/data/adult_sample/adult.csv')]

Then we can have a look at how the data is structured:

In [None]:
df = pd.read_csv(path/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


* Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly.
* Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers.
* We can specify our categorical and continuous column names, as well as the name of the dependent variable in **TabularDataLoaders** factory methods.
* We won't mention those columns which are not necessary for training
* Categorify is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
* FillMissing will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
* Normalize will normalize the continuous variables (subtract the mean and divide by the std)

In [None]:
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


Split the data for validation and training

In [None]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

Preprocessing the data by using fastai's TabularPandas

In [None]:
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


* Once we build our TabularPandas object, our data is completely preprocessed as seen below.
* Use xs when you have a MultiIndex DataFrame and want to select data at a specific level of the index.

In [None]:
to.xs.iloc[:2]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
3696,5,13,7,11,2,5,1,1.491348,0.996304,1.527915
15809,5,12,3,9,6,3,1,0.101506,-0.156221,-0.426585


Now we can build our DataLoaders again:

In [None]:
dls = to.dataloaders(bs=64)    #bs means batch size

The show_batch method works like for every other application:

In [None]:
dls.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Self-emp-not-inc,Masters,Married-civ-spouse,Sales,Husband,White,False,35.0,189877.999996,14.0,>=50k
1,Private,9th,Never-married,Handlers-cleaners,Own-child,White,False,18.0,675420.980558,5.0,<50k
2,Private,Some-college,Married-civ-spouse,Exec-managerial,Husband,White,False,50.0,171337.999519,10.0,>=50k
3,State-gov,Masters,Divorced,Prof-specialty,Not-in-family,White,False,56.000001,67661.995418,14.0,<50k
4,Private,HS-grad,Never-married,Craft-repair,Not-in-family,White,False,33.0,55716.994336,9.0,<50k
5,?,HS-grad,Never-married,?,Not-in-family,White,False,35.0,476573.004385,9.0,<50k
6,Private,HS-grad,Divorced,Adm-clerical,Own-child,White,False,46.0,164427.000077,9.0,<50k
7,Private,HS-grad,Married-civ-spouse,Tech-support,Husband,White,False,42.0,126003.001688,9.0,>=50k
8,Self-emp-not-inc,HS-grad,Never-married,Prof-specialty,Not-in-family,White,False,33.0,361496.999894,9.0,<50k
9,Private,11th,Never-married,Sales,Own-child,White,False,17.0,118792.001247,7.0,<50k


* We can define a model using the tabular_learner method. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

* Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = CategoryBlock in your constructor so fastai won’t presume you are doing regression


In [None]:
learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the fit_one_cycle method (the fine_tune method won’t be useful here since we don’t have a pretrained model).

In [None]:
learn.fit_one_cycle(3)

epoch,train_loss,valid_loss,accuracy,time
0,0.362535,0.353224,0.836763,00:06
1,0.355471,0.346132,0.837531,00:07
2,0.352515,0.343969,0.84398,00:05


We can then have a look at some predictions:

In [None]:
learn.show_results()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary,salary_pred
0,1.0,12.0,4.0,1.0,2.0,5.0,1.0,0.761064,2.046234,-0.41675,0.0,0.0
1,5.0,10.0,5.0,2.0,2.0,5.0,1.0,0.102914,0.684985,1.149537,0.0,0.0
2,5.0,13.0,7.0,13.0,2.0,5.0,1.0,1.419215,-0.745962,1.541108,0.0,0.0
3,5.0,2.0,3.0,9.0,1.0,5.0,1.0,1.419215,0.236183,-1.199894,0.0,0.0
4,5.0,16.0,3.0,13.0,1.0,5.0,1.0,1.638599,4.953921,-0.025179,1.0,0.0
5,5.0,10.0,5.0,9.0,2.0,5.0,1.0,-0.262726,-0.028098,1.149537,0.0,0.0
6,5.0,9.0,5.0,10.0,2.0,5.0,1.0,0.395425,-1.158028,0.366393,0.0,0.0
7,5.0,4.0,5.0,8.0,2.0,5.0,1.0,0.395425,1.672129,-3.157753,0.0,0.0
8,5.0,12.0,3.0,14.0,1.0,5.0,1.0,1.126704,-0.417752,-0.41675,1.0,1.0


Or use the predict method on a row:

In [None]:
row, clas, probs = learn.predict(df.iloc[0])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


In [None]:
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,#na#,#na#,#na#,#na#,#na#,#na#,False,49.0,101320.00011,12.0,<50k


In [None]:
clas, probs

(tensor(0), tensor([0.7461, 0.2539]))

To get prediction on a new dataframe, you can use the **test_dl** method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

In [None]:
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


In [None]:
test_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States


Then Learner.get_preds will give you the predictions:

In [None]:
learn.get_preds(dl=dl)

(tensor([[0.7461, 0.2539],
         [0.5915, 0.4085],
         [0.8121, 0.1879],
         ...,
         [0.7342, 0.2658],
         [0.8484, 0.1516],
         [0.9104, 0.0896]]),
 None)

#fastai with Other Libraries

* As mentioned earlier, TabularPandas is a powerful and easy preprocessing tool for tabular data. Integration with libraries such as Random Forests and XGBoost requires only one extra step, that the .dataloaders call did for us.

* Let’s look at our to again. Its values are stored in a DataFrame like object, where we can extract the cats, conts, xs and ys if we want to

In [34]:
to.xs[:3]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
3696,5,13,7,11,2,5,1,1.491348,0.996304,1.527915
15809,5,12,3,9,6,3,1,0.101506,-0.156221,-0.426585
15991,5,12,3,2,6,5,1,-0.776288,-0.035936,-0.426585


In [35]:
to.ys[:3]

Unnamed: 0,salary
3696,0
15809,0
15991,0


Now that everything is encoded, you can then send this off to XGBoost or Random Forests by extracting the train and validation sets and their values:

In [36]:
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

And now we can directly send this in Random Forests and XGBoost. Boo