# Training Models on Tabular Data

When we train a model on tabular data, we want to create a model that, given columns, can predict the value in another column. In my [collaborative filtering blog](https://geon-youn.github.io/DunGeon/2022/03/16/Collab.html), I gave the model the user's reviews of other movies as inputs and wanted a prediction of the user's review of another movie. 





## Preprocessing Data

Tabular data have two main kinds of variables: *continuous* variables (numerical data) and *categorical* variables (discrete data). When training a model, we want all our inputs to be numbers In the collaborative filtering model, the users and movies were (high-cardinality) categorial variables.

So, you pass your categorical variables through embeddings. An embedding is equivalent to putting a linear layer after every one-hot-encoded input layers. What I mean is that you have inputs, which can be indexed by one-hot-encoded vectors. And, an embedding takes the relevant inputs from those inputs through indexing. In the end, when you pass your one-hot-encoded input layers through an embedding layer, you get the numerical values you need, which you can pass through other layers in your neural network. 

When we train the model on these embeddings (the inputs), we can interpret the distance between the embeddings afterwards; since the embedding distances were learned based on patterns in the data, they also tend to match up with what we intuitively think they would be. 

Since we can form continuous embeddings for our categorical variables, we can treat them like continuous variables when we train our models. So, we could perform probabilistic matrix factorization, or concatenate them with the actual continuous variables and pass them through a neural network. 

Below demonstrates how Google trains a model for recommendations on Google Play:
![Google Recommendations](https://github.com/fastai/fastbook/blob/master/images/att_00019.png?raw=1)

## And Here we Branch

In modern machine learning, there are two main techniques that are widely applicable, each good for specific kinds of data:

1. Ensembles of decision trees (like random forests and gradient boosting machines) for structured data.
2. Multilayered neural networks optimized with SGD (like shallow and/or deep learning) for unstructured data (like images, audio, and natural language). 

Deep learning is almost always superior for unstructured data and give similar results for unstructured data. But, decision trees train much faster, are simpler to train, and are easier to interpret (like which columns were most important). So, ensembles of decision trees are good for forming baselines of most data. 

However, deep learning is a better choice than decision trees when
- there are some high-cardinality categorical variables that are very important (like zip codes); or
- there's some columns that'd be best understood through a neural network like plain text. 

Still, you should try both to see which one works best. Usually, you'll start with decision trees as a baseline and try to get a higher accuracy with a deep learning model if either of those two conditions above applies. 

So, in the next two blog posts, I'll be talking about decision trees and deep learning, respectively, for tabular data.