# TensorFlow Linear Model Tutorial

In this tutorial, we will use the TF.Learn API in TensorFlow to solve a binary classification problem: 

**Given census data about a person such as age, gender, education and occupation (the features), we will work to predict whether or not the person earns more than $50,000 a year (the target label).**

We will train a **logistic regression** model, and given an individual's information our model will output a number between 0 and 1, which can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

You'll learn to:

- use Pandas to read, view, and organize data
- convert data into tensors for learning
- train and evaluate a linear model
- engineer data to improve performance
- adjust hyperparameters to improve performance

In [None]:
print "The answer should be three: " + str(1+2)

Let's execute the cell below to display information about the GPUs running on the server.

In [None]:
!nvidia-smi

# Begin lab

## Reading The Census Data

The dataset we'll be using is the [Census Income Dataset.](https://archive.ics.uci.edu/ml/datasets/Census+Income) <code>urllib.urlretrieve</code> is a great way to fetch unzipped data. Download the [training data](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data) and [test data](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test) by executing the code block below:

In [None]:
import tempfile
import urllib
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)

Once the CSV files are downloaded, let's read them into [Pandas](http://pandas.pydata.org/) dataframes.

In [None]:
import pandas as pd
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]
df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)

The name of the dataframe displays its contents. Execute the cell below to see our training data. Change the contents of the cell to see our test data.

In [None]:
df_train

Take a minute to make sense of the data. Essentially, we're working to use the first 14 columns to predict the 15th. Since there are only two potential outcomes, we're solving a *binary classification* problem. 

We'll construct a label column named "label" whose value is 1 if the income is over 50K, and 0 otherwise.

In [None]:
LABEL_COLUMN = "label"
df_train[LABEL_COLUMN] = (df_train["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

Next, take a look at the dataframe again and see which columns we should use to predict the target label. 

### Task 1 - Categorical vs Continuous Columns

The columns can be grouped into two types—categorical and continuous columns:

- A column is called **categorical** if its value can only be one of the categories in a finite set. For example, the native country of a person (U.S., India, Japan, etc.) or the education level (high school, college, etc.) are categorical columns.
- A column is called **continuous** if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.

Split each of the columns into categorical or continuous by adding their titles where it says "FIX ME" in the code block below. Each column name should be in quotes and be seperated by commas (as shown in the few that have been filled out). Use the printout of the data and your best judgement.

The columns we'll use are:

"age", "workclass", "education", "education_num", "marital_status", "occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week", and "native_country"

In [None]:
CATEGORICAL_COLUMNS = ['workclass', 'education', 'marital_status', 'relationship', 'race', 'gender', 'native_country']
CONTINUOUS_COLUMNS = ['age', 'education_num', 'occupation', 'capital_gain', 'capital_loss', 'hours_per_week']

[You can check your work here.](#columns)


<a id = 'columnreturn'></a>
## Converting Data into Tensors

When building a TF.Learn model, the input data is specified by means of an Input Builder function. This builder function will not be called until it is later passed to TF.Learn methods such as fit and evaluate. The purpose of this function is to construct the input data, which is represented in the form of [tf.Tensors](https://www.tensorflow.org/api_docs/python/tf/Tensor) or [tf.SparseTensors.](https://www.tensorflow.org/api_docs/python/tf/SparseTensor) In more detail, the Input Builder function returns the following as a pair:

1. *feature_cols:* A dict from feature column names to *Tensors* or *SparseTensors.*
2.  *label:* A *Tensor* containing the label column.

The keys of the *feature_cols* will be used to construct columns in the next section. Because we want to call the *fit* and *evaluate* methods with different data, we define two different input builder functions, *train_input_fn* and *test_input_fn* which are identical except that they pass different data to *input_fn.* Note that *input_fn* will be called while constructing the TensorFlow graph, not while running the graph. What it is returning is a representation of the input data as the fundamental unit of TensorFlow computations, a *Tensor* (or *SparseTensor*).

Our model represents the input data as **constant** tensors, meaning that the tensor represents a constant value, in this case the values of a particular column of *df_train* or *df_test.* This is the simplest way to pass data into TensorFlow. Another more advanced way to represent input data would be to construct an [Inputs And Readers](https://www.tensorflow.org/api_guides/python/io_ops#inputs_and_readers) that represents a file or other data source, and iterates through the file as TensorFlow runs the graph. Each continuous column in the train or test dataframe will be converted into a *Tensor,* which in general is a good format to represent dense data. For categorical data, we must represent the data as a *SparseTensor.* This data format is good for representing sparse data.

In [None]:
import tensorflow as tf

def input_fn(df):
  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  continuous_cols = {k: tf.constant(df[k].values)
                     for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  categorical_cols = {k: tf.SparseTensor(
      indices=[[i, 0] for i in range(df[k].size)],
      values=df[k].values,
      dense_shape=[df[k].size, 1])
            for k in CATEGORICAL_COLUMNS}
  # Merges the two dictionaries into one.
  feature_cols = dict(continuous_cols.items() + categorical_cols.items())
  # Converts the label column into a constant Tensor.
  label = tf.constant(df[LABEL_COLUMN].values)
  # Returns the feature columns and the label.
  return feature_cols, label

def train_input_fn():
  return input_fn(df_train)

def eval_input_fn():
  return input_fn(df_test)

## Selecting and Engineering Base Features for the Model
---

Selecting and crafting the right set of feature columns is key to learning an effective model. A **feature column** can be either one of the raw columns in the original dataframe (let's call them **base feature columns**), or any new columns created based on some transformations defined over one or multiple base columns (let's call them **derived feature columns**). Basically, "feature column" is an abstract concept of any raw or derived variable that can be used to predict the target label.

### Base Categorical Feature Columns

To define a feature column for a categorical feature, we can create a *SparseColumn* using the TF.Learn API. If you know the set of all possible feature values of a column and there are only a few of them, you can use *sparse_column_with_keys.* Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the gender column we can assign the feature string "Female" to an integer ID of 0 and "Male" to 1 by doing:

In [None]:
gender = tf.contrib.layers.sparse_column_with_keys(
  column_name="gender", keys=["Female", "Male"], combiner="sqrtn")

What if we don't know the set of possible values in advance? Not a problem. We can use *sparse_column_with_hash_bucket* instead:

In [None]:
education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000, combiner="sqrtn")

What will happen is that each possible value in the feature column *education* will be hashed to an integer ID as we encounter them in training. See an example illustration below:

![](intid.PNG)

No matter which way we choose to define a *SparseColumn,* each feature string will be mapped into an integer ID by looking up a fixed mapping or by hashing. Note that hashing collisions are possible, but may not significantly impact the model quality. Under the hood, the *LinearModel* class is responsible for managing the mapping and creating *tf.Variable* to store the model parameters (also known as model weights) for each feature ID. The model parameters will be learned through the model training process we'll go through later.

### Task 2 - Keys vs Hash Buckets for Categorical Values

In the first three cateories below, choose whether to use *sparse_column_with_keys* or *sparse_column_with_hash_bucket*, based on whether or not you know the categories, respectively.

If you use a *sparse_column_with_keys*, the second argument should be <code>keys=[key1, key2,...]</code>.
If you use a *sparse_column_with_hash_bucket*, the second argument should be <code>hash_bucket_size= 100 </code>, where 100 is just a big round number and is the approximate number of categories with some room for error. 

In [None]:
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=1000, combiner="sqrtn")
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=1000, combiner="sqrtn")
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000, combiner="sqrtn")
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000, combiner="sqrtn")
marital_status = tf.contrib.layers.sparse_column_with_hash_bucket("marital_status", hash_bucket_size=1000, combiner="sqrtn")
race = tf.contrib.layers.sparse_column_with_hash_bucket("race", hash_bucket_size=1000, combiner="sqrtn")
income = tf.contrib.layers.sparse_column_with_hash_bucket("income", hash_bucket_size=1000, combiner="sqrtn")

### Base Continuous Feature Columns

Similarly, we can define a *RealValuedColumn* for each continuous feature column that we want to use in the model:

In [None]:
age = tf.contrib.layers.real_valued_column("age")
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

Until this point, we've simply been organizing data. We've:

1. Pulled the data from the web
2. Loaded the data into Pandas dataframes
3. Seperated the features into continous vs. categorical columns
4. Converted the data to Tensors
5. Engineered the data using tf.contrib

There's a lot more we can do with the data, but we're at the point that we can train a model. Let's do that and assess our baseline performance before going deeper.

## Defining the Logistic Regression Model
---

After processing the input data and defining all the feature columns, we're now ready to put them all together and build a Logistic Regression model. In the previous section we've seen several types of base and derived feature columns, including:

- SparseColumn
- RealValuedColumn

Both of these are subclasses of the abstract *FeatureColumn* class, and can be added to the *feature_columns* field of a model:

In [None]:
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.LinearClassifier(feature_columns=[
  gender, native_country, education, occupation, workclass, marital_status, race],
  model_dir=model_dir)

The model also automatically learns a bias term, which controls the prediction one would make without observing any features (Check out TensorFlow's ["How Logistic Regression Works"](https://www.tensorflow.org/tutorials/wide#how_logistic_regression_works) for more explanation). We've chosen to start with a linear model, again, to assess our baseline performance. 

The learned model files will be stored in *model_dir*.

## Training and Evaluating our Model
---

After adding all the features to the model, now let's look at how to actually train the model. Training a model is just a one-liner using the TF.Learn API:

In [None]:
m.fit(input_fn=train_input_fn, steps=200)

After the model is trained, we can evaluate how good our model is at predicting the labels of the test data:

In [None]:
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
    print("%s: %s" % (key, results[key]))

The first line of the output should be something like accuracy: 0.829433, which means the accuracy is 82.9%. 

For some models, that might be enough. We're going to follow a bad habit by answering the question of "what is our accuracy target?" with the answer, "better!"

Let's start by working with our data, as a better understanding of our data will inform any choices we make when choosing (or building) the right model.

## Adding Derived Columns

### Making Continuous Features Categorical through Bucketization

Sometimes the relationship between a continuous feature and the label is not linear. As an hypothetical example, a person's income may grow with age in the early stage of one's career, then the growth may slow at some point, and finally the income decreases after retirement. In this scenario, using the raw age as a real-valued feature column might not be a good choice because the model can only learn one of the three cases:

1. Income always increases at some rate as age grows (positive correlation),
2. Income always decreases at some rate as age grows (negative correlation), or
3. Income stays the same no matter at what age (no correlation)

If we want to learn the fine-grained correlation between income and each age group separately, we can leverage **bucketization.** Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. So, we can define a *bucketized_column* over *age* as:

In [None]:
age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

where the boundaries is a list of bucket boundaries. In this case, there are 10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24, 25-29, ..., to 65 and over).

### Intersecting Multiple Columns with CrossedColumn

Using each base feature column separately may not be enough to explain the data. For example, the correlation between education and the label (earning > 50,000 dollars) may be different for different occupations. Therefore, if we only learn a single model weight for *education="Bachelors"* and *education="Masters"*, we won't be able to capture every single education-occupation combination (e.g. distinguishing between *education="Bachelors"* AND *occupation="Exec-managerial"* and *education="Bachelors"* AND *occupation="Craft-repair"*). To learn the differences between different feature combinations, we can add **crossed feature columns to the model**.

In [None]:
education_x_occupation = tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4))

We can also create a *CrossedColumn* over more than two columns. Each constituent column can be either a base feature column that is categorical (*SparseColumn*), a bucketized real-valued feature column (*BucketizedColumn*), or even another *CrossColumn*. Here's an example:

In [None]:
age_buckets_x_education_x_occupation = tf.contrib.layers.crossed_column(
  [age_buckets, education, occupation], hash_bucket_size=int(1e6))

Let's now add our derived columns to our model.

In [None]:
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.LinearClassifier(feature_columns=[
  gender, native_country, education, occupation, workclass, marital_status, race,
  age_buckets, education_x_occupation, age_buckets_x_education_x_occupation],
  model_dir=model_dir)

With a dataset of our base and derived features, we'll retrain our model. 

In [None]:
m.fit(input_fn=train_input_fn, steps=200)

In [None]:
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
    print("%s: %s" % (key, results[key]))

Our accuracy printout is now something like: 0.83557522, which means the accuracy is 83.6%. So, a better handling of data did have a small increase in performance. 

If you have time at the end, try more features and transformations and see if you can do even better!

## Changing Hyperparameters of Training Model

### Adding Regularization to Prevent Overfitting
---
Regularization is a technique used to avoid **overfitting.** Overfitting happens when your model does well on the data it is trained on, but worse on test data that the model has not seen before, such as live traffic. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observed training data. Regularization allows for you to control your model's complexity and makes the model more generalizable to unseen data.

### Task 3 - Experiment with hyperparameters

In the Linear Model library, you can add L1 and L2 regularizations to the model. One important difference between L1 and L2 regularization is that L1 regularization tends to make model weights stay at zero, creating sparser models, whereas L2 regularization also tries to make the model weights closer to zero but not necessarily zero. Therefore, if you increase the strength of L1 regularization, you will have a smaller model size because many of the model weights will be zero. This is often desirable when the feature space is very large but sparse, and when there are resource constraints that prevent you from serving a model that is too large.

Try various combinations of L1, L2 regularization strengths and find the best parameters that best control overfitting and give you a desirable model size, starting with the baseline we've written for you below:

In [None]:
m = tf.contrib.learn.LinearClassifier(feature_columns=[
  gender, native_country, education, occupation, workclass, marital_status, race,
  age_buckets, education_x_occupation, age_buckets_x_education_x_occupation],
  optimizer=tf.train.FtrlOptimizer(
    learning_rate=0.1,
    l1_regularization_strength=1.0,
    l2_regularization_strength=1.0),
  model_dir=model_dir)

In [None]:
m.fit(input_fn=train_input_fn, steps=200)
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
    print("%s: %s" % (key, results[key]))

How'd you do?!

At this point you've loaded data and set it up for training in TensorFlow. You've trained and evaluated a linear model. You've manipulated your data and the hyperparameters of your training model to improve performance. There are a few places you can go from here:

1. Make your model deeper. You can check out TensorFlow's [Wide & Deep Learning Tutorial](https://www.tensorflow.org/tutorials/wide_and_deep) where they'll show you how to combine the strengths of linear models and deep neural networks by jointly training them. You can add code blocks to this notebook using the '+' under 'File' to work through that tutorial.
2. See what you can build with your predictions. <code>list(m.predict(input_fn=eval_input_fn))</code> returns a list of predictions. What can you do with them?!
3. Build a linear model using other data. There's another dataset loaded to this instance (from [Kaggle.com](kaggle.com/datasets)) which contains metrics extracted from annual SEC 10K fillings (2012-2016) for the S&P 500. Pick a target column to predict and create a linear model using the workflow we just learned. Start with<code> df_train2 = pd.read_csv('fundamentals.csv', skipinitialspace=True)</code> to read the dataset to a Pandas dataframe.

Use the code block below and feel free to add more with the '+' button under "File."

## Lab Assets

<a id = 'columns'></a>

### Categorical vs. Continuous Column Test

Run the code block below to see if you categorized the columns accurately. If not, the code will fix it for you and you can examine the TEST variables to see what it should have looked like. 

In [None]:
CATEGORICAL_COLUMNS_TEST = ["workclass", "education", "marital_status", "occupation",
                       "relationship", "race", "gender", "native_country"]
CONTINUOUS_COLUMNS_TEST = ["age", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
if sorted(CATEGORICAL_COLUMNS) == sorted(CATEGORICAL_COLUMNS_TEST) and sorted(CONTINUOUS_COLUMNS) == sorted(CONTINUOUS_COLUMNS_TEST):
    print "You got it! Click the link below to return to the lab."
else:
        CATEGORICAL_COLUMNS = CATEGORICAL_COLUMNS_TEST
        CONTINUOUS_COLUMNS = CONTINUOUS_COLUMNS_TEST
        print "Not quite, but we fixed it for you. Examine CATEGORICAL_COLUMNS_TEST and CONTINUOUS_COLUMNS_TEST to see what it should have looked like. When done, click the link below to return to the lab."

[Return to lab](#columnreturn)

Here's a list of columns available in the Census Income dataset:

![](continuous-categorical.PNG)

[Return to lab](#columnreturn)