<a href="https://colab.research.google.com/github/dlsun/pods/blob/master/Chapter_01_The_Data_Ecosystem/Chapter_1.4_Columns_and_Variables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Columns and Variables

Recall that the columns of a tabular data set represent variables. They are the measurements that we make on each observation. 

As an example, let's consider the variables in the OKCupid data set. This data set does not have a natural index, so we use the default index (0, 1, 2, ...).

In [0]:
import pandas as pd

data_dir = "https://dlsun.github.io/pods/data/"
df_okcupid = pd.read_csv(data_dir + "okcupid.csv")
df_okcupid.head()

## Types of Variables

There is a fundamental difference between variables like `age` and `height`, which can be measured on a numeric scale, and variables like `religion` and `orientation`, which cannot be.

Variables that can be measured on a numeric scale are called **quantitative variables**. Just because a variable happens to contain numbers does not necessarily make it "quantitative". For example, in the Framingham data set, the `SEX` column was coded as 1 for men and 2 for women. However, these numbers are not on any meaningful numerical scale; a woman is not "twice" a man.

Variables that are not quantitative but take on a limited set of values are called **categorical variables**. For example, the variable `orientation` takes on one of only three values in this data set (gay, straight, or bisexual), so it is a categorical variable. So is the variable `religion`, which takes on a larger, but still limited, set of values. We call each possible value of a categorical variable a "level". Levels are usually non-numeric.

Some variables do not fit neatly into either classification. For example, the variable `essay1` contains users' answers to the prompt "What I’m doing with my life". This variable is obviously not quantitative, but it is not categorical either because every user has a unique answer. In other words, this variable does not take on a limited set of values. We will group such variables into an "other" category.

Every variable can be classified into one of these three **types**: 
- quantitative,
- categorical, or
- other. 

The type of the variable often dictates how we analyze that variable, as we will see in the next two chapters.

## Selecting Variables

Suppose we want to select the `age` column from the `DataFrame` above. There are three ways to do this.

1\.  Use `.loc`, specifying both the rows and columns. (The colon `:` is Python shorthand for "all".)

In [0]:
df_okcupid.loc[:, "age"]

2\. Access the column as you would a key in a `dict`.

In [0]:
df_okcupid["age"]

3\. Access the column as an attribute of the `DataFrame`.

In [0]:
df_okcupid.age

Method 3 (attribute access) is the most concise. However, it does not work if the variable name contains spaces or special characters, begins with a number, or matches an existing attribute of `DataFrame`. For example, if `df_okcupid` had a column called `head`, `df_okcupid.head` would not return the column because `df_okcupid.head` is already reserved for something else.

Notice that a `Series` is used here to store a single variable (across multiple observations). In the previous section, we saw that a `Series` can also be used to store a single observation (across multiple columns). To summarize, the `Series` data structure is used to store either a single row or a single column in a tabular data set. In other words, while a `DataFrame` is two-dimensional (containing both rows and columns), a `Series` is one-dimensional.

To select multiple columns, you would pass in a _list_ of variable names, instead of a single variable name. For example, to select both `age` and `religion`, either of the two methods below would work (and produce the same result):

In [0]:
# METHOD 1
df_okcupid.loc[:, ["age", "religion"]].head()

# METHOD 2
df_okcupid[["age", "religion"]].head()

## Type Inference and Casting


`pandas` tries to infer the type of each variable automatically. If every value in a column (except for missing values) is a number, then `pandas` will treat that variable as quantitative. Otherwise, the variable is treated as categorical. 

To determine the type that Pandas inferred, simply select that variable using the methods above and look for its `dtype`. A `dtype` of `float64` or `int64` indicates that the variable is quantitative.  For example, the `age` variable has a `dtype` of `int64`, so it is quantitative.

In [0]:
df_okcupid.age

On the other hand, the `religion` variable has a `dtype` of `object`, so `pandas` will treat it as categorical.

In [0]:
df_okcupid.religion

Sometimes it is necessary to convert quantitative variables to categorical variables and vice versa. This can be achieved using the `.astype()` method of a `Series`. For example, to convert `age` to a categorical variable, we simply cast its values to strings.

In [0]:
df_okcupid.age.astype(str)

To save this as a column in the `DataFrame`, we assign it to a column called `age_cat`. (Note that this column does not exist yet! It will be created at the time of assignment.)

In [0]:
df_okcupid["age_cat"] = df_okcupid.age.astype(str)

# Check that age_cat is a column in this DataFrame
df_okcupid.head()

## Exercises

Exercises 1-2 deal with the Titanic data set (https://dlsun.github.io/pods/data/titanic.csv)

1\. Read in the Titanic data set. Identify each variable in the Titanic data set as either quantitative, categorical, or other. Cast all variables to the right type and assign them back to the `DataFrame`. *Note: deciding the appropriate type can be tricky! Think carefully about the `'name'`, `'ticketno'`, and `'survived'` columns, in particular.*

In [0]:
# YOUR CODE HERE

2\. Create a `DataFrame` (not a `Series`) consisting of just the `class` column.

In [0]:
# YOUR CODE HERE