# Feature Types

Tabular data (`pd.DataFrame`), as discussed previously, is made up of observations (rows) and features (columns). Data type (`df.dtypes`) of features fall into two primary categories: **numeric** and **categorical**.

There also exists a third special category of data type called **missing**. Missing data is a special data type because it is not a data type at all. It is a placeholder for a value that is not known or not applicable. Missing data is represented by `NaN` (not a number) in pandas. More on missing data later.


```{figure} ../assets/datatypes.png
---
width: 100%
name: directive-fig
---
Classification of feature types
```

To study these feature types, we will use the dataset of food safety scores for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health. 

In [95]:
import pandas as pd 

data = pd.read_csv('../data/restaurants.tsv', sep='\t')
sample = data.sample(50)
sample = sample[['business_id', 'business_postal_code', 'business_phone_number', 'business_latitude', 'business_longitude',\
                 'inspection_type', 'inspection_score', 'risk_category', 'violation_description']]
sample.columns = ['id', 'zip', 'phone', 'lat', 'lng', 'type', 'score', 'risk', 'violation']
sample.index = range(len(sample))
sample.to_csv('../data/restaurants_truncated.csv')

data = pd.read_csv('../data/restaurants_truncated.csv', index_col=0)
data.head()

Unnamed: 0,id,zip,phone,lat,lng,type,score,risk,violation
0,64454,94105,,37.787925,-122.400953,Routine - Unscheduled,82.0,High Risk,High risk food holding temperature
1,33014,94109,,37.786108,-122.425764,Complaint,,,
2,1526,94115,,37.791607,-122.434563,Routine - Unscheduled,82.0,Low Risk,Inadequate warewashing facilities or equipment
3,73,94115,,37.788932,-122.433895,Routine - Unscheduled,78.0,Low Risk,Improper food storage
4,66402,94110,,37.739161,-122.416967,Routine - Unscheduled,94.0,Low Risk,Unapproved or unmaintained equipment or utensils


Let's first look at the simplest to work with data type: Numerical Data.

## Numerical Features

Numeric data is data that can be represented as numbers. These variables generally describe some numeric _quantity_ or _amount_ and are also sometimes referred to as "quantitative" variables. 

Since numerical features are already represented as numbers, they are already ready to be used in machine learning models and there is no need to encode them.

In the example above, numerical features include `zip`, `phone`, `lat`, `lng`, `score`. 

In [99]:
data[['zip', 'phone', 'lat', 'lng', 'score']].head()

Unnamed: 0,zip,phone,lat,lng,score
0,94105,,37.787925,-122.400953,82.0
1,94109,,37.786108,-122.425764,
2,94115,,37.791607,-122.434563,82.0
3,94115,,37.788932,-122.433895,78.0
4,94110,,37.739161,-122.416967,94.0


### Discrete Features

Discrete data is data that is counted. For example, the number of students in a class is discrete data. You can count the number of students in a class. You can not count the number of students in a class and get a fraction of a student. You can only count whole students.

In the restaurants inspection data set, `zip`, `phone`, `score` are discrete features.

In [100]:
data[['zip', 'phone', 'score']].head()

Unnamed: 0,zip,phone,score
0,94105,,82.0
1,94109,,
2,94115,,82.0
3,94115,,78.0
4,94110,,94.0


### Continuous Features

Continuous data is data that is measured. For example, the height of a student is continuous data. You can measure the height of a student. You can measure the height of a student and get a fraction of a student. You can measure a student and get a height of 5 feet and 6.5 inches.

In the restaurants inspection data set, `lat`, `lng` are continuous features.

In [101]:
data[['lat', 'lng']].head()

Unnamed: 0,lat,lng
0,37.787925,-122.400953
1,37.786108,-122.425764
2,37.791607,-122.434563
3,37.788932,-122.433895
4,37.739161,-122.416967


## Categorical Features

Categorical data is data that is not numeric. It is often represented as text or a set of text values. These variables generally describe some _characteristic_ or _quality_ of a data unit, and are also sometimes referred to as "qualitative" variables.

In [103]:
data[['type', 'risk', 'violation']].head()

Unnamed: 0,type,risk,violation
0,Routine - Unscheduled,High Risk,High risk food holding temperature
1,Complaint,,
2,Routine - Unscheduled,Low Risk,Inadequate warewashing facilities or equipment
3,Routine - Unscheduled,Low Risk,Improper food storage
4,Routine - Unscheduled,Low Risk,Unapproved or unmaintained equipment or utensils


### Nominal Features

Nominal data is data that is not ordered in any way. For example, the color of a car is nominal data. There is no order to the colors. The colors are not more or less than each other. They are just different.

In [104]:
data[['type', 'violation']].head()

Unnamed: 0,type,violation
0,Routine - Unscheduled,High risk food holding temperature
1,Complaint,
2,Routine - Unscheduled,Inadequate warewashing facilities or equipment
3,Routine - Unscheduled,Improper food storage
4,Routine - Unscheduled,Unapproved or unmaintained equipment or utensils


#### Encoding Nominal Features, using One Hot Encoding

Feature engineering opens up a whole new set of possibilities for designing better performing models. As you will see in lab and homework, feature engineering is one of the most important parts of the entire modeling process.

```{figure} ../assets/ohe.png
---
width: 65%
name: directive-fig
---
One-hot encoding
```

A particularly powerful use of feature engineering is to allow us to perform regression on non-numeric features. One hot encoding is a feature engineering technique that generates numeric features from categorical data, allowing us to use our usual methods to fit a regression model on the data.

To illustrate how this works, we’ll refer back to the tips dataset from previous lectures. Consider the "day" column of the dataset:

In [113]:
import seaborn as sns
import numpy as np
np.random.seed(1337)
tips = sns.load_dataset("tips")
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


### Ordinal Features

Ordinal data is data that is ordered in some way. For example, the size of a t-shirt is ordinal data. The sizes are ordered from smallest to largest. The sizes are more or less than each other. They are different and ordered.


In [105]:
data[['risk']].head()

Unnamed: 0,risk
0,High Risk
1,
2,Low Risk
3,Low Risk
4,Low Risk


#### Encoding Ordinal Features

In [109]:
data['risk_enc'] = data['risk'].replace({'Low Risk': 0, 'Moderate Risk': 1, 'High Risk': 2})
data.head()

Unnamed: 0,id,zip,phone,lat,lng,type,score,risk,violation,risk_enc
0,64454,94105,,37.787925,-122.400953,Routine - Unscheduled,82.0,High Risk,High risk food holding temperature,2.0
1,33014,94109,,37.786108,-122.425764,Complaint,,,,
2,1526,94115,,37.791607,-122.434563,Routine - Unscheduled,82.0,Low Risk,Inadequate warewashing facilities or equipment,0.0
3,73,94115,,37.788932,-122.433895,Routine - Unscheduled,78.0,Low Risk,Improper food storage,0.0
4,66402,94110,,37.739161,-122.416967,Routine - Unscheduled,94.0,Low Risk,Unapproved or unmaintained equipment or utensils,0.0



## Missing Data

Missing data is data that is missing. It is often represented as a blank or a question mark. Missing data can be further divided into missing completely at random, missing at random, and missing not at random.

_`.isnull()`_

_`.dropna()`_

_`.fillna()`_

In [114]:
data.isnull().apply(lambda x: sum(x), axis=0)

id            0
zip           0
phone        37
lat          23
lng          23
type          0
score        11
risk         11
violation    11
risk_enc     11
dtype: int64

In [116]:
data['type'].value_counts()

Routine - Unscheduled    39
Complaint                 5
New Ownership             3
Reinspection/Followup     3
Name: type, dtype: int64