# 1. Numeric features 

## 1.1 Feature pre processing

### 1.1 Scaling 

First thing you need to know about handling numeric features is that there are models which do and don't depend on feature scale.
![sc_1.png](pics/sc_1.png)

Tree-based models don't depend on feature scaling. 

Linear models are also experiencing difficulties with differently scaled features.

1. First, we want regularization to be applied to linear models coefficients for features in equal amount. But in fact, regularization impact turns out to be proportional to feature scale. 
2. And second, gradient descent methods can go crazy without a proper scaling. Due to the same reasons, neural networks are similar to linear models in the requirements for feature preprocessing. 

It is important to understand that different features scalings result in different models quality. In this sense, it is just another hyper parameter you need to optimize. 


The easiest way to do this is to rescale all features to the same scale. For example, to make the minimum of a feature equal to zero and the maximum equal to one, you can achieve this in two steps. 
1. First, we sector at minimum value. 
2. And second, we divide the difference base maximum. 

![sc_2.png](pics/sc_2.png)

![sc_3.png](pics/sc_3.png)

![sc_4.png](pics/sc_4.png)

### 1.2 Outliers

When we work with linear models, there is another important moment that influences model training results. I'm talking about outiers. 

![sc_2.png](pics/ot_1.png)
![sc_2.png](pics/ot_2.png)


### 1.3 Ranks

Another effective preprocessing for numeric features is the rank transformation. Basically, it sets spaces between proper assorted values to be equal. This transformation, for example, can be a better option than MinMaxScaler if we have outliers, because rank transformation will move the outliers closer to other objects. 

![r_1.png](pics/r_1.png)

**Linear models, KNN, and neural networks can benefit from this kind of transformation if we have no time to handle outliers manually. **

### 1.4 Log transformation and Raising to the power < 1

There is one more example of numeric features preprocessing which often helps non-tree-based models and especially neural networks. 

![ls_1.png](pics/ls_1.png)


## 1.2. Feature Generation

![fg_1](pics/fg_1.png)

### 1.2.1 Examples

![fg_1](pics/fg_2.png)
![fg_1](pics/fg_3.png)

Among other things, it is useful to know that adding, multiplications, divisions, and other features interactions can be of help not only for linear models. For example, although gradient within decision tree is a very powerful model, it still experiences difficulties with approximation of multiplications and divisions. And adding size features explicitly can lead to a more robust model with less amount of trees. 

![fg_1](pics/fg_4.png)

This feature can help the model utilize the differences in people's perception of these prices. Also, we can find similar patterns in tasks which require distinguishing between a human and a robot.

For example, if we will have some kind of financial data like auctions, we could observe that people tend to set round numbers as prices, and there are something like 0.935, blah, blah,, blah, very long number here. Or, if we are trying to find spambots on social networks, we can be sure that no human ever read messages with an exact interval of one second.

### 1.3. Conclusion

![nf_1.png](pics/nf_1.png)
![nf_1.png](pics/nf_2.png)

# 2. Categorical and Ordinal features

![co_1.png](pics/co_1.png)




## 2.1 Feature pre processing

### 2.1.1 Label encoding

The simplest way to encode a categorical feature is to map its unique values to different numbers. Usually, people referred to this procedure as label encoding. This method works fine with two ways because tree-methods can split feature, and extract most of the useful values in categories on its own. 

**Non-tree-based-models, on the other side, usually can't use this feature effectively.**
And if you want to train linear model kNN on neural network, you need to treat a categorical feature differently. To illustrate this, let's remember example we had in the beginning of this topic.

![co_1.png](pics/co_2.png)

What if Pclass of one usually leads to the target of one, Pclass of two leads to zero, an

![co_3.png](pics/co_3.png)

### 2.1.1 Frequency encoding

![co_3.png](pics/co_4.png)

![co_3.png](pics/co_5.png)

### 2.1.1 One-hot encoding

![co_3.png](pics/co_6.png)


Note that if you care for a fewer important numeric features, and hundreds of binary features are used by one-hot encoding, it could become difficult for tree-methods they use first ones efficiently. 

To store these new array efficiently, we must know about sparse matrices. In a nutshell, instead of allocating space in RAM for every element of an array, we can store only non-zero elements and thus, save a lot of memory. Going with sparse matrices makes sense if number of non-zero values is far less than half of all the values. Sparse matrices are often useful when they work with categorical features or text data.


## 2.2. Feature Generation

 One of most useful examples of feature generation is feature interaction between several categorical features. This is usually useful for non tree based models namely, linear model, kNN.
 
![co_3.png](pics/co_7.png)

### 2.3. Conclusion

![co_3.png](pics/co_8.png)



# 3. Datetime and coordinates 

## 3.1 Feature Generation

Datetime is quite a distinct feature because it isn't relying on your nature, it also has several different tiers like year, day or week. 

Most new features generated from datetime can be divided into two categories. 

1. The first one, time moments in a period, 
2.  and the second one, time passed since particular event. 

First one is very simple. We can add features like second, minute, hour, day in a week, in a month, on the year and so on and so forth.

This is useful to capture repetitive patterns in the data. If we know about some non-common materials which influence the data, we can add them as well. 

For example, if we are to predict efficiency of medication, but patients receive pills one time every three days, we can consider this as a special time period.

![dt_1.png](pics/dt_1.png)

If we are to predict sales in a shop, like in the ROSSMANN's store sales competition. We can add the number of days passed since the **last holiday, weekend or since the last sales campaign, or maybe the number of days left to these events**. 

![dt_1.png](pics/dt_2.png)

So, after adding these features, our dataframe can look like this.

![dt_1.png](pics/st_3.png)

![dt_1.png](pics/dt_4.png)

![cd_1.png](pics/cd_1.png)

![cd_1.png](pics/cd_2.png)

One more trick you need to know about coordinates, that if you train decision trees from them, you can add slightly rotated coordinates is new features. And this will help a model make more precise selections on the map.

It can be hard to know what exact rotation we should make, so we may want to add all rotations to 45 or 22.5 degrees.

![cd_1.png](pics/cd_3.png)



# 4. Missing data 

* Missing values can be hidden from us, with a value which is not a number

![mv_1.png](pics/mv_1.png)

![mv_1.png](pics/mv_2.png)

* First method is useful in a way that it gives three possibility to take missing value into separate category. The downside of this is that performance of linear networks can suffer.

* Second method usually beneficial for simple linear models and neural networks. But again for trees it can be harder to select object which had missing values in the first place.

![mv_1.png](pics/mv_3.png)

One example of such possibility is having missing values in time series. For example, we could have everyday temperature for a month but several values in the middle of months are missing. Well of course, we can approximate them using nearby observations. But obviously, this kind of opportunity is rarely the case. 

In most typical scenario rows of our data set are independent. And we usually will not find any proper logic to reconstruct them.

**Well there's one general concern about generating new features from one with missing values. That is, if we do this, we should be very careful with replacing missing values before our feature generation.**

![mv_1.png](pics/mv_4.png)

As we can see, near the missing values this difference usually will be abnormally huge. And this can be misleading our model. 

![mv_1.png](pics/mv_5.png)

As we can see, all values we will be doing them closer to -999. And the more the row's corresponding to particular category will have missing values. The closer mean value will be to -999. The same is true if we fill missing values with mean or median of the feature. This kind of missing value importation definitely can screw up the feature we are constructing. The way to handle this particular case is to simply ignore missing values while calculating means for each category.

![mv_1.png](pics/mv_6.png)

If you have categorical features, sometimes it can be beneficial to change the missing values or categories which present in the test data but do not present in the train data. The intention for doing so appeals to the fact that the model which didn't have that category in the train data will eventually treat it randomly. Here and of categorical features can be of help. As we already discussed in our course, we can change categories to its frequencies and thus to it categories was in before based on their frequency.
Play video starting at 7 minutes 45 seconds and follow transcript7:45
Let's walk through the example on the slide. There you see from the categorical feature, they not appear in the train. Let's generate new feature indicating number of where the occurrence is in the data.

![mv_1.png](pics/mv_7.png)
![mv_1.png](pics/mv_8.png)

![![mv_1.png](pics/mv_9.png)

![fe_q_1.png](pics/fe_q_1.png)
![fe_q_1.png](pics/fe_q_2.png)
![fe_q_1.png](pics/fe_q_3.png)
![fe_q_1.png](pics/fe_q_4.png)
