### multivariate analysis vs univariate analysis, usually bivariate analysis


We call the variables features if we think they are well-suited to work within our model to explain the target variable. 

Put the target at the center: there should be one target variable unless model is very complicated.

#### variable transformation reasons:
    
1. Machine learning models can only work with numeric variables. So, categorical variables that have text values need to be converted to numeric values.  
    a. one hot encoding: encoded to x-1 for the number of features. the listed variables are called dummy or indicator variables, the unnamed variable is the reference variable 
2. Some machine learning models assume the target variable to be normally distributed. In order to use these models, we may need to transform our target to be normally distributed.
3. Some machine learning models are very sensitive to the relative magnitude of values. So, we may need to limit the values of the variables to some fixed range. Usually, we do this by normalizing our variables.
4. To help our intuition and our understanding of the data, we may want to transform variables to a different unit of measurement. 

**Normalization** is the rescaling of a variable into the [0,1] range (including 0 and 1). For this purpose, we'll use SKLearn's `.normalize()` method from the  preprocessing module.  
**Standardization** is the rescaling of a variable so its mean becomes 0 and its standard deviation becomes 1. Notice in the standardization we don't apply a maximum value for the variable. To apply standardization, we'll use SKLearn's `.scale()` method from the preprocessing module.

### feature selection

1. Filter:  
    rank and eliminate below cut-off. Variance thresholds, correlation of feature with outcome are examples. generally 'cheap'  
2. Wrapper methods:  
    select sets. can be computationally demanding.  
    forward passes - stepwise adding  
    backward passes - stepwise removing  
3. embedded methods:  
    select sets of features, instrinsic to model. This may involve regularization, where a "complexity penalty" is added to the fitness measures typically used to assess the predictive power of a model. less intensive than wrapper.  
4. Dimensionality reduction methods:  
    especially PCA

variables can be
1. **Continuous variables**:  
    a. interval (lack absolute zero point)  
    b. ratio variables  
2. **Categorical** variables:  
    a. ordinal variables (ordered, but no distance between them)  
    b. nominal 

use data types to find bad values:
`# print all values that cannot be converted to float  
for column_name in ["Video Uploads", "Subscribers"]:
    print("These are the problematic values for the variable: {}".format(column_name))
    for value in youtube_df["Video Uploads"]:
        try:
            float(value)
        except:
            print(value)`  
            
then fix  
`
youtube_df["Video Uploads"] = youtube_df["Video Uploads"].apply(str.strip).replace("--", np.nan)
youtube_df["Video Uploads"] = pd.to_numeric(youtube_df["Video Uploads"], downcast="float")
`

for categorical values, look at counts
`youtube_df.Grade.value_counts()`
will show unsual value (usually bad) or typos

**outliers**: can  
* drop  
* cap eg winsorize  
* transform (eg. monotogic transform like log)  

### Standardizing

z = (x - u) / s

standardized = (x - $\bar{x}$) / std_dev  
where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.