# How to Choose ML Features

## What is a Feature
In machine learning, we generally use tabular data, meaning that it is in a table format or in rows and columns. Each row is an observation; each column is an attribute. We then train a computer to predict as well as possible (according to whatever metric is chosen as the most important) which one of the column values will be in cases when that value is unknown. The column whose values we are prdicting is called the target; any columns we use to predict the target values are called features. 

#### Example:

I am creating a machine learning model that will predict whether a person has arrhythmia based on simultaneous EKG (electrocardiogram) readings from several points on the body. One of the most important questions to be answered is this:
 - What features can we extract from an EKG reading?
Some possible features that come to mind immediately are the following:

 - Average heart rate over the course of the reading
 - Summary statistics (max, min, mean, variance, median) of the voltage at each given point in time throughout the reading
 - Summary statistics of the differences in time of consecutive heart beats

## Factors that Contribute to one Feature Being Better or Worse than Another

There are various factors that can contribute to the decision to choose one feature over another. Here are several of the things to consider as you make your choice of ML features:

 1. Correlation with the target
   - While there are other factors, correlation with the target is generally the most important factor as far as predictiveness of a model is concerned. 
   - For example, if you are predicting the gender of a student, height would be a very good feature to have, because height correlates strongly with gender. Obviously no one can predict for any given height all of the time what gender that person is, but males are stochastically taller, that is, their heights come from  different distribution than female heights do. 


```{r warning = FALSE, echo = FALSE, message = FALSE}   
library(tidyverse)

data <- data.frame(
      gender = c(rep(c("M", "F"), 30)),
      height_cm = c(145:204),
      major = c(rep("DS", 30), rep("CS", 30))
   )
boxplot(height_cm~gender, data = data)
table(data$gender, data$major) |> pander::pander()

ggplot(
   data = data, 
   mapping = aes(
      y = height_cm, 
      fill = gender,
      x = major)) + 
   geom_boxplot() + 
   theme_bw()

# plot(data$height_cm~data$gender=="F", data = data)
```



   - In my case, I do not know yet what factors will correlate most to each heart condition. 
   - We can tell what correlates with the target much better when we visaulize it on a scatterplot or other more appropriate graphic based on the data type. I will address this further in a separate section on visualization. 
 2. Suspicion (or Confirmation) that an Interaction Has Correlation with the Target
   - Let's suspend disbelief for a moment and imagine that a group of students in one major has all tall females and short males and the opposite is true for a second major. If we compare the majors based on gender, there will be no correlation. The same is true for height. However, if we look at the interaction, we can gain a lot of insight. We will see that we can determine what major a given student is in (or at least predict with higher accuracy) by using both features together, but not by using either one separately. 
 3. \# Of Missing Values
   - High missing value counts are not helpful in determining the value of a target variable.
   - What that specific threshold is 
 4. Removing Irrelevant Features Helps
 5. Collecting Outside Variables
 6. Visualize
Visualizing the data can help you to understand the relationships between the different 


4. Feature Importance Analysis:
   - Utilize feature importance techniques to determine the importance of each feature in relation to the target variable. This can be done using various methods like:
     - Correlation analysis: Calculate the correlation between each feature and the target variable.
     - Feature selection algorithms: Implement algorithms such as Recursive Feature Elimination (RFE), SelectKBest, or LASSO regression.
     - Tree-based models: Random Forest or Gradient Boosting models provide feature importance scores.
     - Mutual information or chi-squared tests for classification tasks.

5. Domain Knowledge:
   - Incorporate domain knowledge. Experts in the field may have insights into which features are most relevant. They can help you identify meaningful features that might not be apparent from the data alone.

6. Remove Redundant Features:
   - Check for multicollinearity among features (high correlations between features), as redundant features can negatively impact model performance. Remove one of the correlated features.

7. Feature Engineering:
   - Create new features or transform existing ones to capture additional information. For example, you might convert timestamps into day of the week, extract text features, or create interaction terms.

8. Test Iteratively:
   - Start with a subset of features and build a baseline model. Gradually add or remove features and assess the impact on model performance using techniques like cross-validation.

9. Dimensionality Reduction:
   - If you have a large number of features, consider dimensionality reduction techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE).

10. Model Feedback:
    - Some machine learning models provide insights into feature importance. After training your model, analyze feature importance scores to verify the relevance of features.

11. Regularization:
    - Regularized models (e.g., Lasso or Ridge regression) automatically perform feature selection by assigning low coefficients to unimportant features. These models can help identify the most relevant features.

12. Experiment and Evaluate:
    - Experiment with different feature sets and evaluate model performance using metrics like accuracy, precision, recall, F1-score, or mean squared error (depending on your problem).

13. Cross-Validation:
    - Always use cross-validation to ensure that your feature selection choices generalize well to unseen data.

Remember that feature selection is an iterative process, and the best feature set may vary depending on the specific problem and dataset. It's important to be mindful of overfitting and ensure that your feature selection process is guided by the goals of your machine learning project.