# TODO


- Goodfellow, Deep learning, Chapter 2, Linear Algebra
  - $x^{T}Ax$ as it relates to definiteness (positive, semi-positive, etc)
  - Singular Value Decomposition (SVD)
  - Moore-Penrose Pseudoinverse
  - Vectors and projection
  - Trace operator
  - [principal component analysis and dimension reduction](https://www.youtube.com/watch?v=HMOI_lkzW08&t=27s&pp=ygUdcHJpbmNpcGFsIGNvbXBvbmVudHMgYW5hbHlzaXM%3D)
- Goodfellow, Deep learning, Chapter 3, Probability and Information Theory
- Goodfellow, Deep learning, Chapter 4, Numerical Methods
- Goodfellow, Deep learning, Chapter 5, Machine Learning Basics
- Statistics
  - covariance matrix
  - graphical models/structured probabilistic models and probability distribution factorizations
  - condition number
- Information Theory
  - information theory Cover and Thomas (2006) or MacKay (2003)
  - entropy and distributions
- Vector calculus
  - Gradient
  - Jacobian
- lightgbm
- catboost
- https://www.neuraxio.com/blogs/news/whats-wrong-with-scikit-learn-pipelines
- Endogeneity in model evaluation
- Causal Inference
- Interpretability and Explainability
- SHAP values

# Resources

- [Montgomery, Applied Statistics and Probability for Engineers 7ed](https://www.wiley.com/en-us/Applied+Statistics+and+Probability+for+Engineers%2C+7th+Edition-p-9781119400363)
- [Goodfellow, Deep Learning](https://www.deeplearningbook.org/)
- [John Krohn's Youtube Channel](https://www.youtube.com/@JonKrohnLearns/videos)
  - [Probability for Machine Learning](https://www.youtube.com/playlist?list=PLRDl2inPrWQWwJ1mh4tCUxlLfZ76C1zge) 
  - [Calculus for Machine Learning](https://www.youtube.com/playlist?list=PLRDl2inPrWQVu2OvnTvtkRpJ-wz-URMJx)
  - [Linear Algebra for Machine Learning](https://www.youtube.com/playlist?list=PLRDl2inPrWQW1QSWhBU0ki-jq_uElkh2a)
  - [Github](https://github.com/jonkrohn/ML-foundations)
- [Principal Component Analysis](https://www.youtube.com/watch?v=HMOI_lkzW08&t=27s&pp=ygUdcHJpbmNpcGFsIGNvbXBvbmVudHMgYW5hbHlzaXM%3D)
- [ritvikmath Data Science Concepts](https://www.youtube.com/playlist?list=PLvcbYUQ5t0UH2MS_B6maLNJhK0jNyPJUY)
- [oninestatbook](https://onlinestatbook.com)
- [PennState Stat415](https://online.stat.psu.edu/stat415/)
- [Kaggle Courses](https://www.kaggle.com/search?q=+in%3Acourses)
- [Introduction to Statistical Learning Videos](https://www.youtube.com/watch?v=5N9V07EIfIg&list=PLOg0ngHtcqbPTlZzRHA2ocQZqB1D_qZ5V)

# Miscellaneous

- Preprocessing
  - Scaling
  - Handling Outliers
- Feature Engineering
  - Mutual Information
  - Dimensionality Reduction
  - PCA
  - Curse of Dimensionality
- Information Theory
  - Entropy
- Overfitting vs Underfitting
- Different Types of ML Models
  - Decision Trees
  - Random Forest
  - XGBoost
  - LightGBM
- Ensemble Methods
  - Gradient Boosting
- Missing Values
  - Drop columns
  - Impute with Mean Value
  - Add a new column/feature indicating the missing value
- Categorical Variables
  - Drop
  - Ordinal Encoding
  - One-Hot Encoding
    - Find ways of handling features with high cardinality
- Hyperparameter tuning
  - Parameter search, grid search
- Kernel Approximation
- Kernel Density/Kernel Density Plot
- Misc
  - Kaggle Deep Learning Free GPU Access
  - Google Colab
- sklearn
  - sklearn.metrics.mean_absolute_error
  - sklearn.impute.SimpleImputer
  - sklearn.model_selection.cross_val_score
  - GridSearchCV()
  - sklearn.pipeline.make_pipeline
  - sklearn.feature_selection.mutual_info_regression
- pandas
  - pd.get_dummies
  - pd.factorize

# Notes

## EDA

### Understand Spread and Disperson
- Does the feature vary? A feature with little variation or that is almost
  constant contributes little to explaining the variability in the target.
- Is the feature normal?
    - Standard scaling works better on normally distributed features.
    - Models that assume normality: Gaussian Naive Bayes, Linear Discriminant
    Analysis, Quadratic discriminant analysis
    - PCA benefits from normality if you intend to use the principal components
      as uncorrelated variables
    - Affects imputation strategy
        - Mean imputation for missing values
        - Median for highly skewed distributions
        - KNN imputation or predicting missing values based on other features
            - Check if imputation has distorted the original distribution
    - Tests
      - Shapiro-Wilk: Computes correlation between samples and normal scores.
        Good for <5000 samples. A W closer to 1 indicates normality.
      - Kolmogorov-Smirnov: Compares the cumulative distribution function of the
        sample data with the CDF of the normal distribution. It measures the
        maximum vertical distance between these two curves. Needs true
        population mean and variance.
      - Lilliefors Test: an improvement on Kolmogorov-Smirnov. Uses corrected
        critical values to account for uknown population mean and variance.
      - Anderson-Darling: Similar to Kolmogorov-Smirnov but gives more weight to
        the tails of the distribution.
      - Jarque-Bera: Based on skewness and kurtosis. Recommended for larger
        sample sizes >2000.
      - These tests have low power at low sample sizes and become too sensitive
        at high sample sizes. Visual plots are still useful.
- Check for homoscedacity - is the variance constant? 
- Are there gaps, clusters or spikes?
- Do the min/max make sense?
- Look at the distribution of the target conditioned on different classes or
  ranges of the feature. Are there interactions? Are the means and variances of
  these distributions the same? Are the distributions similar or different?
    - Mann-Whitney U test
    - q-q plots
- If the distribution is multi-modal (many peaks), each peak could represent a
  sub-group in the data. Include the underlying categorical variable if you can
  find it.
    - If there is no existing categorical variable, a new one can be created if
      the subgroups can be defined. See K-Means Clustering and Gaussian Mixutre
      Models.
    - Another possibility is to decompose the multi-modal feature into multiple features.

### Outliers
- Do the outleirs make sense?
    - If they are due to errors, drop them.
    - If not, try Winsorization/capping.

### Skewness:
- If the feature is positively skewed, try the following transformations:
    - log transformation, square root, Box-Cox
- Yeo-Johnson

### Predictive Power
- Signal to Noise ratio (SNR); how much of a feature contributes to predicting the
  target and how much is noise?
- A core strategy is to quantify how much of the variation in the target
  variable is explained by variation in a feature?
    - Correlation captures linear relationships.
    - R<sup>2</sup>?
    - Mutual Information
    - ANOVA bins the continuous feature and checks if the mean of the target
      variable differs across these bins.
    - Tree-based feature importance
    - Permutation importance. Train a model. Then randomly arrange the values of
      a feature across a sample. Then train the same model again. Did the new
      model perform worse?

multicollinearity among features