# Week 05 - Exploratory Data Analysis and Data Visualization

## Conditional Probability

In [None]:
# marbles dataframe
import pandas as pd

marbles = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp22/main/materials/sp22/lab/lab10/marbles.csv')
marbles.head(13)

### Marbles

Imagine you and Samantha are playing a game in which you are given a marble and tasked to determine the marble's texture and size. You don't know anything about the marble you're given, but you know that Samantha drew it **uniformly at random** from a bag that contained the following marbles:

* 1 large dull marble
* 4 large shiny marbles
* 2 small dull marbles
* 6 small shiny marbles

In [None]:
# groupby surface and size
marbles.groupby(['size', 'surface'])['size'].count()

Knowing only what we've told you so far, what's the probability that the marble you're given was a large shiny marble?

In [None]:
# 4 large shiney marbles / 13 marbles

Suppose you overhear Samantha say that you were given a large marble. Does this somehow change the chance that your marble is shiny?  Let's find out.

Let's assume that the marble you were given was equally likely to be any of the marbles, simply because we didn't know any better. That's why we looked at all the marbles to compute the probability that your marble was shiny.

But assuming that you've been given a large marble, we can eliminate some of these possibilities. In particular, you can't have been given a small shiny marble or a small dull marble.

You're still equally likely to have been given any of the remaining marbles, because you don't know any other information. 

What's the probability Samantha gives you a shiny marble, knowing that she gave you a large marble?

In [None]:
# groupby surface and size
marbles.groupby(['size', 'surface'])['size'].count()

In [None]:
# given that you already have a large marble
# we have 5 large marbles 4 of which are shiny

### Conditional Probability Formula

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

Dependent Intersection: $P(A \cap B) = P(A) * P(B|A)$

In [None]:
# A = shiny
# B = large
# ((10/13) * (4/10)) / (5/13)

In [None]:
# Using tables as a short cut to solve problems
marbles.groupby(['size', 'surface'])['size'].count()

### Power (Sensitivity) vs. Confidence (Specificity)

* Sensitivity: correctly identifying someone with a disease
* Specificity: correctly identifying someone without a disease

https://www.ncbi.nlm.nih.gov/books/NBK557491/

<img src='https://www.statisticshowto.com/wp-content/uploads/2015/04/statistical-power.png' />

<pre>
                    predicted
                   |  0   |  1     
           -----------------------  ----------------------------------------------
           0       |  {tn}  |  {fp}    tnr (specificity)   |  fpr (type I error) 
  actual   -----------------------  ----------------------------------------------
           1       |  {fn}   |  {tp}   fnr (type II error) |  tpr (sensitivity, recall)

                     npv  | fdr
           ----------------------------
                     for  | precision (ppv)

</pre>

If I were to describe it using the language of classification, statistical power concerns the probability we find a true positive (the effect truly is not the null effect, and we have correctly detected it). Indeed, the type 2 error rate is the false negative rate and power is 1-false negative rate, hence true positive rate.

Demetri Pananos (https://stats.stackexchange.com/users/111259/demetri-pananos), Why Power = P(True Positive) in Hypothesis testing?, URL (version: 2021-07-07): https://stats.stackexchange.com/q/533589

https://en.wikipedia.org/wiki/False_positives_and_false_negatives

* Type I Error: $\alpha$
* Type II Error: $\beta$
* Beta is directly related to the power of a test. Power relates to how likely a test is to distinguish an actual effect from one you could expect to happen by chance alone. Beta plus the power of a test is always equal to 1. Usually, researchers will refer to the power of a test (e.g. a power of .8), leaving the beta level (.2 in this case) as implied. https://www.statisticshowto.com/beta-level/
* The statistical power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis $H_{0}$ when a specific alternative hypothesis $H_{a}$ is true. It is commonly denoted by $1-\beta$ , and represents the chances of a "true positive" detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability $\beta$  of making a type II error by wrongly failing to reject the null hypothesis decreases. https://en.wikipedia.org/wiki/Power_of_a_testHypothesis testing
As I already mentioned, the definition most learners of statistics come to first for beta and alpha are about hypothesis testing.

### Hypothesis testing

https://www.theanalysisfactor.com/confusing-statistical-terms-1-alpha-and-beta/

α (Alpha) is the probability of Type I error in any hypothesis test–incorrectly rejecting the null hypothesis.

β (Beta) is the probability of Type II error in any hypothesis test–incorrectly failing to reject the null hypothesis.  (1 – β is power).

## Coding Practice

### Numpy (Arrays)

* Scalars
* Vectors
* Matrices

In [None]:
# list to array
import numpy as np

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(type(my_list))
my_array = np.array(my_list)
print(type(my_array)) # n-dimensional array

In [None]:
# view list
my_list

In [None]:
# view array
my_array

In [None]:
# 3 x 3 array
my_kart = [['Baby Daisy', 'Baby Luigi', 'Baby Mario'], ['Birdo', 'Bowser', 'Donkey Kong'], ['Princess Peach', 'Isabelle', 'Koopa Troopa']]
my_kart_array = np.array(my_kart)
my_kart_array

In [None]:
# np.zeroes
np.zeros((3, 3))

In [None]:
# arrays from arange and reshape
import numpy as np

my_ndarray = np.arange(1, 5).reshape(2, 2)
my_ndarray

In [None]:
# scalar multiplaction
my_ndarray * 2

In [None]:
# scalar division
my_ndarray / 2

In [None]:
# element wise multiplication
my_ndarray * my_ndarray

In [None]:
# dot product
print([list(a) for a in my_ndarray], end='')
print()
print([list(a) for a in my_ndarray], end='')
np.dot(my_ndarray, my_ndarray)
# 1 * 1 + 2 * 3, 1 * 2 + 2 * 4, 3 * 1 + 4 * 3, 3 * 2 + 4 * 4

In [None]:
# https://nbviewer.org/github/jmportilla/Udemy-notes/blob/master/Lec%209%20-Indexing%20Arrays.ipynb
import numpy as np

my_ndarray = np.arange(1, 17).reshape(4, 4)
my_ndarray

In [None]:
my_ndarray[0]

In [None]:
my_ndarray[0][0]

In [None]:
my_ndarray[0:1, 0:1]

In [None]:
print(my_ndarray)
my_ndarray[:1,:1]

In [None]:
print(my_ndarray)
my_ndarray[:1,2:4]

In [None]:
print(my_ndarray)
my_ndarray[:2,2:4]

In [None]:
print(my_ndarray)
my_ndarray[2:4,1:3]

In [None]:
import random

np.random.randint(10, size=9).reshape(3, 3)

In [None]:
# universal functions
import webbrowser

url = 'https://numpy.org/doc/stable/reference/ufuncs.html'
webbrowser.open(url)

In [None]:
# some descriptive statistics
import numpy as np

stats_array = np.random.randint(10, size=9).reshape(3, 3)
print(stats_array)
print('sum:\t', stats_array.sum())
print('mean:\t', stats_array.mean().round(2))
print('var:\t', stats_array.var().round(2))
print('std:\t', stats_array.std().round(2))

## Data Analysis

### Univariate Analysis

Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the data has only one variable... Univariate data does not answer questions about relationships between variables, but rather it is used to describe characteristics or attributes of a feature... (para 4).

https://en.wikipedia.org/wiki/Univariate_(statistics).

In [None]:
# univariate data histogram: dataset from https://archive.ics.uci.edu/ml/datasets/Auto+MPG
import pandas as pd
import seaborn as sns

cars = pd.read_csv('https://raw.githubusercontent.com/gitmystuff/INFO5502/main/Week_3-Data/auto-mpg.data', sep = '\s+', header = None)
cars.columns=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin',  'car name']
print(cars.head())
sns.histplot(cars['acceleration'].dropna(), kde=True, bins=10);

**Note on sns.histplot(data, kde=True, bins=10) in previous cell**

* Kernel density estimation or KDE is a non-parametric way to estimate the probability density function of a random variable
* The aim of KDE is to find probability density function (PDF) for a given dataset

https://www.homeworkhelponline.net/blog/math/tutorial-kde

In [None]:
# univariate data with pie chart
cars['cylinders'].value_counts().plot.pie();

### Bivariate Analysis

Can be used to describe the relationship between two variables

In [None]:
# bivariate scatter plot
import matplotlib.pyplot as plt

plt.scatter(cars['weight'], cars['mpg'])
plt.xlabel('weight')
plt.ylabel('mpg')
plt.show()

### Multivariate Analysis

Multivariate data involves two or more variables and provides another visual dimension

In [None]:
# multivariate example
import seaborn as sns

sns.scatterplot(data=cars, x='weight', y='mpg', size='acceleration');

## Exploratory Data Analysis

* Getting to know your data: center, spread, shape
* Data types: numerical, categorical, nominal, ordinal, boolean
* Cardinality, duplications, and missing data
* Outliers
* Engineer features with too many labels
* Understanding relationships between variables: feature with feature, feature with target
* Feature engineering
* Feature selection

In [None]:
# get data
import pandas as pd

grades = pd.read_csv('class-grades4.csv', index_col=0)
grades.drop(['index'], axis=1, inplace=True)
print(grades.shape)
print(grades.head())

### Constant Features

In [None]:
# replace missing values and then check how many unique values are in each variable
few_values = [
    val for val in grades.columns if len(grades[val].fillna(0).unique()) == 1
]

few_values

### Quasi Constant Features

In [None]:
# quasi constant values (sometimes these may be boolean features)
for val in grades.columns.sort_values():
    if (len(grades[val].unique()) < 3):
        print(grades[val].value_counts())

### Duplicates

In [None]:
# duplicate rows
grades[grades.duplicated(keep=False)]

In [None]:
# drop duplicate rows
grades.drop_duplicates(inplace=True)

In [None]:
# check of duplicate columns
duplicate_variables = []
for i in range(0, len(grades.columns)):
    orig = grades.columns[i]

    for dupe in grades.columns[i + 1:]:
        if grades[orig].equals(grades[dupe]):
            duplicate_variables.append(dupe)
            print(f'{orig} looks the same as {dupe}')
            
duplicate_variables

In [None]:
# drop the variables that are duplicated or low in variance
grades.drop(['Quiz', 'Student', 'Misc', 'Nothing'], axis=1, inplace=True)
grades.info()

### Missing Values

In [None]:
# check for nulls
grades.isnull().sum()

In [None]:
# look at the shape of variables that are numerical
import matplotlib.pyplot as plt

grades.hist()
plt.tight_layout();

In [None]:
# impute missing values with mean and median
grades['Midterm'].fillna(round(grades['Midterm'].mean(), 2), inplace=True)
grades['Final'].fillna(round(grades['Final'].mean(), 2), inplace=True)
grades['FinalGrade'].fillna(round(grades['FinalGrade'].mean(), 2), inplace=True)
grades['Assignment1'].fillna(grades['Assignment1'].median(), inplace=True)
grades['Tutorial'].fillna(grades['Tutorial'].median(), inplace=True)
grades.isnull().sum()

In [None]:
# impute missing values with mode
print(grades['TakeHome'].value_counts(dropna=False))
print(grades['work_status'].value_counts(dropna=False))

In [None]:
# replace work_status with mode (not employed)
grades['work_status'].fillna(grades['work_status'].mode()[0], inplace=True)
grades.isnull().sum()

In [None]:
# replace TakeHome with function
def convert_grade(row):
    if len(str(row['TakeHome'])) > 1:
        if row['Final'] >= 90:
            return 'A'
        if row['Final'] >= 80:
            return 'B'
        if row['Final'] >= 70:
            return 'C'
        if row['Final'] >= 60:
            return 'D'
        if row['Final'] < 60:
            return 'F'
    else:
        return row['TakeHome']
        
grades['TakeHome'] = grades.apply(convert_grade, axis=1)
grades['TakeHome'].fillna('Missing', inplace=True)
print(grades['TakeHome'].value_counts())
print(grades.isnull().sum())

### Engineer Features With Too Many Labels

In [None]:
# how many different home destination labels are there?
grades['home.dest'].value_counts(dropna=False).sum()

In [None]:
# null home.dest values?
print(len(grades[grades['home.dest'].isnull()]))
grades[grades['home.dest'].isnull()]

In [None]:
import re

def cat_home(r):
    text = str(r['home.dest']).strip()
    if bool(re.search('[A-Z]{2}$', text[-2:])):
        return 'North America'
    elif text == 'nan':
        return 'Missing'
    else:
        return 'Not North America'

grades['cat_home'] = grades.apply(cat_home, axis=1)

print(grades['cat_home'].value_counts())

In [None]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(grades.drop(['FinalGrade', 'home.dest', 'name'], axis=1), grades['FinalGrade'], test_size=.2, random_state=42)

print(X_train.shape)
print(X_test.shape)

In [None]:
# descriptive statistics example
X_train.describe()

### Quartiles

https://en.wikipedia.org/wiki/Interquartile_range

Interquartile range<br />
Whiskers<br />
Fence: https://www.statisticshowto.com/upper-and-lower-fences/

Outliers<br />
Boxplots<br />
Violin plots

In [None]:
# Assignment histogram
X_train['Assignment1'].hist();

In [None]:
# Assignment boxplot
X_train.boxplot(column=['Assignment1'])

In [None]:
# Assignment violinplot
import seaborn as sns

sns.violinplot(x=X_train['Assignment1'])

In [None]:
for feat in X_train._get_numeric_data().columns[1:]:
    q1 = grades[feat].quantile(0.25)
    q3 = grades[feat].quantile(0.75)
    iqr = q3 - q1
    lower_fence = (q1 - 1.5 * iqr).round()
    upper_fence = (q3 + 1.5 * iqr).round()
    lower_count = grades[feat][grades[feat] < lower_fence].count()
    upper_count = grades[feat][grades[feat] > upper_fence].count()
    print(f'{feat} outliers = {lower_count + upper_count}: lower_fence: {lower_fence}, upper_fence: {upper_fence}, lower_count: {lower_count}, upper_count: {upper_count}')

## Correlation

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it actually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase... Correlations are useful because they can indicate a predictive relationship that can be exploited in practice (paras. 1 - 2).

https://en.wikipedia.org/wiki/Correlation.

### Correlation Between Features

* anything above .9 do something about it
* between .5 and .7 may need a closer look

Correlation does not imply cause causation. Warm days on the beach, ice cream, and shark bites.

### Pearson’s r (correlation coefficient)

$\rho_{x,y} = \frac{cov(x,y)}{\sigma_x\sigma_y} = \frac{\frac{1}{N}\sum(x-\bar{x})(y-\bar{y})}{\sqrt\frac{\sum(x-\bar{x})^2}{N}\sqrt\frac{\sum(y-\bar{y})^2}{N}}  = \frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2}\sqrt{\sum(y-\bar{y})^2}}$

* Shows linear relationship between two continuous variables
* How one variable changes as another variable changes
* Measures both strength and direction

https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php<br />
https://www.mygreatlearning.com/blog/covariance-vs-correlation/

### Covariance

$cov(x, y) = \frac{1}{N} \sum_{i=1}^{N}(x_i - \bar{x}) (y_i - \bar{y})$

* Shows how variables change together
* A measure of correlation
* Measures direction

### Multicollinearity

* Makes it difficult to determine which independent variables are influencing the dependent variable

### Correlation vs Multicollinearity

* Correlation measures how two or more variables move together (good between independent and dependent variables)
* (Mutli)collinearity shows a linear relationship, usually high, between features

### Variance Inflation Factor (VIF)

https://towardsdatascience.com/statistics-in-python-collinearity-and-multicollinearity-4cc4dcd82b3f 

* Minimum possible value is one
* Values over 10 mean multicollinearity

In [None]:
# correlation heat map
import numpy as np
import seaborn as sns
from scipy import stats

ptext = f"Pearson r: {round(stats.pearsonr(X_train['Midterm'], X_train['Final'])[0], 2)}"
sns.regplot(x='Midterm', y='Final', data=X_train, ci=None,
            line_kws={'color': 'red', 'label': ptext});
                      
plt.legend()
plt.show();

# correlation matrix
sns.set(style="white")

# compute the correlation matrix
corr = X_train.corr()

# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# set up the matplotlib figure
f, ax = plt.subplots()

# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True);

There's not much correlation between the independent variables.

In [None]:
# sns pairplot
import seaborn as sns

eda_data = X_train.copy()
eda_data['FinalGrade'] = y_train

sns.pairplot(data=eda_data, corner=True);

In [None]:
# scatter plots showing correlation
import pandas as pd
import seaborn as sns

sns.pairplot(data=eda_data, x_vars=['Assignment1', 'Tutorial', 'Midterm', 'Final'], y_vars='FinalGrade', 
             kind='reg',
             height=5,
             aspect=0.8, 
             plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.5}});

In [None]:
# correlation with target
X_train.corrwith(y_train).plot.bar(
        title = 'Correlation with Final Grade (Target)', rot = 45, grid = True);

The Final shows more than 80% correlation with the Final Grade

## Data Visualization

* https://matplotlib.org/stable/gallery/index.html 
* http://seaborn.pydata.org/examples/
* https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html 
* https://towardsdatascience.com/a-complete-guide-to-plotting-categorical-variables-with-seaborn-bfe54db66bec

### Some Types of Plots

https://towardsdatascience.com/intro-to-dynamic-visualization-with-python-animations-and-interactive-plots-f72a7fb69245

* Static
* Dynamic
* Interactive

In [None]:
# X_train review
X_train.head()

In [None]:
X_train.describe()

In [None]:
# histogram
X_train['Final'].hist()

In [None]:
# pie plot
grades['Prefix'].value_counts().plot(kind='pie');

In [None]:
# bar plot
grades[['Assignment1', 'Final']].mean().plot(kind='bar');

In [None]:
# bivariate
grades.plot.scatter(x='Midterm', y='Final', c='black');

In [None]:
# multivariate
grades.plot.scatter(x='Midterm', y='Final', c='age', colormap='viridis');

## Groupby 

https://towardsdatascience.com/meet-the-hardest-functions-of-pandas-part-ii-f8029a2b0c9b

In [None]:
# groupby
X_train.groupby('sex')['Final'].agg(['min', 'max', 'mean'])

In [None]:
# bar chart; https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
X_train.groupby('sex')['Final'].agg(['min', 'max', 'mean']).plot(kind='bar');

In [None]:
# horizontal bar chart
X_train.groupby('sex')['Final'].agg(['min', 'max', 'mean']).plot(kind='barh');

In [None]:
# stacked bar
X_train.groupby('sex')['Final'].agg(['min', 'max', 'mean']).plot.bar(stacked=True);

In [None]:
# plot 4 grades grouped by sex
import matplotlib.pyplot as plt

female_scores = grades.query('`sex` == "female"')
male_scores = grades.query('`sex` == "male"')
ax = female_scores[['Assignment1', 'Tutorial', 'Midterm', 'Final']].mean().plot(kind='line', label='female')
male_scores[['Assignment1', 'Tutorial', 'Midterm', 'Final']].mean().plot(ax=ax, label='male')
plt.legend()
plt.show()

In [None]:
# groupby plot one liner
X_train.groupby('sex')[['Assignment1', 'Tutorial', 'Midterm', 'Final']].agg('mean').transpose().plot(kind='line');

In [None]:
# groupby breakdown
grouped = X_train.groupby('sex')[['Assignment1', 'Tutorial', 'Midterm', 'Final']].agg('mean')
print(grouped)
print('---' * 20)
print(grouped.transpose())
print('---' * 20)
grouped.transpose().plot(kind='line');

In [None]:
# groupby two liner
grouped = X_train.groupby('sex')[['Assignment1', 'Tutorial', 'Midterm', 'Final']].agg('mean')
grouped.transpose().plot(kind='line');

## Feature Engineering

In [None]:
# recheck our data
X_train.info()

### Bi-Label Mapping

In [None]:
# bi-label mapping
X_train['sex'] = X_train['sex'].map({'female':1,'male':0})
X_test['sex'] = X_test['sex'].map({'female':1,'male':0})

X_train['scholarship'] = X_train['scholarship'].map({'yes':1,'no':0})
X_test['scholarship'] = X_test['scholarship'].map({'yes':1,'no':0})

In [None]:
X_train.info()

### One Hot Encoding

In [None]:
# sklearn OneHotEncoder
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# https://stackoverflow.com/questions/50473381/scikit-learns-labelbinarizer-vs-onehotencoder
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer

pets = ['dog', 'cat', 'cat', 'dog', 'turtle', 'cat', 'cat', 'turtle', 'dog', 'cat']
le = LabelEncoder()
int_values = le.fit_transform(pets)
print('Data:', pets)
print('Label Encoder:', int_values)
int_values = int_values.reshape(len(int_values), 1)

ohe = OneHotEncoder(sparse=False)
ohe = ohe.fit_transform(int_values)
print('One Hot Encoder:\n', ohe)

lb = LabelBinarizer()
print('Label Binarizer:\n', lb.fit_transform(int_values))

In [None]:
# using pandas get_dummies
import pandas as pd

X_dummy = pd.get_dummies(X_train, drop_first=True)
y_dummy = pd.get_dummies(X_test, drop_first=True)
print(X_dummy.shape)
print(y_dummy.shape)

In [None]:
# use sklearn one hot encoder
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categories='auto', drop='first', sparse=False, handle_unknown='ignore')

cat_features = ['Prefix', 'TakeHome', 'work_status', 'cat_home']
ohe_train = ohe.fit_transform(X_train[cat_features])
ohe_train = pd.DataFrame(ohe_train, columns=ohe.get_feature_names_out(cat_features))
ohe_train.index = X_train.index
X_train = X_train.join(ohe_train)
X_train.drop(cat_features, axis=1, inplace=True)

ohe_test = ohe.transform(X_test[cat_features])
ohe_test = pd.DataFrame(ohe_test, columns=ohe.get_feature_names_out(cat_features))
ohe_test.index = X_test.index
X_test = X_test.join(ohe_test)
X_test.drop(cat_features, axis=1, inplace=True)

print(X_train.shape)
print(X_test.shape)
print(X_train.info())

## Feature Selection

* https://github.com/codingnest/FeatureSelection/blob/master/Data%20Science%20Lifecycle%20-%20Feature%20Selection%20(Filter%2C%20Wrapper%2C%20Embedded%20and%20Hybrid%20Methods).ipynb
* https://www.analyticsvidhya.com/blog/2020/10/a-comprehensive-guide-to-feature-selection-using-wrapper-methods-in-python/

<img src='https://editor.analyticsvidhya.com/uploads/84353IMAGE1.png' alt='feature selection' />

### Feature-engine

https://feature-engine.readthedocs.io/en/latest/

### Filter Methods

* Constant, Quasi Constant, and Duplicated Features
* Correlation
* Variance
* Mutual Information
* Chi-Square Test
* ANOVA
* SelectKBest

### Wrapper Methods

* Backward, Forward, Stepwise Selection (Regression)

### Embedded Methods

* Coefficients (Regression)
* Lasso Regularization
* Tree Importance

### Hybrid Methods

* Recursive Feature Elimination

### Mutual Information

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html

Estimate mutual information for a continuous target variable.

Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In [None]:
# Plot the mutual information between X_train and y_train
from sklearn.feature_selection import mutual_info_regression

mi = mutual_info_regression(X_train, y_train)
mi = pd.Series(mi)
mi.index = X_train.columns
mi.sort_values(ascending=False).plot.bar()
plt.ylabel('Mutual Information')