# **Feature Selection**
This file is mainly about feature selection through feature correlation. This is a process by which we see how correlated each feature is with each other and remove features that do not provide additional data. I suspect that, in this case, we will not see a need for feature selection because there are only 3 features (protein_sequence, pH, data_source) to work with. However, perhaps additional analysis could be done on the protein_sequence feature to warrant more significant results

# **Import Some Packages**

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import SelectKBest, chi2

# how to import from other .ipynb files
from ipynb.fs.full.data_preprocessing import simple_convert, mod_tr_path

In [6]:
df = simple_convert(mod_tr_path)
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(3)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
seq_id,1.0,-0.0,-0.041,0.153,0.01
protein_sequence,-0.0,1.0,0.02,-0.097,-0.06
pH,-0.041,0.02,1.0,-0.11,-0.045
data_source,0.153,-0.097,-0.11,1.0,0.136
tm,0.01,-0.06,-0.045,0.136,1.0


In [7]:
df = simple_convert(mod_tr_path)
corr = df.drop(columns=["seq_id", "tm"]).corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(3)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,protein_sequence,pH,data_source
protein_sequence,1.0,0.02,-0.097
pH,0.02,1.0,-0.11
data_source,-0.097,-0.11,1.0


In [8]:
def get_features(threshold):
    df = simple_convert(mod_tr_path)
    corr = df.corr()
    corr_abs = abs(corr["tm"])
    relevant_features = corr_abs[corr_abs < threshold]
    return relevant_features.index.tolist()