## Feature Selection

This '.ipynb' file contains some agressive feature selection to reduce the total of a litte over 32000 features down to a manageble count valuable for training. This wil be done trough various methods:

- Inital feature selection of clinical data
- Dropping duplicates
- Dropping NaN values
- Fracting thresholding
- Variance thresholding
- Correlation thresholding
- More...

By aplying al of these feature reducing methods whe hope to reduce the total feature count to a maximun of a couple honderd to a thousend features.

#### 1. NaN and Duplicates

The prepricessed dataset is loaded in, and the two different types of columns are defined. Extra clinical features are dropped.Then rows with duplicate 'case_id' are dropped, so only one row per case remains. Lastly possible rows with NaN values are dropped to ensure error free feature selection and model training. 

In [None]:
import pandas as pd

# Loading the preprocessed dataset
dataFrame: pd.DataFrame = pd.read_csv('./DatasetParser/Dataset/ProcessedFiles/merged_data.csv', low_memory=False)

# Function to get all columns related to genes
def getGeneColumns():
    return [col for col in dataFrame.columns if 'unstranded' in col]

# Selecting all the unstranded and tpm_unstranded columns
geneColumns: list[str] = getGeneColumns()

# Defining the clinical columns to be retained
clinicalColumns: list[str] = ['case_id', 'histological_type', 'icd_o_3_histology']

# Checking rows for NaN values in specific columns
selectedColumns: list[str] = clinicalColumns + geneColumns

# Keeping only the selected columns
dataFrame = dataFrame[selectedColumns]

# Retrieving all the unique cases from the dataset
dataFrame.drop_duplicates(subset=['case_id'], inplace=True)

# Dropping rows with any NaN values
dataFrame.dropna(inplace=True)

#### 2. Mixed data types

Any colums with mixed data types are dropped to ensure correct model training, and feature reducting.

In [None]:

# Identifying columns with mixed data types
mixedTypeColumns: list[str] = [
    col for col in dataFrame.columns
    if dataFrame[col].map(type).nunique() > 1
]

# Dropping columns with mixed data types
dataFrame.drop(columns=mixedTypeColumns, inplace=True)

# Keeping the gene columns up-to-date
geneColumns = getGeneColumns()

### 3. Fraction Threshold

Any genes that where 80% of the sampels have a value lower that 80% wil be dropped, these genes likely do not play a large role in detecting histology subtypes.

In [None]:
countThreshold: int = 10
sampleFractionThreshold: float = 0.8

# Creating a mask for the data frame so only the active genes are included
mask: pd.Series[bool] = (dataFrame[geneColumns] < countThreshold).sum(axis=0) > (sampleFractionThreshold * dataFrame[geneColumns].shape[0])

# Creating a filterd data frame
filterdDataFrame: pd.DataFrame = dataFrame[geneColumns].loc[:, ~mask]

# Updating the gene columns to ensure correct indexing
geneColumns = filterdDataFrame.columns

# Updating the original dataframe
dataFrame = dataFrame[clinicalColumns].join(filterdDataFrame)

print(dataFrame)

                                   case_id  \
0     b40d0849-f4ef-4a14-9732-9beb708cb46b   
1     0d66bf6c-eed0-4726-bd5b-3bf6d610b4e0   
2     193201a3-1447-47b1-bdf1-11ae0eb3b2f3   
3     21fb46f9-4bbb-441c-af19-a687e9138344   
5     4d51ee44-e6f4-4bcb-be28-e9df54b39a8d   
...                                    ...   
1157  ad98977b-e159-410a-b8c2-f4e8a07f9784   
1158  3f1b4356-0b53-48cd-8938-96fc786c9b63   
1159  a9d7adec-3849-40e2-a2a2-f43443ec43bb   
1160  8be10ce9-220d-4816-a575-8e8e7f041114   
1161  d49b0369-905c-4608-96a9-cc854980fc4c   

                                      histological_type icd_o_3_histology  \
0     Lung Adenocarcinoma- Not Otherwise Specified (...            8140/3   
1     Lung Adenocarcinoma- Not Otherwise Specified (...            8140/3   
2     Lung Adenocarcinoma- Not Otherwise Specified (...            8140/3   
3                          Mucinous (Colloid) Carcinoma            8480/3   
5     Lung Adenocarcinoma- Not Otherwise Specified (...       

#### 4. Variance threshold

Small variance between gene samples have litte to no value to the global model training and can be dropped.

In [None]:
# TODO: Seems to not drop any extra columns, but just in case

from sklearn.feature_selection import VarianceThreshold

# Define a threshold for variance
varianceThreshold: int = 10  

# Step 1: Select only the feature columns (ensure no NaNs)
threshHoldDataFrame: pd.DataFrame = dataFrame[geneColumns].copy()

# Step 2: Apply VarianceThreshold
selector = VarianceThreshold(threshold=varianceThreshold) # TODO find logical value
X_selected = selector.fit_transform(threshHoldDataFrame)

# Step 3: Get selected column names
selectedColumns: list[str] = threshHoldDataFrame.columns[selector.get_support()]

# Step 4: Create a new DataFrame with selected features
filterdDataFrame = pd.DataFrame(X_selected, columns=selectedColumns, index=dataFrame.index)

# Step 5: Drop original gene columns and add the reduced set
dataFrame = dataFrame.drop(columns=geneColumns).join(filterdDataFrame)

# Updating the geneColumns to reflect the reduced set
geneColumns = selectedColumns.tolist()

print(dataFrame)

                                   case_id  \
0     b40d0849-f4ef-4a14-9732-9beb708cb46b   
1     0d66bf6c-eed0-4726-bd5b-3bf6d610b4e0   
2     193201a3-1447-47b1-bdf1-11ae0eb3b2f3   
3     21fb46f9-4bbb-441c-af19-a687e9138344   
5     4d51ee44-e6f4-4bcb-be28-e9df54b39a8d   
...                                    ...   
1157  ad98977b-e159-410a-b8c2-f4e8a07f9784   
1158  3f1b4356-0b53-48cd-8938-96fc786c9b63   
1159  a9d7adec-3849-40e2-a2a2-f43443ec43bb   
1160  8be10ce9-220d-4816-a575-8e8e7f041114   
1161  d49b0369-905c-4608-96a9-cc854980fc4c   

                                      histological_type icd_o_3_histology  \
0     Lung Adenocarcinoma- Not Otherwise Specified (...            8140/3   
1     Lung Adenocarcinoma- Not Otherwise Specified (...            8140/3   
2     Lung Adenocarcinoma- Not Otherwise Specified (...            8140/3   
3                          Mucinous (Colloid) Carcinoma            8480/3   
5     Lung Adenocarcinoma- Not Otherwise Specified (...       

#### 5. Correlation threshold

Any correlated features have litte to no extra value to be includeded in model training and can be dropped.

In [None]:
# Creating a correlation matrix
correlationMatrix = dataFrame[geneColumns].corr()

# Defining a threshold for correlation
correlationThreshold = 0.5 

# Finding pairs of columns with correlation above the threshold
selectedColumns = set()
for i in range(len(correlationMatrix.columns)):
    for j in range(i):
        if abs(correlationMatrix.iloc[i, j]) > correlationThreshold:
            colname = correlationMatrix.columns[i]
            selectedColumns.add(colname)

# Converting the set to a list
selectedColumns: list[str] = list(selectedColumns)

# Dropping the selected columns from the original DataFrame
dataFrame.drop(columns=selectedColumns, inplace=True)

print(dataFrame)

Columns with correlation above 0.5: ['AC009506.1_unstranded', 'LINC01176_unstranded', 'MCCC1-AS1_unstranded', 'AC026992.2_unstranded', 'AC004687.1_unstranded', 'AC004982.2_unstranded', 'AL135784.1_unstranded', 'AP001189.1_unstranded', 'OTUD6B-AS1_unstranded', 'SETBP1-DT_unstranded', 'AL021707.4_unstranded', 'AC021087.3_unstranded', 'AC009948.2_unstranded', 'AC002059.1_unstranded', 'CD81-AS1_unstranded', 'SEPTIN9-DT_unstranded', 'AL031717.1_unstranded', 'AC245884.8_unstranded', 'AC005052.2_unstranded', 'AL442125.1_unstranded', 'LINC01943_unstranded', 'AC079804.3_unstranded', 'AC010201.2_unstranded', 'AL137060.3_unstranded', 'AL022069.3_unstranded', 'AL008721.2_unstranded', 'LINC02604_tpm_unstranded', 'AC123912.1_unstranded', 'AC008731.1_unstranded', 'LINC01140_unstranded', 'AC012254.3_unstranded', 'AC007384.1_unstranded', 'AL441992.1_unstranded', 'CAMTA1-DT_unstranded', 'AC005076.1_unstranded', 'CD27-AS1_unstranded', 'AC067852.5_unstranded', 'AC105052.5_unstranded', 'LINC01431_unstrande