# For this problem set, we will apply unsupervised and supervised learning techniques to diagnose hepatocellular carcinoma (HCC) from transcriptomics data. 

## This [dataset](https://figshare.com/articles/dataset/Liver_vs_non-liver_microarray_data_formatted_from_GSE14520_/24616128) was processed from Gene Expression Omnibus [GSE14520](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE14520)
I have formatted the data so that you can read them directly with pandas

### Most of the code has been provided. Fill in the missing code at *FILL HERE*
However, please make sure not to change the provided variable names

You may also need to add more codes to answer some of the questions

## Q1: Import packages that you need here

In [None]:
*FILL HERE*

## Q2: Load the transcriptomics data with pandas

In [None]:
data = pd.read_csv(*FILL HERE*, header = *FILL HERE*, index_col = *FILL HERE*)
data.head()

### Separate gene expression data and cancer labels

In [None]:
exp_data = data.iloc[*FILL HERE*]
cancer_labels = data[*FILL HERE*]

## Q3: Count the numbers of HCC and normal samples

In [None]:
*FILL HERE*

## Q4: Replace microarray probe IDs with gene symbols
1. Download probe ID mapping for platform [GPL3921](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL3921) from GEO (should be in .txt format)
2. Load the mapping file using pandas'
  * You will need to adjust **header** and **sep** parameters to get the right data

In [None]:
annot = pd.read_csv(*FILL HERE*, header = *FILL HERE*, index_col = *FILL HERE*, sep = *FILL HERE*)
annot.head(2)

3. Create a dictionary that map from probe ID to gene symbol

In [None]:
probe_to_gene = {}

for probe_id in annot.index:
    if not pd.isna(annot.loc[*FILL HERE*]): ## ignore rows with missing gene symbol
        gene_symbol = annot.loc[*FILL HERE*].split()[0] ## add .split()[0] because some rows contain multiple gene symbols
        probe_to_gene[probe_id] = gene_symbol

4. Apply the mapping to the column names

In [None]:
probes_with_symbol = *FILL HERE* ## create a list of probes that have associated gene symbols in probe_to_gene

selected_exp_data = exp_data.loc[:, probes_with_symbol].copy()
selected_exp_data.columns = [probe_to_gene[x] for x in selected_exp_data.columns]
selected_exp_data.head(2)

## Q5: Add expression data of probes for the same gene together
Notice above that there are multiple columns for **ACTB**, **GAPDH**, and **STAT1**, for example

What's the number of distinct genes after merging?

Ans:

In [None]:
selected_exp_data = selected_exp_data.groupby(by = *FILL HERE*, axis = *FILL HERE*).sum()
selected_exp_data.head(2)

## Q6: Visualize data distribution with PCA and UMAP
Don't forget to standardize your data first

In [None]:
std_data = *FILL HERE*

pca = *FILL HERE*
pca_embed = *FILL HERE*

umap_embed = *FILL HERE* ## set n_neighbors = 25

### Color by HCC versus normal

In [None]:
plt.figure(figsize = (10, 5))

plt.subplot(1, 2, 1)
plt.scatter(pca_embed[*FILL HERE*, 0], pca_embed[*FILL HERE*, 1], label = 'Normal')
plt.scatter(pca_embed[*FILL HERE*, 0], pca_embed[*FILL HERE*, 1], label = 'HCC')
plt.xlabel('PC1'); plt.ylabel('PC2'); plt.title('PCA')

plt.subplot(1, 2, 2)
*FILL HERE* ## generate the same scatter plot with umap_embed

plt.tight_layout()
plt.show()

### What do these scatter plots tell you about the transcriptomics profiles for HCC vs normal?
Ans:

## Q7: Identify genes that are differentially expressed
This microarray dataset has been normalized and log-transformed. Hence, t-tests can be used

We are collecting the statistical test results in a new DataFrame with p-value and log FC

In [None]:
ttest_results = pd.DataFrame(0, index = selected_exp_data.columns, columns = ['P-value', 'HCC/Normal Log FC'])

for gene in selected_exp_data.columns:
    normal_exp = selected_exp_data[gene].loc[*FILL HERE*]
    cancer_exp = selected_exp_data[gene].loc[*FILL HERE*]
    
    ttest_results.loc[gene, 'P-value'] = ttest_ind(*FILL HERE*)[1]
    ttest_results.loc[gene, 'HCC/Normal Log FC'] = *FILL HERE*
    
ttest_results.head()

### Use statsmodels package to perform Benjamini-Yekutileli correction with FDR cutoff of 0.01
How many statistically significant DEGs are there?

Ans:

In [None]:
by_result = multipletests(*FILL HERE*, method = *FILL HERE*, alpha = *FILL HERE*)
by_filter = by_result[0]
print('number of significant DEGs:', *FILL HERE*)

### Let's apply fold change cutoff at 2-fold
How many statistically significant DEGs also have 2 or higher fold change across the two groups? 

Ans:

In [None]:
fc_filter = *FILL HERE*
combined_filter = by_filter & fc_filter

print('number of significant DEGs with 2 or higher fold change:', *FILL HERE*)

### Visualize volcano plot

In [None]:
plt.figure(figsize = (7, 4))

plt.scatter(*FILL HERE*, s = 3, label = 'Other genes') ## plot all other genes
plt.scatter(*FILL HERE*, s = 3, label = 'DEGs') ## plot only significant DEGs

plt.xlabel('HCC/Normal Log FC'); plt.ylabel('Minus Log P-value'); plt.legend()
plt.show()

## Q8: There is a gene with ~10 log FC but does not have as low p-value as other DEGs
Identify what gene it is and visualize its expression in HCC and normal group using violin plot

Ans:

In [None]:
ttest_results.sort_values('HCC/Normal Log FC', ascending = False).head(3)

In [None]:
*FILL HERE* ## show violin plot for the top gene

### Compare the above pattern with another gene with the lowest p-value
Which gene has the lowest p-value? Visualize its expression in HCC and normal group using violin plot

Ans:

In [None]:
ttest_results.sort_values('P-value').head(3)

In [None]:
*FILL HERE* ## show violin plot for the top gene

### Do the two violin plots agree with observed fold changes and p-values for these two genes?
Ans:

## Q9: Visualize the expression of these two genes on PCA and UMAP scatter plots
Use subplot to include all 4 scatters onto the same figure

In [None]:
*FILL HERE*

## Q10: Let's build a logistic regression model to diagnose HCC from normal
The required packages are imported for you

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, GridSearchCV

five_fold_splitter = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 25) ## split data into 5 equal partitions

### Define the base model and use GridSearchCV to try every hyperparameter combination
Fit **GridSearchCV** on standardized data

In [None]:
base_model = LogisticRegression(max_iter = 1000, solver = 'liblinear', random_state = 25) ## this is our base hyperparameters
hyperparameters = {'penalty': ['l1', 'l2'],
                   'C': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]}

tuned_model = GridSearchCV(base_model, param_grid = hyperparameters, scoring = 'accuracy', 
                           refit = True, cv = five_fold_splitter)
tuned_model.fit(*FILL HERE*)

### View the top 5 hyperparameter sets
What is the best hyperparameter setting? 

Ans:

What is the achieved accuracy?

Ans:

In [None]:
tuned_result = pd.DataFrame.from_dict(tuned_model.cv_results_)
tuned_result = tuned_result.sort_values('rank_test_score')
tuned_result[['params', 'mean_test_score', 'std_test_score']].head(5)

# Congratulations for reaching the end of this year's problem set!!