# [Myelodysplastic syndromes](https://pubmed.ncbi.nlm.nih.gov/27543316/)
For this problem set, you are tasked with analyzing exome sequencing data from 100 MDS patients.

The data is available at [MDSExome.xlsx](https://github.com/cmb-chula/comp-biol-3000788/blob/main/problem-sets/MDSExome.xlsx)

## Instruction
1. Carefully follow the description of the analysis to be performed. If you believe there are more than one ways to solve the problem, please pick the one that you think is closest to the instruction
2. This problem set mimics real-world research situation. So try to **provide answers that are on the same standard as actual research**
3. For conceptual questions, please add you answer in the same markdown cell as the question itself. Look for the *Ans*. 
4. For coding questions, add your code(s) in the code cell(s) provided.
  * You may print out more than what the questions ask for, if you feel that it will improve the understanding of the readers.
5. If the coding question asks you to print something or plot something, always annotate your answer. For example,
  * Use **print('number of rows in the data:', data.shape[0])** rather than just **print(data.shape[0])**
  * Add axis labels, title, and legend to the graphs to make them readable

## Q1: Import packages to load the data
Show the top 3 rows of the loaded dataset

In [None]:
data = 

## Q2: If we want to create a unique index for the data in each row of this dataset, what would be some possible ideas?
Ans: 

## Q3: Examine the data size
1. Print the size of this data (number of rows and columns)
2. Print the number of missing data in each column

There are several missing values from the *Nucleotide Position*, *Amino Acid Position*, and *Amino Acid Change*. 
The code below extract these rows into a DataFrame **rows_with_missing**.

In [None]:
rows_with_missing = data.loc[pd.isna(data[['Nucleotide Position', 'Amino Acid Position', 'Amino Acid Change']]).any(axis = 1), :]
rows_with_missing.head(10)

## Q4: What do you think are the causes for these missing values?
Ans:

### Population AF (allele frequency in human population, estimated from 1,000 genome project)
Some population AF values are missing because several mutations identified in our local cohort are not well-documented in the public human mutation database. 

One possibility is that these mutations are rare. Let's test this idea.

## Q5: Compare the distribution of VAF between rows with Population AF and rows without
1. Use seaborn's **violinplot**
2. Use matplotlib's **hist** and overlay the two histograms onto the same plot

### What is your observation?

Ans:

## Q6: Use an appropriate statistical test to determine whether VAF of mutations with Population AF are significantly higher than VAF of mutations without Population AF
Show the test result.

### Does the test result agree with your observation from Q5?

Ans: 

### If you used t-test above, or plan to use t-test, we should first check whether VAF data are normally distributed.

## Q7: Test whether VAF data is normally distributed
1. Show the test result
2. Plot a histogram of all VAF values

### Does the histogram agrees with the test result?

Ans: 

### Next, let's identified frequently mutated genes in these patients because they might be related to MDS

## Q8: Show the mutation frequency for each gene in this dataset
What are the top 3 mutated genes?

Ans:

Are they known to be involved in MDS? Provide evidence.

Ans:

## Q9 Use pie chart to show the frequency of the following columns
1. Predicted Impact
2. Variant Type

**Hint**: You can use the output from **value_counts** as input for **pie** plot

### Let's explore whether VAF correlates with Population AF
## Q10: Visualize the relationship between VAF and Population AF
Use any plot of your choice that you believe best shows the relationship

## Q11: Calculate Pearson's and Spearman's correlation between VAF and Population AF
**Note**: Be careful that some functions cannot handle missing values

### Examine how Variant Type contributed to the HIGH, MODERATE, LOW, and MODIFIER predicted impact

## Q12: Generate a table summarizing the relationship between Variant Type and Predicted Impact

### What is your observation?

Ans: 

## Q13: Generate a table summarizing the number of mutation for each patient
The result should be a **DataFrame** with patients on the rows and one column containing the *Number of Mutations*

## Bonus question: 
Come up with a visualization, a table, or a statistical test that says something interesting about this dataset and/or MDS.