# Week 01 Assignment genes expressed in the brain

For the assignment of week 2 we will continue to work on the expression data from the human brain.

Learning outcomes:
- know how to work with ordered data datatypes: list, tuple, set
- can use list comprehensions 

The assignment for this week is again divided up in several steps (some of which are overlapping with the previous assignment). The general layout of what the program should do for this week is:
- Read the data
- Prepare the data
- Process the data
- report on results of the analysis

By now you will be familiar with the reading and preparing part of the data and we will add research questions for you to answer using Python and the new learning outcomes of this week.

<div class="alert alert-info">
    **Note:** For this week you should not copy and paste your code for every step and add the appropriate code for the next step. Instead, you are going to write all code in just one code cell.
</div>

## Research question/outcome
1. The number of probes is larger than the number of genes. This is because there is a redundancy in the number of probes per gene. Select per gene the probe that has the highest average expression across the samples.
2. When comparing the "lateral hypothalamic area, mammillary region" and "posterior hypothalamic area" show what the unique probes per region are and also the probes that are shared between the two brain regions. The structure acronyms are '"LHM" and '"PHA'" and can be found in the `SampleAnnot.csv` file.

There are some coding requirements for this assignment and you should make sure that you use following techniques:
 - use a dictionary structure to store the probes per gene
 - create and use a listcomprehension (where appropriate)
 - use the methods of the set datatype to get the difference and shared probe ids between the two brain regions


Tips:
- the number of probes per gene can be found when looking at the first and third column of the `Probes.csv` file. Make sure to skip the header
- The file `SampleAnnot.csv` contains the samples in the same order as the samples in the `MicroarrayExpression.csv` file. The 

In [14]:
gene_probe_dict = {}


# open the probe annotation file line by line
with open('../Data/Probes.csv') as probe_reader:
    line_counter = 0
    for line in probe_reader:
        line = line.strip()
        # skip the header
        if line_counter == 0:
            pass
            line_counter += 1
        else:
            line_splitted = line.split(',')
            probe_id = line_splitted[0]
            gene_id = line_splitted[2]
            
            # store the gene id as the key and add the probe id to the list
            if gene_id in gene_probe_dict:
                gene_probe_dict[gene_id].append(probe_id)
            else:
                # this gene id is new, lets add it to the dict as a key, value is list with the probe id of this line
                gene_probe_dict[gene_id] = [probe_id]


# gene_probe_dict: {'729': ['1058685', '1058686'], '731': ['1058684', '1058683'], '736': ['1058682', '1058681', '1058680'], '737': ['1058679', '1058678']}


expression_dict = {}

# open the file expression file line by line
with open('../Data/MicroarrayExpression.csv') as expression_reader:
    for line in expression_reader:
        line = line.strip() # remove the newline char
        # split the line on the delimiter
        probe_and_values_list = line.split(',')
        
        probe_id = probe_and_values_list[0]
        expression_values = probe_and_values_list[1:]

        # cast all expression values to float
        expression_values = [float(value) for value in expression_values]
        
        # save the probe id as the key and the list of expression values as the value
        expression_dict[probe_id] = expression_values 


# expression_dict: {'1058685': [3.6157916807089787, 2.13807367935382, 2.4805415023588107, 2.96497183239022, 2.67980316976351, 1.8562381007421844}

representative_probe_gene_dict = {}

# loop over every probe per gene
for gene, list_of_probe_ids in gene_probe_dict.items():
    max_expressed_probe = ''
    max_avg = 0

    for probe_id in list_of_probe_ids: # {'729': ['1058685', '1058686'}
        avg = sum(expression_dict[probe_id]) / len(expression_dict[probe_id])
        if avg > max_avg:
            max_expressed_probe = probe_id
            max_avg = avg

    representative_probe_gene_dict[probe_id] = gene

# representative_probe_gene_dict: {'1058686': '729', '1058683': '731'}

LHM = []
PHA = []

# SampleAnnot.csv contains the samples
# 13005,16,2635,CX,"LHM","lateral hypothalamic area, mammillary region, left",240495,90,100,94,3.0,-18.3,-5.9

with open(file='../Data/SampleAnnot.csv') as annotation_reader:
    for index, line in enumerate(annotation_reader):
        split_line = line.split(',')

        structure_acronym = split_line[4]

        if structure_acronym == '"LHM"': # "lateral hypothalamic area, mammillary region, left"
            LHM.append(index)
        elif structure_acronym == '"PHA"': #posterior hypothalamic area, left"
            PHA.append(index)


cutoff = 15
LHM_expressions = [probe for probe in representative_probe_gene_dict.keys() for index in LHM if expression_dict[probe][index] >= cutoff]
PHA_expressions = [probe for probe in representative_probe_gene_dict.keys() for index in PHA if expression_dict[probe][index] >= cutoff]

print(set(LHM_expressions).difference(set(PHA_expressions)))
print(set(PHA_expressions).difference(set(LHM_expressions)))
print(set(PHA_expressions).intersection(set(LHM_expressions)))




{'1050553'}
{'1029046', '1068561', '1021844', '1054159', '1069880', '1057755', '1025870', '1070709', '1070082', '1059084', '1034361', '1011847', '1063255', '1058023', '1050548', '1059572', '1062126', '1051985', '1070545', '1070085', '1051976', '1022359', '1023535', '1058959', '1032860', '1011496', '1052494', '1033084'}
{'1070326', '1033935', '1070402', '1012031', '1023782', '1036892', '1031691', '1010881', '1051987', '1037294', '1011186', '1011290', '1063924', '1014256', '1013992', '1068549', '1026328', '1015889', '1019145', '1065269', '1070418', '1014720', '1025388', '1070126', '1049781', '1070807', '1011086', '1016649', '1061573', '1060237', '1071058', '1064858', '1037328', '1012449', '1059652', '1012544', '1029329', '1059768', '1070087', '1010538', '1059432', '1051271', '1038771', '1032619', '1070319', '1050487', '1032011', '1060007', '1064562', '1054004', '1017544'}
