# Microarray data analysis

Microarray data in genetics refers to a type of high-throughput technology used to measure the expression levels of thousands of genes simultaneously. Microarrays are composed of small glass slides or silicon chips on which thousands of microscopic spots, known as probes, are deposited in an orderly grid pattern. These probes are short DNA sequences or fragments that are complementary to specific genes or regions of the genome. Each spot represents a particular gene or DNA sequence. The results are typically strored in a .CEL file. To process a .CEL file we go thourgh the following steps

- Get .CEL from https://www.ncbi.nlm.nih.gov/
- Every .CEL file corresponds to an experiment which is part of SERIES and SERIES is a based on a platform.
- We use GPL570 platform to restrict our-selves to **Affymetrix Human Genome U133 Plus 2.0 Array**

### Collecting .CEL files

- Download the list of all accession of GPL570 at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570&targ=self&view=brief&form=text

In [None]:
# Read all sample_ids into a list:

with open('GPL570.txt','r') as f:
    lines = f.readlines()
    sample_id = []
    for line in lines:
        if line[:22] == '!Platform_sample_id = ':
            sample_id.append('GSM'+rx.findall(r'\d+',line)[0])

In [None]:
import requests as re
from bs4 import BeautifulSoup
import pandas as pd
import regex as rx

# Download all .CEL.gz files

for idx in sample_id:
    url = 'https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc='+idx
    page = re.get(url)
    with open('./html_files/'+idx+'.html', 'wb') as file:
        file.write(page.content)

### fRMA Normaization of .CEL files

Frozen RMA (fRMA) is a microarray preprocessing algorithm that allows one to analyze
microarrays individually or in small batches and then combine the data for analysis. This
is accomplished by utilizing information from the large publicly available microarray
databases. Specifically, estimates of probe-specific effects and variances are precomputed
and frozen. Then, with new data sets, these frozen parameters are used in concert with
information from the new array(s) to preprocess the data.
Follow documentation: https://www.bioconductor.org/packages/release/bioc/html/frma.html

The following **R-script** does the fRMA transformation and saves output of z-scores in .csv format. 

```R
library(frma)
library(affy)


input_path = paste0('.../cel_files/',i,'.CEL.gz')
Data <- ReadAffy(filenames = input_path)

# for custom CDF file use:
# Data <- ReadAffy(filenames = input_path, cdfname = '...path_to/GPL17996_HGU133Plus2_Hs_ENTREZG.cdf')


Object <- frma(Data)
   bc<- barcode(Object, output = 'z-score')
   output_path = paste0('..../expr_files/',i,'.csv')
   write.csv(bc, file = output_path)}
```

For processing all the .cel files in a parallelized for loop look at **get_z_score_parallel.r**