## TODO:
- make column1 of the GSM DGE matrix files for the single cell
data match column1 of the cell to cluster assignment file
- transpose the GSM DGE matrix files so that the gene names are now column 1
- join replicates (samp1+samp2, 1dpa1+1dpa2, 2dpa1+2dpa2, 4dpa1+4dpa2) on the gene name column
- join all files together

In [81]:
import pandas as pd

### Initialize samples and files
Below are the sample names and the file names as arrays to be used later in the script.

In [82]:
samples = ['samp1',
           'samp2',
           '1dpa1',
           '1dpa2',
           '2dpa1',
           '2dpa2',
           '4dpa1',
           '4dpa2']
files = ['\GSM4095393_samp1_DGEmatrix.csv',
         '\GSM4095394_samp2_DGEmatrix.csv',
         '\GSM4095395_1dpa1_DGEmatrix.csv',
         '\GSM4095396_1dpa2_DGEmatrix.csv',
         '\GSM4095397_2dpa1_DGEmatrix.csv',
         '\GSM4095398_2dpa2_DGEmatrix.csv',
         '\GSM4095399_4dpa1_DGEmatrix.csv',
         '\GSM4095400_4dpa2_DGEmatrix.csv']

### `clean_matrix` function
This funtion iterates over the rows of the dataframe and parses column 1 to get rid of
the "-1" at the end, leaving the 16bp barcode. Then, it concatenates the sample identifier to the beginning.
Next, outside the for loop, the dataframe is transposed. The transposition sets the index [0, 1, 2, ...]
as column names, and the gene names as the indices. The next 3 lines set row 1 as the header
so that the barcodes become the column names. Then, the last 2 lines before returning the final
cleaned matrix resets the indices so that the gene names become their own column called "genes" and
the indices are back to [0, 1, 2, ...].

In [83]:
def clean_matrix(exp, sample):
    for index, row in exp.iterrows():
        col1 = row['Barcodes']
        col1 = col1[:16]
        col1 = sample+col1
        exp.at[index,'Barcodes'] = col1
    exp = exp.transpose()
    new_header = exp.iloc[0] #grab the first row for the header
    exp = exp[1:] #take the data less the header row
    exp.columns = new_header
    exp.reset_index(inplace=True)
    exp = exp.rename(columns = {'index':'genes'})
    return exp

### PreInjury samples
The two preinjury files are processed and joined, then exported to a `csv.gz` file.
The two matrices are joined in such a way that if a gene name is in one sample but
not the other, the columns for that sample are filled with a `NaN` value.

In [84]:
file = files[0]
sample = samples[0]
filepath = '..\data\Hou_Data_raw' + file
samp1 = pd.read_csv(filepath)
samp1 = clean_matrix(samp1, sample)

file = files[1]
sample = samples[1]
filepath = '..\data\Hou_Data_raw' + file
samp2 = pd.read_csv(filepath)
samp2 = clean_matrix(samp2, sample)

preinjury = pd.merge(samp1, samp2, left_on='genes', right_on='genes', how='outer')
preinjury.to_csv("..\data\Hou_expression_matrices\preinjury_merged.csv.gz",
           index=False,
           compression="gzip")

### 1dpa samples
The two 1dpa files are processed and joined, then exported to a `csv.gz` file.
The two matrices are joined in such a way that if a gene name is in one sample but
not the other, the columns for that sample are filled with a `NaN` value.

In [88]:
file = files[2]
sample = samples[2]
filepath = '..\data\Hou_Data_raw' + file
samp1 = pd.read_csv(filepath)
samp1 = clean_matrix(samp1, sample)

file = files[3]
sample = samples[3]
filepath = '..\data\Hou_Data_raw' + file
samp2 = pd.read_csv(filepath)
samp2 = clean_matrix(samp2, sample)

onedpa = pd.merge(samp1, samp2, left_on='genes', right_on='genes', how='outer')
onedpa.to_csv("..\data\Hou_expression_matrices\_1dpa_merged.csv.gz",
           index=False,
           compression="gzip")

### 2dpa samples
The two 2dpa files are processed and joined, then exported to a `csv.gz` file.
The two matrices are joined in such a way that if a gene name is in one sample but
not the other, the columns for that sample are filled with a `NaN` value.

In [90]:
file = files[4]
sample = samples[4]
filepath = '..\data\Hou_Data_raw' + file
samp1 = pd.read_csv(filepath)
samp1 = clean_matrix(samp1, sample)

file = files[5]
sample = samples[5]
filepath = '..\data\Hou_Data_raw' + file
samp2 = pd.read_csv(filepath)
samp2 = clean_matrix(samp2, sample)

twodpa = pd.merge(samp1, samp2, left_on='genes', right_on='genes', how='outer')
twodpa.to_csv("..\data\Hou_expression_matrices\_2dpa_merged.csv.gz",
           index=False,
           compression="gzip")

MemoryError: Unable to allocate 543. MiB for an array with shape (21212, 3353) and data type object

### 4dpa samples
The two 4dpa files are processed and joined, then exported to a `csv.gz` file.
The two matrices are joined in such a way that if a gene name is in one sample but
not the other, the columns for that sample are filled with a `NaN` value.

In [None]:
file = files[6]
sample = samples[6]
filepath = '..\data\Hou_Data_raw' + file
samp1 = pd.read_csv(filepath)
samp1 = clean_matrix(samp1, sample)

file = files[7]
sample = samples[7]
filepath = '..\data\Hou_Data_raw' + file
samp2 = pd.read_csv(filepath)
samp2 = clean_matrix(samp2, sample)

fourdpa = pd.merge(samp1, samp2, left_on='genes', right_on='genes', how='outer')
fourdpa.to_csv("..\data\Hou_expression_matrices\_4dpa_merged.csv.gz",
           index=False,
           compression="gzip")
