2. Izvrsiti prosirenje dostupne matrice podataka tako da svaki red u matrici bude zasebna celija (CellID) a da svaka kolona u matrici bude zasebni gen (GeneID). U svakom polju u matrici je potrebno da bude zapisan (MIDCount).
*MIDCount predstavlja ekspresiju datog gena u datoj celiji i sluzi za klasterovanje celija po slicnosti.
3. Na dobijenoj matrici je potrebno izvrsiti transformaciju tako da eksprimirani geni (MIDCount>=1) prodju kroz logaritamsku transofrmaciju kako bi se smanjila apsolutna udaljenost medju njima, dok oni geni koji nisu eksprimirani uopste (MIDCount=0) budu postavljeni na -1
*Obe tabele su bitne da ostanu sacuvane kako bi se algoritmi testirali na obe
4. Potrebno je izvrsiti redukciju dimenzionalnosti podataka tako da se konacni klasteri mogu predstaviti u 2D prostoru (nije dozvoljeno vise od 2 dimenzije)
5. Potrebno je isprobati vise razlicitih klasterizacionih algoritama (slucajevi u kojima celije ne pripadaju ni jednom klasteru su dozvoljene ali broj celija koje nisu dodeljene ni jednom klasteru ne sme da biti preveliki)

Postoje jos neke transformacije podataka koje mogu da unaprede konacne performanse klasterovanja ali one zahtevaju znanje iz genetike tako da mozemo da ih zanemarimo u ovom slucaju. Mozemo o njima da razgovaramo kasnije kada se prodju ovi pocetni koraci.

Sto se tice literature, radi se o oblasti genetske transkriptomike sa uracunatim pozicionim parametrima. Ova oblast je relativno nova i nema mnogo literature dostupne za nju. Kada smo mi radili zadatak nismo imali ni jedan naucni rad dostupan, tako da mozete da pogledate ako nesto mozete da nadjete, ali ja nemam ni jedan konkretan rad da vam posaljem.

Na kraju klasterovanja bilo bi dobro izvrsiti vizuelizaciju na osnovu dobijenih vrednosti nakon redukcije dimenzionalnosti u 2D prostoru. Takodje ukoliko zelite mozete da probate i da napravite jednu rekonstrukciju celije na osnovu podataka x i y koji su dati za svaku celiju u pocetnom skupu podataka.

In [3]:
import pandas as pd
import matplotlib as plt
import numpy as np
import seaborn as sb
import multiprocessing as mp
import os
import cupy as cp

In [2]:
data = pd.read_csv('smaller_dataset.csv', index_col=0)

In [None]:
data.reset_index(inplace=True, drop=True)
data.head()

In [4]:
data['CellID'] = data['CellID'].astype(np.uint16)
data['ExonCount'] = data['ExonCount'].astype(np.uint8)
data['MIDCount'] = data['MIDCount'].astype(np.uint8)
data['x'] = data['x'].astype(np.uint16)
data['y'] = data['y'].astype(np.uint16)

In [5]:
unique_cells = data['CellID'].unique()
unique_genes = data['geneID'].unique()

In [None]:
gpu_matrix = cp.zeros((len(unique_cells), len(unique_genes)), dtype=cp.uint8)
cell_indices = pd.factorize(data['CellID'])[0] 
gene_indices = pd.factorize(data['geneID'])[0]

gpu_cell_indices = cp.array(cell_indices, dtype=cp.uint16)
gpu_gene_indices = cp.array(gene_indices, dtype=cp.uint16)
gpu_midcount = cp.array(data['MIDCount'].values, dtype=cp.uint8)

gpu_matrix[gpu_cell_indices, gpu_gene_indices] = gpu_midcount

tmp_matrix = cp.asnumpy(gpu_matrix)

unloged_matrix = pd.DataFrame(tmp_matrix, index=unique_cells, columns=unique_genes, dtype=np.uint8)
unloged_matrix.to_csv('unloged_matrix.csv', index=False)

In [None]:
gpu_matrix = gpu_matrix.astype(cp.float32)
gpu_matrix[gpu_matrix == 0] = -1
gpu_matrix = cp.where(data == -1, -1, cp.log(gpu_matrix))

tmp_matrix = cp.asnumpy(gpu_matrix)
loged_matrix = pd.DataFrame(tmp_matrix, index=unique_cells, columns=unique_genes, dtype=np.float32)
loged_matrix.to_csv('loged_matrix.csv', index=False)

In [14]:
loged_matrix = pd.read_csv('unloged_matrix.csv', index_col=0)
loged_matrix = pd.DataFrame(np.where(loged_matrix == 0, -1, np.log(loged_matrix)), columns=loged_matrix.columns)

  result = func(self.values, **kwargs)


In [15]:
loged_matrix

Unnamed: 0,Zfy2,Asmt,CAAA01118383.1,Gm21887,Gm15247,Gm15261,Gm15232,Gemin8,Asb9,Asb11,...,Gm5524,Tmem127,Gstz1,Atp13a4,Bub1,Mrps5,Adra2b,Gatm,Gpx3,Gm42418
0,-1.0,-1.0,-1.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.000000
1,-1.0,-1.0,-1.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.000000
2,-1.0,-1.0,-1.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.693147
3,-1.0,-1.0,-1.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.000000
4,-1.0,-1.0,-1.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-1.0,-1.0,0.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.000000
96,-1.0,-1.0,0.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,-1.000000
97,-1.0,-1.0,0.693147,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.000000
98,-1.0,-1.0,0.000000,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.000000,0.000000
