<a href="https://colab.research.google.com/github/alopezgar/Reprogramming-aging/blob/main/Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install anndata scanpy

Collecting anndata
  Downloading anndata-0.8.0-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 3.4 MB/s 
[?25hCollecting scanpy
  Downloading scanpy-1.8.2-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 45.4 MB/s 
Collecting umap-learn>=0.3.10
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
[K     |████████████████████████████████| 86 kB 6.9 MB/s 
Collecting sinfo
  Downloading sinfo-0.3.4.tar.gz (24 kB)
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.6.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 40.4 MB/s 
[?25hCollecting stdlib_list
  Downloading stdlib_list-0.8.0-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.2 MB/s 
Building wheels for collected packages: umap-learn, pynndescent, sinfo
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82708 sha256=94ffee6cc1e10d34305000bf0b52

In [None]:
import os
import scanpy as sc
import anndata
from urllib.request import urlretrieve
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# **Main ideas**

---



Cells states and genes: 
* Dual transgene-positive (Tg+) 
* Single transgene or transgene-negative (Tg-) (control). 
* Transient reprogramming (Aged Tg+)
* Control adipogenic cells are divided into Hoxc10+/- subsets.
* MSX1 is an encoded protein by msx1 gen.
* Adipogenic states are marked by Lpl.
* Secretory states by Rspo1.
* Reprogramming specific states by Snca and Nanog.
* Mesenchymal identity genes Thy1 and Col1a1

Concepts: 

* RNA velocity infers a rate of change for each gene based on the relative ratio of spliced and unspliced transcripts under the assumption that unspliced transcripts represent nascent or newly transcribed messages
* Normalized data should be log(x+1)-transformed for use with downstream analysis methods that assume data are normally distributed.
* scvI is used to reduce noise.
* UMAP is used for data summarization.
* Methods are divided into cell- and gene-level analysis. Cell-level analysis approaches are again subdivided into cluster and trajectory analysis branches, which include also gene-level analysis methods.
* Batch effects can occur when cells are handled in distinct groups (between groups of cells in an experiment, between experiments performed in the same laboratory or between datasets from different laboratories).

https://www.embopress.org/doi/full/10.15252/msb.20188746

https://web-frontend-git-analysis-guide-contents-10x-genomics.vercel.app/resources/analysis-guides/continuing-your-journey-after-running-cell-ranger

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices

https://reprog.research.calicolabs.com/data

#**1.Murine myogenic cell Msx1 reprogramming**

---



In [None]:
# load the file using anndata
adata_myo = anndata.read_h5ad("/content/drive/MyDrive/Colab Notebooks/myo_msx1.h5ad")

In [None]:
adata_myo

AnnData object with n_obs × n_vars = 1186 × 22767
    obs: 'SMP', '10x_lane', 'sample', 'n_counts', 'cell_containing_barcode', 'batch', 'n_genes', 'perc_mito', 'perc_rrna', 'age', 'sample_treatment', 'age_treatment', 'leiden', 'experiment', '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean', '_scvi_local_l_var', 'latent_library_size', 'root', 'dpt_pseudotime', 'tx_exp', 'myogenesis'
    uns: '_scvi', 'diffmap_evals', 'experiment_colors', 'iroot', 'leiden', 'leiden_colors', 'neighbors', 'root_colors', 'sample_treatment_colors', 'tx_exp_colors', 'umap'
    obsm: 'X_diffmap', 'X_pca', 'X_pca_harmony', 'X_scvi', 'X_umap', 'X_umap_scvi'
    layers: 'counts', 'ind_scale', 'ind_scale_bc', 'log1p_cpm', 'scvi_normalized', 'scvi_normalized_bc'
    obsp: 'connectivities', 'distances'

## **1.1 obs**

---
Observations. Key-indexed one-dimensional annotation. CELLS

[Sparse matrix](https://media.geeksforgeeks.org/wp-content/uploads/Sparse-Matrix-Array-Representation1.png)

In [None]:
# inspect the covariates
adata_myo.obs.head()

Unnamed: 0,SMP,10x_lane,sample,n_counts,cell_containing_barcode,batch,n_genes,perc_mito,perc_rrna,age,...,experiment,_scvi_batch,_scvi_labels,_scvi_local_l_mean,_scvi_local_l_var,latent_library_size,root,dpt_pseudotime,tx_exp,myogenesis
TCACGCTTCAAGTCGT-6,2FES,L7,LungMyo_Aged_Tg+,61001.0,True,0,7476,0.017213,0.000836,Aged,...,0,0,0,9.749166,0.298148,0.912073,False,0.439934,Tg+ (0),0.880872
TCTAACTTCATTACTC-6,2FES,L7,LungMyo_Aged_Tg+,52953.0,True,0,7338,0.022303,0.00085,Aged,...,0,0,0,9.749166,0.298148,1.900635,False,0.443646,Tg+ (0),0.877299
TACAACGGTCATTCCC-6,2FES,L7,LungMyo_Aged_Tg+,50127.0,True,0,7324,0.024418,0.001436,Aged,...,0,0,0,9.749166,0.298148,1.732892,False,0.503578,Tg+ (0),1.021866
GATCGTAAGTGAGCCA-6,2FES,L7,LungMyo_Aged_Tg+,49118.0,True,0,6821,0.054115,0.001222,Aged,...,0,0,0,9.749166,0.298148,1.586143,False,0.43191,Tg+ (0),0.777569
AGCCACGAGTTACGAA-6,2FES,L7,LungMyo_Aged_Tg+,45686.0,True,0,7430,0.027755,0.001051,Aged,...,0,0,0,9.749166,0.298148,1.431968,False,0.493813,Tg+ (0),1.244467


In [None]:
adata_myo.n_obs

1186

In [None]:
columns_obs = adata_myo.obs.columns
data_obs = {}
for variable in columns_obs: 
  description = adata_myo.obs[variable].unique()
  data_obs[variable]= description

data_obs

{'10x_lane': ['L7', 'L8']
 Categories (2, object): ['L7', 'L8'],
 'SMP': ['2FES', '2FEU', 'L7_AdipoA_Msx1p', 'L8_AdipoA_Msx1n']
 Categories (4, object): ['2FES', '2FEU', 'L7_AdipoA_Msx1p', 'L8_AdipoA_Msx1n'],
 '_scvi_batch': [0, 1]
 Categories (2, int64): [0, 1],
 '_scvi_labels': array([0], dtype=int8),
 '_scvi_local_l_mean': array([9.74916649, 9.99405193]),
 '_scvi_local_l_var': array([0.29814765, 0.58767927]),
 'age': ['Aged']
 Categories (1, object): ['Aged'],
 'age_treatment': ['nan', 'Aged Tg+', 'Aged Tg-']
 Categories (3, object): ['Aged Tg+', 'Aged Tg-', 'nan'],
 'batch': ['0', '2', '3']
 Categories (3, object): ['0', '2', '3'],
 'cell_containing_barcode': array([ True]),
 'dpt_pseudotime': array([0.4399345 , 0.44364572, 0.50357825, ..., 0.58427095, 0.75753784,
        0.56796664], dtype=float32),
 'experiment': ['0', '1']
 Categories (2, object): ['0', '1'],
 'latent_library_size': array([0.91207343, 1.9006348 , 1.732892  , ..., 3.5138807 , 2.3285398 ,
        3.3326032 ], dtyp

In [None]:
adata_myo.obs.groupby(["age", "age_treatment"]).apply(len)

age   age_treatment
Aged  Aged Tg+         298
      Aged Tg-         515
      nan              373
dtype: int64

In [None]:
#Barcode
adata_myo.obs_names

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9',
       ...
       'Cell_1176', 'Cell_1177', 'Cell_1178', 'Cell_1179', 'Cell_1180',
       'Cell_1181', 'Cell_1182', 'Cell_1183', 'Cell_1184', 'Cell_1185'],
      dtype='object', length=1186)

Questions obs: 
* **10x_lane**: 10xlane is the probability of being a doublet for each cell per sample?  What's the meaning of L7 and L8?.  
*   **SMP**: what is SMP? what's msx1n, msx1p, 2FEU and 2FES? 
* **n_counts**: number of expressed genes per cell?
* **perc_mito**: is the % of mitochondrial mRNA ?
* **perc_rrna**: is the % of ribosomic rRNA ?
* **root**: what's the meaning? True and False?
* **experiment**: what is experiment 0 and 1?
* **myogenesis**: measure of myogenesis?
* **tx_exp**: what is it?

Variables' descriptions: 
*   **Barcode**: The term “barcode” is used here instead of “cell” as all reads assigned to the same barcode may not correspond to reads from the same cell. A barcode may mistakenly tag multiple cells (doublet) or may not tag any cells (empty droplet/well).
* **cell_contaning_barcode**: True if the cell has barcode. 
* **batch**:  what is 0,2,3 batch?
* **n_genes**: number of genes in cell.
* **_scvi_batch**
* **_scvi_label**
* **_scvi_local_l_mean**
* **_scvi_local_l_var**
* **age**: Aged (adult)
* **age_treatment**: Aged Tg+ (adult reprogrammed cell), Aged Tg- (adult control cell), nan (not know).
* **dpt_pseudotime**:  tool to estimate the temporal order of differentiating cell in single-cell RNA-seq (scRNA-seq) data.
* **latent_library_size**: the library size is the total number of reads that were sequenced (total sum of counts across all genes for each cell), so the latent library size is the scaling factor used to divide the raw counts of a particular cell to obtain normalized expression values. 
* **leiden**: clustering algorithm (groups)
* **sample_treatment**: combines type of cell and treatment
* **sample**: type of cell





##**1.2 layers**

---

Key-indexed multi-dimensional arrays aligned to dimensions of X

In [None]:
adata_myo.layers.keys()

KeysView(Layers with keys: counts, ind_scale, ind_scale_bc, log1p_cpm, scvi_normalized, scvi_normalized_bc)

In [None]:
for key, value in adata_myo.layers.items():
  print(f"{key} = {value}")

counts =   (0, 4)	1.0
  (0, 6)	3.0
  (0, 7)	4.0
  (0, 8)	5.0
  (0, 9)	5.0
  (0, 10)	4.0
  (0, 13)	3.0
  (0, 14)	5.0
  (0, 16)	5.0
  (0, 19)	2.0
  (0, 23)	1.0
  (0, 29)	1.0
  (0, 30)	2.0
  (0, 32)	8.0
  (0, 33)	36.0
  (0, 35)	17.0
  (0, 37)	1.0
  (0, 40)	1.0
  (0, 41)	1.0
  (0, 42)	1.0
  (0, 46)	2.0
  (0, 49)	2.0
  (0, 52)	12.0
  (0, 53)	5.0
  (0, 55)	27.0
  :	:
  (1185, 291)	1.0
  (1185, 572)	1.0
  (1185, 231)	1.0
  (1185, 570)	4.0
  (1185, 229)	1.0
  (1185, 984)	1.0
  (1185, 238)	3.0
  (1185, 218)	2.0
  (1185, 33)	1.0
  (1185, 207)	2.0
  (1185, 708)	4.0
  (1185, 671)	1.0
  (1185, 114)	1.0
  (1185, 943)	1.0
  (1185, 787)	1.0
  (1185, 471)	1.0
  (1185, 627)	1.0
  (1185, 827)	2.0
  (1185, 124)	1.0
  (1185, 898)	4.0
  (1185, 163)	1.0
  (1185, 864)	3.0
  (1185, 1052)	1.0
  (1185, 693)	4.0
  (1185, 1268)	2.0
ind_scale = [[0.14901832 1.         0.         ... 0.38620183 0.21009685 0.60787666]
 [0.25139663 0.886217   0.1132042  ... 0.50360966 0.21116309 0.64177805]
 [0.11737712 0.8052095  0.1


Questions layers:
* What is the meaning of the counts' structure? -> (0,4) 1 -> (row,columnd) value 
* What is ind_scale and the difference between ind_scale and ind_scale_bc?

Variables' descriptions:
* **counts**: 
* **ind_scale**
* **ind_scale_bc**
* **log1p_cpm**: data normalize 
* **scvi_normalized**: reduce the noise with scvi
* **scvi_normalized_bc**: reduce the noise with scvi

##**1.3 uns**

---

Unstructured annotation. Key-indexed.

In [None]:
adata_myo.uns

OverloadedDict, wrapping:
	{'_scvi': {'categorical_mappings': {'_scvi_batch': {'mapping': array(['0', '1'], dtype=object), 'original_key': 'experiment'}, '_scvi_labels': {'mapping': array([0]), 'original_key': '_scvi_labels'}}, 'data_registry': {'X': {'attr_key': 'counts', 'attr_name': 'layers'}, 'batch_indices': {'attr_key': '_scvi_batch', 'attr_name': 'obs'}, 'labels': {'attr_key': '_scvi_labels', 'attr_name': 'obs'}, 'local_l_mean': {'attr_key': '_scvi_local_l_mean', 'attr_name': 'obs'}, 'local_l_var': {'attr_key': '_scvi_local_l_var', 'attr_name': 'obs'}}, 'scvi_version': '0.8.1', 'summary_stats': {'n_batch': 2, 'n_cells': 1186, 'n_labels': 1, 'n_proteins': 0, 'n_vars': 22767}}, 'diffmap_evals': array([1.        , 0.9509746 , 0.9205005 , 0.90304685, 0.87337786,
       0.8671872 , 0.8444139 , 0.8431211 , 0.82637614, 0.8173035 ,
       0.81278545, 0.7903669 , 0.7812581 , 0.77489287, 0.77078325],
      dtype=float32), 'experiment_colors': array(['#95d0fc', '#ad8150'], dtype=object), '

scvi
* scvi_version 0.8.1. 
* scvi_batch = 0,1 -> original key= EXPERIMENT
* scvi_labels = 0 -> original key = SCVI_LABELS

data registry
* X: atribute key = COUNTS, atribute name = LAYERS
* batch_ indices: atribute key = _SCVI_BATCH, atribute name = OBS
* labels : atribute key = _SCVI_LABELS, atribute name = OBS
* local_l_mean: atribute key = _SCVI_LOCAL_L_MEAN, atribute name = OBS
* local_l_var: atribute key = _SCVI_LOCAL_L_VAR, atribute name = OBS
* n_batch = 2
* n_cells = 1186
* n_vars = 22767
* n_labels = 1
* n_proteins = 0


Variables' descriptions: 
* **_scvi**: sciv's desription
* **diffmap_evals**: evaluation of diffusion map
* **experiment_colors**: colors
* **iroot**
* **leiden**: clustering
* **leiden_colors**: clustering's colors
* **neighbors**: from umap
* **root_colors**: root's colors
* **sample_treatment_colors**: sample treatments' colors
* **tx_exp_colors**: colors tx exp
* **umap**: is used for visualization by reducing data to 2-dimensions.

##**1.4 obsm**

---

Key-indexed multi-dimensional annotation observation

In [None]:
adata_myo.obsm
#https://www.sharpsightlabs.com/blog/numpy-axes-explained/

AxisArrays with keys: X_diffmap, X_pca, X_pca_harmony, X_scvi, X_umap, X_umap_scvi

Variables' descriptions
* **X_diffmap**: data difussion map
* **X_pca**: data pca
* **X_pca_harmony**: data pca harmony
* **X_scvi**: data scvi normalized
* **X_umap**: data umap
* **X_umap_scvi**: data scvi normalized + umap

##**1.5 obsp**

---

Pairwaise annotation of observation. 

In [None]:
adata_myo.obsp

PairwiseArrays with keys: connectivities, distances

Variables' descriptions: UMAP
* **connectivities**
* **distances**

##**1.6 var**

---

Variables. Key-indexed one-dimensional annotation. ANOTATION OF VARIABLES/FEATURES (DATAFRAME) = GENES

In [None]:
adata_myo.var.head()

Ifi202b
Serpinb8
Fasl
Aox1
Cnnm3


In [None]:
columns_var = adata_myo.var.columns
data_var = {}
for variable in columns_var: 
  description = adata_myo.var[variable].unique()
  data_var[variable]= description

data_var

{}

In [None]:
adata_myo.n_vars

22767

In [None]:
adata_myo.var_names

Index(['Ifi202b', 'Serpinb8', 'Fasl', 'Aox1', 'Cnnm3', 'Lman2l', 'Ankrd39',
       'Usf1', 'Rrp15', 'Ddx18',
       ...
       'Gm28930', 'Gm28919', 'Gm20814', 'Gm29276', 'Gm28510', 'Gm28588',
       'Gm29866', 'Gm37440', 'Gm31571', 'Gm33815'],
      dtype='object', length=22767)

##**1.7 conjunto**

---



In [None]:
adata_myo.obs_names = [f"Cell_{i}" for i in range(adata_myo.n_obs)]
adata_myo.var_names = [f"Gene_{i}" for i in range(adata_myo.n_vars)]
print(adata_myo.obs_names[:10])
print(adata_myo.var_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')
Index(['Gene_0', 'Gene_1', 'Gene_2', 'Gene_3', 'Gene_4', 'Gene_5', 'Gene_6',
       'Gene_7', 'Gene_8', 'Gene_9'],
      dtype='object')


In [None]:
adata_myo.to_df(layer = "counts")

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_22757,Gene_22758,Gene_22759,Gene_22760,Gene_22761,Gene_22762,Gene_22763,Gene_22764,Gene_22765,Gene_22766
Cell_0,0.0,0.0,0.0,0.0,1.0,0.0,3.0,4.0,5.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_1,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,6.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,2.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_4,0.0,0.0,0.0,0.0,2.0,1.0,0.0,5.0,4.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_1181,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_1182,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_1183,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_1184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#**2.Summary questions about Murine myogenic cell Msx1 reprogramming**

---



Questions obs: 
* **10x_lane**: 10xlane is the probability of being a doublet for each cell per sample?  What's the meaning of L7 and L8?.  
*   **SMP**: what is SMP? what's msx1n, msx1p, 2FEU and 2FES? 
* **n_counts**: number of expressed genes per cell?
* **perc_mito**: is the % of mitochondrial mRNA ?
* **perc_rrna**: is the % of ribosomic rRNA ?
* **root**: what's the meaning? True and False?
* **experiment**: what is experiment 0 and 1?
* **myogenesis**: measure of myogenesis?
* **tx_exp**: what is it?

Questions layers:
* What is the meaning of the counts' structure? -> (0,4) 1 -> (row,column) value ? 
* What is ind_scale and the difference between ind_scale and ind_scale_bc?

#**3.Murine adipogenic cell Yamanaka Factor screen**

---



In [None]:
adata_adipo = anndata.read_h5ad("/content/drive/MyDrive/Colab Notebooks/adipo_screen.h5ad")

In [None]:
adata_adipo

AnnData object with n_obs × n_vars = 9880 × 28701
    obs: 'SMP', '10x_lane', 'sample', 'n_counts', 'cell_containing_barcode', 'batch', 'n_genes', 'perc_mito', 'perc_rrna', 'perc_tg', 'age', 'experiment', 'log10_n_counts', 'log10_n_genes', 'leiden', 'combination', 'combination_short', '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean', '_scvi_local_l_var', 'latent_library_size', 'combination_order', 'umap_density_combination_order', 'age_combination_short'
    var: 'gene_id_orig', 'gene_id', 'gene_name', 'contig', 'gene_biotype', 'n_splice_sites', 'length', 'pseudogene_status', 'transcript_support_level', 'mitochondrial', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: '_scvi', 'age_colors', 'combination_colors', 'combination_short_colors', 'experiment_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'sample_colors', 'umap', 'umap_density_combination_order_params'
    obsm: 'X_pca', 'X_scvi', 'X_umap'
    varm: 'PCs'
    layers

##**3.1Concepts**

---



* Differents pool of Yamanaka Factors
* Adipocites
*  Young vs Aged

## **3.2 obs**

---
Observations. Key-indexed one-dimensional annotation. CELLS

[Sparse matrix](https://media.geeksforgeeks.org/wp-content/uploads/Sparse-Matrix-Array-Representation1.png)

In [None]:
# inspect the covariates
adata_adipo.obs.head()

Unnamed: 0,SMP,10x_lane,sample,n_counts,cell_containing_barcode,batch,n_genes,perc_mito,perc_rrna,perc_tg,...,combination,combination_short,_scvi_batch,_scvi_labels,_scvi_local_l_mean,_scvi_local_l_var,latent_library_size,combination_order,umap_density_combination_order,age_combination_short
CTAACTTAGACCCGCT-0,L1_AdipoY_pool1,L1,L1_AdipoY_pool1,250565.0,True,0,10404,0.024098,0.004234,1.6e-05,...,Oct4_Klf4,OK,0,0,10.256632,1.021739,1.177346,2 Factors,0.44999,Young_OK
TACCTCGGTACCTGTA-0,L1_AdipoY_pool1,L1,L1_AdipoY_pool1,192970.0,True,0,10470,0.021496,0.003306,0.000269,...,Sox2_Klf4_Myc,SKM,0,0,10.256632,1.021739,1.60133,3 Factors,0.647961,Young_SKM
GCCCGAAAGTCTTCCC-0,L1_AdipoY_pool1,L1,L1_AdipoY_pool1,177708.0,True,0,9541,0.017613,0.005019,0.001356,...,Sox2_Oct4_Klf4_Myc,SOKM,0,0,10.256632,1.021739,0.840399,4 Factors,0.858543,Young_SOKM
CATGCCTGTCATGACT-0,L1_AdipoY_pool1,L1,L1_AdipoY_pool1,176202.0,True,0,9595,0.025584,0.007032,7.9e-05,...,Sox2_Oct4_Klf4,SOK,0,0,10.256632,1.021739,0.79729,3 Factors,0.367654,Young_SOK
ACTGCAATCACTGTCC-0,L1_AdipoY_pool1,L1,L1_AdipoY_pool1,169552.0,True,0,9194,0.016632,0.002566,6.5e-05,...,Oct4_Myc,OM,0,0,10.256632,1.021739,2.020094,2 Factors,0.504523,Young_OM


In [None]:
adata_adipo.n_obs

9880

In [None]:
columns_obs = adata_adipo.obs.columns
data_obs = {}
for variable in columns_obs: 
  description = adata_adipo.obs[variable].unique()
  data_obs[variable]= description

data_obs

{'10x_lane': ['L1', 'L2', 'L3', 'L4']
 Categories (4, object): ['L1', 'L2', 'L3', 'L4'],
 'SMP': ['L1_AdipoY_pool1', 'L2_AdipoA_pool1', 'L3_AdipoY_pool2', 'L4_AdipoA_pool2']
 Categories (4, object): ['L1_AdipoY_pool1', 'L2_AdipoA_pool1', 'L3_AdipoY_pool2', 'L4_AdipoA_pool2'],
 '_scvi_batch': array([0, 1], dtype=int8),
 '_scvi_labels': array([0], dtype=int8),
 '_scvi_local_l_mean': array([10.25663185, 10.76336384]),
 '_scvi_local_l_var': array([1.02173865, 0.28201589]),
 'age': ['Young', 'Aged']
 Categories (2, object): ['Young', 'Aged'],
 'age_combination_short': ['Young_OK', 'Young_SKM', 'Young_SOKM', 'Young_SOK', 'Young_OM', ..., 'Aged_SM', 'Aged_K', 'Aged_KM', 'Aged_M', 'Aged_SK']
 Length: 32
 Categories (32, object): ['Young_NT', 'Young_S', 'Young_O', 'Young_K', ..., 'Aged_SOM', 'Aged_SKM',
                           'Aged_OKM', 'Aged_SOKM'],
 'batch': ['0', '1', '2', '3']
 Categories (4, object): ['0', '1', '2', '3'],
 'cell_containing_barcode': array([ True]),
 'combination': ['O

In [None]:
#Barcode
adata_adipo.obs_names

Index(['CTAACTTAGACCCGCT-0', 'TACCTCGGTACCTGTA-0', 'GCCCGAAAGTCTTCCC-0',
       'CATGCCTGTCATGACT-0', 'ACTGCAATCACTGTCC-0', 'GGAGGATAGTGATTCC-0',
       'CGTCAAATCCTTGGAA-0', 'TTCCTAACACTAACGT-0', 'TAAGCACCACGTAGAG-0',
       'GCGTTTCCAGGTCCCA-0',
       ...
       'CTTTCGGGTATTGCCA-3', 'GTCACTCAGAGGTTTA-3', 'TCGTGGGCAGGCTACC-3',
       'CAACAGTTCGATGCAT-3', 'AGGATAACAACACAGG-3', 'CTAGGTAAGCTCACTA-3',
       'ACATTTCCAGCTCATA-3', 'TAGCACATCCATAGAC-3', 'GGATGTTCAGCAGGAT-3',
       'GACTCAAGTCCAATCA-3'],
      dtype='object', length=9880)

Questions obs: 
* what are Adipo**A** and Adipo**Y**?
* L1,L2,L3,L4? 
* which factors are in pool1 and pool2 (sample) ? 
* categories 0,1,2,3? 
* batch 0,1,2,3?
* experiment 1 and 2?
* what is perc_tg?




##**3.3 layers**

---

Key-indexed multi-dimensional arrays aligned to dimensions of X

In [None]:
adata_adipo.layers.keys()

KeysView(Layers with keys: counts, log1p_cpm, log1p_scvi)

In [None]:
for key, value in adata_adipo.layers.items():
  print(f"{key} = {value}")

counts =   (0, 6)	2.0
  (0, 7)	24.0
  (0, 8)	4.0
  (0, 10)	57.0
  (0, 12)	1.0
  (0, 13)	9.0
  (0, 16)	30.0
  (0, 17)	1.0
  (0, 18)	21.0
  (0, 20)	10.0
  (0, 24)	39.0
  (0, 28)	4.0
  (0, 29)	6.0
  (0, 30)	2.0
  (0, 32)	1.0
  (0, 33)	37.0
  (0, 37)	26.0
  (0, 38)	12.0
  (0, 39)	18.0
  (0, 40)	1.0
  (0, 41)	3.0
  (0, 42)	1.0
  (0, 45)	8.0
  (0, 49)	6.0
  (0, 51)	24.0
  :	:
  (9879, 28038)	13.0
  (9879, 28045)	4.0
  (9879, 28187)	1.0
  (9879, 28198)	3.0
  (9879, 28204)	11.0
  (9879, 28305)	203.0
  (9879, 28323)	2.0
  (9879, 28332)	2.0
  (9879, 28368)	2.0
  (9879, 28391)	5.0
  (9879, 28392)	1.0
  (9879, 28414)	1.0
  (9879, 28437)	3.0
  (9879, 28463)	1.0
  (9879, 28479)	1.0
  (9879, 28496)	9.0
  (9879, 28509)	1.0
  (9879, 28511)	4.0
  (9879, 28578)	4.0
  (9879, 28590)	3.0
  (9879, 28595)	1.0
  (9879, 28603)	1.0
  (9879, 28606)	1.0
  (9879, 28690)	8.0
  (9879, 28700)	6.0
log1p_cpm = [[2.8415179e-01 3.5867538e-02 7.5495237e-01 ... 2.2954051e+00
  3.3342636e+00 4.4833970e+00]
 [1.0783092e+00 8.


Concepts: 
* log1p: X = log(X+1)
* CPM (Counts Per Million) normalization are obtained by dividing counts by the library counts sum and multiplying the results by a million

##**3.4 uns**

---

Unstructured annotation. Key-indexed.

In [None]:
adata_adipo.uns

OverloadedDict, wrapping:
	{'_scvi': {'categorical_mappings': {'_scvi_batch': {'mapping': array(['exp 1', 'exp 2'], dtype=object), 'original_key': 'experiment'}, '_scvi_labels': {'mapping': array([0]), 'original_key': '_scvi_labels'}}, 'data_registry': {'X': {'attr_key': 'counts', 'attr_name': 'layers'}, 'batch_indices': {'attr_key': '_scvi_batch', 'attr_name': 'obs'}, 'labels': {'attr_key': '_scvi_labels', 'attr_name': 'obs'}, 'local_l_mean': {'attr_key': '_scvi_local_l_mean', 'attr_name': 'obs'}, 'local_l_var': {'attr_key': '_scvi_local_l_var', 'attr_name': 'obs'}}, 'scvi_version': '0.8.1', 'summary_stats': {'n_batch': 2, 'n_cells': 11373, 'n_labels': 1, 'n_proteins': 0, 'n_vars': 28701}}, 'age_colors': array(['#1f77b4', '#ff7f0e'], dtype=object), 'combination_colors': array(['#1f77b4', '#ff7f0e', '#279e68', '#d62728', '#aa40fc', '#8c564b',
       '#e377c2', '#b5bd61', '#17becf', '#aec7e8', '#ffbb78', '#98df8a',
       '#ff9896', '#c5b0d5', '#c49c94', '#f7b6d2'], dtype=object), 'comb

scvi
* scvi_version 0.8.1. 
* scvi_batch -> original key= EXPERIMENT 1,2
* scvi_labels -> original key = SCVI_LABELS

data registry
* X: atribute key = COUNTS, atribute name = LAYERS
* batch_ indices: atribute key = _SCVI_BATCH, atribute name = OBS
* labels : atribute key = _SCVI_LABELS, atribute name = OBS
* local_l_mean: atribute key = _SCVI_LOCAL_L_MEAN, atribute name = OBS
* local_l_var: atribute key = _SCVI_LOCAL_L_VAR, atribute name = OBS
* n_batch = 2
* n_cells = 11373
* n_vars = 28701
* n_labels = 1
* n_proteins = 0


Questions uns: 
* <f8,<f4?

##**3.5 obsm**

---

Key-indexed multi-dimensional annotation observation

In [None]:
adata_adipo.obsm
#https://www.sharpsightlabs.com/blog/numpy-axes-explained/

AxisArrays with keys: X_pca, X_scvi, X_umap

##**3.6 obsp**

---

Pairwaise annotation of observation. 

In [None]:
adata_adipo.obsp

PairwiseArrays with keys: connectivities, distances

##**3.7 var**

---

Variables. Key-indexed one-dimensional annotation. ANOTATION OF VARIABLES/FEATURES (DATAFRAME) = GENES

In [None]:
adata_adipo.var.head()

Unnamed: 0,gene_id_orig,gene_id,gene_name,contig,gene_biotype,n_splice_sites,length,pseudogene_status,transcript_support_level,mitochondrial,highly_variable,means,dispersions,dispersions_norm
Xkr4,ENSMUSG00000051951.5,ENSMUSG00000051951,Xkr4,1,protein_coding,3,465597,False,1,False,False,0.5915157,4.378706,-0.022897
Gm1992,ENSMUSG00000089699.1,ENSMUSG00000089699,Gm1992,1,antisense,1,46966,False,3,False,False,0.005532729,3.071597,-0.894038
Gm37381,ENSMUSG00000102343.1,ENSMUSG00000102343,Gm37381,1,lincRNA,3,80476,False,1,False,False,0.9042411,4.106532,-0.617417
Rp1,ENSMUSG00000025900.11,ENSMUSG00000025900,Rp1,1,protein_coding,31,409684,False,1,False,False,0.01228122,3.000259,-0.989352
Rp1-1,ENSMUSG00000109048.1,ENSMUSG00000109048,Rp1,1,protein_coding,3,116206,False,1,False,False,1e-12,,


In [None]:
columns_var = adata_adipo.var.columns
data_var = {}
for variable in columns_var: 
  description = adata_adipo.var[variable].unique()
  data_var[variable]= description

data_var

{'contig': ['1', '2', 'X', '3', '4', ..., 'tg19', 'tg20', 'tg21', 'tg22', 'tg28']
 Length: 46
 Categories (46, object): ['1', '2', '3', '4', ..., 'tg20', 'tg21', 'tg22', 'tg28'],
 'dispersions': array([4.37870623, 3.07159743, 4.10653174, ..., 5.54281909, 5.58368691,
        7.44526907]),
 'dispersions_norm': array([-0.02289738, -0.89403826, -0.6174166 , ...,  2.589962  ,
         2.5586796 ,  3.870227  ], dtype=float32),
 'gene_biotype': ['protein_coding', 'antisense', 'lincRNA', 'TR_V_gene', 'TR_V_pseudogene', ..., 'IG_V_pseudogene', 'IG_J_gene', 'IG_C_gene', 'IG_C_pseudogene', 'IG_D_gene']
 Length: 16
 Categories (16, object): ['IG_C_gene', 'IG_C_pseudogene', 'IG_D_gene', 'IG_J_gene', ...,
                           'TR_V_pseudogene', 'antisense', 'lincRNA', 'protein_coding'],
 'gene_id': array(['ENSMUSG00000051951', 'ENSMUSG00000089699', 'ENSMUSG00000102343',
        ..., 'Calico-Tg-pJCK0022-CDS2', 'Calico-Tg-pJCK0028-CDS1',
        'Calico-Tg-pJCK0028-CDS2'], dtype=object),
 'gene_

In [None]:
adata_adipo.n_vars

28701

In [None]:
adata_adipo.var_names

Index(['Xkr4', 'Gm1992', 'Gm37381', 'Rp1', 'Rp1-1', 'Sox17', 'Gm37323',
       'Mrpl15', 'Lypla1', 'Gm37988',
       ...
       'tg-Sox2', 'tg-eGFP-BC19', 'tg-Pou5f1', 'tg-eGFP-BC20', 'tg-Klf4',
       'tg-eGFP-BC21', 'tg-Myc', 'tg-eGFP-BC22', 'tg-Tet3G', 'tg-mCherry'],
      dtype='object', length=28701)

Questions :
* Clasification of transcript support level?: 
  * 1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA),
  * 2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs),
  * 3 (the only support is from a single EST),
  * 4 (the best supporting EST is flagged as suspect),
  * 5 (no single transcript supports the model structure),
  * 6 (the transcript was not analyzed)
* What variable is used to compute the mean?
* What method is used to normalize dispersion ? 


Concepts:
* **gene_id_orig**: stable ID -> ENS[species prefix][feature type prefix][a unique eleven digit number] -> mouse gene might be ENSMUSG###########
* **gene_id**
* **gene_name**: https://m.ensembl.org/info/genome/genebuild/gene_names.html
* **contig**: is a series of overlapping DNA sequences used to make a physical map that reconstructs the original DNA sequence of a chromosome or a region of a chromosome. A contig can also refer to one of the DNA sequences used in making such a map.
* **gene_biotype**: A gene classification. http://www.ensembl.org/info/genome/genebuild/biotypes.html 
* **n_splice_sites**: number of splice sites 
* **length**: gene's lenght
* **pseudogene_status**: a pseduogene is a DNA sequence that resembles a gene but has been mutated into an inactive form over the course of evolution. True = pseudogene, False = not pseudogene. 
* **transcript_support_level**: a method to highlight the well-supported and poorly-supported transcript models for users. https://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html 
* **mitochondrial**: mitocondrial gene = True vs False
* **highly_variable**:  allows the detection of genes that contribute strongly to cell-to-cell variation. True = Highly variable vs False
* **means**
* **dispersions**: is a parameter describing how much the variance deviates from the mean. 
* **dispersions_norm**: normalized dispersion

##**3.8 conjunto**

---



In [None]:
adata_adipo.obs_names = [f"Cell_{i}" for i in range(adata_adipo.n_obs)]
adata_adipo.var_names = [f"Gene_{i}" for i in range(adata_adipo.n_vars)]
print(adata_adipo.obs_names[:10])
print(adata_adipo.var_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')
Index(['Gene_0', 'Gene_1', 'Gene_2', 'Gene_3', 'Gene_4', 'Gene_5', 'Gene_6',
       'Gene_7', 'Gene_8', 'Gene_9'],
      dtype='object')


In [None]:
adata_adipo.to_df(layer = "counts")

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_28691,Gene_28692,Gene_28693,Gene_28694,Gene_28695,Gene_28696,Gene_28697,Gene_28698,Gene_28699,Gene_28700
Cell_0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,24.0,4.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0
Cell_1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,51.0
Cell_2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,19.0,3.0,0.0,...,0.0,3.0,0.0,14.0,0.0,7.0,0.0,4.0,26.0,187.0
Cell_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,0.0,0.0,...,0.0,4.0,0.0,4.0,0.0,1.0,0.0,0.0,0.0,5.0
Cell_4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,14.0,3.0,0.0,...,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_9875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_9876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
Cell_9877,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cell_9878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#**4.Summary questions about Murine adipogenic cell Yamanaka Factor screen**

---



Questions obs: 
* what are Adipo**A** and Adipo**Y**?
* L1,L2,L3,L4? 
* which factors are in pool1 and pool2 (sample) ? 
* categories 0,1,2,3? 
* batch 0,1,2,3?
* experiment 1 and 2?
* what is perc_tg?
Questions layers:
* What  is cpm?

Questions uns: 
* <f8,<f4?

Questions var:
* Clasification of transcript support level?: 
  * 1 (all splice junctions of the transcript are supported by at least one non-suspect mRNA),
  * 2 (the best supporting mRNA is flagged as suspect or the support is from multiple ESTs),
  * 3 (the only support is from a single EST),
  * 4 (the best supporting EST is flagged as suspect),
  * 5 (no single transcript supports the model structure),
  * 6 (the transcript was not analyzed)
* What variable is used to compute the mean?
* What method is used to normalize dispersion ? 


#**5.Murine mesenchymal stem cell Yamanaka Factor screen**
---



In [None]:
adata_msc_screen = anndata.read_h5ad("/content/drive/MyDrive/msc_screen.h5ad")

In [None]:
adata_msc_screen

AnnData object with n_obs × n_vars = 10021 × 19321
    obs: 'batch', 'n_genes', 'perc_mito', 'age', 'log10_n_counts', 'combination_short', 'age_combination_short', '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean', '_scvi_local_l_var'
    var: 'gene_id_orig-0', 'gene_id-0', 'gene_name-0', 'contig-0', 'gene_biotype-0', 'n_splice_sites-0', 'length-0', 'pseudogene_status-0', 'transcript_support_level-0', 'mitochondrial-0', 'highly_variable-0', 'means-0', 'dispersions-0', 'dispersions_norm-0', 'n_cells-0', 'gene_id_orig-1', 'gene_id-1', 'gene_name-1', 'contig-1', 'gene_biotype-1', 'n_splice_sites-1', 'length-1', 'pseudogene_status-1', 'transcript_support_level-1', 'mitochondrial-1', 'highly_variable-1', 'means-1', 'dispersions-1', 'dispersions_norm-1', 'n_cells-1', 'transgene-1', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: '_scvi', 'age_colors', 'batch_colors', 'combination_short_colors', 'hvg', 'leiden_colors', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_

##**5.1Concepts**

---



* Differents pool of Yamanaka Factors
* Mesenchymal
*  Young vs Aged

## **5.2 obs**

---
Observations. Key-indexed one-dimensional annotation. CELLS

[Sparse matrix](https://media.geeksforgeeks.org/wp-content/uploads/Sparse-Matrix-Array-Representation1.png)

In [None]:
# inspect the covariates
adata_msc_screen.obs.head()

Unnamed: 0_level_0,batch,n_genes,perc_mito,age,log10_n_counts,combination_short,age_combination_short,_scvi_batch,_scvi_labels,_scvi_local_l_mean,_scvi_local_l_var
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AAACCCATCATTTCGT-0-0,0,7232,0.011245,Young,4.668386,SO,Young_SO,0,0,10.615695,0.565337
AAACCCATCCAAGCTA-0-0,0,5667,0.013463,Young,4.552084,NT,Young_NT,0,0,10.615695,0.565337
AAACCCATCGACATCA-0-0,0,6133,0.008502,Young,4.620719,NT,Young_NT,0,0,10.615695,0.565337
AAACGAAAGGGTTTCT-0-0,0,4332,0.006365,Young,4.446599,NT,Young_NT,0,0,10.615695,0.565337
AAACGAACAACCGCTG-0-0,0,6741,0.029732,Young,4.674144,M,Young_M,0,0,10.615695,0.565337


In [None]:
adata_msc_screen.n_obs

10021

In [None]:
columns_obs = adata_msc_screen.obs.columns
data_obs = {}
for variable in columns_obs: 
  description = adata_msc_screen.obs[variable].unique()
  data_obs[variable]= description

data_obs

{'_scvi_batch': array([0], dtype=int8),
 '_scvi_labels': array([0], dtype=int8),
 '_scvi_local_l_mean': array([10.615695]),
 '_scvi_local_l_var': array([0.56533653]),
 'age': ['Young', 'Aged']
 Categories (2, object): ['Young', 'Aged'],
 'age_combination_short': ['Young_SO', 'Young_NT', 'Young_M', 'Young_S', 'Young_SOM', ..., 'Aged_SKM', 'Aged_SO', 'Aged_SM', 'Aged_SOM', 'Aged_S']
 Length: 32
 Categories (32, object): ['Young_NT', 'Young_S', 'Young_O', 'Young_K', ..., 'Aged_SOM', 'Aged_SKM',
                           'Aged_OKM', 'Aged_SOKM'],
 'batch': ['0', '1']
 Categories (2, object): ['0', '1'],
 'combination_short': ['SO', 'NT', 'M', 'S', 'SOM', ..., 'SK', 'SOKM', 'SKM', 'SM', 'SOK']
 Length: 16
 Categories (16, object): ['NT', 'S', 'O', 'K', ..., 'SOM', 'SKM', 'OKM', 'SOKM'],
 'log10_n_counts': array([4.668386 , 4.552084 , 4.620719 , ..., 5.197581 , 4.3740516,
        5.1549773], dtype=float32),
 'n_genes': array([7232, 5667, 6133, ..., 1751, 1747, 9330]),
 'perc_mito': array([0

In [None]:
#Barcode
adata_msc_screen.obs_names

Index(['AAACCCATCATTTCGT-0-0', 'AAACCCATCCAAGCTA-0-0', 'AAACCCATCGACATCA-0-0',
       'AAACGAAAGGGTTTCT-0-0', 'AAACGAACAACCGCTG-0-0', 'AAACGAAGTCACAGTT-0-0',
       'AAACGAAGTTGAAGTA-0-0', 'AAACGAATCCAAGGGA-0-0', 'AAACGAATCTGTCGCT-0-0',
       'AAACGCTAGACTTCCA-0-0',
       ...
       'TGTTGAGAGCACACCC-3-1', 'TTACCGCCAGGGACTA-3-1', 'TTCCTCTCACCAGCCA-3-1',
       'TTGAGTGAGCACCGAA-3-1', 'TTGAGTGCAACCTATG-3-1', 'TTTACGTGTGTGAATA-3-1',
       'TTTCACACAGAAGCTG-3-1', 'TTTCACATCACAGTGT-3-1', 'TTTGACTCATGTGGTT-3-1',
       'TTTGGAGCACGTCGTG-3-1'],
      dtype='object', name='barcode', length=10021)

##**5.3 layers**

---

Key-indexed multi-dimensional arrays aligned to dimensions of X

In [None]:
adata_msc_screen.layers.keys()

KeysView(Layers with keys: counts, log1p_cpm, scvi_normalized)

In [None]:
for key, value in adata_msc_screen.layers.items():
  print(f"{key} = {value}")

counts =   (0, 11481)	4.0
  (0, 8208)	1.0
  (0, 16915)	8.0
  (0, 2196)	3.0
  (0, 14499)	13.0
  (0, 12875)	6.0
  (0, 15070)	2.0
  (0, 1371)	1.0
  (0, 18341)	3.0
  (0, 11070)	1.0
  (0, 16165)	2.0
  (0, 3912)	4.0
  (0, 4113)	7.0
  (0, 1868)	3.0
  (0, 13724)	1.0
  (0, 975)	1.0
  (0, 16658)	2.0
  (0, 16018)	3.0
  (0, 11894)	3.0
  (0, 17654)	5.0
  (0, 10390)	2.0
  (0, 14969)	75.0
  (0, 14590)	1.0
  (0, 16555)	1.0
  (0, 18089)	2.0
  :	:
  (10020, 5682)	1.0
  (10020, 15536)	1.0
  (10020, 13712)	7.0
  (10020, 8962)	3.0
  (10020, 19182)	10.0
  (10020, 4088)	1.0
  (10020, 19314)	148.0
  (10020, 19315)	108.0
  (10020, 19310)	382.0
  (10020, 19311)	370.0
  (10020, 19309)	1.0
  (10020, 19308)	280.0
  (10020, 19312)	382.0
  (10020, 19316)	38.0
  (10020, 19318)	18.0
  (10020, 19317)	234.0
  (10020, 19319)	35.0
  (10020, 19320)	40.0
  (10020, 19313)	630.0
  (10020, 18318)	2.0
  (10020, 17457)	2.0
  (10020, 1015)	1.0
  (10020, 12661)	23.0
  (10020, 4381)	3.0
  (10020, 14783)	313.0
log1p_cpm =   (0, 0)	4


Concepts: 
* log1p: X = log(X+1)
* CPM (Counts Per Million) normalization are obtained by dividing counts by the library counts sum and multiplying the results by a million

##**5.4 uns**

---

Unstructured annotation. Key-indexed.

In [None]:
adata_msc_screen.uns

OverloadedDict, wrapping:
	{'_scvi': {'categorical_mappings': {'_scvi_batch': {'mapping': array([0]), 'original_key': '_scvi_batch'}, '_scvi_labels': {'mapping': array([0]), 'original_key': '_scvi_labels'}}, 'data_registry': {'X': {'attr_key': 'counts', 'attr_name': 'layers'}, 'batch_indices': {'attr_key': '_scvi_batch', 'attr_name': 'obs'}, 'labels': {'attr_key': '_scvi_labels', 'attr_name': 'obs'}, 'local_l_mean': {'attr_key': '_scvi_local_l_mean', 'attr_name': 'obs'}, 'local_l_var': {'attr_key': '_scvi_local_l_var', 'attr_name': 'obs'}}, 'scvi_version': '0.7.1', 'summary_stats': {'n_batch': 1, 'n_cells': 10021, 'n_labels': 1, 'n_proteins': 0, 'n_vars': 19321}}, 'age_colors': array(['#1f77b4', '#ff7f0e'], dtype=object), 'batch_colors': array(['#1f77b4', '#ff7f0e'], dtype=object), 'combination_short_colors': array(['#929591', '#f77189', '#f37a32', '#ca9232', '#ae9d31', '#8ea631',
       '#50b131', '#33b07a', '#35ae99', '#36acae', '#38a9c5', '#3ba3ec',
       '#9591f4', '#cc7af4', '#f5

scvi
* scvi_version 0.8.1. 
* scvi_batch -> original key= EXPERIMENT 1,2
* scvi_labels -> original key = SCVI_LABELS

data registry
* X: atribute key = COUNTS, atribute name = LAYERS
* batch_ indices: atribute key = _SCVI_BATCH, atribute name = OBS
* labels : atribute key = _SCVI_LABELS, atribute name = OBS
* local_l_mean: atribute key = _SCVI_LOCAL_L_MEAN, atribute name = OBS
* local_l_var: atribute key = _SCVI_LOCAL_L_VAR, atribute name = OBS
* n_batch = 1
* n_cells = 10021
* n_vars =  19321
* n_labels = 1
* n_proteins = 0



##**5.5 obsm**

---

Key-indexed multi-dimensional annotation observation

In [None]:
adata_msc_screen.obsm
#https://www.sharpsightlabs.com/blog/numpy-axes-explained/

AxisArrays with keys: X_pca, X_scvi, X_umap

##**5.6 obsp**

---

Pairwaise annotation of observation. PAIRWAISE ANNOTATION

In [None]:
adata_msc_screen.obsp

PairwiseArrays with keys: connectivities, distances

##**5.7 var**

---

Variables. Key-indexed one-dimensional annotation. ANOTATION OF VARIABLES/FEATURES (DATAFRAME) = GENES

In [None]:
adata_msc_screen.var.head()

Unnamed: 0,gene_id_orig-0,gene_id-0,gene_name-0,contig-0,gene_biotype-0,n_splice_sites-0,length-0,pseudogene_status-0,transcript_support_level-0,mitochondrial-0,...,highly_variable-1,means-1,dispersions-1,dispersions_norm-1,n_cells-1,transgene-1,highly_variable,means,dispersions,dispersions_norm
0610007P14Rik,ENSMUSG00000021252.11,ENSMUSG00000021252,0610007P14Rik,12,protein_coding,5.0,9102.0,False,1.0,False,...,False,4.453439,4.250101,0.235851,4707.0,False,False,4.505795,4.366016,0.285764
0610009B22Rik,ENSMUSG00000007777.9,ENSMUSG00000007777,0610009B22Rik,11,protein_coding,2.0,3488.0,False,1.0,False,...,False,4.177652,3.672722,-0.599898,4412.0,False,False,4.057542,3.781253,-0.646734
0610009L18Rik,ENSMUSG00000043644.4,ENSMUSG00000043644,0610009L18Rik,11,protein_coding,1.0,2512.0,False,2.0,False,...,False,2.043237,3.482285,-0.18865,1286.0,False,False,2.071436,3.763982,-0.039764
0610009O20Rik,ENSMUSG00000024442.5,ENSMUSG00000024442,0610009O20Rik,18,protein_coding,12.0,12380.0,False,1.0,False,...,False,2.107634,3.252145,-0.651041,1493.0,False,False,2.201376,3.677687,-0.218231
0610010F05Rik,ENSMUSG00000042208.15,ENSMUSG00000042208,0610010F05Rik,11,protein_coding,24.0,68670.0,False,1.0,False,...,False,2.65629,3.450919,-0.329439,2172.0,False,False,2.478637,3.618971,-0.339661


In [None]:
columns_var = adata_msc_screen.var.columns
data_var = {}
for variable in columns_var: 
  description = adata_msc_screen.var[variable].unique()
  data_var[variable]= description

data_var

{'contig-0': ['12', '11', '18', '17', '16', ..., 'GL456216.1', 'Y', 'JH584304.1', 'GL456233.1', 'MT']
 Length: 31
 Categories (30, object): ['1', '2', '3', '4', ..., 'JH584304.1', 'MT', 'X', 'Y'],
 'contig-1': ['12', '11', '18', '17', '16', ..., 'GL456216.1', 'Y', 'JH584304.1', 'GL456233.1', 'MT']
 Length: 29
 Categories (28, object): ['1', '2', '3', '4', ..., 'JH584304.1', 'MT', 'X', 'Y'],
 'dispersions': array([4.36601589, 3.78125268, 3.76398156, ..., 5.22074444, 5.12831909,
        5.11894304]),
 'dispersions-0': array([4.42897203, 3.83900471, 3.91076296, ..., 5.16200267, 5.08835398,
        5.0222146 ]),
 'dispersions-1': array([4.25010089, 3.67272195, 3.48228541, ..., 5.29012507, 5.16512131,
        5.16223163]),
 'dispersions_norm': array([ 0.28576386, -0.6467335 , -0.03976388, ...,  0.6972309 ,
         0.76984257,  0.6282668 ], dtype=float32),
 'dispersions_norm-0': array([ 0.11189876, -0.42727396,  0.08539563, ...,  0.6992171 ,
         0.60122025,  0.51321524], dtype=float32)

In [None]:
adata_msc_screen.n_vars

19321

In [None]:
adata_msc_screen.var_names

Index(['0610007P14Rik', '0610009B22Rik', '0610009L18Rik', '0610009O20Rik',
       '0610010F05Rik', '0610010K14Rik', '0610011F06Rik', '0610012D04Rik',
       '0610012G03Rik', '0610025J13Rik',
       ...
       'mt-Co2', 'mt-Co3', 'mt-Cytb', 'mt-Nd1', 'mt-Nd2', 'mt-Nd3', 'mt-Nd4',
       'mt-Nd4l', 'mt-Nd5', 'mt-Nd6'],
      dtype='object', length=19321)

Questions :
* Why are gene_id_0 and gene_id_1 (all variables)?


Concepts:
* **gene_id_orig**: stable ID -> ENS[species prefix][feature type prefix][a unique eleven digit number] -> mouse gene might be ENSMUSG###########
* **gene_id**
* **gene_name**: https://m.ensembl.org/info/genome/genebuild/gene_names.html
* **contig**: is a series of overlapping DNA sequences used to make a physical map that reconstructs the original DNA sequence of a chromosome or a region of a chromosome. A contig can also refer to one of the DNA sequences used in making such a map.
* **gene_biotype**: A gene classification. http://www.ensembl.org/info/genome/genebuild/biotypes.html 
* **n_splice_sites**: number of splice sites 
* **length**: gene's lenght
* **pseudogene_status**: a pseduogene is a DNA sequence that resembles a gene but has been mutated into an inactive form over the course of evolution. True = pseudogene, False = not pseudogene. 
* **transcript_support_level**: a method to highlight the well-supported and poorly-supported transcript models for users. https://www.ensembl.org/info/genome/genebuild/transcript_quality_tags.html 
* **mitochondrial**: mitocondrial gene = True vs False
* **highly_variable**:  allows the detection of genes that contribute strongly to cell-to-cell variation. True = Highly variable vs False
* **means**
* **dispersions**: is a parameter describing how much the variance deviates from the mean. 
* **dispersions_norm**: normalized dispersion

##**5.8 conjunto**

---



In [None]:
adata_msc_screen.obs_names = [f"Cell_{i}" for i in range(adata_msc_screen.n_obs)]
adata_msc_screen.var_names = [f"Gene_{i}" for i in range(adata_msc_screen.n_vars)]
print(adata_msc_screen.obs_names[:10])
print(adata_msc_screen.var_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')
Index(['Gene_0', 'Gene_1', 'Gene_2', 'Gene_3', 'Gene_4', 'Gene_5', 'Gene_6',
       'Gene_7', 'Gene_8', 'Gene_9'],
      dtype='object')


In [None]:
adata_msc_screen.to_df(layer = "counts")

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_19311,Gene_19312,Gene_19313,Gene_19314,Gene_19315,Gene_19316,Gene_19317,Gene_19318,Gene_19319,Gene_19320
Cell_0,6.0,1.0,0.0,0.0,1.0,3.0,1.0,0.0,7.0,0.0,...,61.0,85.0,119.0,18.0,16.0,3.0,50.0,7.0,6.0,6.0
Cell_1,3.0,1.0,0.0,0.0,0.0,2.0,1.0,0.0,3.0,0.0,...,68.0,90.0,87.0,17.0,17.0,3.0,41.0,3.0,6.0,5.0
Cell_2,3.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,7.0,0.0,...,22.0,21.0,32.0,13.0,9.0,2.0,37.0,2.0,5.0,6.0
Cell_3,9.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,0.0,...,21.0,45.0,43.0,9.0,10.0,2.0,7.0,0.0,1.0,1.0
Cell_4,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,162.0,248.0,172.0,73.0,44.0,11.0,110.0,17.0,16.0,22.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_10016,17.0,6.0,2.0,1.0,3.0,4.0,2.0,0.0,11.0,0.0,...,257.0,329.0,290.0,91.0,31.0,17.0,121.0,13.0,13.0,31.0
Cell_10017,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,19.0,24.0,15.0,3.0,2.0,0.0,16.0,1.0,4.0,7.0
Cell_10018,6.0,9.0,2.0,0.0,2.0,4.0,4.0,0.0,22.0,0.0,...,79.0,114.0,216.0,45.0,13.0,11.0,81.0,5.0,1.0,2.0
Cell_10019,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,64.0,100.0,98.0,23.0,13.0,5.0,35.0,4.0,2.0,6.0


#**6.Summary questions Murine mesenchymal stem cell Yamanaka Factor screen**

Questions vars:
* Why are gene_id_0 and gene_id_1 (all variables)?


#**7.Murine mesenchymal stem cell polycistronic reprogramming**

---



In [None]:
adata_sokm = anndata.read_h5ad("/content/drive/MyDrive/Colab Notebooks/msc_sokm.h5ad")

In [None]:
adata_sokm

AnnData object with n_obs × n_vars = 20661 × 28694
    obs: 'sample', 'batch', 'n_counts', 'n_genes', 'perc_mito', 'perc_rrna', 'age', 'sample_treatment', 'age_treatment', 'leiden', 'hash', 'animal', 'treatment', 'state', 'multiseq_derived_treatment', 'velocity_pseudotime', 'velocity_pseudotime_r', 'velocity_self_transition', 'root_cells', 'end_points', '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean', '_scvi_local_l_var'
    var: 'gene_id_orig', 'gene_id', 'gene_name', 'contig', 'gene_biotype', 'n_splice_sites', 'length', 'pseudogene_status', 'transcript_support_level', 'mitochondrial', 'transgene', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: '_scvi', 'age_colors', 'age_treatment_colors', 'animal_colors', 'draw_graph', 'hvg', 'leiden', 'leiden_colors', 'leiden_sizes', 'multiseq_derived_treatment_colors', 'neighbors', 'paga', 'pca', 'rank_genes_groups', 'sample_colors', 'state_colors', 'treatment_colors', 'umap', 'velocity_graph', 'velocity_graph_neg', 've

##**7.1Concepts**

---



* Yamanaka Factors
* mesenchymal stem
*  Young vs Aged

## **7.2 obs**

---
Observations. Key-indexed one-dimensional annotation. CELLS

[Sparse matrix](https://media.geeksforgeeks.org/wp-content/uploads/Sparse-Matrix-Array-Representation1.png)

In [None]:
# inspect the covariates
adata_sokm.obs.head()

Unnamed: 0_level_0,sample,batch,n_counts,n_genes,perc_mito,perc_rrna,age,sample_treatment,age_treatment,leiden,...,multiseq_derived_treatment,velocity_pseudotime,velocity_pseudotime_r,velocity_self_transition,root_cells,end_points,_scvi_batch,_scvi_labels,_scvi_local_l_mean,_scvi_local_l_var
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCCAAGATTCGAA-0,L1,0,6952.0,2887,0.016974,0.002301,Aged,Tg_Dox,Aged_Tg_Dox,9,...,Tg+/Dox+,0.743154,0.256846,0.161464,0.017942,5.208137e-05,0,0,9.759639,0.334722
AAACCCAAGCGCAATG-0,L1,0,12720.0,3349,0.011321,0.00228,Aged,Tg_Dox,Aged_Tg_Dox,9,...,Tg+/Dox+,0.741661,0.258339,0.189393,0.117389,8.334437e-06,0,0,9.759639,0.334722
AAACCCAAGGACGGAG-0,L1,0,17532.0,3819,0.006046,0.000742,Aged,Tg_Dox,Aged_Tg_Dox,2,...,Tg+/Dox+,0.664532,0.335468,0.106752,0.034192,0.0005413001,0,0,9.759639,0.334722
AAACCCAAGGCTCAAG-0,L1,0,17448.0,4051,0.010087,0.00149,Aged,Tg_Dox,Aged_Tg_Dox,5,...,Tg+/Dox+,0.273838,0.726162,0.092914,0.002692,9.040137e-08,0,0,9.759639,0.334722
AAACCCACAACAGCTT-0,L1,0,17821.0,3938,0.004938,0.000786,Aged,Tg_Dox,Aged_Tg_Dox,1,...,Tg+/Dox+,0.168965,0.831035,0.203846,0.452895,5.260476e-07,0,0,9.759639,0.334722


In [None]:
adata_sokm.n_obs

20661

In [None]:
columns_obs = adata_sokm.obs.columns
data_obs = {}
for variable in columns_obs: 
  description = adata_sokm.obs[variable].unique()
  data_obs[variable]= description

data_obs

{'_scvi_batch': array([0], dtype=int8),
 '_scvi_labels': array([0], dtype=int8),
 '_scvi_local_l_mean': array([9.75963879]),
 '_scvi_local_l_var': array([0.33472234]),
 'age': ['Aged', 'Young']
 Categories (2, object): ['Young', 'Aged'],
 'age_treatment': ['Aged_Tg_Dox', 'Aged_NegCtrl', 'Young_Tg_Dox', 'Young_NegCtrl']
 Categories (4, object): ['Young_NegCtrl', 'Young_Tg_Dox', 'Aged_NegCtrl', 'Aged_Tg_Dox'],
 'animal': ['1', '2', 'unknown', '3']
 Categories (4, object): ['1', '2', '3', 'unknown'],
 'batch': ['0', '1', '2', '3']
 Categories (4, object): ['0', '1', '2', '3'],
 'end_points': array([5.20813709e-05, 8.33443656e-06, 5.41300074e-04, ...,
        3.22625144e-01, 4.75299369e-01, 4.42185427e-01]),
 'hash': ['H2', 'H1', 'H3', 'negative', 'H6', 'H5', 'H4']
 Categories (7, object): ['H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'negative'],
 'leiden': ['9', '2', '5', '1', '0', '4', '8', '7', '3']
 Categories (9, object): ['0', '1', '2', '3', ..., '5', '7', '8', '9'],
 'multiseq_derived_treat

In [None]:
#Barcode
adata_sokm.obs_names

Index(['AAACCCAAGATTCGAA-0', 'AAACCCAAGCGCAATG-0', 'AAACCCAAGGACGGAG-0',
       'AAACCCAAGGCTCAAG-0', 'AAACCCACAACAGCTT-0', 'AAACCCATCGCTCTAC-0',
       'AAACCCATCGTCTAAG-0', 'AAACCCATCTCCGAGG-0', 'AAACCCATCTTGGTGA-0',
       'AAACGAAAGCAGCCCT-0',
       ...
       'TTTCGATGTAAGCGGT-3', 'TTTCGATTCCTTATAC-3', 'TTTGACTAGAGTCTGG-3',
       'TTTGACTAGCAGTAAT-3', 'TTTGACTTCGCCTTGT-3', 'TTTGGAGCAATTGTGC-3',
       'TTTGGAGGTAGACGGT-3', 'TTTGGTTTCGACCATA-3', 'TTTGTTGGTATTGCCA-3',
       'TTTGTTGGTTTGATCG-3'],
      dtype='object', name='barcode', length=20661)

Questions: 
* what are animals 1,2,3,unknown?
* categories 0,1,2,3? 
* batch 0,1,2,3?
* sample L1,L2,L3,L4?
* what is the difference between multiseq_derived_treatment and treatment?
* end point: last state of reprogramming ? 
* hash: animal origin. What is the meaning of H1,H2,H3,H4,H5,H6,negative?
* velocity self transition: self-renewal?
* state: 
  * cycling: doing cell-division cycle (series of events that take place in a cell that cause it to divide into two daughter cells)
  * Dox+ baseline: baseline conditions with doxycycline (dox+)
  * Col11a1 high: Mesenchyme-derived tumors 
  * reprogramming cycling: doing reprogramming cycle?
  * reprogramming: reprogrammed cell?
  * reprogramming intermediate: ?
  * Tg+ baseline: transgene positive

Concepts: 
* Dox +: 3 Dox day pulse
* Dox -: 3 day Dox chase
* velocity pseudotime: pseudotime analysis is a method to infer continuous processes using the relationships among cells profiled at a single timepoint. Random-walk based distance measures on the velocity graph. 
* velocity pseudotime r: infer velocity based in pseudotime in R




##**7.3 layers**

---

Key-indexed multi-dimensional arrays aligned to dimensions of X

In [None]:
adata_sokm.layers.keys()

KeysView(Layers with keys: counts, log1p_cpm, scvi_normalized, spliced, unspliced)

In [None]:
for key, value in adata_sokm.layers.items():
  print(f"{key} = {value}")

counts =   (0, 6)	1.0
  (0, 7)	1.0
  (0, 10)	1.0
  (0, 13)	1.0
  (0, 20)	1.0
  (0, 24)	1.0
  (0, 37)	1.0
  (0, 51)	1.0
  (0, 62)	9.0
  (0, 71)	2.0
  (0, 117)	1.0
  (0, 130)	2.0
  (0, 131)	1.0
  (0, 147)	2.0
  (0, 162)	2.0
  (0, 173)	1.0
  (0, 177)	1.0
  (0, 188)	4.0
  (0, 207)	2.0
  (0, 208)	4.0
  (0, 212)	3.0
  (0, 213)	2.0
  (0, 227)	5.0
  (0, 230)	1.0
  (0, 239)	1.0
  :	:
  (20660, 28578)	31.0
  (20660, 28580)	1.0
  (20660, 28583)	5.0
  (20660, 28590)	6.0
  (20660, 28591)	1.0
  (20660, 28593)	2.0
  (20660, 28595)	1.0
  (20660, 28600)	36.0
  (20660, 28601)	17.0
  (20660, 28602)	144.0
  (20660, 28603)	83.0
  (20660, 28604)	1.0
  (20660, 28605)	94.0
  (20660, 28606)	58.0
  (20660, 28607)	6.0
  (20660, 28608)	9.0
  (20660, 28609)	55.0
  (20660, 28610)	8.0
  (20660, 28611)	10.0
  (20660, 28612)	99.0
  (20660, 28628)	1.0
  (20660, 28655)	1.0
  (20660, 28686)	2.0
  (20660, 28690)	53.0
  (20660, 28691)	1.0
log1p_cpm =   (0, 6)	4.9756536
  (0, 7)	4.9756536
  (0, 10)	4.9756536
  (0, 13)	4.975

* Spliced and unspliced: ARN sequence? 

##**7.4 uns**

---

Unstructured annotation. Key-indexed.

In [None]:
adata_sokm.uns

OverloadedDict, wrapping:
	{'_scvi': {'categorical_mappings': {'_scvi_batch': {'mapping': array([0]), 'original_key': '_scvi_batch'}, '_scvi_labels': {'mapping': array([0]), 'original_key': '_scvi_labels'}}, 'data_registry': {'X': {'attr_key': 'counts', 'attr_name': 'layers'}, 'batch_indices': {'attr_key': '_scvi_batch', 'attr_name': 'obs'}, 'labels': {'attr_key': '_scvi_labels', 'attr_name': 'obs'}, 'local_l_mean': {'attr_key': '_scvi_local_l_mean', 'attr_name': 'obs'}, 'local_l_var': {'attr_key': '_scvi_local_l_var', 'attr_name': 'obs'}}, 'scvi_version': '0.7.1', 'summary_stats': {'n_batch': 1, 'n_cells': 20661, 'n_labels': 1, 'n_proteins': 0, 'n_vars': 28694}}, 'age_colors': array(['#1f77b4', '#ff7f0e'], dtype=object), 'age_treatment_colors': array(['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78'], dtype=object), 'animal_colors': array(['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'], dtype=object), 'draw_graph': {'params': {'layout': 'fa', 'random_state': 0}}, 'hvg': {'flavor': 'seurat'}, '

scvi
* scvi_version 0.7.1. 
* scvi_batch -> original key= _SCVI_BATCH
* scvi_labels -> original key = SCVI_LABELS

data registry
* X: atribute key = COUNTS, atribute name = LAYERS
* batch_ indices: atribute key = _SCVI_BATCH, atribute name = OBS
* labels : atribute key = _SCVI_LABELS, atribute name = OBS
* local_l_mean: atribute key = _SCVI_LOCAL_L_MEAN, atribute name = OBS
* local_l_var: atribute key = _SCVI_LOCAL_L_VAR, atribute name = OBS
* n_batch = 2
* n_cells =  20661
* n_vars = 28694
* n_labels = 1
* n_proteins = 0



##**7.5 obsm**

---

Key-indexed multi-dimensional annotation observation

In [None]:
adata_sokm.obsm
#https://www.sharpsightlabs.com/blog/numpy-axes-explained/

AxisArrays with keys: X_draw_graph_fa, X_pca, X_scvi, X_umap, velocity_draw_graph_fa, velocity_pca, velocity_umap

##**7.6 obsp**

---

Pairwaise annotation of observation. 

In [None]:
adata_sokm.obsp

PairwiseArrays with keys: connectivities, distances

##**7.7 var**

---

Variables. Key-indexed one-dimensional annotation. ANOTATION OF VARIABLES/FEATURES (DATAFRAME) = GENES

In [None]:
adata_sokm.var.head()

Unnamed: 0,gene_id_orig,gene_id,gene_name,contig,gene_biotype,n_splice_sites,length,pseudogene_status,transcript_support_level,mitochondrial,transgene,highly_variable,means,dispersions,dispersions_norm
Xkr4,ENSMUSG00000051951.5,ENSMUSG00000051951,Xkr4,1,protein_coding,3,465597,False,1,False,False,False,0.297472,4.158728,-0.273348
Gm1992,ENSMUSG00000089699.1,ENSMUSG00000089699,Gm1992,1,antisense,1,46966,False,3,False,False,False,1e-12,,
Gm37381,ENSMUSG00000102343.1,ENSMUSG00000102343,Gm37381,1,lincRNA,3,80476,False,1,False,False,False,0.6201813,4.112489,-1.443341
Rp1,ENSMUSG00000025900.11,ENSMUSG00000025900,Rp1,1,protein_coding,31,409684,False,1,False,False,False,0.04510114,4.113581,-0.342157
Rp1-1,ENSMUSG00000109048.1,ENSMUSG00000109048,Rp1,1,protein_coding,3,116206,False,1,False,False,False,1e-12,,


In [None]:
columns_var = adata_sokm.var.columns
data_var = {}
for variable in columns_var: 
  description = adata_sokm.var[variable].unique()
  data_var[variable]= description

data_var

{'contig': ['1', '2', 'X', '3', '4', ..., 'GL456216.1', 'JH584292.1', 'JH584295.1', 'tg15', 'tg16']
 Length: 43
 Categories (43, object): ['1', '2', '3', '4', ..., 'X', 'Y', 'tg15', 'tg16'],
 'dispersions': array([4.15872816,        nan, 4.11248914, ..., 8.41482734, 5.02912026,
        7.06613639]),
 'dispersions_norm': array([-0.27334756,         nan, -1.4433409 , ...,  5.7965803 ,
         0.6841689 ,  2.8619218 ], dtype=float32),
 'gene_biotype': ['protein_coding', 'antisense', 'lincRNA', 'TR_V_gene', 'TR_V_pseudogene', ..., 'IG_V_pseudogene', 'IG_J_gene', 'IG_C_gene', 'IG_C_pseudogene', 'IG_D_gene']
 Length: 16
 Categories (16, object): ['IG_C_gene', 'IG_C_pseudogene', 'IG_D_gene', 'IG_J_gene', ...,
                           'TR_V_pseudogene', 'antisense', 'lincRNA', 'protein_coding'],
 'gene_id': array(['ENSMUSG00000051951', 'ENSMUSG00000089699', 'ENSMUSG00000102343',
        ..., 'Calico-Tg-pJCK0015-CDS1', 'Calico-Tg-pJCK0016-CDS1',
        'Calico-Tg-pJCK0016-CDS2'], dtype=obje

In [None]:
adata_sokm.n_vars

28694

In [None]:
adata_sokm.var_names

Index(['Xkr4', 'Gm1992', 'Gm37381', 'Rp1', 'Rp1-1', 'Sox17', 'Gm37323',
       'Mrpl15', 'Lypla1', 'Gm37988',
       ...
       'AC168977.2', 'AC168977.1', 'PISD', 'DHRSX', 'Vmn2r122',
       'CAAA01147332.1', 'Rn45s', 'Tet3G-T2A-mCerulean', 'Y4TF', 'mCherry'],
      dtype='object', length=28694)

##**7.8 conjunto**

---



In [None]:
adata_sokm.obs_names = [f"Cell_{i}" for i in range(adata_sokm.n_obs)]
adata_sokm.var_names = [f"Gene_{i}" for i in range(adata_sokm.n_vars)]
print(adata_sokm.obs_names[:10])
print(adata_sokm.var_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')
Index(['Gene_0', 'Gene_1', 'Gene_2', 'Gene_3', 'Gene_4', 'Gene_5', 'Gene_6',
       'Gene_7', 'Gene_8', 'Gene_9'],
      dtype='object')


In [None]:
adata_sokm.to_df(layer = "counts")

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_28684,Gene_28685,Gene_28686,Gene_28687,Gene_28688,Gene_28689,Gene_28690,Gene_28691,Gene_28692,Gene_28693
Cell_0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,16.0,0.0,0.0,0.0
Cell_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,29.0,0.0,0.0,0.0
Cell_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,13.0,0.0,0.0,0.0
Cell_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,26.0,1.0,0.0,8.0
Cell_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_20656,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,44.0,0.0,0.0,0.0
Cell_20657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,...,0.0,0.0,2.0,2.0,0.0,0.0,64.0,0.0,0.0,0.0
Cell_20658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,96.0,0.0,0.0,0.0
Cell_20659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,88.0,0.0,0.0,0.0


#**8.Summary questions Murine mesenchymal stem cell polycistronic reprogramming**

Questions obs: 
* what are animals 1,2,3,unknown?
* categories 0,1,2,3? 
* batch 0,1,2,3?
* sample L1,L2,L3,L4?
* what is the difference between multiseq_derived_treatment and treatment?
* end point: last state of reprogramming ? 
* hash: animal origin. What is the meaning of H1,H2,H3,H4,H5,H6,negative?
* velocity self transition: self-renewal?
* state: 
  * cycling: doing cell-division cycle (series of events that take place in a cell that cause it to divide into two daughter cells)
  * Dox+ baseline: baseline conditions with doxycycline (dox+)
  * Col11a1 high: Mesenchyme-derived tumors 
  * reprogramming cycling: doing reprogramming cycle?
  * reprogramming: reprogrammed cell?
  * reprogramming intermediate: ?
  * Tg+ baseline: transgene positive

Questions vars :
* Spliced and unspliced: ARN sequence? 

#**9.Murine adipogenic cell polycistronic reprogramming**

---



In [None]:
adata_adipo_sokm = anndata.read_h5ad("/content/drive/MyDrive/adipo_sokm.h5ad")

In [None]:
adata_adipo_sokm

## **9.1Concepts**

---



* Yamanaka Factors
* adipogenic cells
*  Young vs Aged

## **9.2 obs**

---
Observations. Key-indexed one-dimensional annotation. CELLS

[Sparse matrix](https://media.geeksforgeeks.org/wp-content/uploads/Sparse-Matrix-Array-Representation1.png)

In [None]:
# inspect the covariates
adata_adipo_sokm.obs.head()

In [None]:
adata_adipo_sokm.n_obs

In [None]:
columns_obs = adata_adipo_sokm.obs.columns
data_obs = {}
for variable in columns_obs: 
  description = adata_adipo_sokm.obs[variable].unique()
  data_obs[variable]= description

data_obs

In [None]:
#Barcode
adata_adipo_sokm.obs_names

Questions: 
* what is the difference between state, substate and full state?
* what is the meaning of 2FED, 2FEE, 2FEH, 2FEK (SMP)?

Concepts: 
* '10x_lane'
* 'SMP': 2FED, 2FEE, 2FEH, 2FEK
* 'age': Young or Aged
* 'age_treatment': Young Tg-,Young Tg+,Aged Tg-, Aged Tg+
* 'batch'
* 'full_state': adipogenic Hoxc10+,adipogenic Hoxc10-,reprogramming,secretory.
* 'leiden'
* 'sample'
* 'sample_cell_types': adipose
* 'sample_treatment': Tg+ or Tg-
* 'state': adipogenic,reprogramming,secretory
* 'substate': 0×0 char,Hoxc10+,Hoxc10-


##**9.3 layers**

---

Key-indexed multi-dimensional arrays aligned to dimensions of X

In [None]:
adata_adipo_sokm.layers.keys()

In [None]:
for key, value in adata_adipo_sokm.layers.items():
  print(f"{key} = {value}")

Spliced and unspliced: ARN sequence? 

##**9.4 uns**

---

Unstructured annotation. Key-indexed.

In [None]:
adata_adipo_sokm.uns

scvi
* scvi_version 0.7.1. 
* scvi_batch -> original key= _SCVI_BATCH
* scvi_labels -> original key = SCVI_LABELS

data registry
* X: atribute key = COUNTS, atribute name = LAYERS
* batch_ indices: atribute key = _SCVI_BATCH, atribute name = OBS
* labels : atribute key = _SCVI_LABELS, atribute name = OBS
* local_l_mean: atribute key = _SCVI_LOCAL_L_MEAN, atribute name = OBS
* local_l_var: atribute key = _SCVI_LOCAL_L_VAR, atribute name = OBS




##**9.5 obsm**

---

Key-indexed multi-dimensional annotation observation

In [None]:
adata_adipo_sokm.obsm
#https://www.sharpsightlabs.com/blog/numpy-axes-explained/

##**9.6 obsp**

---

Pairwaise annotation of observation. 

In [None]:
adata_adipo_sokm.obsp

##**9.7 var**

---

Variables. Key-indexed one-dimensional annotation. ANOTATION OF VARIABLES/FEATURES (DATAFRAME) = GENES

In [None]:
adata_adipo_sokm.var.head()

In [None]:
columns_var = adata_adipo_sokm.var.columns
data_var = {}
for variable in columns_var: 
  description = adata_adipo_sokm.var[variable].unique()
  data_var[variable]= description

data_var

In [None]:
adata_adipo_sokm.n_vars

In [None]:
adata_adipo_sokm.var_names

Questions var: 
* what is the difference between geneid-0,1,2,3...? 
* what is n_cells? and p_cells? 

##**9.8 conjunto**

---



In [None]:
adata_adipo_sokm.obs_names = [f"Cell_{i}" for i in range(adata_adipo_sokm.n_obs)]
adata_adipo_sokm.var_names = [f"Gene_{i}" for i in range(adata_adipo_sokm.n_vars)]
print(adata_adipo_sokm.obs_names[:10])
print(adata_adipo_sokm.var_names[:10])

In [None]:
adata_adipo_sokm.to_df(layer = "counts")

#**10.Summarize questions Murine adipogenic cell polycistronic reprogramming**

Questions obs: 
* what is the difference between state, substate adn full state?
* what is the meaning of 2FED, 2FEE, 2FEH, 2FEK (SMP)?

Questions var: 
* what is the difference between geneid-0,1,2,3...? 
* what is n_cells? and p_cells? 