## Integration
First I read all the different filtered samples and combine them into a single object.  
I take care to preserve the clustering that was done per sample by adding an affix to the cluster with the sample.  
I use a for loop to load in each file and then also append the sample ID to the cluster.
```python
adata_dict = {}
for sample in sample_names:
    adata_dict[sample] = sc.read(output_prefix+species+'/'+genome+'/'+sample+'_after_filtering.h5ad')
    adata_dict[sample].obs['leiden_post_QC'] = adata_dict[sample].obs['leiden_post_QC'].astype(str) + '_'+sample
adata = ad.concat(adata_dict)
```
Subsequently I concatenate the dictionary of anndata objects into a single object.  
Because I am concatenating them along the obs axis I lose the information in var.  
To remedy this I copy the information from var from one of the samples. I only keep the columns that pertain solely to the genes, anything sample specific is discarded.  
```python
adata.var = adata_dict['ME12'].var.iloc[:,range(0,11)]
adata.X = adata.layers['original_counts'].copy()
adata.write(output_prefix+species+'/'+genome+'/'+'adata_concat.h5ad')
```
I reset adata.X to the original counts layer and save it to a file.  

### Trying soft integration with batch aware Highly Variable Genes (HVGs)
In the past it has proven possible to soft integrate different samples using batch aware HVGs.  
Generally, selecting for a low amount (e.g. 500) of HVGs was enough to bring similar samples together.  
The signal from these genes was strong enough to outweigh any technical signal from batch effects. Making any batch effects largely disappear.  

![](../markdown_images/mouse/mm10/markdown_plots/soft_integration_umap_sample.png)   
Applying this to our data however proved less successful. The different samples, ME8 in particular, proved to be too different and would show no overlap with the other samples.  
The first two PCs in the PCA are very strong compared the the rest. Further investigation shows that they're reflective of the samples (PC1) and the difference between mesenchyme and brain (PC2).  
Clearly the batch effect is still very present after soft integration.  

![](../markdown_images/mouse/mm10/markdown_plots/soft_integration_pca_variance.png)   
![](../markdown_images/mouse/mm10/markdown_plots/soft_integration_pca_loading.png)   


In [None]:
import logging
logging.getLogger('matplotlib.font_manager').setLevel(logging.ERROR)
import scanpy as sc
import anndata as ad
import scvelo as scv
import scvi
import seaborn as sns
import plotly.express as px
import numpy as np
from dash import Dash, dcc, html, Input, Output

import pandas as pd

import os
import sys
import time
import gc
os.environ['R_HOME'] = sys.exec_prefix+"/lib/R/"

# Plotting
import matplotlib
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.colors import LinearSegmentedColormap, ListedColormap
from matplotlib.lines import Line2D 

from copy import copy
reds = copy(mpl.cm.Reds)
reds.set_under("lightgray")

project_directory = '/Cranio_Lab/Louk_Seton/4_species_project'
os.chdir(os.path.expanduser("~")+project_directory)

In [None]:
##mouse mm10
start_time=time.strftime("%Y_%m_%d-%I_%M_%S_%p")
print('start time:',start_time)

sample_names = ['ME8','ME9','ME10','ME11','ME12'] #specify the sample names
species = 'mouse' #specify the species
genome = 'mm10' #specify the genome
output_prefix = 'h5ad_files/' #specify the location of the cellranger output

adata_dict = {}
for sample in sample_names:
    adata_dict[sample] = sc.read(output_prefix+species+'/'+genome+'/'+sample+'_after_filtering.h5ad')
    adata_dict[sample].obs['leiden_post_QC'] = adata_dict[sample].obs['leiden_post_QC'].astype(str) + '_'+sample
    
adata = ad.concat(adata_dict)
adata.var = adata_dict['ME12'].var.iloc[:,range(0,11)]
adata.obs['leiden_post_QC'] = adata.obs['leiden_post_QC'].astype('category')
del adata.obs['leiden_filt']
del adata.obs['leiden_filt_highres']
del adata.obs['leiden_post_QC_highres']

adata.X = adata.layers['original_counts'].copy()

adata.write(output_prefix+species+'/'+genome+'/'+'adata_concat.h5ad')

In [None]:
##highly variable genes
#sc.pp.highly_variable_genes(adata,flavor = 'seurat_v3',batch_key='sample', n_top_genes=500,)

sc.pp.normalize_total(adata) # Normalizing to median total counts
sc.pp.log1p(adata) # Logarithmize the data
sc.pp.highly_variable_genes(adata,batch_key='sample', n_top_genes=500,)
adata.layers["normalized_counts"] = adata.X.copy()

##dimensionality reduction and clustering
sc.tl.pca(adata)
sc.pp.neighbors(adata)
#sc.tl.umap(adata)
sc.tl.umap(adata, min_dist = 0.2, negative_sample_rate=0.2)
sc.tl.leiden(adata,)

##cell cycle scoring
print(time.strftime("%Y_%m_%d-%I_%M_%S_%p"),'Prepare cell cycle scoring')
cell_cycle_genes = [x.strip() for x in open('required_files/regev_lab_cell_cycle_genes.txt')]
s_genes = cell_cycle_genes[:43]
g2m_genes = cell_cycle_genes[43:]
print(time.strftime("%Y_%m_%d-%I_%M_%S_%p"),'Cell cycle scoring')
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)

In [None]:
## code for any markdown figures ##

species = 'mouse' #specify the species
genome = 'mm10' #specify the genome

prefix = 'soft_integration'

output_dir = 'markdown_images/'+species+'/'+genome+'/markdown_plots/'
!mkdir -p {output_dir}

plt.rcParams['figure.figsize'] = [4,3]
ax = sc.pl.umap(adata, color = ['sample'], show=False)
plt.savefig(output_dir+prefix+'_umap_sample.png', dpi = 80,bbox_inches='tight')
plt.close()

plt.rcParams['figure.figsize'] = [5,3]
ax = sc.pl.pca_variance_ratio(adata,n_pcs = 40, show = False)
plt.savefig(output_dir+prefix+'_pca_variance.png', dpi = 80,bbox_inches='tight')
plt.close()

plt.rcParams['figure.figsize'] = [4,3]
ax = sc.pl.pca(adata,color = ['sample','Dcc','Cped1'], ncols = 2, show = False)
plt.savefig(output_dir+prefix+'_pca_loading.png', dpi = 80,bbox_inches='tight')
plt.close()

In [None]:
sc.pl.umap(adata, color = ['total_counts','n_genes_by_counts','pct_counts_mt','pct_counts_hb'], vmin = 0.05, ncols = 2)

In [None]:
sc.pl.umap(adata, color = ['sample','leiden','phase','Sox10','Epcam','Wnt6','Alx4','Dlx2','Tfap2b','Plp1','Mitf','Sox2','Cped1','Dcc'],cmap = reds, vmin = 0.05, ncols = 2)