## Summary  
Here we're integrating our new ectoderm data (E8.5-E12.5) with our previous dataset (E10.5-E14.5).  
Previously we had removed any clusters from our ectodermal cells that were not as relevant to us.  
However, here the data we are integrating it with, should contain these parts we removed previously.  
So I went back and selected all the Epcam positive clusters from our previous dataset. I also transfered the labels from the ectoderm we had annotated.  
I did some other things to make sure everything was transfered properly between the two datasets when merging the objects.  

With this I applied our previous strategy of calculating hvgs twice. Once only on TFs and once on the other genes. This combined list of hvgs,  
enriched in TFs, is then used as our hvg set. I still want to explore a better way of identifying hvgs, they're quite crucial to the analysis,  
but usually overlooked. Regardless, with this set I integrated the two datasets using MIRA.  

First I ran a loop to try different integration parameters. In the past I found that the number of topics and epochs are the most influential.  
For each topic below I ran it with each of the number of epochs.  
topics = [8,11,12,14,16,19]  
epochs = [50,75,100,125,150]  

I saved the results in /groups/mpistaff/Cranio_Lab/Louk_Seton/4_species_project/figures_ignore/mouse/mm39/mira_integration_tuning/andrea_ecto  
The file are named by topic number, each page represents the number of epochs used in the order above.  

Going over the results I actually found the 8 topics 50 epochs to look most promising. This seemed to show some of the relations between cells that I expected.  
Generally, most of the results looked similar, I am mostly looking for something that will be able to show the velocity well. It might be that this view changes later.  

For now I will work with the 8 topics 50 epochs data.

In [None]:
#ensure cuda is working
import torch
assert torch.cuda.is_available()
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.get_device_name(0))

In [None]:
import mira

import anndata
import scanpy as sc
import numpy as np
import pandas as pd
import scvelo as scv

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
import matplotlib
import matplotlib as mpl
from copy import copy
reds = copy(mpl.cm.Reds)
reds.set_under("lightgray")

import os
import sys
from pathlib import Path
os.environ['R_HOME'] = sys.exec_prefix+"/lib/R/"

project_directory = '/Cranio_Lab/Louk_Seton/4_species_project'
os.chdir(os.path.expanduser("~")+project_directory)

In [None]:
seed = 666
import random
random.seed(seed)
np.random.seed(seed)

In [None]:
# #new merge
# #load in all data and show ecto
# adata = sc.read('../mesenchyme_project_2023/anndata_objects/dataset_cleaned.h5ad')
# adata = adata[adata.obs['sample'].isin(['10','11','12','13','14'])].copy()

# adata.layers['original_counts'] = adata.X.copy()
# sc.pp.normalize_total(adata) # Normalizing to median total counts
# sc.pp.log1p(adata) # Logarithmize the data
# adata.layers["normalized_counts"] = adata.X.copy()

# ##highly variable genes
# sc.pp.highly_variable_genes(adata, n_top_genes=1000,)

# ##dimensionality reduction and clustering
# sc.tl.pca(adata)
# sc.pp.neighbors(adata)
# sc.tl.umap(adata)
# sc.tl.leiden(adata,resolution = .3, key_added = 'leiden')

# sc.pl.umap(adata,color = ['Epcam','sample','leiden',
#                          ], ncols = 3, 
#            groups = ['3','11'],
#            cmap = reds, vmin = 0.05)

# #subset ecto
# adata = adata[adata.obs['leiden'].isin(['3','11'])].copy()

# cell_cycle_genes = [x.strip() for x in open('required_files/regev_lab_cell_cycle_genes.txt')]
# s_genes = cell_cycle_genes[:43]
# g2m_genes = cell_cycle_genes[43:]
# sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)

# #copy over annotation from before
# adata_ecto = sc.read('../mesenchyme_project_2023/anndata_objects/ectoderm_interactive.h5ad')
# adata.obs['annotation_ecto'] = np.nan
# adata.obs.loc[adata_ecto[adata[adata_ecto[adata_ecto.obs.index.isin(adata.obs.index)].obs.index].obs.index].obs.index,'annotation_ecto'] = list(adata_ecto[adata[adata_ecto[adata_ecto.obs.index.isin(adata.obs.index)].obs.index].obs.index].obs['annotation_ecto']) #Socs2 Dlk1 Dlx5 Hey1 Eya4 Will become neuronal and non neuronal progenitors

# sc.pl.umap(adata,color = ['Epcam','sample','leiden','annotation_ecto'
#                          ], ncols = 3, 
#            #groups = ['3','11'],
#            cmap = reds, vmin = 0.05)

# adata_ecto_new = sc.read('h5ad_files/mouse/mm39/adata_mm39_epcam_concat.h5ad')
# adata_ecto_new = adata_ecto_new[:,adata[:,adata.var.index.isin(adata_ecto_new.var.index)].var.index].copy()
# adata = adata[:,adata_ecto_new[:,adata_ecto_new.var.index.isin(adata_ecto_new.var.index)].var.index].copy()

# adata.X = adata.layers['original_counts'].copy()

# import anndata as ad
# ecto_combined = ad.concat([adata,adata_ecto_new],join = 'outer')
# ecto_combined.var = adata_ecto_new.var

# ouput_dir = 'h5ad_files/mouse/ecto_andrea/'
# !mkdir -p {ouput_dir}
# ecto_combined.write(ouput_dir+'ecto_combined_kaucka_all.h5ad')

In [None]:
##old merge
# adata_ecto_new = sc.read('h5ad_files/mouse/mm39/adata_mm39_epcam_concat.h5ad')
# adata_ecto = sc.read('../mesenchyme_project_2023/anndata_objects/ectoderm_interactive.h5ad')

# adata_ecto = adata_ecto[adata_ecto.obs['sample'].isin(['10','11','12','13','14'])].copy()
# adata_ecto.obs = adata_ecto.obs.loc[:,['sample','barcode','batch','doublet_score','n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rb', 'pct_counts_rb',
#                    'phase','S_score','G2M_score','annotation_ecto']]

# adata_all = sc.read('../mesenchyme_project_2023/anndata_objects/all_cells_unfiltered.h5ad')
# adata_all = adata_all[adata_ecto[adata_ecto.obs.index.isin(adata_all.obs.index)].obs.index].copy() #subset by ecto

# ##keep matching genes
# adata_ecto_new = adata_ecto_new[:,adata_all[:,adata_all.var.index.isin(adata_ecto_new.var.index)].var.index].copy()
# adata_all = adata_all[:,adata_ecto_new[:,adata_ecto_new.var.index.isin(adata_ecto_new.var.index)].var.index].copy()

# adata_all.obs = adata_ecto.obs #copy over obs
# adata_ecto = adata_all.copy() #replace with raw cells
# adata_ecto.layers['original_counts'] = adata_ecto.X.copy()
# import anndata as ad
# ecto_combined = ad.concat([adata_ecto,adata_ecto_new],join = 'outer')
# ecto_combined.var = adata_ecto_new.var

# ouput_dir = 'h5ad_files/mouse/ecto_andrea/'
# !mkdir -p {ouput_dir}
# ecto_combined.write(ouput_dir+'ecto_combined.h5ad')

In [None]:
ouput_dir = 'h5ad_files/mouse/ecto_andrea/'

adata = sc.read(ouput_dir+'ecto_combined_kaucka_all.h5ad')

In [None]:
with open('required_files/allTFs_mm.txt') as f:
    tf_list = [line.rstrip('\n') for line in f]
import numpy as np
adata.var['TF'] = np.where(adata.var.index.isin(tf_list),True,False)

In [None]:
adata = adata[:,adata.var['TF']==True].copy()

In [None]:
#sc.pp.filter_genes(adata, min_cells=15)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
#sc.pp.highly_variable_genes(adata, min_disp = 0.5)
sc.pp.highly_variable_genes(adata, min_disp = 0.5,batch_key='sample',
                            #n_top_genes=1000
                           )

sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=6)
sc.tl.umap(adata, min_dist = 0.2, negative_sample_rate=0.2)
sc.pl.umap(adata, color = ['sample',], frameon=False)

In [None]:
TF_highly_variable = list(adata.var[adata.var['highly_variable']==True].index)


In [None]:
ouput_dir = 'h5ad_files/mouse/ecto_andrea/'

adata = sc.read(ouput_dir+'ecto_combined_kaucka_all.h5ad')
with open('required_files/allTFs_mm.txt') as f:
    tf_list = [line.rstrip('\n') for line in f]
import numpy as np
adata.var['TF'] = np.where(adata.var.index.isin(tf_list),True,False)
adata = adata[:,adata.var['TF']==False].copy()
#sc.pp.filter_genes(adata, min_cells=15)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
#sc.pp.highly_variable_genes(adata, min_disp = 0.5)
sc.pp.highly_variable_genes(adata, min_disp = 0.5,batch_key='sample',
                            n_top_genes=500
                           )

sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=6)
sc.tl.umap(adata, min_dist = 0.2, negative_sample_rate=0.2)
sc.pl.umap(adata, color = ['sample','phase'], frameon=False)

rest_highly_variable = list(adata.var[adata.var['highly_variable']==True].index)


In [None]:
highly_variable_list = TF_highly_variable+rest_highly_variable

In [None]:
len(highly_variable_list)

In [None]:
ouput_dir = 'h5ad_files/mouse/ecto_andrea/'

adata = sc.read(ouput_dir+'ecto_combined_kaucka_all.h5ad')
adata.var['highly_variable_list'] = np.where(adata.var.index.isin(highly_variable_list),True,False)
#sc.pp.filter_genes(adata, min_cells=15)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
#sc.pp.highly_variable_genes(adata, min_disp = 0.5)
sc.pp.highly_variable_genes(adata, min_disp = 0.5,batch_key='sample',
                            #n_top_genes=500
                           )

sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=6)
sc.tl.umap(adata, min_dist = 0.2, negative_sample_rate=0.2)
sc.pl.umap(adata, color = ['sample','phase'], frameon=False)

In [None]:
adata.var['highly_variable_list'].value_counts()

In [None]:
##mira stuff
model = mira.topics.make_model(
    adata.n_obs, adata.n_vars, # helps MIRA choose reasonable values for some hyperparameters which are not tuned.
    feature_type = 'expression',
    #highly_variable_key='TF',
    highly_variable_key = 'highly_variable_list',
    counts_layer='original_counts',
    categorical_covariates='sample',
    continuous_covariates= ['S_score','G2M_score'],
    #max_learning_rate = 0.1
)

In [None]:
model.get_learning_rate_bounds(adata)

In [None]:
model.set_learning_rates(1e-3, 0.1) # for larger datasets, the default of 1e-3, 0.1 usually works well.
model.plot_learning_rate_bounds(figsize=(7,3))

In [None]:
# ## quick loop to try out some different parameters
# from matplotlib.backends.backend_pdf import PdfPages

# topics = [8,11,12,14,16,19]
# epochs = [50,75,100,125,150]
# output_dir = 'figures_ignore/mouse/mm39/mira_integration_tuning/andrea_ecto/'
# for topic in topics:
#     with PdfPages(output_dir+str(topic)+'_trials.pdf') as pdf:
#         for epoch in epochs:
#             model = model.set_params(num_topics = topic,num_epochs = epoch).fit(adata)
#             model.predict(adata,)
#             sc.pp.neighbors(adata, use_rep = 'X_umap_features', metric = 'manhattan',n_neighbors=15)
#             sc.tl.umap(adata, )

#             plt.rcParams['figure.figsize'] = [5,4]
#             ax = sc.pl.umap(adata, color = ['sample','phase','Hesx1','Sox10','Fezf1','Fezf2','Pax9','Shh','Sox2','Pax6','Wnt6'], cmap = reds,ncols = 2, vmin = 0.05, show = False)
#             for p in ax:
#                 p.set_rasterized(True)
#             pdf.savefig(dpi=150,bbox_inches='tight')
#             plt.close()

In [None]:
topic_contributions = mira.topics.gradient_tune(model, adata)

In [None]:
NUM_TOPICS = 20

mira.pl.plot_topic_contributions(topic_contributions, NUM_TOPICS)

In [None]:
NUM_TOPICS = 8 #24
model = model.set_params(num_topics = NUM_TOPICS,num_epochs = 50).fit(adata)

In [None]:
model.predict(adata,)

In [None]:
sc.pp.neighbors(adata, use_rep = 'X_umap_features', metric = 'manhattan',n_neighbors=15)
#sc.tl.umap(adata, min_dist=0.1, negative_sample_rate=0.05,)
#sc.tl.umap(adata, min_dist=0.3, negative_sample_rate=0.05,n_components =3)
sc.tl.umap(adata, )



In [None]:
sc.tl.leiden(adata)
sc.tl.leiden(adata,resolution = 2, key_added = 'leiden_high')


In [None]:
sc.pl.umap(adata, color = ['sample','Emx2'],groups = ['ME8','ME9','ME10'])

In [None]:
sc.pl.umap(adata, color = ['sample','annotation_ecto','leiden','leiden_high','Sox10','phase','Wnt6',
                           'Pax6','Sox2','Hesx1','Fezf1','Fezf2','Fgf8','Lhx3','Emx2','Foxg1','Pitx3','Pitx1',
                           'Dlx5','Six6','Pitx2'], cmap = reds, vmin = 0.05, ncols = 4)

In [None]:
sc.pl.umap(adata, color = ['sample','annotation_ecto','Aldh1a3','Casr','Hmx1','Tlx2','Spink1','Pax9','Shh','T','Thrb'],cmap = reds, vmin = 0.05)

In [None]:
adata.write('h5ad_files/mouse/ecto_andrea/ecto_combined_all_integrated.h5ad')