# Single-cell RNA Sequencing of Fibrotic Skin Disease Dermis Tissue and Normal Dermise Tissue: Preprocessing

**Data Source Acknowledgment:**
The dataset is sourced from GSE163973. This dataset comprises single-cell RNA sequencing data from keloid (nsample=3) and normal scar(nsample=3).

Reference: Deng CC, Hu YF, Zhu DH, Cheng Q et al. Single-cell RNA-seq reveals fibroblast heterogeneity and increased mesenchymal fibroblasts in human fibrotic skin diseases. Nat Commun 2021 Jun 17;12(1):3709. PMID: 34140509

It's essential to emphasize that this dataset is exclusively utilized for Python practice purposes within this repository. This notebook will use this dataset to practice data cleaning techniques and clustering.

In [2]:
import numpy as np
import pandas as pd
import anndata as ad
import scanpy as sc
import seaborn as sns
import scvi
import os
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from data_preprocessing import data_preprocessing
for file in os.listdir('./'):
    if 'matrix' in file:
        print(file)

KF3_matrix
KF1_matrix
NF2_matrix
KF2_matrix
NF3_matrix
NF1_matrix


# Preprocessing

### 1.1. KF1

In [3]:
adata_KF1 = data_preprocessing('./KF1_matrix')

  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [25:22<00:00,  3.69s/it, v_num=1, train_loss_step=649, train_loss_epoch=667]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [25:22<00:00,  3.81s/it, v_num=1, train_loss_step=649, train_loss_epoch=667]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 202/400:  50%|██████████████████████████████████████████████▍                                             | 202/400 [02:06<02:04,  1.60it/s, v_num=1, train_loss_step=0.293, train_loss_epoch=0.286]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.269. Signaling Trainer to stop.


  adata.obs['n_genes'] = number


In [4]:
adata_KF1.write_h5ad('KF1.h5ad')

  df[key] = c
  df[key] = c


In [5]:
adata_KF1

AnnData object with n_obs × n_vars = 8171 × 21000
    obs: 'sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

### 1.2. KF2

In [6]:
adata_KF2 = data_preprocessing('./KF2_matrix')

  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [24:49<00:00,  3.52s/it, v_num=1, train_loss_step=679, train_loss_epoch=697]

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [24:49<00:00,  3.72s/it, v_num=1, train_loss_step=679, train_loss_epoch=697]
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 277/400:  69%|███████████████████████████████████████████████████████████████▋                            | 277/400 [02:42<01:12,  1.71it/s, v_num=1, train_loss_step=0.483, train_loss_epoch=0.312]
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.295. Signaling Trainer to stop.


  adata.obs['n_genes'] = number


In [7]:
adata_KF2.write_h5ad('KF2.h5ad')

  df[key] = c
  df[key] = c


In [9]:
adata_KF2

AnnData object with n_obs × n_vars = 7401 × 20770
    obs: 'sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

### 1.3. KF3

In [3]:
adata_KF3 = data_preprocessing('./KF3_matrix')

  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|█| 400/400 [24:47<00:00,  3.29s/it, v_num=1, train_loss_step

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|█| 400/400 [24:47<00:00,  3.72s/it, v_num=1, train_loss_step
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 225/400:  56%|▌| 225/400 [02:02<01:35,  1.83it/s, v_num=1, train_loss_step
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.361. Signaling Trainer to stop.


  adata.obs['n_genes'] = number


In [4]:
adata_KF3.write_h5ad('KF3.h5ad')

  df[key] = c
  df[key] = c


### 1.4. NF1

In [6]:
adata_NF1 = data_preprocessing('./NF1_matrix')

  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
  library_log_means, library_log_vars = _init_library_size(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|█| 400/400 [18:48<00:00,  2.76s/it, v_num=1, train_loss_step

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|█| 400/400 [18:48<00:00,  2.82s/it, v_num=1, train_loss_step
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


  latent_adata = AnnData(np.concatenate([latent_rep, np.log(lib_size)], axis=1))
  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 1/400:   0%| | 1/400 [00:00<03:22,  1.97it/s, v_num=1, train_loss_step=nan
Monitored metric validation_loss = nan is not finite. Previous best value was inf. Signaling Trainer to stop.


  adata.obs['n_genes'] = number


In [7]:
adata_NF1.write_h5ad('NF1.h5ad')

  df[key] = c
  df[key] = c


In [8]:
adata_NF1

AnnData object with n_obs × n_vars = 7620 × 20464
    obs: 'sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

### 1.5. NF2

In [11]:
adata_NF2 = data_preprocessing('./NF2_matrix')

  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|█| 400/400 [23:27<00:00,  3.43s/it, v_num=1, train_loss_step

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|█| 400/400 [23:27<00:00,  3.52s/it, v_num=1, train_loss_step
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 174/400:  44%|▍| 174/400 [01:44<02:15,  1.67it/s, v_num=1, train_loss_step
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.413. Signaling Trainer to stop.


  adata.obs['n_genes'] = number


In [12]:
adata_NF2.write_h5ad('NF2.h5ad')

  df[key] = c
  df[key] = c


In [13]:
adata_NF2

AnnData object with n_obs × n_vars = 7681 × 21176
    obs: 'sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

### 1.6. NF3

In [15]:
adata_NF3 = data_preprocessing('./NF3_matrix')

  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 400/400: 100%|█| 400/400 [16:55<00:00,  2.51s/it, v_num=1, train_loss_step

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|█| 400/400 [16:55<00:00,  2.54s/it, v_num=1, train_loss_step
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


  _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Applications/anaconda3/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Epoch 181/400:  45%|▍| 181/400 [01:23<01:40,  2.17it/s, v_num=1, train_loss_step
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.404. Signaling Trainer to stop.


  adata.obs['n_genes'] = number


In [16]:
adata_NF3.write_h5ad('NF3.h5ad')

  df[key] = c
  df[key] = c


In [17]:
adata_NF3

AnnData object with n_obs × n_vars = 5402 × 20348
    obs: 'sample', 'doublet', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
    var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'