# Single-cell RNA Sequencing of Lung Samples from COVID-19 Decedents and Control Individuals

**Data Source Acknowledgment:**

The dataset is sourced from [GSE171524](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE171524). This dataset comprises single-nuclei RNA sequencing data from 116,314 cells collected from 20 frozen lungs obtained from 19 individuals who died from COVID-19 and 7 control patients. It's essential to emphasize that this dataset is exclusively utilized for Python practice purposes within this repository.

In [1]:
#unzip .tar
#!tar -xf GSE171524_RAW.tar
#!gunzip GSM5226574_C51ctr_raw_counts.csv.gz

In [2]:
import numpy as np
import pandas as pd
import anndata as ad
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Reading and creating AnnData object

In [3]:
#read a file using Scanpy
#it's necessary to transpose it because Scanpy requires genes as columns and cells as rows
adata = sc.read_csv('GSM5226574_C51ctr_raw_counts.csv').T
adata

AnnData object with n_obs × n_vars = 6099 × 34546

In [4]:
#first component = observation (df)
adata.obs

TAGGTACCATGGCCAC-1_1
ATTCACTGTAACAGGC-1_1
TAACTTCCAACCACGC-1_1
TTGGGTACACGACAAG-1_1
AGGCCACAGAGTCACG-1_1
...
CGCCATTGTTTGCCGG-1_1
CACTGGGGTCTACGTA-1_1
CATACTTGTAGAGGAA-1_1
TTTGGTTTCCACGGAC-1_1
ATGCATGAGTCATGAA-1_1


In [5]:
#second component = variables/genes (df)
adata.var

AL627309.1
AL627309.5
AL627309.4
AL669831.2
LINC01409
...
VN1R2
AL031676.1
SMIM34A
AL050402.1
AL445072.1


In [6]:
#third component = data matrix (numpy array)
adata.X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

# 2. Doublet removal (optional)