# Lecture Assignment 1: Genetic Data Analysis using PLINK

## Objective:
To familiarize students with PLINK, a widely used open-source whole genome association analysis toolset. By the end of this lab, students should understand the basic PLINK commands and be able to perform quality control on genetic data.

## Prerequisites
- Basic knowledge in genetics and R programming.
- Installed PLINK software.
- Familiarity with the terminal or command-line interface.

In [24]:
shell_call <- function(command, ...) {
    result <- system(command, intern = TRUE, ...)
    cat(paste0(result, collapse = "\n"))
}

## Part 1: Data Retrieval and Initial Exploration

1.  Download the dataset of 156 Qataris from the provided link Dataset of [156 Qataris](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896773/). 

In [25]:
shell_call("wget https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896773/bin/mmc2.zip")

2. Extract and organize the files in a designated directory for easy access.

In [26]:
shell_call("unzip mmc2.zip")

Archive:  mmc2.zip
  inflating: Qatari156_filtered_pruned.bed  
  inflating: Qatari156_filtered_pruned.bim  
  inflating: Qatari156_filtered_pruned.fam  

3.  Use PLINK to view basic statistics of the dataset.

In [27]:
shell_call("plink --noweb --bfile Qatari156_filtered_pruned --recode --tab --out dataset")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ dataset.log ]
Analysis started: Sun Oct 22 18:51:11 2023

Options in effect:
	--noweb
	--bfile Qatari156_filtered_pruned
	--recode
	--tab
	--out dataset

Reading map (extended format) from [ Qatari156_filtered_pruned.bim ] 
67735 markers to be included from [ Qatari156_filtered_pruned.bim ]
Reading pedigree information from [ Qatari156_filtered_pruned.fam ] 
156 individuals read from [ Qatari156_filtered_pruned.fam ] 
0 individuals with no

In [28]:
shell_call("plink --noweb --file dataset --out info")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ info.log ]
Analysis started: Sun Oct 22 18:51:13 2023

Options in effect:
	--noweb
	--file dataset
	--out info

67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 156 missing
49 males, 107 females, and 0 of unspecified sex
Before freque

4.  Report the number of individuals, SNPs, and the gender distribution.

In [29]:
individuals <- 156
males <- 49
females <- 107
n_snps <- 67735

## Part 2: Quality Control Using PLINK

### Task 2.1: Allele Frequency (--freq)

1. Use the --freq command in PLINK to compute allele frequencies.

In [30]:
shell_call("plink --noweb --file dataset --freq --out task2.1")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.1.log ]
Analysis started: Sun Oct 22 18:51:16 2023

Options in effect:
	--noweb
	--file dataset
	--freq
	--out task2.1

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 156

2.  Report the number of SNPs after this step.

In [31]:
snps_task2_1 <- 67735

### Task 2.2: Minor Allele Frequency (--maf 0.2)

1. Filter SNPs with a minor allele frequency less than 0.2 using --maf 0.2.

In [32]:
shell_call("plink --noweb --file dataset --maf 0.2 --make-bed --out task2.2")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.2.log ]
Analysis started: Sun Oct 22 18:51:20 2023

Options in effect:
	--noweb
	--file dataset
	--maf 0.2
	--make-bed
	--out task2.2

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 c

2. Report the number of SNPs remaining after this step.

In [33]:
snps_task2_2 <- 31694

### Task 2.3: Genotyping Rate (--geno 0.02) 

1. Exclude SNPs with missing genotyping rate more than 2% using --geno 0.02.

In [34]:
shell_call("plink --noweb --file dataset --geno 0.02 --make-bed --out task2.3")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.3.log ]
Analysis started: Sun Oct 22 18:51:23 2023

Options in effect:
	--noweb
	--file dataset
	--geno 0.02
	--make-bed
	--out task2.3

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0

2. Report the number of SNPs remaining after this step.

In [35]:
snps_task2_3 <- 67735

### Task 2.4: Hardy-Weinberg Equilibrium (--hwe 10E-06)

1. Exclude SNPs that fail the Hardy-Weinberg equilibrium test at threshold 10E-06 using --hwe 10E-06.

In [36]:
shell_call("plink --noweb --file dataset --hwe 10E-06 --make-bed --out task2.4")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.4.log ]
Analysis started: Sun Oct 22 18:51:27 2023

Options in effect:
	--noweb
	--file dataset
	--hwe 10E-06
	--make-bed
	--out task2.4

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 

2. Report the number of SNPs remaining after this step.

In [37]:
snps_task2_4 <- 67735

### Task 2.5: Minor Allele Frequency (--maf 0.2) and Genotyping Rate (--geno 0.02)

1. Execute Minor Allele Frequency and Genotyping Rate on the dataset.

In [38]:
shell_call("plink --noweb --file dataset --maf 0.2 --geno 0.02 --make-bed --out task2.5")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.5.log ]
Analysis started: Sun Oct 22 18:51:31 2023

Options in effect:
	--noweb
	--file dataset
	--maf 0.2
	--geno 0.02
	--make-bed
	--out task2.5

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9

2. Report the number of SNPs remaining after this step.

In [39]:
snps_task2_5 <- 31694

### Task 2.6: Minor Allele Frequency (--maf 0.2) and Hardy-Weinberg Equilibrium (--hwe 10E-06)

1.  Execute Minor Allele Frequency and Hardy-Weinberg Equilibrium on the dataset.

In [40]:
shell_call("plink --noweb --file dataset --maf 0.2 --hwe 10E-06 --make-bed --out task2.6")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.6.log ]
Analysis started: Sun Oct 22 18:51:34 2023

Options in effect:
	--noweb
	--file dataset
	--maf 0.2
	--hwe 10E-06
	--make-bed
	--out task2.6

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -

2.  Report the number of SNPs remaining after this step.

In [41]:
snps_task2_6 <- 31694

### Task 2.7: Genotyping Rate (--geno 0.02) and Hardy-Weinberg Equilibrium (--hwe 10E-06)

1.  Execute Genotyping Rate and Hardy-Weinberg Equilibrium on the dataset.

In [42]:
shell_call("plink --noweb --file dataset --geno 0.02 --hwe 10E-06 --make-bed --out task2.7")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.7.log ]
Analysis started: Sun Oct 22 18:51:38 2023

Options in effect:
	--noweb
	--file dataset
	--geno 0.02
	--hwe 10E-06
	--make-bed
	--out task2.7

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also

2.  Report the number of SNPs remaining after this step.

In [43]:
snps_task2_7 <- 67735

### Task 2.8: Cumulative QC Steps

1.  Execute all the quality control steps in sequence on the dataset.

In [44]:
shell_call("plink --noweb --file dataset --maf 0.2 --geno 0.02 --hwe 10E-06 --make-bed --out task2.8")


@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Skipping web check... [ --noweb ] 
Writing this text to log file [ task2.8.log ]
Analysis started: Sun Oct 22 18:51:42 2023

Options in effect:
	--noweb
	--file dataset
	--maf 0.2
	--geno 0.02
	--hwe 10E-06
	--make-bed
	--out task2.8

** For gPLINK compatibility, do not use '.' in --out **
67735 (of 67735) markers to be included from [ dataset.map ]
156 individuals read from [ dataset.ped ] 
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype va

2. Report the number of SNPs remaining after all steps.

In [45]:
snps_task2_8 <- 31694

3. Report the final number of individuals, number of males and females, and SNPs.

In [46]:
final_individuals <- 156
final_males <- 49
final_females <- 107
final_snps <- 31694