# Pre-processing Step
In this step, we need to prepare the data to be ready-to-use data. This step include:
- Preparing metadata
- Demultiplexing FASTQ Files
- Removing Non-Biological Sequences
- Sequence Denoising

## Preparing Metadata
Before starting the analysis, explore the sample metadata to familiarize yourself with the samples used in this study. QIIME 2 metadata is most commonly stored in a TSV (i.e. tab-separated values) file. It usually contains at least sample ID and additional information related to the samples, though metadata consisting only sample ID is still valid QIIME 2 metadata file. The instructors will guide you how to create a metadata file using Google Spreadsheet.

Once you have your metadata file, you can copy it to the server. Open a new terminal/secure shell app, and run this command.

In [None]:
scp '/path/to/local/sample-metadata.tsv' username@remote_host:'/path/to/destination/'

Another metadata file is provided inside the materials folder.

## Demultiplexing FASTQ Files
Raw microbiome data typically exists in one of two forms: multiplexed or demultiplexed. In multiplexed data, sequences from all samples are grouped together in one or more files. In demultiplexed data, sequences are separated into different files based on the sample they are derived from.

In [None]:
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path raw-data/ \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux.qza

### Summarize Demultiplexed FASTQ Files
When you have demultiplexed sequence data, the next step is typically to generate a visual summary of it. This allows you to determine how many sequences were obtained per sample, and also to get a summary of the distribution of sequence qualities at each position in your sequence data.

In [None]:
qiime demux summarize \
  --i-data demux.qza \
  --o-visualization demux.qzv

Now, we have two types of output files:

*   .qza files store QIIME 2 Artifacts on disk. QIIME 2 Artifacts represent data that are generated by QIIME 2 and intended to be used by QIIME 2, such as intermediary files in an analysis workflow. qza stands for QIIME Zipped Artifact.
*   .qzv files store QIIME 2 Visualizations on disk. QIIME 2 Visualizations represent data that are generated by QIIME 2 and intended to be viewed by humans, such as an interactive visualization. qzv stands for QIIME Zipped Visualization.

.qza and .qzv files can be loaded with [QIIME 2 View](https://view.qiime2.org/). Because we are working on our server, we need to download the files from the server before uploading it to QIIME 2 View. Open a new terminal/secure shell app and run this command.

In [None]:
scp username@remote_host:'/path/to/file' '/path/to/local/destination/'

## Removing Non-Biological Sequences
Raw data from sequencer may contain any non-biological sequences (e.g. primers, sequencing adapters, PCR spacers, etc), we should remove these.

The q2-cutadapt plugin has comprehensive methods for removing non-biological sequences from paired-end or single-end data.

In [None]:
qiime cutadapt trim-paired \
  --i-demultiplexed-sequences demux.qza \
  --p-front-f AGAGTTTGATCCTGGCTCAG \
  --p-front-r ACGGCTACCTTGTTACGACTT \
  --o-trimmed-sequences primer-trimmed-demux.qza \
  --verbose

## Sequence Denoising
QIIME 2 plugins are available for several quality control methods, including DADA2, Deblur, and basic quality-score-based filtering. In this tutorial we present this step using DADA2. The result of this method will be a FeatureTable[Frequency] QIIME 2 artifact, which contains counts (frequencies) of each unique sequence in each sample in the dataset, and a FeatureData[Sequence] QIIME 2 artifact, which maps feature identifiers in the FeatureTable to the sequences they represent.

The denoise-paired action, which we’ll use here, requires four parameters that are used in quality filtering:

trim-left-f a, which trims off the first a bases of each forward read

trunc-len-f b which truncates each forward read at position b

trim-left-r c, which trims off the first c bases of each forward read

trunc-len-r d which truncates each forward read at position d This allows the user to remove low quality regions of the sequences. To determine what values to pass for these parameters, you should review the Interactive Quality Plot tab in the demux.qzv file that was generated above.

In [None]:
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs primer-trimmed-demux.qza \
  --p-trim-left-f 0 \
  --p-trunc-len-f 270 \
  --p-trim-left-r 0 \
  --p-trunc-len-r 200 \
  --o-representative-sequences rep-seqs.qza \
  --o-table asv-table.qza \
  --o-denoising-stats stats.qza

After DADA2 completes, you’ll want to explore the resulting data. We can do this using the following commands, which will create visual summaries of the data. The feature-table summarize action command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics.

In [None]:
qiime feature-table summarize \
  --i-table asv-table.qza \
  --o-visualization asv-table.qzv \
  --m-sample-metadata-file sample-metadata.tsv

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

qiime metadata tabulate \
  --m-input-file stats.qza \
  --o-visualization stats.qzv

If the sequences appear to have low quality score, we can filter them further using the `q2-quality-filter` plugin and run the . For example, we want to filter out the sequences that have qscore below 28, we can run this command:

In [None]:
qiime quality-filter q-score \
  --i-demux primer-trimmed-demux.qza \
  --p-min-quality 28 \
  --o-filtered-sequences demux-filtered.qza \
  --o-filter-stats demux-filter-stats.qza

Congratulations, you have sucessfully generate ready-to-use data. Now continue to the next section of the tutorial: [Microbiome Classification Analysis](4_Microbiome_Classification_Analysis.ipynb),

or you just can back to the main tutorial page [Main Page](1_Metagenomics_Workshop_Module.ipynb)