#### Margaret Antonio 16.09.01

### Data

#### PHENOTYPIC DATA
We have a dataset of traits measured from maize tassels that were grown in two separate fields (or years). Two to three tassels were measured for the same seed (plot). The entire dataset reprsents tassel traits from over 2000 inbred lines.

#### GENOTYPE DATA
The genotypes for the lines represented in the phenotypic data are available at panzea.org under "DATA"-->"GBS GENOTYPE SEARCH", the genotypes can be requested using the accession/item ID located in the phenotype file. More on acquiring genotypes to follow below.  

### Summary of procedure
1. Clean phenotypic data file for obvious errors and formatting
2. Outlier removal on phenotypic data
3. Get a list of the "taxa" (lines/accession IDs) available in [Panzea](http://www.panzea.org/)
4. Use the GRIN zea mays database to verify line identifiers. Accession ID should be the ID corresponding to Panzea genotypes, but sometimes it's the item ID.
5. Match line IDs in the phenotype file to the exact ID in Panzea. The phenotype file could have a shortened version of the whole ID.
6. Request exact list and order of Panzea genotypes
7. Get BLUPs for trait data
8. Filter and format genotypes using [Tassel5](http://www.maizegenetics.net/#!tassel/c17q9)
9. Run genotype file through NAM for quality filtering and imputation
10. Run population structure analysis in NAM
11. GWAS with BLUPs, Genotypes, and Population Structure


## Part 1: Data Preparation

### For phenotype file
1. If not already done, merge all files (i.e. both years) and add identifiers if needed (i.e. a column for year). 
2. Rename columns to all lowercase, no spaces (i.e. Tassel Weight -> tassel_weight or tw)
3. Remove rows that do not have data for any of the traits. Useful to check the notes section, if there is a "No tassel" note, then remove that row
4. Remove rows that do not have an item ID OR accession ID
5. ...Probably some more things that have to be done...
6. Sometimes, because of editing in a non-plain text editor (ie. Excel), data entries get "" around them or new lines get ^M marks so open in vim and change:

    ```bash
    :%s/^M/\r/g
    :%s/"//g
    ```
    
7. If there is an ITEM ID without an ACCESSION ID, then look up the item ID in GRIN and get the corresponding ACCESSION ID. 
8. Check for typos by changing all IDs to lowercase or uppercase and removing any spaces (can be done in VIM)

### For genotype file

1. Download or request via panzea.org (DATA -> GBS GENOTYPES SEARCH -> IMPUTED DATASET) the entire dataset
2. Open in Tassel 5 (download GUI) and export a text file of all the taxa names (lines)
3. The panzea taxa names are longer than the accession IDs. Taxa names will have a colon (:) with a number following. Create a new file of Panzea taxa which do not have the colon and number (just the first part). Make all of these taxa names upper or lower case to correspond to step #8 for the phenotype file.

### Match genotype and phenotype file line identifiers

1. Use [Venny](http://bioinfogp.cnb.csic.es/tools/venny/) to find commonalities between panzea IDs and item and/or accession IDs in the phenotype file. Venny also lets you copy lists of the entries that are unique or common, so use this to go back and check the IDs in the phenotype file.
2. Make two new columns in the phenotype file for genotype-by-accession and genotype-by-item
3. Put matching ones in corresponding columns
4. If both the accession and item IDs match to DIFFERENT panzea IDs then remove the entry. Do the same if neither accession nor item ID matches to a panzea ID
5. If the item ID does and the accession ID does not, then make that the final identifier
6. Every entry should correspond to a panzea ID

## Part 2: Outlier removal for trait data

### Goal of outlier removal 
Remove (set to NA) trait values that are insane...like 2843 side branches...who counted that many side branches. Obviously an error. If the dataset is small enough, it would be good to actually go through all of the entries and see if data was entered in the cell next to where it was supposed to be. This is an easy error in the phenotyping process.

### Multiple rounds of outlier removal
Important to look at a boxplot of the trait values after each round of outlier removal to make sure you're not actually removing really interesting data, because that's what we're looking for.

### R method for outlier removal: getOutliers.R
getOutliers.R is a method for outlier removal based on a SAS method written by some person somewhere. The basic idea is to create a mixed model, calculate the studentized residuals, calculate a threshold, and then remove (set to NA) trait values whose absolute value does not satisfy the threshold. The getOutliers.R script can run through a whole matrix containing multiple trait values. Read through the script and set the column names and columns with traits in them because those are hardcoded. A boxplot should output for each trait and for all of the traits at the end. The returned value is the model itself.

#### getOutliers.R is available at [github.com/antmarge/maize](github.com/antmarge/maize)

## Part 3: BLUPs for trait data

### BLUPs Tutorial
Best Linear Unbiased Predictors (BLUPs) were calculated following and adaptation of the tutorial provided by extension.org, titled: ["Estimating Heritability and BLUPs for Traits Using Tomato Phenotypic Data"](http://articles.extension.org/pages/61006/estimating-heritability-and-blups-for-traits-using-tomato-phenotypic-data)

### Script for getting trait data BLUPs
A script for getting BLUPs for trait data is available at [github.com/antmarge/maize](https://github.com/antmarge/maize)

## Part 4: Genotype formatting and filtering
1. Load Panzea genotype file (any format supported) into Tassel 5 
2. Perform the following filters on the data (make sure the right dataset in the upper left panel is selected with each filtering step):
    - FILTER -> sites -> In the entry box: (a) MAF>.05, (b) Marker must be present in at least 80% of individuals, and (c) Remove minor SNP states
    - DATA -> HOMOZYGOUS (because we have inbred lines)
    - OPTIONAL: For separate chromosomes, go to DATA -> SEPARATE (don't keep chr. 0)
3. Output dataset as Hapmap (DO NOT SELECT the "keep annotations" or "diploid" options)
4. Use the script (hapmap2Matrix.R available at github.com/antmarge/maize) to turn the hapmap file into a matrix where rows are genotypes and columns are markers (SNPs), missing data is set to NA, and if desired, data is run through NAM's snpQC and imputed

## Part 5: NAM: Nested Association Mapping Analysis
       https://cran.r-project.org/web/packages/NAM/index.html
       
### Part 5A: SNP QC filtering in NAM

```R
genbin <- snpQC(genbin,psy=1,MAF=0.05,remove=TRUE,impute=TRUE)
```

### Part 5B: Population structure analysis

See procedure sent by Travis, located in rochefordLab/2-brenda/popstr

### Part 5C: GWAS







# Running GWAS with GAPIT

Ran fine, but something wrong with genotype file because there are only 12,594 SNPs and all of the files (i.e. Chromosomewise manhattan plots) only report chromosome 1. Strange because chr 1 has 85,283 sites. So checked and found out that there is indeed only a chr 1 reported in the hapmap genotype file (with 4 SNPs as chr0).

> moo<-read.delim("amesGenotypesFiltWhole.hmp.txt")
> chr<-moo[,3]
> summary(chr)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.9997  1.0000  1.0000 
> length(chr)
[1] 12594


Looks like the genotype file was truncated. Opened the original in torbert/margaret/ in Tassel (the original one, not the exact one that was used) and it looks complete. The one that was actually used in GAPIT (stored in torbert/gapit/) won't even open in Tassel, says there are too few SNPs or something like that. 

What probably happened: the files didn't decompress completely.



