# Extract relevant genes and generate files in intermediate_data_01/

## Workflow description

1. Run **I. Essentials** and **II. Custom Functions**
2. For each dataset in `lipid_selection/data/raw_data/source_data/`:

    1. Extract basic information:
        - genome_version 
        - database_source 
        - inclusion_criteria 
        - first_author
        - publication_year
      
    2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`
        - Use `append_basic_info()`
    
    3. Extract candidate and non-candidate genes.
        - Use `check_excel_data()`, `import_messy_excel()`
    
    4. Export candidate and non-candidate genes to `<first_author>_<year>.txt`.
        - Use `export_data()`

## I. Essentials

#### Check directories and load library packages

Working directory is `lipid_selection/data/raw_data/source_data`.

In [3]:
current_dir = getwd()
source_data_dir = "../../data/raw_data/source_data"
setwd(source_data_dir)

#Set target folder for candidate gene info from step 2.D
target_folder = "../../intermediate_data_01/"

Load essential library packages

In [71]:
library("readxl")
library("dplyr")
library("tidyverse")


In [5]:
getwd()

## II. Custom functions

* `check_excel_data`
    - **Usage**: Check if excel dataset has more than one sheet; check if file type is .xls or .xlsx
    - **input**: file_path (str)
    - **output**: list of sheet names as strings
    
    
* `import_messy_excel(file_path, sheet_name)`
    - **Usage**: Remove non-data table rows and import a cleaner dataframe from an excel dataset
    - **input**: file_path (str), sheet_name (str)
    - **output**: dataframe
    
    
* `export_data`
    - **Usage**: Export candidate and non-candidate genes to `<first_author>_<year>.txt` in `target_folder`
    - **input**: gene_data (dataframe), 
    - **output**: `<first_author>_<year>.txt` in `target_folder`
    
    - **Usage**: 
    - **input**: 
    - **output**: 

#### check_excel_data

In [6]:
#Check if excel dataset has more than one sheet. 
#Check if file type is .xls or .xlsx
check_excel_data <- function (file_path){
    
    #If file type is not .xls or .xlsx, return FALSE
    #Return sheet name(s) as a list of strings
    
    #check if "readxl" is loaded
    require("readxl") 
    
    #check if file type is .xls or .xlsx. excel_sheets() only works with these file types
    if (strsplit(file_path, "[.]")[[1]][2] %in% c("xls", "xlsx")){
        list_of_sheets <- excel_sheets(file_path)
        return (list_of_sheets)
    }
    else {
        stop("File type is not .xls or .xlsx")} 
    
}

In [7]:
#Test code with excel dataset with more than one sheet
#check_excel_data("Bajhaiya_2016.xls")  

#Test code with excel dataset with one sheet
#check_excel_data("Boyle_2012.xls") 

#Test code with incorrect file type
#check_excel_data("Li_2016.xlsb")

ERROR: Error in check_excel_data("Li_2016.xlsb"): File type is not .xls or .xlsx


#### import_messy_excel

In [8]:
#Remove non-data table rows and import a cleaner dataframe from an excel dataset

import_messy_excel <- function(file_path, sheet_name){
    
    #check if "readxl" is loaded
    require("readxl") 
    require("dplyr")
    require("stringr")
    
    #Checks to see if sheet name exists
    if (sheet_name %in% excel_sheets(file_path) == FALSE){
            stop("Sheet name does not exist")
    }
    
    #Remove rows in excel sheet if more than half of the columns have NAs
    df <- read_excel(file_path, sheet = sheet_name) %>% 
        filter(rowSums(is.na(.))/ncol(.) < 0.5)
    
    #Assumes Row 1 of the subset dataframe is the column names
    #Assign Row 1 as column names and remove Row 1
    colnames(df)<- df[1,] %>% str_replace_all(" ", "_")
    df<- df[-1,]
    message("Check if column names are correct.")
    
    return(df)
    
}

#### export_data

`gene_data` should have the some of the following columns *(when available)* :
- `gene_id` : <`string`> matches `^Cre.+`
- `gene_name` : <`string`> abbreviation of gene name 
- `protein_id`: <`string`> 
- `candidate_gene` : <`boolean`> whether the gene is candidate or not
- `fold_difference`: <`numeric`> linear fold-difference in gene expression levels
- `p_value`: <`numeric`> p-value for null hypothesis `|log2(fold_difference)| = 1`
- `protein_fold_difference`: <`numeric`> linear fold-difference in protein expression levels
- `protein_p_value`: <`numeric`> p-value of differentially expressed proteins


In [9]:
#Export candidate and non-candidate genes to `<first_author>_<year>.txt` in target_folder
#Input: gene_data as dataframe

export_data <- function(df = gene_data, target_folder = target_folder){
    
    require("dplyr")
       
    write.table(gene_data, 
            paste(target_folder, first_author,"_",publication_year,".txt", sep=""), 
            quote = FALSE, sep = "\t", col.names = TRUE, row.names = FALSE)
    
}

#### append_basic_info

`basic_info` should have the some of the following columns *(when available)* :
- `first_author` : <`string`> 
- `publication_year` : <`numeric`> 
- `genome_version` : <`string`> 
- `database_source` : <`string`> 
- `inclusion_criteria` : <`string`> see options below
    - `fold difference greater than 2` 
    - `p-value <0.05`
- `type_of_study` : <`string`> see options below
    - `gene_expression` : transcriptomics, qPCR
    - `protein_expression` : proteomics

In [10]:
#Input basic_info.txt as output_filepath
#basic_info is a dataframe

append_basic_info <- function(basic_info = basic_info, output_filepath){
    
    colnames(basic_info)<- NULL
    #Append basic info to output_filepath
    write.table(basic_info, output_filepath, 
                append = TRUE, sep = "\t", quote = FALSE, 
                row.names = FALSE)
}

### Create basic_info.txt in intermediate_data_01

In [9]:
basic_info <- data.frame(matrix(ncol = 6, nrow = 0))
colnames(basic_info) <- c("first_author",
                         "publication_year",
                         "genome_version",
                         "database_source", 
                         "inclusion_criteria", 
                         "type_of_study")
write.table(basic_info, "../../intermediate_data_01/basic_info.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)

## II. Add source data to intermediate_data_01

### Bajhaiya_2016.xls

#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [10]:
#Basic information

file_name = "Bajhaiya_2016.xls"
genome_version = 5.3
database_source = "Phytozome 9.1"
inclusion_criteria = "fold difference greater than 2"
first_author = "Bajhaiya"
publication_year = 2016
type_of_study = "gene_expression"

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria,
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	1 obs. of  6 variables:
 $ first_author      : chr "Bajhaiya"
 $ publication_year  : num 2016
 $ genome_version    : num 5.3
 $ database_source   : chr "Phytozome 9.1"
 $ inclusion_criteria: chr "fold difference greater than 2"
 $ type_of_study     : chr "gene_expression"


In [11]:
#Check file type and number of excel sheets
sheets <- check_excel_data(file_name)  
print(length(sheets))

[1] 2


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

**Inclusion criteria**: Within each strain, if fold-difference between high P and low P is >2 count as candidate gene. 

**Sheet 1: "Day 3" **

In [12]:
df <- import_messy_excel(file_name, sheets[1]) 
#Use fold difference from normalized expression
temp1 <- df %>% 
    select(starts_with("Gene"), ends_with("foldchange")) %>%
    mutate_at(vars(ends_with("foldchange")),list(as.numeric)) %>%

    #reshape data such that one of the columns is fold difference
    gather(., 'WT_LP_/_HP_D3_foldchange','psr1_LP_/_HP_D3_foldchange', 
           key = 'treatment', value = 'fold_difference') %>%

    #assign candidate gene label
    mutate(candidate_gene = case_when(
        abs(log(fold_difference,2))>1 ~ TRUE,
        TRUE ~ FALSE)) %>%
    
    #select columns that go in to working gene_data dataframe
    select(starts_with("Gene"), candidate_gene, fold_difference)
           
temp1[1:3,]

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 13 more problems
Check if column names are correct.


Gene_id,Gene_name,candidate_gene,fold_difference
<chr>,<chr>,<lgl>,<dbl>
Cre09.g404900,Cre09.g404900,True,85.99029
Cre04.g216700,PHOX,True,82.75193
Cre01.g044300,Cre01.g044300,True,71.1001


** Sheet 2: "Day 5" **

In [13]:
df <- import_messy_excel(file_name, sheets[2]) 
#Use fold difference from normalized expression
temp2 <- df %>% 
    select(starts_with("Gene"), ends_with("foldchange")) %>%
    mutate_at(vars(ends_with("foldchange")),list(as.numeric)) %>%

    #reshape data such that one of the columns is fold difference
    gather(., 'WT_LP_/_HP_D5_foldchange','psr1_LP_/_HP_D5_foldchange', 
           key = 'treatment', value = 'fold_difference') %>%

    #assign candidate gene label
    mutate(candidate_gene = case_when(
        abs(log(fold_difference,2))>1 ~ TRUE,
        TRUE ~ FALSE)) %>%
    
    #select columns that go into working gene_data dataframe
    select(starts_with("Gene"), candidate_gene, fold_difference)
           
temp2[1:3,]

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 13 more problems
Check if column names are correct.


Gene_id,Gene_name,candidate_gene,fold_difference
<chr>,<chr>,<lgl>,<dbl>
Cre09.g404900,Cre09.g404900,True,15.11795
Cre04.g216700,PHOX,True,13.01713
Cre01.g044300,Cre01.g044300,True,18.13303


##### Join Sheet 1 and 2 data by creating two dataframes:

In [14]:
gene_data <- rbind(temp1, temp2) %>%
    rename(gene_id = Gene_id,
          gene_name = Gene_name)

gene_data[1:3,]
print(dim(gene_data))

gene_id,gene_name,candidate_gene,fold_difference
<chr>,<chr>,<lgl>,<dbl>
Cre09.g404900,Cre09.g404900,True,85.99029
Cre04.g216700,PHOX,True,82.75193
Cre01.g044300,Cre01.g044300,True,71.1001


[1] 70948     4


#### 4. Export `gene_data` to `Bajhaiya_2016.txt`.

In [15]:
export_data(df = gene_data, target_folder = target_folder)

## Blaby_2013_DS2.xlsx

#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [16]:
#Basic information

file_name = c("Blaby_2013_DS2.xlsx","Blaby_2013_DS8.xlsx")
genome_version = 5.0
database_source = "Augustus u10.2"
inclusion_criteria = "fold difference greater than 2"
first_author = "Blaby"
publication_year = 2013
type_of_study = "gene_expression"

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria, 
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	1 obs. of  6 variables:
 $ first_author      : chr "Blaby"
 $ publication_year  : num 2013
 $ genome_version    : num 5
 $ database_source   : chr "Augustus u10.2"
 $ inclusion_criteria: chr "fold difference greater than 2"
 $ type_of_study     : chr "gene_expression"


In [17]:
#Check file type and number of excel sheets
sheets <- check_excel_data(file_name[1])  
print(length(sheets))
sheets <- check_excel_data(file_name[2])  
print(length(sheets))

[1] 1
[1] 1


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

**Inclusion criteria**: Within each strain, if fold-difference between 0 hour and *n* hours after N starvation >2 count as candidate gene. 

* Gene expression differences due to *sta-6* mutation is less important than gene expression differences due to nutrient starvation

In [18]:
require("readxl")
require("tidyverse")
require("dplyr")

#Data manipulation to separate data by strains
CC_4349 <- data.frame(matrix(nrow = 0, ncol = 11))
sta_6 <-data.frame(matrix(nrow = 0, ncol = 11))
for (file in file_name){
    df<- import_messy_excel(file, sheets[1])
    CC_4349 <- df[,1:11] %>% 
        mutate(strain = "CC_4349") %>% 
        rename ( '0' = '0b') %>% 
        rbind(CC_4349,.)
    sta_6 <- df[, c(1:3, 12:19)] %>% mutate(strain = "sta_6") %>%
        rbind(CC_4349,.)
}
 
 
CC_4349[1:3,]

sta_6[1:3,]

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 13 more problems
Check if column names are correct.
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 13 more problems
Check if column names are correct.


Gene,Annotation,Augustus_u10.2_ID,0,0.5,2,4,8,12,24,48,strain
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
,,Cre01.g001100,41.864,45.6182,30.04442,28.77635,20.15472,20.21065,20.21139,23.32514,CC_4349
,,Cre01.g004750,0.1167516,0.232438,0.1756702,0.1662885,0.1605637,0.2873106,0.7915955,1.283815,CC_4349
FBA1,"Fructose-1,6-bisphosphate aldolase",Cre01.g006950,98.35175,155.9517,117.9667,57.06845,27.87264,23.58092,28.82621,24.37027,CC_4349


Gene,Annotation,Augustus_u10.2_ID,0,0.5,2,4,8,12,24,48,strain
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
,,Cre01.g001100,41.864,45.6182,30.04442,28.77635,20.15472,20.21065,20.21139,23.32514,CC_4349
,,Cre01.g004750,0.1167516,0.232438,0.1756702,0.1662885,0.1605637,0.2873106,0.7915955,1.283815,CC_4349
FBA1,"Fructose-1,6-bisphosphate aldolase",Cre01.g006950,98.35175,155.9517,117.9667,57.06845,27.87264,23.58092,28.82621,24.37027,CC_4349


In [19]:
gene_data <- rbind(CC_4349, sta_6) %>% 

    #reshape dataframe to assign candidate_gene label based on fold difference
    gather(.,'0.5', '2', '4', '8', '12', '24', '48', 
      key = "time", value = 'expression') %>%
    rename(time_0h = '0') %>%
    mutate(time_0h = as.numeric(time_0h)) %>%

    #assign candidate gene label based on fold difference
    mutate( candidate_gene = case_when(
        expression >= 2*time_0h ~ TRUE,
        time_0h>= 2*expression ~ TRUE,
        TRUE ~ FALSE),
    #calculate fold difference relative to time 0H control
          fold_difference = time_0h/expression) %>%

    #select columns that go into working gene_data dataframe
    rename (gene_name = Gene, gene_id = Augustus_u10.2_ID) %>%
    select( gene_id, gene_name, candidate_gene, fold_difference)

#### 4. Export `gene_data` to `Blaby_2013.txt`

In [20]:
gene_data[1:3,]
export_data(df = gene_data, target_folder = target_folder)

gene_id,gene_name,candidate_gene,fold_difference
<chr>,<chr>,<lgl>,<dbl>
Cre01.g001100,,False,0.9177039
Cre01.g004750,,False,0.5022914
Cre01.g006950,FBA1,False,0.6306552


### Boyle_2012.xls

#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [21]:
file_name = "Boyle_2012.xls"
genome_version = 4.0
database_source = "Augustus 10.2"
inclusion_criteria = "fold difference greater than 2"
first_author = "Boyle"
publication_year = 2012
type_of_study = "gene_expression"

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria,
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	1 obs. of  6 variables:
 $ first_author      : chr "Boyle"
 $ publication_year  : num 2012
 $ genome_version    : num 4
 $ database_source   : chr "Augustus 10.2"
 $ inclusion_criteria: chr "fold difference greater than 2"
 $ type_of_study     : chr "gene_expression"


In [22]:
#Check file type and number of excel sheets
sheets <- check_excel_data(file_name)  
print(length(sheets))

[1] 1


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

**Inclusion criteria**: If fold-difference of RPKM between 0 hour and *n* hours after N starvation >2 count as candidate gene. 

In [23]:
require("readxl")
require("tidyverse")
require("dplyr")

#Import excel sheet and rename column names
df <- import_messy_excel(file_name, sheets[1])
colnames(df)[1:5]<- c("gene_id", "Au.5", "gene_name","protein_name", "time_0h")
df[1:3,]

New names:
* `` -> ...6
* `` -> ...7
* `` -> ...8
* `` -> ...9
Check if column names are correct.


gene_id,Au.5,gene_name,protein_name,time_0h,2_h,12_h,24_h,48_h
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Cre12.g519100,513248,ACX1,?-Carboxyltransferase,140.5,68.2,81.29999999999998,71.6,70.2
Cre12.g484000,512497,BCX1,?-Carboxyltransferase,103.5,53.4,35.1,35.4,31.4
Cre17.g715250,517403,BCC1,Acetyl-CoA biotin carboxyl carrier,293.9,129.6,54.7,74.7,68.9


In [24]:
#Reshape dataframe such that it is easier to compare time 0h to 'n'h
gene_data<- df %>% gather(., '2_h', '12_h', "24_h", '48_h',
                   key = 'time', value = "expression") %>%
    mutate(time_0h = as.numeric(time_0h),
        expression= as.numeric(expression)) %>%
           
    #assign candidate_gene label based on fold difference
    mutate(candidate_gene = case_when(
            expression >= 2*time_0h~ TRUE,
            time_0h >= 2* expression~ TRUE,
            TRUE ~ FALSE),
          fold_difference = time_0h/expression) %>%

    #aggregate dataframe and remove unnecessary columns for append_genes()
    select(gene_id, gene_name, candidate_gene, fold_difference)

#gene_data[1:3,]
#summary(gene_data)

#### 4. Export `gene_data` to `Boyle_2012.txt`.

In [25]:
export_data(df = gene_data, target_folder = target_folder)

## Gargouri_2015.xlsx
#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [26]:
#Basic information

file_name = "Gargouri_2015.xlsx"
genome_version = NA
database_source = "Phytozyme 10.0"
inclusion_criteria = c("fold difference greater than 2", "p-value <0.05")
first_author = "Gargouri"
publication_year = 2015
type_of_study = "gene_expression"

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria, 
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	2 obs. of  6 variables:
 $ first_author      : chr  "Gargouri" "Gargouri"
 $ publication_year  : num  2015 2015
 $ genome_version    : logi  NA NA
 $ database_source   : chr  "Phytozyme 10.0" "Phytozyme 10.0"
 $ inclusion_criteria: chr  "fold difference greater than 2" "p-value <0.05"
 $ type_of_study     : chr  "gene_expression" "gene_expression"


In [27]:
#Check file type and number of excel sheets
sheets <- check_excel_data(file_name)  
print(length(sheets))
print(sheets)

[1] 16
 [1] "TF Gene Expression"     "TR Gene Expression "    "TF & TR proteins"      
 [4] "R values"               "P values"               "Correlation lists"     
 [7] "Nitrogen genes"         "Photosynthesis genes"   "Chlorophyll genes"     
[10] "Calvin cycle genes"     "Photorespiration genes" "OPPP genes"            
[13] "TCA-glyoxylate genes"   "Amino acid genes"       "sucrose-starch genes"  
[16] "Lipid genes"           


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

**Inclusion criteria**: When fold-difference between 0 hour and *n* hours after N starvation is statistically signficant (false discovery rate adjusted p-value <0.05), count as candidate gene. 
    
* Significance in expression is defined as >2 fold-difference in a time point relative to the control (time= 0 hours)
* Exclude proteomics information because the study measured transcription factor and regulator proteins. These proteins have low copy number in the cells because they don't need to exist at higher copy numbers. 

In [28]:
require("readxl")
require("tidyverse")
require("dplyr")

#Data manipulation: combine TF and TR gene expression data
df <- rbind(import_messy_excel(file_name, sheet_name = "TF Gene Expression") , import_messy_excel(file_name, sheet_name = "TR Gene Expression "))
colnames(df)[1:3] <- c("gene_id","", "gene_name")

#Split data into p-value data and fold difference data
p_value_df<-df[,c(1,3,18:24)]
log2fold_diff_df<-df[,c(1,3,4:10)]
p_value_df[1:3,]
log2fold_diff_df[1:3,] 

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 18 more problems
Check if column names are correct.
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 18 more problems
Check if column names are correct.


gene_id,gene_name,0.5h,1h,2h,4h,6h,12h,24h
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Cre13.g562400,ABI3,0.2662413863716787,0.3784149427406196,0.1710728018036058,0.08969348134671708,0.6747007534654417,0.1911839931425287,0.09917656994690371
Cre16.g661650,AP2.1,0.389648201061047,0.6077728481571952,0.8148417282451008,0.2417443713837775,0.685940846358145,0.3188625459563791,0.1669253421880012
Cre06.g275500,AP2.2,0.3225917301235968,0.28657095682324,0.9763104391019098,0.9174718655771834,0.8606023792386808,0.3501941975397299,0.139895815498112


gene_id,gene_name,0.5h,1h,2h,4h,6h,12h,24h
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Cre13.g562400,ABI3,0.2449997586303851,0.152815133035785,0.6582990940995443,0.5963349885218836,0.8377579524378042,0.4926107536949789,0.829104318114255
Cre16.g661650,AP2.1,-0.2049811425946774,-0.0373423144159995,0.2725394189188488,-0.3849485106593602,-0.4247172929231348,-0.4154184546234874,-0.5204901907180629
Cre06.g275500,AP2.2,0.5449519681765332,-0.9911149374400134,-0.1388274625023567,-0.2196249599012253,0.041920206177396405,0.333444087201702,0.5217494995143


In [29]:
#Reshape dataframe such that it is easier to compare time 0h to 'n'h
p_value_df <- p_value_df %>% 
    gather(., ends_with('h'),
           key = 'time', value = "p_value") %>%
    mutate(p_value= as.numeric(p_value))
log2fold_diff_df <- log2fold_diff_df  %>% 
    gather(., ends_with('h'),
           key = 'time', value = "log2fold_diff") %>%
    mutate(log2fold_diff= as.numeric(log2fold_diff))
gene_data<- merge(log2fold_diff_df, p_value_df, 
                  by = c('gene_id', 'gene_name', 'time')) %>%
            mutate(candidate_gene = case_when(
                p_value< 0.05 ~TRUE,
                TRUE~FALSE),
                  fold_difference = 2^log2fold_diff) %>%
            select(gene_id, gene_name, candidate_gene,fold_difference, p_value)


gene_data[1:3,]
dim(gene_data)
#summary(gene_data)


gene_id,gene_name,candidate_gene,fold_difference,p_value
<chr>,<chr>,<lgl>,<dbl>,<dbl>
Cre01.g000050,RWP.14,False,1.5629104,0.0912037
Cre01.g000050,RWP.14,False,1.9702945,0.3410131
Cre01.g000050,RWP.14,False,0.9607626,0.4324472


#### 4. Export `gene_data` to `Gargouri_2015.txt`.

In [30]:
export_data(df = gene_data, target_folder = target_folder)


## Goodenough_2014.xlsx
#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [31]:
#Basic information

file_name = "Goodenough_2014.xlsx"
genome_version = 4
database_source = "Augustus 10.2"
inclusion_criteria = c("fold difference greater than 2")
first_author = "Goodenough"
publication_year = 2014
type_of_study = "gene_expression"

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria, 
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	1 obs. of  6 variables:
 $ first_author      : chr "Goodenough"
 $ publication_year  : num 2014
 $ genome_version    : num 4
 $ database_source   : chr "Augustus 10.2"
 $ inclusion_criteria: chr "fold difference greater than 2"
 $ type_of_study     : chr "gene_expression"


In [32]:
#Check file type and number of excel sheets
sheets <- check_excel_data(file_name)  
print(length(sheets))
print(sheets)

[1] 24
 [1] "Table 1"    "Table 2"    "Table 3"    "Table 4"    "Table 5"   
 [6] "Table 6"    "Table 7"    "Table 8"    "Table 9"    "Table S1"  
[11] "Table S2"   "Table S3"   "Table S4"   "Table S5"   "Table S6"  
[16] "Table S7"   "Table S8"   "Table S9"   "Dataset S1" "Dataset S2"
[21] "Dataset S3" "Dataset S4" "Dataset S5" "Dataset S6"


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

**Inclusion criteria**: If fold-difference between 0 hour and *n* hours after N starvation >2 count as candidate gene. 
    
* Only use excel sheets "Table 2" to "Table 7"
* Exclude acetate boost data

In [33]:
require("readxl")
require("tidyverse")
require("tidyr")
require("dplyr")
require("stringr")
gene_data<- data.frame(matrix(nrow = 0, ncol = 4))
#Data manipulation: combine TF and TR gene expression data
#gene_data<- data.frame(matrix(nrow = 0, ncol = 10))
for (sheet in sheets[2:7]){
    
    #Import relevant columns, 
    #first 10 columns do not include acetate-boosted expression
    df <- read_excel(file_name, sheet = sheet)[,1:10] 
    colnames(df)<- c('name',df[2,2:10])
    
    #Assign gene id to each row, ignoring strain names in experiment
    gene_id <-NA
    for (row in 1:nrow(df)){
        if(TRUE %in% stringr::str_detect(df[row,1],"^Cre.+")){
            gene_id<- grep("^Cre.+", df[row,1], value = TRUE)
        }
    else{df[row,1] <- gene_id}
    }

    #Reshape dataframe to compare expression levels to time 0
    gene_data <- df %>% rename(control = "0 h") %>%
        gather(ends_with('h'), key = "time", value = "expression") %>%
        na.omit() %>% select(-log) %>%
        mutate(control = as.numeric(control),
              expression = as.numeric(expression)) %>%
        
        #Calculate fold difference relative to control time 0H
        mutate(fold_difference = control/expression,
        #assign candidate gene label based of fold difference in expression level
               candidate_gene = case_when(
            abs(log(fold_difference,2))>1  ~ TRUE,
            TRUE ~ FALSE)) %>%

    #add gene_data to working dataframe gene_data
        rbind(gene_data, .)
}


New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 12 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 12 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 12 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 12 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 12 more problems
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 12 more problems


In [34]:
#Split "name" column to "gene_id" and "gene_name" columns in gene_data
gene_data<- gene_data %>% 
    separate(., name, sep = "\\s+", into=c("gene_id","gene_name")) %>%
    select(gene_id, gene_name, candidate_gene, fold_difference)
print(gene_data[31,])
dim(gene_data)

"Expected 2 pieces. Missing pieces filled with `NA` in 14 rows [523, 524, 533, 534, 543, 544, 553, 554, 563, 564, 573, 574, 583, 584]."

[90m# A tibble: 1 x 4[39m
  gene_id       gene_name candidate_gene fold_difference
  [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m     [3m[90m<lgl>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m1[39m Cre03.g188250 STA6      FALSE                    0.677


In [35]:
#Export gene_data
export_data(df = gene_data, target_folder = target_folder)

### Hemme_2014.xlsx

#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [1]:
file_name = "Hemme_2014.xlsx"
genome_version = NA
database_source = "Augustus 10.2"
inclusion_criteria = c("fold difference greater than 2", "p-value <0.05")
first_author = "Hemme"
publication_year = 2014
type_of_study = "gene_expression"

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria,
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	2 obs. of  6 variables:
 $ first_author      : chr  "Hemme" "Hemme"
 $ publication_year  : num  2014 2014
 $ genome_version    : logi  NA NA
 $ database_source   : chr  "Augustus 10.2" "Augustus 10.2"
 $ inclusion_criteria: chr  "fold difference greater than 2" "p-value <0.05"
 $ type_of_study     : chr  "gene_expression" "gene_expression"


In [2]:
#Check file type and number of excel sheets
sheets <- check_excel_data(file_name)  
print(length(sheets))
print(sheets)

ERROR: Error in check_excel_data(file_name): could not find function "check_excel_data"


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

**Inclusion criteria**: When fold-difference between 0 hour and *n* hours after heat stress (HS) is statistically signficant (p-value <0.05), count as candidate gene. 
    
* Significance in expression is defined as >2 fold-difference in a time point relative to the control (time= 0 hours) within the HS treatment **(not recovery)**
* column `TP24HS_fold_change` is the linear fold difference of HS at time = 24 hours compared to time = 0 hour. 
* use p-value from `C1_significance`

In [37]:
require("readxl")
require("tidyverse")
require("dplyr")

#Data manipulation: combine TF and TR gene expression data
df <- import_messy_excel(file_name, sheet_name = "Protein data")
#colnames(df)

gene_data <- df %>% 
    
    #rename columns
    rename( gene_name = DisplayId, gene_id = Gene_identifier, fold_difference = TP24HS_fold_change, 
           p_value = C1_Significance ) %>%
    
    #select relevant columns
    select(gene_id, gene_name, fold_difference, p_value) %>%
    
    #assign candidate gene label based on inclusion criteria
    #Reformat gene_id and gene_name to Cre* format and gene name abbreviations when possible
    mutate( fold_difference = as.numeric(fold_difference),
           p_value = as.numeric(p_value),
           gene_id = case_when(
              str_detect(gene_id,"Cre.[0-9]+\\.g[0-9]+") ~ str_extract(gene_id, "Cre.[0-9]+\\.g[0-9]+"),
               TRUE ~ gene_id),
           gene_name = case_when((nchar(gene_name)<9) & (str_detect(gene_name, "[[:alnum:]]+")) ~ str_extract(gene_name, "[[:alnum:]]+"),
                                 (nchar(gene_name)>=9) & (str_detect(gene_name, "Cre.[0-9]+\\.g[0-9]+")) ~ str_extract(gene_id, "Cre.[0-9]+\\.g[0-9]+"),
                                    TRUE ~ gene_name),
           candidate_gene = case_when(
                p_value < 0.05 & abs(log(fold_difference,2))>1 ~ TRUE,
                TRUE ~ FALSE))



gene_data[1:3,]

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 45 more problems
Check if column names are correct.


gene_id,gene_name,fold_difference,p_value,candidate_gene
<chr>,<chr>,<dbl>,<dbl>,<lgl>
Cre17.g720250,LHCB4,0.74,0.182,False
Cre16.g673650,LHCB5,1.32,5.1e-10,False
Cre02.g110750,LHCB7,0.43,0.243,False


In [34]:
length(gene_name)
gene_name <- 'Cre17.g720250.t1'
str_extract(gene_name, "[[:alnum:]]+")

#### 4. Export `gene_data` to `Hemme_2014.txt`.

In [38]:
export_data(df = gene_data, target_folder = target_folder)


### Juergens_2015.xls

#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [39]:
#Basic information

file_name = "Juergens_2015.xls"
genome_version =  NA
database_source = c("Augustus 10.2","Phytozome 10.0")
inclusion_criteria = "fold difference greater than 2"
first_author = "Juergens"
publication_year = 2015
type_of_study = c("gene_expression", "protein_expression")

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria, 
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	2 obs. of  6 variables:
 $ first_author      : chr  "Juergens" "Juergens"
 $ publication_year  : num  2015 2015
 $ genome_version    : logi  NA NA
 $ database_source   : chr  "Augustus 10.2" "Phytozome 10.0"
 $ inclusion_criteria: chr  "fold difference greater than 2" "fold difference greater than 2"
 $ type_of_study     : chr  "gene_expression" "protein_expression"


In [40]:
#Check file type and number of excel sheets
sheets <- check_excel_data(file_name)  
print(length(sheets))

[1] 2


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

**Inclusion criteria**: If linear transcript `fold change >2` relative to 0 hour, count as candidate gene. 

**Sheet 1: "Photosynthetic" **

In [53]:
#Import sheet 1 from dataset
protein_df <- import_messy_excel(file_name, sheet_name = sheets[1]) %>%

    #extract protein expression data only
    select(Name, 'Cre#', contains("_"), -Unique_to_Calvin_Cycle) %>%
    gather(., contains("_"), key = "time", value = "protein_fold_difference") %>%
    
    #remove rows without protein fold difference values
    mutate(protein_fold_difference = as.numeric(protein_fold_difference)) %>%
    na.omit() %>%
        
    #Data manipulation into gene_data format
    separate(., Name, into= c("gene_name", "full_name"), sep = ";|,") %>%
    separate(., time, into= c("time", "time_unit"), sep = "_") %>%
    mutate(time = as.numeric(time)) %>%
    rename(gene_id = "Cre#") %>%

    select(gene_id, gene_name, time, protein_fold_difference) 
    
protein_df[1:3,]
#df[1:5,]


New names:
* `` -> ...1
* `` -> ...2
* `` -> ...3
* `` -> ...5
* `` -> ...6
* ... and 10 more problems
Check if column names are correct.
"Expected 2 pieces. Missing pieces filled with `NA` in 118 rows [31, 39, 67, 69, 70, 71, 72, 73, 74, 76, 77, 78, 79, 81, 82, 83, 84, 86, 87, 88, ...]."

gene_id,gene_name,time,protein_fold_difference
<chr>,<chr>,<dbl>,<dbl>
Cre02.g120100.t1.2,RBCS1,1,0.961
Cre02.g120150.t1.2,RBCS2,1,0.944
g5049.t1,CPN60A,1,0.99


In [54]:
#Import sheet 1 from dataset
gene_df <- import_messy_excel(file_name, sheet_name = sheets[1]) %>%

    #extract transcript expression data only
    select(Name, 'Cre#', matches("[0-9]+[a-z]")) %>%
    gather(., matches("^[0-9].+"), key = "time", value = "fold_difference") %>%
    
    #Data manipulation into gene_data format
    separate(., Name, into= c("gene_name", "full_name"), sep = ";|,") %>% 
    separate(., time, into= c("time","unit"), sep = "m|h") %>%
    rename(gene_id = "Cre#") %>% #rename column

    #remove rows comparing time 0h to time 0h and other rows not containing expression data
    mutate(time = as.numeric(time)) %>%
    filter(time!= 0)%>% #remove control time = 0h    
    mutate(fold_difference = 2^as.numeric(fold_difference)) %>%
    na.omit() %>%

    #select relevant columns
    select(gene_id, gene_name, time, fold_difference) 

print(gene_df[1:3,])
   

New names:
* `` -> ...1
* `` -> ...2
* `` -> ...3
* `` -> ...5
* `` -> ...6
* ... and 10 more problems
Check if column names are correct.
"NAs introduced by coercion"

[90m# A tibble: 3 x 4[39m
  gene_id            gene_name  time fold_difference
  [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m Cre12.g554800.t1.2 PRK1         30            1.59
[90m2[39m Cre03.g185550.t1.2 SBP1         30            1.30
[90m3[39m Cre02.g120100.t1.2 RBCS1        30            1.72


In [56]:
#Merge gene and protein expression data
df1<- merge(gene_df, protein_df, by = c("gene_id", "gene_name", "time"), fill = NA) 
    #filter(time != NA)
    
print(df1[1:3,])
print(dim(df1))

                    gene_id gene_name time fold_difference
1 1::BK000554.2|DAA00906.1|      CHLB   12        1.073319
2 1::BK000554.2|DAA00906.1|      CHLB    2        2.480202
3 1::BK000554.2|DAA00906.1|      CHLB    4        1.982700
  protein_fold_difference
1                   0.328
2                   0.768
3                   0.844
[1] 447   5


**Sheet 2: "Photosynthetic" **

In [57]:
#Import sheet 2 from dataset
df <- import_messy_excel(file_name, sheet_name = sheets[2]) 
colnames(df)[1:4]<- c("gene_id", "gene_name", "description", "category")
print(colnames(df))

protein_df<- df %>%
    
    #extract protein expression data only
    select(contains("_")) %>%
    gather(., matches("^[0-9]+_"), key = "time", value = "protein_fold_difference") %>%
    
    #remove rows without protein fold difference values
    mutate(protein_fold_difference = as.numeric(protein_fold_difference)) %>%
    na.omit() %>%
        
    #Data manipulation into gene_data format
    separate(., time, into= c("time","unit"), sep = "_") %>%
    mutate(time = as.numeric(time)) %>%
    select(gene_id, gene_name, time, protein_fold_difference) 
    
protein_df[1:3,]
#df[1:5,]


New names:
* `` -> ...6
* `` -> ...7
* `` -> ...8
* `` -> ...9
* `` -> ...10
* ... and 8 more problems
Check if column names are correct.


 [1] "gene_id"     "gene_name"   "description" "category"    "0h"         
 [6] "30m"         "1h"          "2h"          "4h"          "6h"         
[11] "12h"         "24h"         "NA"          "1_hr"        "2_h"        
[16] "4_h"         "6_h"         "12_h"        "24_h"       


"NAs introduced by coercion"

gene_id,gene_name,time,protein_fold_difference
<chr>,<chr>,<dbl>,<dbl>
Cre04.g215150.t1.2,SSS1,1,0.931
Cre17.g721500.t1.2,STA2,1,0.786
Cre06.g270100.t1.3,SBE2,1,0.929


**Data anomaly: **
* Transcript abundance for starch genes have abnormally high `log 2` changes >200 for some genes.
* Transcript abundance for `time = 0h` are positive numbers. There are few large numbers. 

**Action:**
* Assume Transcript abundance as expression data and perform calculations to determine linear fold difference.

In [58]:
#Import sheet 2 from dataset
gene_df <- df %>%

    #extract transcript expression data only
    select(contains('gene'), matches("^[0-9]+[m|h]")) %>%
    rename(control = "0h") %>%
    gather(., matches("^[0-9].+"), key = "time", value = "expression") %>%
    
    #remove rows comparing time 0h to time 0h and other rows not containing expression data
    mutate(fold_difference = as.numeric(expression)/as.numeric(control)) %>%
    na.omit() %>%
    
    #Data manipulation into gene_data format
    separate(., time, into= c("time","unit"), sep = "m|h") %>%
    mutate(time = as.numeric(time)) %>%
    select(gene_id, gene_name, time, fold_difference)

print(gene_df[1:3,])

[90m# A tibble: 3 x 4[39m
  gene_id            gene_name  time fold_difference
  [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m Cre01.g012600.t1.3 GPM2         30           1.54 
[90m2[39m g6352              GPM1         30           1.40 
[90m3[39m Cre04.g215150.t1.2 SSS1         30           0.962


In [60]:
#Merge gene and protein expression data
df2<- merge(gene_df, protein_df, by = c("gene_id", "gene_name", "time"), fill = NA)
    #filter(time != NA)

print(df1[1:3,])
dim(df2)

                    gene_id gene_name time fold_difference
1 1::BK000554.2|DAA00906.1|      CHLB   12        1.073319
2 1::BK000554.2|DAA00906.1|      CHLB    2        2.480202
3 1::BK000554.2|DAA00906.1|      CHLB    4        1.982700
  protein_fold_difference
1                   0.328
2                   0.768
3                   0.844


##### Join sheet data and assign candidate gene label

In [61]:
#Join sheet 1 and sheet 2 data
gene_data <- bind_rows(df1, df2)
print(dim(gene_data))

#Assign candidate gene labels based on inclusion criteria
gene_data <- gene_data %>% 
    mutate(candidate_gene = case_when(fold_difference >2 ~ TRUE, 
                                     TRUE ~ FALSE),
    #Reformat gene_id and gene_name to Cre* format and gene name abbreviations when possible
        gene_id = case_when(
                    str_detect(gene_id,"Cre.[0-9]+\\.g[0-9]+") ~ str_extract(gene_id, "Cre.[0-9]+\\.g[0-9]+"),
                    TRUE ~ gene_id),
        gene_name = case_when(
                    (nchar(gene_name)<9) & (str_detect(gene_name, "[[:alnum:]]+")) ~ str_extract(gene_name, "[[:alnum:]]+"),
                    (nchar(gene_name)>=9) & (str_detect(gene_name, "Cre.[0-9]+\\.g[0-9]+")) ~ str_extract(gene_id, "Cre.[0-9]+\\.g[0-9]+"),
                    TRUE ~ gene_name)) %>%
    select(gene_id, gene_name, candidate_gene, fold_difference, protein_fold_difference)
print(summary(gene_data))

[1] 507   5
   gene_id           gene_name         candidate_gene  fold_difference  
 Length:507         Length:507         Mode :logical   Min.   :0.02381  
 Class :character   Class :character   FALSE:455       1st Qu.:0.46981  
 Mode  :character   Mode  :character   TRUE :52        Median :0.79612  
                                                       Mean   :1.02058  
                                                       3rd Qu.:1.49233  
                                                       Max.   :4.21655  
 protein_fold_difference
 Min.   :0.2450         
 1st Qu.:0.8825         
 Median :0.9840         
 Mean   :1.0111         
 3rd Qu.:1.0819         
 Max.   :2.6890         


#### 4. Export `gene_data` to `Juergens_2015.txt`.

In [62]:
export_data(df = gene_data, target_folder = target_folder)

### Kwak_2017.xlsx

#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [63]:
#Basic information

file_name = "Kwak_2017.xlsx"
genome_version = 5.5
database_source = "Phytozome 10"
inclusion_criteria = "fold difference greater than 2"
first_author = "Kwak"
publication_year = 2017
type_of_study = "gene_expression"

basic_info <- data.frame(first_author, 
                         publication_year, 
                         genome_version, 
                         database_source, 
                         inclusion_criteria, 
                         type_of_study,
                         stringsAsFactors = FALSE)
str(basic_info)

append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	1 obs. of  6 variables:
 $ first_author      : chr "Kwak"
 $ publication_year  : num 2017
 $ genome_version    : num 5.5
 $ database_source   : chr "Phytozome 10"
 $ inclusion_criteria: chr "fold difference greater than 2"
 $ type_of_study     : chr "gene_expression"


#### 3. Extract candidate and non-candidate genes based on inclusion criteria.

- Dataset provided transcript expression data of 3 biological replicates for CC-124 sampled over 10-day period.
- Calculate average fold difference between 26C and 10C treatments on the same day

**Inclusion criteria**: If fold-difference between 26C and 10C treatments on the same day >2 count as candidate gene. 

In [64]:
#List sheet names
sheets<-check_excel_data(file_name)
print(sheets)

[1] "Table Legends"                   "Table S1-Overview of clusters"  
[3] "Table S2-Transcriptomics"        "Table S3-KEGG analysis"         
[5] "Table S4-central metabolism"     "Table S5 Temperature comparison"
[7] "Table S6 Digital PCR condition" 


**Sheet 3: 'Table S2-Transcriptomics' **

In [81]:
df <- import_messy_excel(file_path = file_name, sheet_name = sheets[3])
gene_data<- df[,1:17] %>%

    rename(gene_id = "Transcrtips") %>%
               
    #Reshape dataframe to such that it contains two columns of expression data 
    #(one per temperature treatment)
    gather(., contains("cc124"), key = "treatment", value = "expression") %>%
    mutate(expression = as.numeric(expression)) %>%

    separate(., treatment, into = c("strain", "temperature", "day", "replicate"), 
            sep = "[^[:alnum:]]") %>%
    spread(., key = temperature, value = expression) %>%
    rename(cold = '10', control = '26') %>%
    select(-cluster) %>%
    na.omit() %>%

    #Calculate average fold difference
    group_by(gene_id, day) %>%
    summarize(fold_difference= mean(cold/control)) %>%

    #assign candidate_gene label based on inclusion criteria
    mutate(candidate_gene = case_when(fold_difference >2 ~ TRUE,
                                     TRUE~ FALSE)) %>%

    #select only columns needed for export_data
    select(gene_id, candidate_gene, fold_difference) %>%
    ungroup(gene_id) %>%
    #Reformat gene_id to Cre* format when possible
    mutate(gene_id = case_when(
                        str_detect(gene_id,"Cre.[0-9]+\\.g[0-9]+") ~ str_extract(gene_id, "Cre.[0-9]+\\.g[0-9]+"),
                        TRUE ~ gene_id))

    
#print(colnames(df))
print(gene_data[1:5,])
print(dim(gene_data))

New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ... and 42 more problems
Check if column names are correct.


[90m# A tibble: 5 x 3[39m
  gene_id       candidate_gene fold_difference
  [3m[90m<chr>[39m[23m         [3m[90m<lgl>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m1[39m Cre01.g000017 FALSE                    1.16 
[90m2[39m Cre01.g000017 FALSE                    0.589
[90m3[39m Cre01.g002203 FALSE                    1.67 
[90m4[39m Cre01.g002203 TRUE                     2.39 
[90m5[39m Cre01.g002300 FALSE                    0.494
[1] 1700    3


#### 4. Export `gene_data` to `Kwak_2017.xlsx`.

In [82]:
export_data(df = gene_data, target_folder = target_folder)

### Lee_2012.xlsx

#### 1. Extract basic information
#### 2. Append basic information to `lipid_selection/data/intermediate_data_01/basic_info.txt`

In [54]:
#Basic information

#file_name = "Lee_2012.xlsx"
#genome_version = NA
#database_source = "NCBI nonredundant protein seq 2008"
#inclusion_criteria = "p-value <0.05"
#first_author = "Kwak"
#publication_year = 2017
#type_of_study = "gene_expression"

#basic_info <- data.frame(first_author, 
#                         publication_year, 
 #                        genome_version, 
  #                       database_source, 
   ##                      inclusion_criteria, 
     #                    stringsAsFactors = FALSE)
#str(basic_info)

#append_basic_info(basic_info, "../../intermediate_data_01/basic_info.txt")

'data.frame':	1 obs. of  5 variables:
 $ first_author      : chr "Kwak"
 $ publication_year  : num 2017
 $ genome_version    : logi NA
 $ database_source   : chr "NCBI nonredundant protein seq 2008"
 $ inclusion_criteria: chr "p-value <0.05"
