Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: Check your lab QC data! Is everything in limit?  

**Name:** Deniz Ilhan Topcu

**Email address associated with your DataCamp account:** ditopcu@gmail.com

**GitHub username:** ditopcu

**Project description**: 

How can clinical laboratories be sure about their results? Statistical **Quality Control** (QC) methods are helping laboratories every day for ensuring the required quality. But evaluating lots of quality control results can be overwhelming. Data science tools could help us to facilitate this process, once again!  
In this project, we will investigate laboratory QC data using data import, data wrangling, and visualization tools in R. We’re going to calculate QC statistics and plot Levey-Jennings charts and apply basic [_“Westgard Rules”._  ](https://www.westgard.com/mltirule.htm)

Completing [“Introduction to the Tidyverse course” ](https://www.datacamp.com/courses/introduction-to-the-tidyverse)
or experience with `dplyr` and `ggplot2` packages is recommended. Also, familiarity with data importing, writing functions and basic functional programming skills will be helpful.  
  
  
We will examine two level QC results for 10 different biochemistry analytes in this project. The dataset comprises simulated QC results based on real clinical laboratory data.  

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Understanding and Importing Data

Clinical laboratory tests such as serum glucose, HDL, Iron, etc. are using to evaluate the condition of a patient and they are important for clinical decision.  

![Clinical Lab](img/visual.png)


Quality control (QC) is a process to periodically examine these tests measurement procedure and verify that it is performing according to predefined specifications. There are two main errors for measurement: inaccuracy and imprecision.  


![Measure](img/measure03.png)


**Accuracy** is used to describe the closeness of a measurement to the true value.  **Precision** is the closeness of agreement between repeated measurements of a sample.  
Laboratories are using internal quality control (IQC) procedures to assess their imprecision (random error). Traditionally, IQC uses sample materials with assigned values and IQC results are evaluated continuously in relation to these known values.   

To understand these procedures and evaluate imprecision, we are going to inspect QC results. 
We have QC results for 10 different analytes as separate CSV files and one CSV file for predefined specifications for these 10 tests. We are going to import these files, then evaluate precision as in terms of mean, standard deviation. Finally, we plot Levey-Jennings charts using these statistics.  

Let's start with data importing.


In [23]:
# Load tidyverse and lubridate package
library(tidyverse)
library(lubridate)

# Before importing data let’s start with learning which files we have. Then import glucose results.

# Check current directory for .csv files with ends with _QC_results
list.files("datasets/", pattern = "_QC_results.csv")

# Import Albumin_QC_results.csv file into test_read using read_csv and inspect results.


  test_read <- read_csv("datasets/Albumin_QC_results.csv")
  
  glimpse(test_read)
  head(test_read)

# Import  Glucose.csv file into glucose_qc and inspect results
# We have result_date column but it is parsed as character in test_read
# Let's fix this using ymd_hms function (?ymd_hms for details )
# also convert lot_number, and levels columns into factor


albumin_qc <- read_csv("datasets/Albumin_QC_results.csv") %>% 
    mutate(result_date = ymd_hms(result_date)) %>% 
    mutate(lot_number = as.factor(lot_number), level = as.factor(level))
    
glimpse(albumin_qc)
head(albumin_qc)


# Import all test specifications(test_specs.csv) into test_performance_data and inspect
test_performance_data<-read_csv("datasets/test_specs.csv")

glimpse(test_performance_data)


Parsed with column specification:
cols(
  new_device_name = col_character(),
  lot_number = col_integer(),
  level = col_character(),
  result_date = col_character(),
  test_code = col_character(),
  result = col_double(),
  unit = col_character(),
  name = col_character()
)


Observations: 376
Variables: 8
$ new_device_name <chr> "Analyser 001", "Analyser 001", "Analyser 001", "An...
$ lot_number      <int> 123456, 123456, 123456, 123456, 123456, 123456, 123...
$ level           <chr> "002", "001", "001", "002", "001", "002", "002", "0...
$ result_date     <chr> "2018.01.01 08:13:00", "2018.01.01 08:21:00", "2018...
$ test_code       <chr> "t001", "t001", "t001", "t001", "t001", "t001", "t0...
$ result          <dbl> 4.17, 3.28, 3.24, 4.13, 3.23, 4.16, 4.18, 3.22, 3.2...
$ unit            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ name            <chr> "Turner, Cameron", "Cordova, Rocky", "Cordova, Rock...


new_device_name,lot_number,level,result_date,test_code,result,unit,name
Analyser 001,123456,2,2018.01.01 08:13:00,t001,4.17,,"Turner, Cameron"
Analyser 001,123456,1,2018.01.01 08:21:00,t001,3.28,,"Cordova, Rocky"
Analyser 001,123456,1,2018.01.02 09:20:00,t001,3.24,,"Cordova, Rocky"
Analyser 001,123456,2,2018.01.02 09:28:00,t001,4.13,,"Ramundo, Dustyn"
Analyser 001,123456,1,2018.01.03 09:21:00,t001,3.23,,"Acres, Valerie"
Analyser 001,123456,2,2018.01.03 09:21:00,t001,4.16,,"Cordova, Rocky"


Parsed with column specification:
cols(
  new_device_name = col_character(),
  lot_number = col_integer(),
  level = col_character(),
  result_date = col_character(),
  test_code = col_character(),
  result = col_double(),
  unit = col_character(),
  name = col_character()
)


Observations: 376
Variables: 8
$ new_device_name <chr> "Analyser 001", "Analyser 001", "Analyser 001", "An...
$ lot_number      <fct> 123456, 123456, 123456, 123456, 123456, 123456, 123...
$ level           <fct> 002, 001, 001, 002, 001, 002, 002, 001, 001, 002, 0...
$ result_date     <dttm> 2018-01-01 08:13:00, 2018-01-01 08:21:00, 2018-01-...
$ test_code       <chr> "t001", "t001", "t001", "t001", "t001", "t001", "t0...
$ result          <dbl> 4.17, 3.28, 3.24, 4.13, 3.23, 4.16, 4.18, 3.22, 3.2...
$ unit            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ name            <chr> "Turner, Cameron", "Cordova, Rocky", "Cordova, Rock...


new_device_name,lot_number,level,result_date,test_code,result,unit,name
Analyser 001,123456,2,2018-01-01 08:13:00,t001,4.17,,"Turner, Cameron"
Analyser 001,123456,1,2018-01-01 08:21:00,t001,3.28,,"Cordova, Rocky"
Analyser 001,123456,1,2018-01-02 09:20:00,t001,3.24,,"Cordova, Rocky"
Analyser 001,123456,2,2018-01-02 09:28:00,t001,4.13,,"Ramundo, Dustyn"
Analyser 001,123456,1,2018-01-03 09:21:00,t001,3.23,,"Acres, Valerie"
Analyser 001,123456,2,2018-01-03 09:21:00,t001,4.16,,"Cordova, Rocky"


Parsed with column specification:
cols(
  new_test_name = col_character(),
  new_lot_number = col_character(),
  man_mean = col_number(),
  man_sd = col_character()
)


Observations: 24
Variables: 4
$ new_test_name  <chr> "Albumin", "Albumin", "ALT", "ALT", "Amylase", "Amyl...
$ new_lot_number <chr> "123456-001", "123456-002", "123456-001", "123456-00...
$ man_mean       <dbl> 324, 417, 27, 1189, 415, 1225, 579, 1019, 79, 942, 4...
$ man_sd         <chr> "0,072", "0,075", "1,168", "5,65", "1,44", "3,79", "...


## 2. Importing file using function

We have just read glucose QC results. But there are 9 files to go! Should we copy&past above code?

No. There must be more elegant way. Let's define a function and use this function to read all files. To do this:

1. Define a reader function: We need a function which takes file name as parameter and return a tibble. While defining function we should add col types (col_types) while using read_csv file in a function. This will a) Ensure we are reading files correctly b) Suppress warning messages

2. Test function

3. Use some kind of loops to read all files: We will use purr::map family functions

Now we are going to complete first two steps.

In [19]:
# We already know how to read file. Convert these into a function is simple

qc_result_reader <- function(file_name) {    
    read_csv(file_name, 
            col_types = cols(  new_device_name = col_character(),
                  lot_number = col_integer(),
                  level = col_character(),
                  result_date = col_character(),
                  test_code = col_character(),
                  result = col_double(),
                  unit = col_character(),
                  name = col_character())) %>% 
    mutate(result_date = ymd_hms(result_date)) %>% 
    mutate(lot_number = as.factor(lot_number), level = as.factor(level))
    

    
}


# Test function with Glucose_QC_results.csv and import glucose data into albumin_qc

glucose_qc <- qc_result_reader("datasets/Glucose_QC_results.csv")



glimpse(glucose_qc)
head(glucose_qc)

Observations: 376
Variables: 8
$ new_device_name <chr> "Analyser 001", "Analyser 001", "Analyser 001", "An...
$ lot_number      <fct> 123456, 123456, 123456, 123456, 123456, 123456, 123...
$ level           <fct> 002, 001, 001, 002, 002, 001, 001, 002, 001, 002, 0...
$ result_date     <dttm> 2018-01-01 08:18:00, 2018-01-01 08:23:00, 2018-01-...
$ test_code       <chr> "t008", "t008", "t008", "t008", "t008", "t008", "t0...
$ result          <dbl> 119, 46, 46, 118, 119, 46, 46, 119, 46, 119, 46, 11...
$ unit            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ name            <chr> "Seidel, Sophia", "Seidel, Sophia", "Acres, Valerie...


new_device_name,lot_number,level,result_date,test_code,result,unit,name
Analyser 001,123456,2,2018-01-01 08:18:00,t008,119,,"Seidel, Sophia"
Analyser 001,123456,1,2018-01-01 08:23:00,t008,46,,"Seidel, Sophia"
Analyser 001,123456,1,2018-01-02 09:28:00,t008,46,,"Acres, Valerie"
Analyser 001,123456,2,2018-01-02 09:28:00,t008,118,,"Acres, Valerie"
Analyser 001,123456,2,2018-01-03 09:29:00,t008,119,,"Acres, Valerie"
Analyser 001,123456,1,2018-01-03 09:30:00,t008,46,,"Ramundo, Dustyn"


## 3. Importing multiple files

Yes! Our function is working! 
  
Now we have a function to read our QC Files. 

Hint: We provide `col_types` to ensure we are reading columns correctly. We can also use specific `col_` instead of additional  mutate's. For example: `result_date = col_datetime(format = "%Y.%m.%d %H:%M:%S")` instead of `lubridate::ymd_hms`. This can be fast in big data sets.  
  

We can use `map_df` from `purrr` package to read all files and combine into one single tibble. `purrr::map` family functions provides simple interfaces for repetitive tasks. 


`map_df` function also accepts `.id` parameter to create additional variable which contains index name. In this case our file name.  (use ?map_df for details)

To obtain test names from file names we can use regex.

In [24]:

# Get file names and paths using full.names argument 

file_list <- list.files(path ="datasets/", pattern = "_QC_results.csv", full.names = TRUE)

# Inspect file_list
file_list


# use regex to remove path and "_QC_results" tag and obtain test names. Set these  names to use in map_df

test_names <-  file_list  %>% 
    str_extract( "(?<=\\/)(.*?)(?=\\.)")  %>% # extracts text between / and .
    str_replace("_QC_results", "") # removes _QC_results 

names(file_list) <- test_names

# Read all files in file_list
all_test_qc <- map_df(file_list, qc_result_reader, .id = "test_name")

all_test_qc



glimpse(all_test_qc)
head(all_test_qc)

# Now we are ready to join data and do statistics.

test_name,new_device_name,lot_number,level,result_date,test_code,result,unit,name
Albumin,Analyser 001,123456,002,2018-01-01 08:13:00,t001,4.17,,"Turner, Cameron"
Albumin,Analyser 001,123456,001,2018-01-01 08:21:00,t001,3.28,,"Cordova, Rocky"
Albumin,Analyser 001,123456,001,2018-01-02 09:20:00,t001,3.24,,"Cordova, Rocky"
Albumin,Analyser 001,123456,002,2018-01-02 09:28:00,t001,4.13,,"Ramundo, Dustyn"
Albumin,Analyser 001,123456,001,2018-01-03 09:21:00,t001,3.23,,"Acres, Valerie"
Albumin,Analyser 001,123456,002,2018-01-03 09:21:00,t001,4.16,,"Cordova, Rocky"
Albumin,Analyser 001,123456,002,2018-01-04 09:45:00,t001,4.18,,"Ramundo, Dustyn"
Albumin,Analyser 001,123456,001,2018-01-04 09:49:00,t001,3.22,,"Ramundo, Dustyn"
Albumin,Analyser 001,123456,001,2018-01-05 09:23:00,t001,3.23,,"Seidel, Sophia"
Albumin,Analyser 001,123456,002,2018-01-05 09:31:00,t001,4.19,,"Turner, Cameron"


Observations: 4,511
Variables: 9
$ test_name       <chr> "Albumin", "Albumin", "Albumin", "Albumin", "Albumi...
$ new_device_name <chr> "Analyser 001", "Analyser 001", "Analyser 001", "An...
$ lot_number      <fct> 123456, 123456, 123456, 123456, 123456, 123456, 123...
$ level           <fct> 002, 001, 001, 002, 001, 002, 002, 001, 001, 002, 0...
$ result_date     <dttm> 2018-01-01 08:13:00, 2018-01-01 08:21:00, 2018-01-...
$ test_code       <chr> "t001", "t001", "t001", "t001", "t001", "t001", "t0...
$ result          <dbl> 4.17, 3.28, 3.24, 4.13, 3.23, 4.16, 4.18, 3.22, 3.2...
$ unit            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ name            <chr> "Turner, Cameron", "Cordova, Rocky", "Cordova, Rock...


test_name,new_device_name,lot_number,level,result_date,test_code,result,unit,name
Albumin,Analyser 001,123456,2,2018-01-01 08:13:00,t001,4.17,,"Turner, Cameron"
Albumin,Analyser 001,123456,1,2018-01-01 08:21:00,t001,3.28,,"Cordova, Rocky"
Albumin,Analyser 001,123456,1,2018-01-02 09:20:00,t001,3.24,,"Cordova, Rocky"
Albumin,Analyser 001,123456,2,2018-01-02 09:28:00,t001,4.13,,"Ramundo, Dustyn"
Albumin,Analyser 001,123456,1,2018-01-03 09:21:00,t001,3.23,,"Acres, Valerie"
Albumin,Analyser 001,123456,2,2018-01-03 09:21:00,t001,4.16,,"Cordova, Rocky"


*Stop here! Only the three first tasks. :)*