> ## Learning Objectives
> - Install and load packages
- Load external data from a .csv file into a data frame.
- Describe what a data frame is.
- Summarize the contents of a data frame.
- Use indexing to subset specific portions of data frames.
- Describe what a factor is.
- Convert between strings and factors.
- Reorder and rename factors.
- Change how character strings are handled in a data frame.

## Installing and loading packages 

Packages are self contained groups of functions that are generally created around a central goal or topic, qpcr analysis, cleaning data, package management, are all topics that have their own packages. 

**'Instaling'** packages using `install.packages()` downloads them from CRAN to your local machine and tells R where they are.  you only have to do this once. 

**'Loading'** packages using `library()` loads packages into your current session making them available to use. you have to do this every time you start a new session. 


In [None]:
install.packages("tidyverse")

In [None]:
library(tidyverse)

Each Package has its own vingette wich is full of descriptions of functions, details about the inputs and outputs of the functions and usually helpfull examples.  I reccomend reading at least the relevant parts when you are first using a package. 

# Reading in a dataset to make a data frame

for read ing and writing manipulating and visualizing data we will be using the `tidyverse` package. Its an extremely diverse and usefull set of functions for data analysis. 

to read a csv file we use `read_csv()` and specify the file name (!! make sure it is in your working directory or specify the path to the file). you can also explicitly speicify the data type for each column. R will make these a data frame. 



## What are Data frames? 

Data frames are the _de facto_ data structure for most tabular data, and what we
use for statistics and plotting.

A data frame can be created by hand, but most commonly they are generated by the
functions `read.csv()` or `read.table()`; in other words, when importing
spreadsheets from your hard drive (or the web).

A data frame is the representation of data in the format of a table where the
columns are vectors that all have the same length. Because columns are
vectors, each column must contain a single type of data (e.g., characters, integers,
factors). For example, here is a figure depicting a data frame comprising a
numeric, a character, and a logical vector.

![](./img/data-frame.svg)

In [1]:
csv_luciferase <- read_csv(file = "luciferase_toy_data.csv", col_types = "cdfddcdcffl")
print(csv_luciferase)



[38;5;246m# A tibble: 74 x 11[39m
   Wells Signal Condition   Rep L_Concentration L_Units D_Concentration D_Units
   [3m[38;5;246m<chr>[39m[23m  [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<fct>[39m[23m     [3m[38;5;246m<dbl>[39m[23m           [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m             [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m  
[38;5;250m 1[39m C01    [4m2[24m[4m3[24m768 D             1               0 uM                   10 uM     
[38;5;250m 2[39m C10     [4m4[24m165 NO treat      1               0 uM                    0 uM     
[38;5;250m 3[39m D01     [4m5[24m996 D             2               0 uM                   10 uM     
[38;5;250m 4[39m D10    [4m5[24m[4m0[24m004 NO treat      2               0 uM                    0 uM     
[38;5;250m 5[39m E01     [4m3[24m559 D             3               0 uM                   10 uM     
[38;5;250m 6[39m E10   [4m3[24m[4m2[24m[4m7[24m884 NO

Often times full datasets are two large to see and we want to quickly inspect the first few or last few rows, we can use the `head` and `tail` functions to quickly inspect our data. 



In [None]:
# Content: try using head() and tail() to inspect the data 

# Size: use * `dim()`,  `nrow()`,  `ncol()`  to return the dimensions, number of rows and number of columns. 

# Names:  use names() and rownames() to get the column and row names for the dataframe. 

We need meaningful ways to quickly understand a large amount of data. We can see this when inspecting the <b>str</b>ucture of a data frame
with the function `str()`:

In [None]:
str(csv_luciferase)

###  to get a summary of the values of a column use summary() for factors and characters
To inspect the columns and to understand what values we have and how many of each we can use the `table` function

In [None]:
summary(csv_luciferase)

Fornumerical columns, one can use `hist()` to make a quick and dirty histogram to get a sense of the disribution of the data. 

In [None]:
# use hist() with our luciferse data as one argument and set the argument 'breaks' = 20

hist(csv_luciferase$Signal, breaks = 20)

## Indexing and subsetting data frames

Our survey data frame has rows and columns (it has 2 dimensions), if we want to
extract some specific data from it, we need to specify the "coordinates" we
want from it. Row numbers come first, followed by column numbers. However, note
that different ways of specifying these coordinates lead to results with
different classes.



In [None]:
# first element in the first column of the data frame (as a vector)
surveys[1, 1]   
# first element in the 6th column (as a vector)
surveys[1, 6]   
# first column of the data frame (as a vector)
surveys[, 1]    
# first column of the data frame (as a data.frame)
surveys[1]      
# first three elements in the 7th column (as a vector)
surveys[1:3, 7] 
# the 3rd row of the data frame (as a data.frame)
surveys[3, ]    



### Subsetting by calling column names directly

Column names can be used directly to inpect a whole column one at a time. The column name is referenced using the `$`.  if we wanted to only see the enitre `Condition` Column we could use `csv_luciferase$Condition`


In [None]:
csv_luciferase$Condition

### Challenge

In [None]:
 #1. Create a `data.frame` (`luciferase_200`) containing only the data in row 200 of the `csv_luciferase` dataset.



In [None]:
 # 2. Notice how `nrow()` gave you the number of rows in a `data.frame`?

      # Use that number to pull out just that last row in the data frame.
      # Compare that with what you see as the last row using `tail()` to make
      # sure it's meeting expectations.
      # Pull out that last row using `nrow()` instead of the row number.
      # Create a new data frame (`surveys_last`) from that last row.


In [None]:
 # 3. Use `nrow()` to extract the row that is in the middle of the data
    #frame. Store the content of this row in an object named `surveys_middle`.


In [None]:
# 4. Combine `nrow()` with the `-` notation above to reproduce the behavior offset
#    `head(surveys)`, keeping just the first through 6th rows of the surveys
#    dataset.


## Factors

Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

In [None]:
plate <- csv_luciferase$Plate_ID

R will assign 1 to the level "BIC_luc" and 2 to the level "ENZ_luc" (because B comes before E). You can see this by using the function levels() and you can find the number of levels using nlevels():

In [None]:
# use the functions level() and nlevels() to examine the Plate_ID column



Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be:

In [None]:
plate



In [None]:
plate  <- factor(plate, levels = c('ENZ_luc', 'BIC_luc'))
plate # after reordering

## viewing and Renaming factors

When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. 

In [None]:
plot(plate)

In [None]:
# let's rename them to remove the "_luc" since its redundant 




In [None]:
# plot to make sure its correct

