## Reading data into R



When reading data into R, there are two essential pieces of information to provide. Firstly, you need to specify the data structure used to organize the data. Data structures come in various types, such as lists and trees. One specific data structure we previously discussed is vectors.

In this Lab, we will utilize the data_frame data structure, also known as a tibble. If you're interested, you can find more information about tibbles [here](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html). It's important to note that tibbles are not part of the base R package; thus, to use this data structure, we will need to employ an additional package.

`Tidyverse` packages namely ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats

Install all the packages in the tidyverse by running `install.packages("tidyverse").`

In [6]:
install.packages('tidyverse')

also installing the dependencies ‘textshaping’, ‘rmarkdown’, ‘selectr’, ‘stringi’, ‘broom’, ‘dbplyr’, ‘modelr’, ‘ragg’, ‘reprex’, ‘rvest’, ‘stringr’, ‘tidyr’


“installation of package ‘textshaping’ had non-zero exit status”
“installation of package ‘ragg’ had non-zero exit status”
“installation of package ‘tidyverse’ had non-zero exit status”
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [7]:
library(tidyverse)

ERROR: Error in library(tidyverse): there is no package called ‘tidyverse’


Now that we have imported the required package, we can proceed to load the data. We will use the `read_csv()` function, which was included in the package we imported earlier. Remember, it is essential to load the package using the `library()` function before attempting to use `read_csv()` to avoid encountering an error.

Let's read in our data file, which is in the .csv format. "CSV" stands for "comma-separated values," making it easy to read and analyze later. Unlike other file types specific to particular programs, a .csv file can be read by almost any program.

In this lab, we will work with a dataset containing ratings of different chocolate bars.

In [5]:
chocolateData <- read_csv("../data/flavors_of_cacao.csv")

# some of our column names have spaces in them. This line changes the column names to 
# versions without spaces, which let's us talk about the columns by their names.

names(chocolateData) <- make.names(names(chocolateData), unique=TRUE)

ERROR: Error in read_csv("../data/flavors_of_cacao.csv"): could not find function "read_csv"


In [None]:
# Try!

# To give you practice reading in files, Please use another dataset "food_coded.csv" 
# This dataset is in the following place: ../data/flavors_of_cacao.csv


# read in your dataset and save it as a variable called "foodPreferences"


## Look at the data we've read in

Congrats, you've gotten some data into R! Now we want to make sure that it all read in correctly, and get an idea of what's in our data file.

In [None]:
# the head() function reads just the first few lines of a file. 
head(chocolateData)

# the tail() function reads in the just the last few lines of a file. 
# we can also give both functions a specific number of lines to read.
# This line will read in the last three lines of "chocolateData".
tail(chocolateData, 3)

In [None]:
# Try!

# Get the first four lines of the foodPreferences dataframe you read in earlier



When working with the data_frame data structure, you'll observe that it has two dimensions, distinguishing it from the single-dimensional vectors we previously dealt with. However, the interesting fact is that both of these dimensions are, in fact, vectors themselves! This allows us to access specific cells within the data_frame using the indexes of the values we want.

Let's have a quick refresher on how to access data by its index:

In [None]:
# make a little example vector
a <- c(5,10,15)

# if you ask for something at an index, but don't say which one, you'll get everything
a[]

# if you ask for a value at a specific index, you'll only get only that value. In R,
# indexes start counting from 1 and go up. (So 3 is the third)
a[1]

Data_frames work the same way, but you need to specify both the row and column, with a comma between them.

> In R, if you ask for something from a two dimensional data structure, you'll always ask for the row first and the column second. So "dataObject[2,4]" means "give me whatever is in the 2nd row and 4th column of the data frame called 'dataObject'".

In [None]:
# get the contents in the cell in the sixth row and the forth column
chocolateData[6,4]

# get the contents of every cell in the 6th row (note that you still need the comma!)
chocolateData[6,]

# if you forget the coulmn, you'll get the 6th *column* instead of the 6th *row*
head(chocolateData[6])
# We've used "head" here because the column is very long and We don't want
# to fill up the screen by printing the whole thing out

In [None]:
# Try
# dataframe[row,column]
# Get the first row of your "foodPreferences" data_frame

# Get the value from the cell in the 100th row and 4th column


## Remove unwanted data


In addition to using indexes to get certain values, we can also use them to *remove* data we're not interested in. You can do this by putting a minus sign (-) in front of the index you don't want.

You may have noticed earlier that the first row of the "chocolateData" data_frame is the same as the column names. Let's remove it.

In [None]:
head(chocolateData)

In [4]:
# get all rows EXCEPT the first row and all columns of chocolateData
# By putting it back in the same variable, we're overwriting what was in 
# that variable before, so be careful with this!
chocolateData <- chocolateData[-1,] 

# make sure we removed the row we didn't want
head(chocolateData)

ERROR: Error in eval(expr, envir, enclos): object 'chocolateData' not found


In [None]:
# Your turn!

# The 5th column in the "foodPreferences" dataset has a lot of values that aren't 
# numbers (nan means "not a number"). Can you remove the 5th column from the dataset?


Alright, now that we've read data into R, checked that it looks alright and gotten rid of a row we didn't want, it's time to get down to doing some analysis.