<a href="https://colab.research.google.com/github/Xavier-ML/Programing-with-R--step-by-step/blob/master/getting_started_in_r_load_data_into_r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://www.r-project.org/Rlogo.png)

____________________________________________________________________________________
This tutorial is the second part of a series. If you've never programmed before, I recommend checking out the "[First Steps](https://www.kaggle.com/rtatman/getting-started-in-r-first-steps/)" part of the tutorial.

In this part of the tutorial, we'll:

* read data into R
* look at the data we've read in
* remove unwanted rows

____________________________________________________________________________________


### Learning goals:

By the end of this tutorial, you will be able to do the following things. (Don't worry if you don't know what all these things are yet; we'll get there together!)

* [Be familiar with basic concepts: functions, variables, data types and vectors](https://www.kaggle.com/rtatman/getting-started-in-r-first-steps/)
* [Load data into R](https://www.kaggle.com/rtatman/getting-started-in-r-load-data-into-r)
* [Summerize your data](https://www.kaggle.com/rtatman/getting-started-in-r-summarize-data)
* [Graph data](https://www.kaggle.com/rtatman/getting-started-in-r-graphing-data/)

______

### Your turn!

Throughout this tutorial, you'll have lots of opportunities to practice what you've just learned. Look for the phrase "your turn!" to find these exercises.

## Reading data into R


When you read data into R, you need to tell it two things. The first is what type of data structure the data is in. The second is where to find the data.

> Data structure: A specfic way of organizing data to store it. There are lots of different types of data structures that you will learn about, including lists and trees. Vectors, which we talked about in the "First Steps" part of the tutorial, are a specific data structure.

For this tutorial, we're going to use the data_frame data structure (also called a tibble). If you're curious, you can find more information on these [here](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html). However, this data structure isn't one that comes with base R. To use this data structure, we're going to need to use  a package.

In [3]:
# this line will read in the "tidyverse" package. An R package is a collection
# of special functions (and sometimes data). Before you use can use the functions in a
# package, though, you need to tell R that you want it to use that package using the
# library() function.

library(tidyverse)

Alright, now that we've read in the package we need, we're ready to read in data. We can do this using the read_csv() function (which was in the package we just read in--if you try to use this function without loading the package using library() first, you'll get an error!).

Let's read  in our file. This file is a .csv file. "csv" stands for "comma separated values". You can save any spreadsheet at a .csv file, and that will make it easier to read and analyze later: many file types that you can save spreadsheets as can only be read by one specific program. A .csv can be read by pretty much any program.

For this tutorial, we'll be using a dataset of ratings of different chocolate bars. You can learn more about this dataset by clicking on the plus sign (+) next to "input files" at the top of the page.

In [22]:
# Read our data into R. The argument here is a file path. The ".." means "look at the
# folder above this one", "input" is a specific folder, "chocolate-bar-ratings" is
# a folder wihtin the "input" folder, and and "flavors_of_cacao.csv" is
# the specific file we're reading from inside that file.

chocolateData <- read_csv("/content/flavors_of_cacao.csv")

# some of our column names have spaces in them. This line changes the column names to
# versions without spaces, which let's us talk about the columns by their names.
names(chocolateData) <- make.names(names(chocolateData), unique=TRUE)

[1mRows: [22m[34m1795[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): Company 
(Maker-if known), Specific Bean Origin
or Bar Name, Cocoa
...
[32mdbl[39m (3): REF, Review
Date, Rating

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [23]:
# Your turn!

# To give you practice reading in files, I've added a second dataset to this notebook
# as well. This dataset is in the following place: ../input/food-choices/food_coded.csv
fooddata <- read_csv("/content/food_coded.csv")
# read in your dataset and save it as a variable called "foodPreferences"
names(fooddata) <- make.names(names(fooddata), unique=TRUE)

[1m[22mNew names:
[36m•[39m `comfort_food_reasons_coded` -> `comfort_food_reasons_coded...10`
[36m•[39m `comfort_food_reasons_coded` -> `comfort_food_reasons_coded...12`
[1mRows: [22m[34m125[39m [1mColumns: [22m[34m61[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (14): GPA, comfort_food, comfort_food_reasons, diet_current, eating_chan...
[32mdbl[39m (47): Gender, breakfast, calories_chicken, calories_day, calories_scone,...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Look at the data we've read in

Congrats, you've gotten some data into R! Now we want to make sure that it all read in correctly, and get an idea of what's in our data file.

In [24]:
# the head() function reads just the first few lines of a file.
head(chocolateData)

# the tail() function reads in the just the last few lines of a file.
# we can also give both functions a specific number of lines to read.
# This line will read in the last three lines of "chocolateData".
tail(chocolateData, 3)

Company...Maker.if.known.,Specific.Bean.Origin.or.Bar.Name,REF,Review.Date,Cocoa.Percent,Company.Location,Rating,Bean.Type,Broad.Bean.Origin
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
A. Morin,Agua Grande,1876,2016,63%,France,3.75,,Sao Tome
A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
A. Morin,Atsane,1676,2015,70%,France,3.0,,Togo
A. Morin,Akata,1680,2015,70%,France,3.5,,Togo
A. Morin,Quilla,1704,2015,70%,France,3.5,,Peru
A. Morin,Carenero,1315,2014,70%,France,2.75,Criollo,Venezuela


Company...Maker.if.known.,Specific.Bean.Origin.or.Bar.Name,REF,Review.Date,Cocoa.Percent,Company.Location,Rating,Bean.Type,Broad.Bean.Origin
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
Zotter,Kerala State,749,2011,65%,Austria,3.5,Forastero,India
Zotter,Kerala State,781,2011,62%,Austria,3.25,,India
Zotter,"Brazil, Mitzi Blue",486,2010,65%,Austria,3.0,,Brazil


In [25]:
# Your turn!
head(fooddata, 4)
# Get the first four lines of the foodPreferences dataframe you read in earlier
tail(fooddata, 3)

GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded...10,⋯,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
2.4,2,1,430,,315,1,none,we dont have comfort,9,⋯,1,1,1,1165,345,car racing,5,1,1315,187
3.654,1,1,610,3.0,420,2,"chocolate, chips, ice cream","Stress, bored, anger",1,⋯,1,1,2,725,690,Basketball,4,2,900,155
3.3,1,1,720,4.0,420,2,"frozen yogurt, pizza, fast food","stress, sadness",1,⋯,1,2,5,1165,500,none,5,1,900,I'm not answering this.
3.2,1,1,430,3.0,420,2,"Pizza, Mac and cheese, ice cream",Boredom,2,⋯,1,2,5,725,690,,3,1,1315,"Not sure, 240"


GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded...10,⋯,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
3.882,1,1,720,,420,1,"rice, potato, seaweed soup",sadness,,⋯,1,2,5,580,690,none,4,2,1315,120
3.0,2,1,720,4.0,420,1,"Mac n Cheese, Lasagna, Pizza","happiness, they are some of my favorite foods",,⋯,2,2,1,940,500,,3,1,1315,135
3.9,1,1,430,,315,2,"Chocolates, pizza, and Ritz.","hormones, Premenstrual syndrome.",,⋯,1,2,2,725,345,,4,2,575,135


You'll notice that the data_frame data structure has two dimensions, unlike the vectors we worked with in the "First Steps" part of the tutorial. But the secret is that both of these dimensions are actually vectors! This mean that we can access specific cells in our data_frame using the indexes of values we're interested in.

A quick refresher on how to acess data by its index:

In [26]:
# make a little example vector
a <- c(5,10,15)

# if you ask for something at an index, but don't say which one, you'll get everything
a[]

# if you ask for a value at a specific index, you'll only get only that value. In R,
# indexes start counting from 1 and go up. (So 3 is the third)
a[1]

Data_frames work the same way, but you need to specify both the row and column, with a comma between them.

> In R, if you ask for something from a two dimensional data structure, you'll always ask for the row first and the column second. So "dataObject[2,4]" means "give me whatever is in the 2nd row and 4th column of the data frame called 'dataObject'".
>
> One way to remember this is by thinking of "RC Cola". In the brand's name, "RC" stands for "Royal Crown"... but we can pretend it stands for "Row Column".

![](https://upload.wikimedia.org/wikipedia/commons/e/e9/Drink_Royal_Crown_Cola.jpg)

In [27]:
# get the contents in the cell in the sixth row and the forth column
chocolateData[6,4]

# get the contents of every cell in the 6th row (note that you still need the comma!)
chocolateData[6,]

# if you forget the coulmn, you'll get the 6th *column* instead of the 6th *row*
head(chocolateData[6])
# I've used "head" here because the column is very long and I don't want
# to fill up the screen by printing the whole thing out

Review.Date
<dbl>
2014


Company...Maker.if.known.,Specific.Bean.Origin.or.Bar.Name,REF,Review.Date,Cocoa.Percent,Company.Location,Rating,Bean.Type,Broad.Bean.Origin
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
A. Morin,Carenero,1315,2014,70%,France,2.75,Criollo,Venezuela


Company.Location
<chr>
France
France
France
France
France
France


In [28]:
# Your turn!
# dataframe[row,column]
# Get the first row of your "foodPreferences" data_frame
fooddata[1,]
# Get the value from the cell in the 100th row and 4th column
fooddata[100,4]

GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded...10,⋯,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
2.4,2,1,430,,315,1,none,we dont have comfort,9,⋯,1,1,1,1165,345,car racing,5,1,1315,187


calories_chicken
<dbl>
430


## Remove unwanted data


In addition to using indexes to get certain values, we can also use them to *remove* data we're not interested in. You can do this by putting a minus sign (-) in front of the index you don't want.

You may have noticed earlier that the first row of the "chocolateData" data_frame is the same as the column names. Let's remove it.

In [29]:
head(chocolateData)

Company...Maker.if.known.,Specific.Bean.Origin.or.Bar.Name,REF,Review.Date,Cocoa.Percent,Company.Location,Rating,Bean.Type,Broad.Bean.Origin
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
A. Morin,Agua Grande,1876,2016,63%,France,3.75,,Sao Tome
A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
A. Morin,Atsane,1676,2015,70%,France,3.0,,Togo
A. Morin,Akata,1680,2015,70%,France,3.5,,Togo
A. Morin,Quilla,1704,2015,70%,France,3.5,,Peru
A. Morin,Carenero,1315,2014,70%,France,2.75,Criollo,Venezuela


In [44]:
# get all rows EXCEPT the first row and all columns of chocolateData
# By putting it back in the same variable, we're overwriting what was in
# that variable before, so be careful with this!
chocolateData <- chocolateData[-1,]

# make sure we removed the row we didn't want
head(chocolateData)

Company...Maker.if.known.,Specific.Bean.Origin.or.Bar.Name,REF,Review.Date,Cocoa.Percent,Company.Location,Rating,Bean.Type,Broad.Bean.Origin
<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
A. Morin,Atsane,1676,2015,70%,France,3.0,,Togo
A. Morin,Akata,1680,2015,70%,France,3.5,,Togo
A. Morin,Quilla,1704,2015,70%,France,3.5,,Peru
A. Morin,Carenero,1315,2014,70%,France,2.75,Criollo,Venezuela
A. Morin,Cuba,1315,2014,70%,France,3.5,,Cuba
A. Morin,Sur del Lago,1315,2014,70%,France,3.5,Criollo,Venezuela


In [45]:
# Your turn!

# The 5th column in the "foodPreferences" dataset has a lot of values that aren't
column_5 <- fooddata[5]
print(column_5)
# numbers (nan means "not a number"). Can you remove the 5th column from the dataset?
remove_5_Nan <- na.omit(column_5)
print(remove_5_Nan)

[90m# A tibble: 125 × 1[39m
   calories_day
          [3m[90m<dbl>[39m[23m
[90m 1[39m          [31mNaN[39m
[90m 2[39m            3
[90m 3[39m            4
[90m 4[39m            3
[90m 5[39m            2
[90m 6[39m            3
[90m 7[39m            3
[90m 8[39m            3
[90m 9[39m          [31mNaN[39m
[90m10[39m            3
[90m# ℹ 115 more rows[39m
[90m# A tibble: 106 × 1[39m
   calories_day
          [3m[90m<dbl>[39m[23m
[90m 1[39m            3
[90m 2[39m            4
[90m 3[39m            3
[90m 4[39m            2
[90m 5[39m            3
[90m 6[39m            3
[90m 7[39m            3
[90m 8[39m            3
[90m 9[39m            3
[90m10[39m            4
[90m# ℹ 96 more rows[39m


Alright, now that we've read our data into R, checked that it looks alright and gotten rid of a row we didn't want, it's time to get down to doing some analysis. In the next section, we'll learn how to summerize our data and get into some basic statistics!

## Next step: [Summarize our data](https://www.kaggle.com/rtatman/getting-started-in-r-summarize-data/)