# Dataframes - Introduction

## Table of Contents 
1. Creating a dataframe
2. Structure of a datafame 
3. Dimensions of a dataframe

##### 
 In these three notebooks, you will get to learn everything about dataframes in R:
1. This notebook - how to show the basic information of a dataframe, such as its structure and the number of rows
2. [Column Indexing](https://bentley.cloud.databricks.com/#notebook/92452/command/92453) - retrieve the columns you want to display (single or multiple) 
3. [Row Indexing](https://bentley.cloud.databricks.com/#notebook/92428/command/92429) - retrieve the rows you want or retrive logical index for some specific values

The two links above are to notebooks in this folder.

#### Creating a dataframe 

A __dataframe__ is the R version of a table, with rows, columns and cells (as you would expect.)

In R, tables can be created in many ways. One of the ways is using the data.frame() function. Within this function, each column can be specified using column names followed by the contents of the column.

In [5]:
%r
course_info=data.frame(course_id=c(100:103), 
                       course_name=c("Biology","Chemistry","History","Physics"), 
                       students_enrolled=c(20,25,15,18))
course_info

In this tutorial we will use a few of the many tables that are available in all R installations.
Later tutorials will demonstrate reading tables from files and from other types of data sources.

#### Structure of a dataframe

All rows from the `iris` table are listed below. 

Select the output part of the cell and then scroll through the rows.

Next, the str() function is used to understand the structure of the dataframe.

In [8]:
%r
str(iris)

Notice that `iris`
-  is a `data.frame`
- has 150 rows/observations
- has 5 columns/variables, with names, class and values

The values in the cells are included in the output.
This is sometimes helpful but it is better to use the `head` command to show the first several rows.

From the output of the str() function it can be seen that the sepal length , sepal width , petal length and petal width are numeric columns where as the species is a factor with 3 levels.

In [11]:
%r
class(iris$Petal.Width)

In [12]:
%r
class(iris$Species)

The first few rows (of the `iris` table) are displayed with the `head` command and the last few displayed with the `tail` command. By default the first 6 or last 6 rows are displayed unless specified.

In [14]:
%r
head(iris)

In [15]:
%r
tail(iris,10)

Notice the row numbers are displayed to the left of the first column.

The names() function can also be used to retrieve only the names of all the columns in the dataframe.

In [18]:
%r
names(iris)

While dealing with individual columns they can be referred by using the $ sign i.e. the Species column is referred to as iris$Species. This is particularly useful when working with multiple dataframes which have common column names. In the cases where there are no overlapping columns, another option is to use the attach() function after which the column names can be referred to without using the $ sign.

In [20]:
%r
iris$Petal.Length

#### Dimensions of dataframes

The number of rows and columns of a table are display with the `dim`, `nrow` and `ncol` commands.

In [22]:
%r
dim(iris)

In [23]:
%r
nrow(iris)

In [24]:
%r
ncol(iris)

To retrieve a single cell of a dataframe, place the row number and column number in square brackets separated by a comma (","):

- The row number is to the left of the comma
- The column number is to the right of the comma

In [26]:
%r
head(iris,2)

In case we need to find the recorded sepal width of the second iris observed, the following command can be used since the head() function tells us that the sepal width is the second column.

In [28]:
%r
iris[9:1,c('Petal.Width','Petal.Length')]

Functions such as average, max , min can be performed on the columns of the dataframes.

In [30]:
%r
mean(iris$Petal.Width)

In [31]:
%r
min(iris$Sepal.Length)

__Exercise__: Find and display the 25 records from the `iris` dataframe where both of the following are true:
- `Sepal.Width` is greater than the average value of all `Sepal.Width` values
- `Sepal.Length` is greater than the average value of all `Sepal.Length` values

Work through this one step at a time:
1. find the average/mean of the `Sepal.Width` column
1. find the average/mean of the `Sepal.Length` column
1. create a logical index vector that is `True` only when the `Sepal.Width` value is greater than the average `Sepal.Width` value
1. create a logical index vector that is `True` only when the `Sepal.Length` value is greater than the average `Sepal.Length` value
1. create a logical index vector that is `True` only when both of these logical index vectors are `True`
1. use this last logical index vector to retrieve these 25 records

Check your work at each step.

The End