# Data exploration using the `dplyr` package

In [1]:
# Load the dplyr package
require(dplyr)

Loading required package: dplyr

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [2]:
# Load the house prices data
house <- read.csv("../data/houseprices/train.csv")
print(house[1:2,])

  Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
  Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
  HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
1     2Story           7           5      2003         2003     Gable  CompShg
2     1Story           6           8      1976         1976     Gable  CompShg
  Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
2     MetalSd     MetalSd       None          0        TA        TA     CBlock
  BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 Bsmt

In [3]:
# Subsetting using the filter function
print(filter(house, SaleCondition == "Normal", Fireplaces == 1, SalePrice > 18000)[1:2,])

  Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
1  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
2  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
  Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
1    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
2    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
  HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
1     1Story           6           8      1976         1976     Gable  CompShg
2     2Story           7           5      2001         2002     Gable  CompShg
  Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
1     MetalSd     MetalSd       None          0        TA        TA     CBlock
2     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
  BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 Bsmt

In [4]:
# You can also use boolean operators
print(filter(house, SaleCondition == "Normal" & Fireplaces == 1 & SalePrice > 18000)[1:2,])

  Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
1  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
2  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
  Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
1    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
2    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
  HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
1     1Story           6           8      1976         1976     Gable  CompShg
2     2Story           7           5      2001         2002     Gable  CompShg
  Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
1     MetalSd     MetalSd       None          0        TA        TA     CBlock
2     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
  BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 Bsmt

In [5]:
# Select rows using the slice operator:
print(slice(house[,1:5], 1:4))

  Id MSSubClass MSZoning LotFrontage LotArea
1  1         60       RL          65    8450
2  2         20       RL          80    9600
3  3         60       RL          68   11250
4  4         70       RL          60    9550


In [6]:
# Sorting the rows of the table
# Use of the desc function to denote decsending order
print(arrange(house, -SalePrice, desc(LotArea))[1:2,])

    Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
1  692         60       RL         104   21535   Pave  <NA>      IR1
2 1183         60       RL         160   15623   Pave  <NA>      IR1
  LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
1         Lvl    AllPub    Corner       Gtl      NoRidge       Norm       Norm
2         Lvl    AllPub    Corner       Gtl      NoRidge       Norm       Norm
  BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
1     1Fam     2Story          10           6      1994         1995     Gable
2     1Fam     2Story          10           5      1996         1996       Hip
  RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
1  WdShngl     HdBoard     HdBoard    BrkFace       1170        Ex        TA
2  CompShg     Wd Sdng     ImStucc       None          0        Gd        TA
  Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
1      PConc       E

In [7]:
# Select columns by name
print(select(house, SalePrice, Heating, YrSold)[1:2,])

  SalePrice Heating YrSold
1    208500    GasA   2008
2    181500    GasA   2007


In [8]:
# Select unique values using the distinct function
print(distinct(house, Condition1))

  Condition1
1       Norm
2      Feedr
3       PosN
4     Artery
5       RRAe
6       RRNn
7       RRAn
8       PosA
9       RRNe


In [9]:
print(distinct(house, Condition1, Condition2))

   Condition1 Condition2
1        Norm       Norm
2       Feedr       Norm
3        PosN       Norm
4      Artery       Norm
5      Artery     Artery
6        RRAe       Norm
7       Feedr       RRNn
8        RRNn       Norm
9        RRAn      Feedr
10       PosA       Norm
11      Feedr      Feedr
12       RRAn       Norm
13       RRNe       Norm
14       PosN       PosN
15       RRNn      Feedr
16     Artery       PosA
17      Feedr       RRAn
18      Feedr       RRAe


In [10]:
print(mutate(house, price_unit_area = SalePrice/LotArea, perc_garage = GarageArea/LotArea)[1:3, 
                                     c("Id", "Heating", "LotArea", "price_unit_area", "perc_garage")])

  Id Heating LotArea price_unit_area perc_garage
1  1    GasA    8450        24.67456  0.06485207
2  2    GasA    9600        18.90625  0.04791667
3  3    GasA   11250        19.86667  0.05404444


In [11]:
# Tranmute keeps only the new variables
print(transmute(house, price_unit_area = SalePrice/LotArea, perc_garage = GarageArea/LotArea)[1:4,])

  price_unit_area perc_garage
1        24.67456  0.06485207
2        18.90625  0.04791667
3        19.86667  0.05404444
4        14.65969  0.06722513


In [12]:
# Summarise data ...
print(summarise(house, mean_price = mean(SalePrice), median_price = median(SalePrice)))

  mean_price median_price
1   180921.2       163000


In [13]:
# Sample data by number of observations ...
print(sample_n(house[, 1:5], 3))

       Id MSSubClass MSZoning LotFrontage LotArea
138   138         90       RL          82   11070
1186 1186         50       RL          60    9738
21     21         60       RL         101   14215


In [14]:
# Sample data by fraction
print(sample_frac(house[,1:5], 0.005))

       Id MSSubClass MSZoning LotFrontage LotArea
369   369         20       RL          78    7800
1023 1023         50       RM          52    9439
594   594        120       RM          NA    4435
1247 1247         60       FV          65    8125
1231 1231         90       RL          NA   18890
349   349        160       RL          36    2448
413   413         20       FV          NA    4403


In [15]:
# Create grouping data object
grouping <- group_by(house[, c("MSZoning", "SaleCondition", "Heating")], MSZoning, SaleCondition, Heating)
print(grouping[1:6,])

Source: local data frame [6 x 3]
Groups: MSZoning, SaleCondition, Heating [2]

  MSZoning SaleCondition Heating
    <fctr>        <fctr>  <fctr>
1       RL        Normal    GasA
2       RL        Normal    GasA
3       RL        Normal    GasA
4       RL       Abnorml    GasA
5       RL        Normal    GasA
6       RL        Normal    GasA


In [16]:
# Use the groupings to generate frequency of each group
print(summarise(grouping, Count = n())[1:6,])

Source: local data frame [6 x 4]
Groups: MSZoning, SaleCondition [5]

  MSZoning SaleCondition Heating Count
    <fctr>        <fctr>  <fctr> <int>
1  C (all)       Abnorml    GasA     5
2  C (all)        Alloca    GasA     1
3  C (all)        Normal    GasA     3
4  C (all)        Normal    GasW     1
5       FV       Abnorml    GasA     4
6       FV        Normal    GasA    39


# Exercise 2.3

**Question 1**

Load `"loan.csv"` into a data.frame called `loan`. Use the `filter` function from the `dplyr` package to filter `Gender` `female`, `age` greater than or equal to `30` `Principal` less than or equal to `1000` using a logical statement and then using comma notation.

**Question 2**

Use the `slice` function to select rows `300` to `350`.

**Question 3**

Use the `arrange` function to sort the `loan` data.frame by increasing `age`, increasing `Principal`, decreasing `terms`.

**Question 4**

Use the `select` function to select `loan_status`, `Principal`, `terms`, `age`, and `Gender` from the `loan` data.

**Question 5**

Use the `distinct` function to select the unique combination of `Gender`, `education`, and `loan_status` from the `loan` dataset.

**Question 6**

Use the `group_by`, `summarise`, and `n` functions to create the counts of the combination of `Gender`, `education`, and `loan_status` in the `loan` dataset.

**Question 7**

Use the mutate function to calculate the `p.t = Principal/terms` in the `loan` table dataset.