# Lab 3 - Extra ways to summarize data with R


## Summarizing vectors

As we have seen in modules 1 and 2, vectors are summarized using measures of central tendency and variability. 
We will look into other descriptive statistics for summarizing the vectors. 
We will work with the same kings county housing porices dataset.

In [2]:
housing_prices <- read.csv("../../../datasets/house_sales_in_king_county/kc_house_data.csv")

apply(), lapply(), sapply(), tapply(), ddply() are some of the summarizing functions you can use to apply functions on columns.
Let's look into each of them. 

In [3]:
#Apply function used to apply a function to the rows or columns of a matrix. It collapses either a row or column. 
apply(housing_prices[,!names(housing_prices) %in% c('date','colors')], 2, mean)

# colMeans, rowMeans, colSums, rowSums are functions you can use if you want to do averages on a matrix columns or rows. 
# It is much quicker using this functions.

In [4]:
head(housing_prices)

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,⋯,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
7129300520,20141013T000000,221900,3,1.0,1180,5650,1,0,0,⋯,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
6414100192,20141209T000000,538000,3,2.25,2570,7242,2,0,0,⋯,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
5631500400,20150225T000000,180000,2,1.0,770,10000,1,0,0,⋯,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
2487200875,20141209T000000,604000,4,3.0,1960,5000,1,0,0,⋯,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
1954400510,20150218T000000,510000,3,2.0,1680,8080,1,0,0,⋯,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
7237550310,20140512T000000,1225000,4,4.5,5420,101930,1,0,0,⋯,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930


In [None]:
#Lets create a list using variables bedrooms and bathrooms of housing_prices dataset.
x=list(housing_prices$bedrooms,housing_prices$bathrooms)

In [None]:
#lapply() is used When you want to apply a function to each element of a list. A list of values is returned back for every 
#element of the list
lapply(x, FUN = mean)

In [None]:
#sapply() is used When you want to apply a function to each element of a list. In return you will get a vector rather than a list
    
#Since date is a factor variable. You cannot apply mean() on it. We have to exclude it from dataframe.
sapply(housing_prices[,!names(housing_prices) %in% c('date','colors')], mean)

In [None]:
#mapply() - when we have several data structures (e.g. vectors, lists) and we want to apply a function to the 1st 
#elements of each, and then the 2nd elements of each, etc.. The result is coerced into a vector/array as in sapply.

#For example in our dataset we want there are different variables measuring different areas like sqft_living, sqft_lot, 
#sqft_above,sqft_basement, sqft_living15, sqft_lot15. If we want to find total area then we can use this function.

result = mapply(sum, housing_prices$sqft_living, housing_prices$sqft_lot,housing_prices$sqft_above,housing_prices$sqft_basement, 
       housing_prices$sqft_living15, housing_prices$sqft_lot15)
head(result)

If you are confused how these values are generated, look at below cells couple of rows are worked for you. So the values of the six variables are added for each row. 

In [None]:
head(housing_prices)

In [None]:
#If you sum up the 
1180+5650+1180+0+1340+5650
2570+7242+2170+400+1690+7639

In [None]:
#tapply() - You should be familiar with tapply by now. You will this function when you want to apply a function to subsets 
#of a vector and the subsets are defined by some other vector, usually a factor.

#For example, we want to know the average price of homes for each number of bedrooms in the house.
t(tapply(housing_prices$price,housing_prices$bedrooms,mean))

#### By
------
tapply can be used to summarize one variable based on another variable. But what if we want to summarize many variables. By is like an extended version of tapply() command.


In [None]:
byviews <- by(housing_prices[,c('price','sqft_living')], housing_prices$view, summary)
byviews

### 2-way tables
------
2-way tables are very informative. In above table, we have the distribution of bathrooms for every count of bedrooms. It is very detailed and the sums of columns and rows are displayed which indicate number of bedrooms or bathrooms with a specific number. 

In [None]:
#Below command will produce a 2-way table with distribution count of every combination between bedrooms and bathrooms. 
#addmargins() will give the summary or sum of this counts at the end of both x and y axis.
bed_vs_bath = table(housing_prices$bedrooms,housing_prices$bathrooms)
addmargins(bed_vs_bath)

Down below is an extended version of table command adding 3rd dimension to 2-way table. We can see same information as above but for every kind of view(0,1,2,3,4)

In [None]:
bed_bath_view <- xtabs(~bedrooms+bathrooms+view, data=housing_prices)
bed_bath_view

In [None]:
#The stat.desc() function gives an elaborate descriptive statistics of input object. Most of the statistics are commonly used ones
library(pastecs)
options(scipen=999)
stat.desc(housing_prices)

In [None]:
#Aggregate works just like groupby in sql. Here we are grouping data based on bedrooms. WE are interested in columns price, 
#bathrooms and  sqft_living. Finally applying mean function on this subset of data for every group of data(i'e number of bedrooms)

aggregate(housing_prices[c("price","bathrooms","sqft_living")],by=list(bedrooms=housing_prices$bedrooms), mean)