#Introduction to Working with Data in R: Cleaning and Munging


###Goals

- Become familiar with basic tools and methods for data munging and cleaning in R

Tasks

- Start an RStudio session 
- Load data from a csv into an R dataframe 
- Load data from the database into an R dataframe
- Remove missing values
- Fill missing values with an interpolation
 

#Import CSV
One of the great things about R is that it easily reads CSVs without importing packages:

```
data <- read.csv('~/Desktop/create_dssg_training_data/building_violations.csv')
```

#Import from Database
The first thing we need to do is import a library that can tell R how to communicate with a Postgres server:
```
library('RPostgreSQL')
```

I don't want to make my password public through this notebook, so I store it in a file on my computer and read it in as a variable:

```
passwd = read.csv('~/Desktop/password.txt', header=F, stringsAsFactors=F)
```

###Establish database connection
We need to load the driver and then create a connection object:
```
drv <- dbDriver('PostgreSQL')
con <- dbConnect(drv, 
                 host = "dssgsummer2014postgres.c5faqozfo86k.us-west-2.rds.amazonaws.com",
                 dbname = "training_2015",
                 user = "jwalsh",
                 password = passwd)
```

###Query the database
Every time we want to interact with the Postgres database, we need to refer to `con`, the connection object. `con` 
```
data <- dbGetQuery(con, "SELECT * FROM jwalsh.building_permits;")
```

###Close database connection
```
dbDisconnect(con)
```

##Viewing your dataframe

Just like we did in the command line, you can use `head` and `tail` to get a view of your data:

```
head(data)
tail(data)
```

For prettier results, wrap with `View()`, e.g. `View( head(data) )`

We can get a sense for the size and shape of the data using `dim`:

```
dim(data)
```

Get a sense for the type of each column using `sapply`:

```
sapply(data, class)
```

There doesn't seem to be much to `SSA`. Let's get rid of it:

```
data <- data[, !(names(data) %in% 'SSA')]
```

Great. Now let's learn how to reference columns:

```
df$ID #this one gives tab completion
df['ID']
df[c('ID','VIOLATION.DATE')]
```

##Converting to datetime

RPostgreSQL brought our date fields in as factors. Convert!

```
data$VIOLATION.DATE <- as.character(data$VIOLATION.DATE)
data$VIOLATION.DATE <- as.Date(data$VIOLATION.DATE, format = '%m/%d/%Y')
class(data$VIOLATION.DATE)
head(data$VIOLATION.DATE)
```

##Exploring Data

Lets get a better sense what these fields look like. `R`'s `summary` command does pretty well (and unlike Python's `describe`, it also summarizes non-numeric variables):

```
summary(data)
```

Let's see if there are missing values:

```
sum(is.na(data$VIOLATION.INSPECTOR.COMMENTS))
```

That's not right. We know there are many more. Let's take a closer look:

```
comment_breakdown <- table(data$VIOLATION.INSPECTOR.COMMENTS)
comment_breakdown_sorted <- sort(comment_breakdown, decreasing = TRUE)
head(comment_breakdown_sorted)
names(comment_breakdown_sorted)[1]
```

There are lots of cells with an empty string but not a missing value. Fix it:

```
data$VIOLATION.INSPECTOR.COMMENTS[ data$VIOLATION.INSPECTOR.COMMENTS == '' ] <- NA
```

##Applying functions

Often we want to apply a function to an entire column to create a new column:

```
data$log_latitude <- log(data$LATITUDE)
head( data[c('LATITUDE', 'log_latitude')] )
```

Let's say we wanted to anonymize location by add noise to the latitude:

```
data$new_latitude = data$LATITUDE + rnorm( length(data$LATITUDE) )
head(data[c('LATITUDE', 'new_latitude')])
```