It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time actually analyzing it. For this reason, it is critical to become familiar with the data cleaning process and all of the tools available to you along the way. This course provides a very basic introduction to cleaning data in R using the tidyr, dplyr, and stringr packages. After taking the course you'll be able to go from raw data to awesome insights as quickly and painlessly as possible!

1. Exploring raw data
2. Tidying data
3. Preparing data for analysis

# 1) Introduction and exploring raw data

This chapter will give you an overview of the process of data cleaning with R, then walk you through the basics of exploring raw data.

## a) Exploring raw data
the first step in data cleaning process is exploring your raw data, we can think our process in 3 steps:

1. Undestanding the structure of your data.
2. Look at your data
3. Visualize your data

by the first step "understanding the structure of your data" we have several tools in R:

    #Load a specif data, for example
    luch<-read.csv("dataset/lunch_clean.csv")
    
    #View the class
    class(lunch)
    
    #View its dimensions
    dim(lunch)
    
    #look at columns names
    names(lunch)
    
    #see the structure
    str(lunch)
    
    #summary
    summary(lunch)

**the package `dplry` offers the `glimpse()` function from `dplyr` is a slightly cleaner alternative to `str()`**

The str(), head(), and summary() functions are designed to give you some information about a dataset without being overwhelming. However, this dataset is so large and has so many variables that even these outputs seemed pretty intimidating!

The glimpse() function from the dplyr package often formats information in a more approachable way.

    #load dplyr
    library(dplyr)
    glimpse(lunch)

## b) Exploring raw data (part 2)

We have seen a several functions that allow us to see the structure our data but always it's better to see it,so for this we have:

    #View the top or the bottom by default only we can see 6 first or last registers.
    head(luch, n=10) 
    tail(luch, n=10)
    
    #Other way to see issues, it´s plotting our datas
    hist(luch$perc_free_red)

    #we can plot fo two variables
    plot(lunch$year, luch$perc_free_red)
    
    
    


# 2) Tidying data
Note: tidy (adj) put in order 

## a) tidyr - gather and spread

it's a wonderful package , whose purpose is help us to apply the principle of tidy data, we won't see all function, however we will see the most common.

The most important function in tidyr is gather(). It should be used when you have columns that are not variables and you want to collapse them into key-value pairs.
    
    gather(wide_df, my_key, my_val, -col)
    #other example
    gather(df, time, val, t1:t3)
    
Notice that gather() allows you to select multiple columns to be gathered by using the : operator.    

The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns. This is useful when values in a column should actually be column names (i.e. variables). It can also make data more compact and easier to read.

    spread(long_df, my_key, my_val)


In [10]:
a<-c("X","Y")
b<-c(1,4)
c<-c(2,5)
d<-c(3,6)

wide_df<-data.frame(a,b,c,d,stringsAsFactors = FALSE)
names(wide_df)<-c("Categoria","A","B","C")
wide_df

#Gather (apilar) the columns of wide_df
#install.packages("tidyr")
library(dplyr)
library(tidyr)
str(wide_df)

# source, nombre que recibira la columna 
wide_gather<-gather(wide_df, Name_Col, Name_Val, -Categoria)
wide_gather


spread(wide_gather,Name_Col,Name_Val)


Categoria,A,B,C
X,1,2,3
Y,4,5,6


'data.frame':	2 obs. of  4 variables:
 $ Categoria: chr  "X" "Y"
 $ A        : num  1 4
 $ B        : num  2 5
 $ C        : num  3 6


Categoria,Name_Col,Name_Val
X,A,1
Y,A,4
X,B,2
Y,B,5
X,C,3
Y,C,6


Categoria,A,B,C
X,1,2,3
Y,4,5,6


## b) tidyr - separate

it´is often useful to separate data of a single columns into a multiple columns, we can do it easily with the separate functions.

    separate(data,col,into) we can use a 4th parameter, it´s sep="" 
    
The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in the new column, but this behavior can be altered via the sep argument.  

    unite(data,col,....) 

In [8]:

library(tidyr)
pat<-c("X","Y","X","Y","X","Y")
treat<-c("A","A","B","B","C","C")
ye_mo<-c("2010-10","2010-10","2012-08","2012-08","2014-12","2014-12")
res<-c(1,4,2,5,3,6)
treatments<-data.frame(pat,treat,ye_mo,res)
names(treatments)<-c("patient","treatment","year_mo","response")
treatments
#We can separate
dt_sepr<-separate(treatments,year_mo,c("year","month"),sep="-")
dt_sepr

#We can unite
unite(dt_sepr,ye_mo2,year,month,sep="/")



patient,treatment,year_mo,response
X,A,2010-10,1
Y,A,2010-10,4
X,B,2012-08,2
Y,B,2012-08,5
X,C,2014-12,3
Y,C,2014-12,6


patient,treatment,year,month,response
X,A,2010,10,1
Y,A,2010,10,4
X,B,2012,8,2
Y,B,2012,8,5
X,C,2014,12,3
Y,C,2014,12,6


patient,treatment,ye_mo2,response
X,A,2010/10,1
Y,A,2010/10,4
X,B,2012/08,2
Y,B,2012/08,5
X,C,2014/12,3
Y,C,2014/12,6


# 3) Preparing data for analysis
This chapter will teach you how to prepare your data for analysis. We will look at type conversion, string manipulation, missing and special values, and outliers and obvious errors.

## a) Type of variable in R
We will see a quick review about these:

    1 character: "treatmen","123"
    2 numeric: 23.44, 120 , NaN, Inf
    3 integer: 25L, 1123L
    4 factor: factor("Hello"), factor(8)
    5 logical: TRUE, FALSE , NA
    
Additionally these type of variables, we can use some function that allow us to know what kind of variable we have or we will coerce a variable in other type:

    class(value) #to obtation a type of variable
    as.function() #to converter 
    
until now you can see that we have not seen type of data of date itps because it will be a little hard, but a lot of problemes could be solved used addinaly packages

    install.packages("lubridate")
    #Load package 
    library("lubridate")

    #Experiment 
    ymd("2018-08-25")

    hms("13:33:09")

    ymd_hms("2018/08/25 13.33.09")

## b) String Manipulation
Manipulate a text string it´is crucial skill when we are cleaning data in R, tha same way like lubridate packages that allow work easier with dates by text string we have a stringr package.

Some examples:

    install.packages("stringr")
    library(stringr)
    #Trim leading and trailing white space
    str_trim(" Hola Mundo         ")

    #Pad string with zeros
    str_pad("10936", width = 10, side="left", pad="0")


    #search for a string in a vector
    names<-c("Luis","Antonio","Zacarias")
    str_detect(names,"Zaca")

    #Replace string in Vector
    str_replace(names,"Zacarias","Meza")

    #Make all lowerCase % upperCase
    toupper("hola mundo")
    tolower("HI WORDL")


In [3]:
library(stringr)    
    #Trim leading and trailing white space
    str_trim(" Hola Mundo         ")

    #Pad string with zeros
    str_pad("10936", width = 10, side="left", pad="0")


    #search for a string in a vector
    names<-c("Luis","Antonio","Zacarias")
    str_detect(names,"Zaca")

    #Replace string in Vector
    str_replace(names,"Zacarias","Meza")

    toupper("hola mundo")
    tolower("HI WORDL")

## c) Missing and special values

Before dealing with missing values in the data, it's important to find them and figure out why they exist in the first place. If your dataset is too big to look at all at once, like it is here, remember you can use sum() and is.na() to quickly size up the situation by counting the number of NA values.

The summary() function may also come in handy for identifying which variables contain the missing values. Finally, the which() function is useful for locating the missing values within a particular column.

Carefully consider what the inputs and outputs are for the functions is.na() and which(). is.na() takes in a vector and returns TRUE if an index in the input vector is NA and FALSE otherwise. which() takes in a vector of TRUE and FALSE values and returns numbers that indicate the indices (or positions) in which the vector argument is TRUE

As you've seen, missing values in R should be represented by NA, but unfortunately you will not always be so lucky. Before you can deal with missing values, you have to find them in the data.

If missing values are properly coded as NA, the is.na() function will help you find them. Otherwise, if your dataset is too big to just look at the whole thing, you may need to try searching for some of the usual suspects like "", "#N/A", etc. You can also use the summary() and table() functions to turn up unexpected values in your data.


    new_df<-data.frame(A=c(1,NA,8,NA),B=c(3,NA,88,23),C=c(2,45,3,1))
    new_df

    #i fwe have a data set not very large we can use de function is.na
    is.na(new_df)

    #in other case could be not usefull used the previous function, so we can use 
    any(is.na(new_df))

    #how many?
    is.na(new_df)
    
    #use summary to find NAs
    summary(new_df)
    

    #Now if we want to replace or work only with completes values we have:
    #Find rows with no missing values 
    complete.cases(new_df)

    #subset data, keeping only complete cases
    new_df[complete.cases(new_df),]

    #another way to remove rows whith NA´s
    na.omit(new_df)

## d) Outliers and obvious errors

When dealing with strange values in your data, you often must decide whether they are just extreme or actually erroneous. Extreme values show up all over the place, but you, the data analyst, must figure out when they are plausible and when they are not.

Another useful way of looking at strange values is with boxplots. Simply put, boxplots draw a box around the middle 50% of values for a given variable, with a bolded horizontal line drawn at the median. Values that fall far from the bulk of the data points (i.e. outliers) are denoted by open circles. (If you're curious about the exact formula for determining what is "far", check out ?hist.)


## Summary

Undestanding the structure of your data

class()- class of data object
dim() - dimension of data
names()- column names
str()- previe of data with helpful datails 
glimpse()- better version of str() from dplyr
summarry() summary of data 

Looking at your data

head()- view top of dataset
tail()- view bottoms of dataset
print()- view entire dataset (not recommended!)

Visualizing your data
hist()- view histrogram of a single variable
plot()- view plot of two variables.

In [1]:
    new_df<-data.frame(A=c(1,NA,8,NA),B=c(3,NA,88,23),C=c(2,45,3,1))
    new_df

    #i fwe have a data set not very large we can use de function is.na
    is.na(new_df)

    #in other case could be not usefull used the previous function, so we can use 
    any(is.na(new_df))

    #how many?
    is.na(new_df)
    
    #use summary to find NAs
    summary(new_df)
    

    #Now if we want to replace or work only with completes values we have:
    #Find rows with no missing values 
    complete.cases(new_df)

    #subset data, keeping only complete cases
    new_df[complete.cases(new_df),]

    #another way to remove rows whith NA´s
    na.omit(new_df)

A,B,C
1.0,3.0,2
,,45
8.0,88.0,3
,23.0,1


A,B,C
False,False,False
True,True,False
False,False,False
True,False,False


A,B,C
False,False,False
True,True,False
False,False,False
True,False,False


       A              B              C        
 Min.   :1.00   Min.   : 3.0   Min.   : 1.00  
 1st Qu.:2.75   1st Qu.:13.0   1st Qu.: 1.75  
 Median :4.50   Median :23.0   Median : 2.50  
 Mean   :4.50   Mean   :38.0   Mean   :12.75  
 3rd Qu.:6.25   3rd Qu.:55.5   3rd Qu.:13.50  
 Max.   :8.00   Max.   :88.0   Max.   :45.00  
 NA's   :2      NA's   :1                     

Unnamed: 0,A,B,C
1,1,3,2
3,8,88,3


Unnamed: 0,A,B,C
1,1,3,2
3,8,88,3
