# Importing Data

In this second part to Importing Data in R, you will take a deeper dive into the wide range of data formats out there. More specifically, you'll learn how to import data from relational databases and how to import and work with data coming from the web. Finally, you'll get hands-on experience with importing data from statistical software packages such SAS, STATA and SPSS.

## 1) Importing data from databases 

Many companies store their information in relational databases. The R community has also developed R packages to get data from these architectures. You'll learn how to connect to a database and how to retrieve data from it.

### 1.1) Connect to a database
First of all we need a packages that allow us relation with the DB and it will depend our SGDB, fo example:

1. MySQL - RMySQL
2. PosgresSQL- RPostgresSQL
3. Oracle- ROracle
4. Etc.

But now, how R interact with the DB?, so which R functions you use to access and manipulate the DB, is specified in other R package called **DBI** in more technical terms, DBI is an interface and RmySQL is the implementation, so first we will install our RmySQL Packages, which automatically install the DBI package.

    #First we need install our packages 
    install.packages("RMySQL")
    #library "RmySQL not required yet!
    library(DBI) 
    
#### a) Establish a connection
The first step to import data from a SQL database is creating a connection to it, you need different packages depending on the database you want to connect to. All of these packages do this in a uniform way, as specified in the DBI package.

**dbConnect()** creates a connection between your R session and a SQL database. The first argument has to be a **DBIdriver** object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with **RMySQL::MySQL()**.

If the MySQL database is a remote database hosted on a server, you'll also have to specify the following arguments in **dbConnect(): dbname, host, port, user and password**. Most of these details have already been provided.

#### b) Import table data

After you've successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the **dbListTables()** function and the same way you can use **dbReadTable** to read a specific table.

    Example
    
    #First we need install our packages 
    install.packages("RMySQL")
    #library "RmySQL not required yet!
    library(DBI) 

    #  dbConnect() call
    con <- dbConnect(RMySQL::MySQL(), 
                     #dbname = "tweater", 
                     dbname = "company",
                     host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                     port = 3306,
                     user = "student",
                     password = "datacamp")
    
    #To see the class
    class(con)

    #to see all tables
    dbListTables(con)

    #to read a specific table
    dbReadTable(con,"employees")

    #to disconnected 
    #dbDisconnect(con)

#### c) SQL Queries from inside R
In your life as a data scientist, you'll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it's possible that you only need a fraction of this data. In this case, it's a good idea to send SQL queries to your database, and only import the data you actually need into R.

**dbGetQuery()** is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string, for example:

    dbGetQuery(con, "SELECT age FROM people WHERE gender = 'male'")
    
the most important thing here that we can customize your sentence of sql and we won't have any problem with this.

#### d) DBI internals
You've used **dbGetQuery()** multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:

1. Sending the specified query with `dbSendQuery()`;
2. Fetching the result of executing the query on the database with `dbFetch()`; You can specify the n argument inside dbFetch()
3. Clearing the result with `dbClearResult()`  
    
So if we combined all these function, we will obtain the same result as **dbGetQuery()** , we do this, well **dbFetch** query allow us to specify a maximum numbers of records to retrieve per fetch, for example: 

    res<-dbSendQuery(con, "SELECT age FROM people WHERE gender = 'male'")
    dbFetch(res)
    dbClearResult(res)
    
#### d) Disconnect  
Every time you connect to a database using dbConnect(), you're creating a new connection to the database you're referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections for you, but still: it's always polite to manually disconnect from the database afterwards. You do this with the **dbDisconnect()** function.    

## 2) Importing data from the web
More and more of the information that data scientists are using resides on the web. Importing this data into R requires an understanding of the protocols used on the web. In this chapter, you'll get a crash course in HTTP and learn to perform your own HTTP requests from inside R.

### 2.1 HTTP
HTTP: HyperText Transfer Protocol

It´s a basically a system of rules for how data should be exchanged between computers, is the lenguage of the web 

#### a) Import flat files from the web
You will see that the **utils functions** to import flat file data, such as `read.csv() and read.delim()`, are capable of automatically importing from URLs that point to flat files on the web.

You must be wondering whether Hadley Wickham's alternative package, readr, is equally potent.

    ##Working with Web flat file
    library(readr)
    # Import the csv file: pools
    url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"


    # Import the txt file: potatoes
    url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"

    #read information
    pools<-read_csv(url_csv)
    potatoes<-read_tsv(url_delim)

**Note: readxl does not know how to handle Excel files that are stored on the Web **
#### b) Secure importing
In the previous example, you have been working with URLs that all start with http://. There is, however, a safer alternative to HTTP, namely HTTPS, which stands for HypterText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.

Luckily for us, you can use the standard importing functions with https:// connections since R version 3.2.2


### 2.2 Downloading files
When you learned about gdata, it was already mentioned that gdata can handle .xls files that are on the internet. readxl can't, at least not yet. The URL with which you'll be working is already available in the sample code. You will import it once using gdata and once with the readxl package via a workaround.

    # Load the readxl and gdata package
    library(readxl)
    library(gdata)

    # Specification of url: url_xls
    url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"

    # Import the .xls file with gdata: excel_gdata
    excel_gdata<-read.xls(url_xls)

    # Download file behind URL, name it local_latitude.xls
    download.file(url_xls,"local_latitude.xls")

    # Import the local .xls file with readxl: excel_readxl
    excel_readxl<-read_excel("local_latitude.xls")
    
#### a ) Downloading any file, secure or not
In the previous example you've seen how you can read excel files on the web using the read_excel package by first downloading the file with the download.file() function.

There's more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData files. An RData file is very efficient format to store R data.

You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument. In this exercise, you'll first download the RData file securely, and then import the local data file.


    # https URL to the wine RData file.
    url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"

    # Download the wine file to your working directory
    download.file(url_rdata,"wine_local.RData")

    # Load the wine data into your workspace using load()
    load("wine_local.RData")

    # Print out the summary of the wine data
    summary(wine)
    
#### b) HTTP? httr! 
Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.

httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.

You can extract the content from the request using:

`the content()` function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don't tell `content()` how to retrieve the content through the `as` argument, it'll try its best to figure out which type is most appropriate based on the content-type.

    # Load the httr package
    library(httr)

    # Get the url, save response to resp
    url <- "http://www.example.com/"
    resp <- GET(url)

    # Print resp
    resp

    # Get the raw content of resp: raw_content
    raw_content <- content(resp, as = "raw")

    # Print the head of raw_content
    head(raw_content)
    
Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp's Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.

You'll learn about Web APIs and JSON in the video and exercises that follow, but some experimentation never hurts, does it?

    # httr is already loaded

    # Get the url
    url <- "http://www.omdbapi.com/?apikey=ff21610b&t=Annie+Hall&y=&plot=short&r=json"
    resp<-GET(url)

    # Print resp
    resp

    # Print content of resp as text
    content(resp, as="text")

    # Print content of resp
    content(resp)

### 3) Importing data from the web II
Importing data from the web is one thing; actually being able to extract useful information is another. Learn more about the JSON format to get one step closer to web domination.

#### a) APIs & JSON
The JSON format is very simple, concise and well-structered on top of that, it´s human-readable, but it´s easy to intepret and generate fo machine and this make it perfecto to communicate with Web APIs (Application Programming Interface) very generally put, it´s a set of routines and protocols for building software componets it´s a way in which different components interact, however we will be focus in Web API, this is an interface to get data and proccesed information from a server or to add data to a server.

Now we are working with the following package:

    ##JSON file
    install.packages("jsonlite")
    library(jsonlite)

#### b) From JSON to R
In the simplest setting, `fromJSON()` can convert character strings that represent JSON data into a nicely structured R list, for example:

     # Load the jsonlite package
    library(jsonlite)

    # wine_json is a JSON
    wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'

    # Convert wine_json into a list: wine
    wine<-fromJSON(wine_json)

    # Print structure of wine
    str(wine)

#### c) Quandl API
`fromJSON()` also works if you pass a URL as a character string or the path to a local file that contains JSON data. Let's try this out on the Quandl API, where you can fetch all sorts of financial and economical data.

    # Definition of quandl_url
    quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"

    # Import Quandl data: quandl_data
    quandl_data<-fromJSON(quandl_url)

    # Print structure of quandl_data
    str(quandl_data)

Note it´s posible that you need `install.packages('curl')`

#### c) OMDb API - (The Open Movie Database)
You also saw how to fetch all information on Rain Man from OMDb. Simply perform a ~`GET()` call, and next ask for the contents with the `content()` function. This `content()` function, which is part of the `httr` package, uses `jsonlite` behind the scenes to import the JSON data into R.

However, by now you also know that jsonlite can handle URLs itself. Simply passing the request URL to `fromJSON()` will get your data into R

    # The package jsonlite is already loaded

    # Definition of the URLs
    url_sw4 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0076759&r=json"
    url_sw3 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0121766&r=json"

    # Import two URLs with fromJSON(): sw4 and sw3
    sw4 <- fromJSON(url_sw4)
    sw3 <- fromJSON(url_sw3)

    # Print out the Title element of both lists
    sw4$Title
    sw3$Title

    # Is the release year of sw4 later than sw3?
    sw4$Year>sw3$Year


JSON is built on two structures: objects and arrays. To help you experiment with these, two JSON strings are included in the sample code. It's up to you to change them appropriately and then call jsonlite's `fromJSON()` function on them each time

    # jsonlite is already loaded

    # Challenge 1
    json1 <- '[1, 2,3 ,4,5, 6]'
    fromJSON(json1)

    # Challenge 2
    json2 <- '{"a": [1, 2, 3],"b":[4,5,6]}'
    fromJSON(json2)
    
 Other Example:
 
        #Challenge 1 matrix 2x2 
        json1 <- '[[1, 2], [3, 4]]'
        fromJSON(json1)

        #Challenge 2 add a new observation
        json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}]'
        fromJSON(json2)

#### toJSON()
Apart from converting JSON to R with `fromJSON()`, you can also use `toJSON()` to convert R data to a JSON format. In its most basic use, you simply pass this function an R object to convert to a JSON. The result is an R object of the class json, which is basically a character string representing that JSON.

    # jsonlite is already loaded

    # URL pointing to the .csv file
    url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"

    # Import the .csv file located at url_csv
    water<-read.csv(url_csv, stringsAsFactors=FALSE)

    # Convert the data file according to the requirements
    water_json<-toJSON(water)

    # Print out water_json
    water_json

#### Minify and prettify
JSONs can come in different formats. Take these two JSONs, that are in fact exactly the same: the first one is in a minified format, the second one is in a pretty format with indentation, whitespace and new lines:

    # Mini
    {"a":1,"b":2,"c":{"x":5,"y":6}}

    # Pretty
    {
      "a": 1,
      "b": 2,
      "c": {
        "x": 5,
        "y": 6
      }
    }


Unless you're a computer, you surely prefer the second version. However, the standard form that `toJSON()` returns, is the minified version, as it is more concise. You can adapt this behavior by setting the pretty argument inside `toJSON()` to TRUE. If you already have a JSON string, you can use `prettify()` or `minify()` to make the JSON pretty or as concise as possible.

    # jsonlite is already loaded

    # Convert mtcars to a pretty JSON: pretty_json
    pretty_json<-toJSON(mtcars, pretty=TRUE)

    # Print pretty_json
    pretty_json

    # Minify pretty_json: mini_json
    mini_json<-minify(pretty_json)

    # Print mini_json
    mini_json

## 4) Importing data from statistical software packages
Next to R, there are also other commonly used statistical software packages: SAS, STATA and SPSS. Each of them has their own file format. Learn how to use the haven and foreign packages to get them into R with remarkable ease!

In the rest of this documentacion, we will see two new packages that allow us to work with this environment:

1. haven, it´s more consistent, easier to use and faster to foreign 
2. foregin, support more data formats 

### Import SAS data with haven
haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:

    SAS: read_sas()
    STATA: read_dta() (or read_stata(), which are identical)
    SPSS: read_sav() or read_por(), depending on the file type.

All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.

You'll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in your current working directory. You can also download the data.

    # Load the haven package
    library(haven)

    # Import sales.sas7bdat: sales
    sales<-read_sas("sales.sas7bdat")

    # Display the structure of sales
    str(sales)

### Import STATA data with haven
Next up are STATA data files; you can use `read_dta()` for these.

When inspecting the result of the `read_dta()` call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it's best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you'll need haven's `as_factor()` function.

In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: http://assets.datacamp.com/production/course_1478/datasets/trade.dta

    # haven is already loaded

    # Import the data from the URL: sugar
    sugar<-read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")

    # Structure of sugar
    str(sugar)

    # Convert values in Date column to dates
    sugar$Date<-as.Date(as_factor(sugar$Date))

    # Structure of sugar again
    str(sugar)

### Import SPSS data with haven
The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you're working with, you'll need either 
`read_sav() - for .sav files - or read_por() - for .por files`

In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.

    # haven is already loaded

    # Import person.sav: traits
    traits<-read_sav("person.sav")

    # Summarize traits
    summary(traits)

    # Print out a subset
    subset(traits, traits$Extroversion>40 & traits$Agreeableness>40)

### Factorize, round two
In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It's advised to coerce (or change) these variables to factors or other standard R classes.

The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL:

http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav

    # haven is already loaded

    # Import SPSS data from the URL: work
    work<-read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")

    # Display summary of work$GENDER
    summary(work$GENDER)


    # Convert work$GENDER to a factor
    work$GENDER<-as_factor(work$GENDER)


    # Display summary of work$GENDER again
    summary(work$GENDER)


I don´t write the things that we saw in the package "foreign" 