## Introduction

All data problems begin with a question and end with a narrative construct that provides a clear answer. From there, the next step is getting your data. As a Data Scientist, you'll spend an incredible amount of time and skills on acquiring, prepping, cleaning, and normalizing your data. In this tutorial, we'll review some of the best tools used in the rhelm of data acquisition. 

But first, let's go into the differences between Data Acquisition, Preparation, and Cleaning. 

### Data Acquisition

Data Acquisition is the process of getting your data, hence the term <i>acquisition</i>. Data doesn't come out of nowhere, so the very first step of any data science problem is going to be getting the data in the first place. 

### Data Preparation

Once you have the data, it might not be in the best format to work with. You might have scraped a bunch of data from a website, but need it in the form of a dataframe to work with it in an easier manner. This process is called data preparation - preparing your data in a format that's easiest to form with.

### Data Cleaning

Once your data is being stored or handled in a proper manner, that might still not be enough. You might have missing values or values that need normalizing. These inconsistencies that you fix before analysis refers to data cleaning. 


## Reading, Writing, and Handling Data Files

The simplest way of acquiring data is downloading a file - either from a website, straight from your desktop, or elsewhere. Once the data is downloaded, you'll open the files for reading and possible writing. 

### CSV files

Very often, you'll have to work with CSV files. A csv file is a comma-separated values file stores tabular data in plain text. 

In the following examples, we'll be working with NBA data, which you can download from [here](https://github.com/ByteAcademyCo/data-acq/blob/master/nba.csv).

### R Programming

We've just gone through how to read CSV files in Python. But how do you do this in R? Pretty simply, actually. R has built in functions to handle CSV files, so you don't even have to use a library to accomplish what we just did with Python.

In [2]:
data <- read.csv("nba.csv")

### JSON

Because HTTP is a protocol for transferring text, the data you request through a web API (which we'll go through soon enough) needs to be serialized into a string format, usually in JavaScript Object Notation (JSON). JavaScript objects look quite similar to Python dicts, which makes their string representations easy to interpret:

```
{ 
 "name" : "Lesley Cordero",
 "job" : "Data Scientist",
 "topics" : [ "data", "science", "data science"] 
}
```

#### jsonlite

Now, in R, working with JSON can be a bit more complicated. Unlike Python, R doesn't have a data type that resembles JSON closely (dictionaries in Python). So we have to work with what we do have, which is lists, vectors, and matrices.

Working with the same data from the Python example, we have:


In [3]:
serialized = '{ 
 "name" : "Lesley Cordero",
 "job" : "Data Scientist",
 "topics" : [ "data", "science", "data science"] 
} '

Now, if we want to properly load this into R, we'll be using the `jsonlite` library. 

In [1]:
library("jsonlite")

Once we've loaded the library, we'll use the `fromJSON` function to convert this into a data type R is more familiar with: <b>lists</b>.


In [4]:
l <- fromJSON(serialized, simplifyVector=TRUE)

Notice that `simplifyVector` is set to `TRUE`. When simplifyMatrix is enabled, JSON arrays containing equal-length sub-arrays simplify into a matrix. 

And to convert this back to JSON, we type:

In [5]:
toJSON(l, pretty=TRUE)


{
  "name": ["Lesley Cordero"],
  "job": ["Data Scientist"],
  "topics": ["data", "science", "data science"]
} 

## APIs

There are several ways to extract information from the web. Use of APIs, Application Program Interfaces, is probably the best way to extract data from a website. APIs are especially great if your data is constantly changing. Many websites have public APIs providing data feeds via JSON or some other format. 

So far we've seen APIs with Python. Let's take a look on how you can use R to do some simple API calls. We'll be working with the `httr` library and the EPDB API, which we load in the next three lines:

### GET request

There are many different types of requests. The most simplest is a GET request. GET requests are used to retrieve your data. In Python, you can make a get request to get the latest position of the international space station from the `OpenNotify` API.


In [1]:
library("httr")
url  <- "http://api.epdb.eu"
path <- "eurlex/directory_code"

With `httr`, you can make GET requests, like this:


In [3]:
raw.result <- GET(url=url, path=path)
print(raw.result)

Response [http://api.epdb.eu/eurlex/directory_code/]
  Date: 2017-06-07 19:44
  Status: 200
  Content-Type: application/json
  Size: 121 kB



Notice the 'status' label? Great, that'll bring us to status codes. 

### Status Codes

What we just printed was a status code of `200`. Status codes are returned with every request made to a web server and indicate what happened with a request. The following are the most common types of status codes:

- `200` - everything worked as planned!
- `301` - the server is redirecting you to anotehr endpoint (domain).
- `400` - it means you made a bad request by not sending the right data or some other error.
- `401` - you're not authenticated, which means you don't have access to the server.
- `403` - this means access is forbidden. 
- `404` - whatever you tried to access wasn't found. 

Notice that if we try to access something that doesn't exist, we'll get a `404` error:

In [5]:
raw.result$status_code

Let's try a get request where the status code returned is `404`. 


In [5]:
#python
response = requests.get("http://api.open-notify.org/iss-pass.json")
print(response.status_code)

400


Like we mentioned before, this indicated a bad request. This is because it requires two parameters, as you can see [here](http://open-notify.org/Open-Notify-API/ISS-Pass-Times/). 

We set these with an optional `params` variable. You can opt to make a dictionary and then pass it into the `requests.get` function, like follows:


In [6]:
# python
parameters = {"lat": 40.71, "lon": -74}
response = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)

You can skip the variable portion with the following instead: 


In [8]:
# python
response = requests.get("http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74")
print(response.content)

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1496864501, \n    "latitude": 40.71, \n    "longitude": -74.0, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 645, \n      "risetime": 1496866724\n    }, \n    {\n      "duration": 591, \n      "risetime": 1496872564\n    }, \n    {\n      "duration": 550, \n      "risetime": 1496878434\n    }, \n    {\n      "duration": 609, \n      "risetime": 1496884248\n    }, \n    {\n      "duration": 638, \n      "risetime": 1496890035\n    }\n  ]\n}\n'


This is pretty messy, but luckily, we can clean this up into JSON with:


In [10]:
# python
data = response.json()
print(data)

{'message': 'success', 'request': {'altitude': 100, 'datetime': 1496864501, 'latitude': 40.71, 'longitude': -74.0, 'passes': 5}, 'response': [{'duration': 645, 'risetime': 1496866724}, {'duration': 591, 'risetime': 1496872564}, {'duration': 550, 'risetime': 1496878434}, {'duration': 609, 'risetime': 1496884248}, {'duration': 638, 'risetime': 1496890035}]}


Now let's pull the name entities from this GET request:


In [4]:
# python
print(names(raw.result))

 [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
 [6] "content"     "date"        "times"       "request"     "handle"     


You can extract each of the entitites above with the `$` character, like this:

## Web Scraping

Web Scraping tools are specifically developed for extracting information from websites. Web Scraping mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

### HTML

While performing web scraping, we deal with html tags. Thus, we must have good understanding of them. Below is the basic syntax of HTML:

``` html
<!DOCTYPE html> 
<html>
	<body>
		<h1> First Heading </h1>
		<p> First Paragraph </p>
	</body>
</html>
```
Let's break down each of these tags:

1. `<!DOCTYPE html>`: This is the initial HTML declaration.
2. `<html>`: The HTML document is going to be contained within this tag.
3. `<body>`: This is where the visible portion of the HTML document is between. 
4. `<h1>`: This is an HTML heading.
5. `<p>`: HTML paragraphs are defined here. 

We've also got the following tags:

- `<a>`: These always define HTML links, such as with 
``` HTML
<a href="http://byteacademy.co">This is Byte Academy's website!</a>
```
- `<table>`: HTML tables are defined with this tag, such as:
*Note that the `<tr>`are rows and `<td>` defines columns. 
``` HTML
<table style="width:100%">
	<tr>
		<td>Lesley</td>
		<td>Cordero</td>
		<td>24</td>
	</tr>
	<tr>
		<td>Helen</td>
		<td>Chen</td>
		<td>22</td>
	</tr>
</table>
```

This will yield the following:

```
Lesley		Cordero		24
Helen		Chen		22
```
- `<li>` initializes the beginning of a list. `<ul>` and `<ol>` each define whether it's an unordered list or an ordered list. 

### rvest

Now we'll try scraping a website with R. R has a library called `rvest` which allows you scrape the HTML from any webpage. In the following two lines, we call this library and take the HTML with the `read_html` function. 

In [1]:
library(rvest)
movie <- read_html("http://www.imdb.com/title/tt1490017/")

Loading required package: xml2


Let's now scape some information from the website. `html_nodes` easily extract pieces out of HTML documents using css selectors while `html_text` extracts attributes, text, and tag name from the HTML. Using these two functions, we can extract the rating for this movie. 


In [3]:
rating <- movie %>%
    html_nodes("strong span") %>%
    html_text() %>%
    as.numeric()
print(rating)

[1] 7.8


Next, let's get the cast of the movie:


In [5]:
cast <- movie %>%
    html_nodes("#titleCast .itemprop span") %>%
    html_text()
print(cast)

 [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"     "Alison Brie"    
 [5] "David Burrows"   "Anthony Daniels" "Charlie Day"     "Amanda Farinos" 
 [9] "Keith Ferguson"  "Will Ferrell"    "Will Forte"      "Dave Franco"    
[13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"     


And lastly, we extract the first movie review on the site:


In [6]:
review <- movie %>%
    html_nodes("#titleUserReviewsTeaser p") %>%
    html_text()
print(review)

[1] "This film has great animation and a great story, it has an all star voice cast example Morgan freeman and will Ferrell. This film will be up on the shelve as one of the greatest films ever animated, ever thought about and ever written. When I was a kid playing with Lego I never thought to my self that they will make a film on it now that they have all my Christmases have come at once. Cant wait for the special features on blue ray. This movie will be up there with toy story 1 2 and 3 , the lion king , frozen and wreck it Ralph. The power of good films are in Lego hands people are genius congrats to all the Oscars for the road ahead"


## Advanced Web Scraping


### Sitemaps

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.

An example of what a sample XML sitemap might look like is:

``` 
<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>

      <loc>http://www.example.com/</loc>

      <lastmod>2005-01-01</lastmod>

      <changefreq>monthly</changefreq>

      <priority>0.8</priority>

   </url>

</urlset> 
```

### Estimating Website Size

The size of the website will affect how you crawl it. If the website is just a few hundred URLs, such as our example website, efficiency is not important. However, if the website has over a million web pages, downloading each sequentially would take months. 

### Regular Expressions

A regular expression is a sequence of characters that define a string.

#### Simplest Form

The simplest form of a regular expression is a sequence of characters contained within <b>two backslashes</b>. For example, <i>python</i> would be  

``` 
\python
```

#### Case Sensitivity

Regular Expressions are <b>case sensitive</b>, which means 

``` 
\p and \P
```
are distinguishable from eachother. This means <i>python</i> and <i>Python</i> would have to be represented differently, as follows: 

``` 
\python and \Python
```

We can check these are different by running:

In [1]:
import re
re1 = re.compile('python')
print(bool(re1.match('Python')))

False


#### Disjunctions

If you want a regular expression to represent both <i>python</i> and <i>Python</i>, however, you can use <b>brackets</b> or the <b>pipe</b> symbol as the disjunction of the two forms. For example, 
``` 
[Pp]ython or \Python|python
```
could represent either <i>python</i> or <i>Python</i>. Likewise, 

``` 
[0123456789]
```
would represent a single integer digit. The pipe symbols are typically used for interchangable strings, such as in the following example:

```
\dog|cat
```

#### Ranges

If we want a regular expression to express the disjunction of a range of characters, we can use a <b>dash</b>. For example, instead of the previous example, we can write 

``` 
[0-9]
```
Similarly, we can represent all characters of the alphabet with 

``` 
[a-z]
```

#### Exclusions

Brackets can also be used to represent what an expression <b>cannot</b> be if you combine it with the <b>caret</b> sign. For example, the expression 

``` 
[^p]
```
represents any character, special characters included, but p.

#### Question Marks 

Question marks can be used to represent the expressions containing zero or one instances of the previous character. For example, 

``` 
<i>\colou?r
```
represents either <i>color</i> or <i>colour</i>. Question marks are often used in cases of plurality. For example, 

``` 
<i>\computers?
```
can be either <i>computers</i> or <i>computer</i>. If you want to extend this to more than one character, you can put the simple sequence within parenthesis, like this:

```
\Feb(ruary)?
```
This would evaluate to either <i>February</i> or <i>Feb</i>.

#### Kleene Star

To represent the expressions containing zero or <b>more</b> instances of the previous character, we use an <b>asterisk</b> as the kleene star. To represent the set of strings containing <i>a, ab, abb, abbb, ...</i>, the following regular expression would be used:  
```
\ab*
```

#### Wildcards

Wildcards are used to represent the possibility of any character and symbolized with a <b>period</b>. For example, 

```
\beg.n
```
From this regular expression, the strings <i>begun, begin, began,</i> etc., can be generated. 

#### Kleene+

To represent the expressions containing at <b>least</b> one or more instances of the previous character, we use a <b>plus</b> sign. To represent the set of strings containing <i>ab, abb, abbb, ...</i>, the following regular expression would be used:  

```
\ab+
```