# Data Mining using R

#### R in a nutshell

- Statistical programming environments
- Originally designed and implemented by statisticians
- Widely popular due to its extensive collection of community-contributed packages
- Quickly gaining market-share among traditional proprietary tools such as SAS and STATA for data analytics

#### Learning Objectives

- Understand data acquisition: downloading from static links, crawling through entire websites, and streaming data from real-time sources
- Understand data curation: working with hierarchically structured data (text, XML/HTML, JSON)
- Understand data management: organizing data directories, working with databases
- Understand HPC concepts: automating data-mining process through the Palmetto and Cypress Supercomputers

## Where am I?

In [None]:
getwd()

In [None]:
setwd("/home/lngo/data-mining-r/")

In [None]:
getwd()

## Data Curation

- For JSON and XML formats, we will use the [NY Retail Food Store Database from data.gov](https://catalog.data.gov/dataset/retail-food-stores/resource/498a7e81-ea0e-425c-bb8d-a4e36d619f81)


### 2. XML Format

- Extended Markup Language
- Example of data in XML format: `samples/books.xml` from https://msdn.microsoft.com/en-us/library/ms762271(v=vs.85).aspx

Package `XML` reads xml data into a tree structure that can be interpreted by external XML processing functions. 

In [None]:
library(xml2)

In [None]:
sample_xml <- read_xml('./samples/books.xml')

In [None]:
sample_xml

In [None]:
xml_structure(sample_xml)

In [None]:
book_catalog <- as_list(sample_xml)

In [None]:
str(book_catalog)

In [None]:
summary(book_catalog)

For XML version of the NY food store data, we don't have separate data and metadata. Instead, everything is stored together as tags, attributes, and values

In [None]:
print (getwd())
stores_file <- file.path('data','food_stores.xml') #back up copy in samples/food_stores.xml
stores_url <- 'https://data.ny.gov/api/views/9a8c-vfzj/rows.xml?accessType=DOWNLOAD'

In [None]:
download.file(stores_url,stores_file,method = "wget",quiet = TRUE)

In [None]:
stores_xml <- read_xml(stores_file)

In [None]:
str(stores_xml)

In [None]:
list_stores <- as_list(stores_xml)

Anticipating from the results of `as_list` when called on `book_catalog`, we will not want to risk displaying the structure of a lengthy list. Frequent size checks are recommended

In [None]:
print(length(list_stores))

In [None]:
print(length(list_stores[[1]]))

In [None]:
print(length(list_stores[[1]][[1]]))

In [None]:
str(list_stores[[1]][[1]])

Similar to the JSON case, to convert elements of this XML-based list into rows of a data frame, we need to first construct the headers:

- XML's tags become names of the list's elements
- XML's attributes become attributes of associated elements within the list

In [None]:
names(list_stores[[1]][[1]])

We don't want `location` but attributes `latitute` and `longitude` of `location`

In [None]:
attributes(list_stores[[1]][[1]][['location']])

In [None]:
xml_headers <- names(list_stores[[1]][[1]])
location_attributes <- names(attributes(list_stores[[1]][[1]][['location']]))
stores_headers <- xml_headers[1:(length(xml_headers) - 1)]
stores_headers <- c(stores_headers,
                   location_attributes[[2]],
                   location_attributes[[3]])
print(stores_headers)

While it is possible to *hardcode* headers information, an implementation that rely on raw data's information is potentially dynamic, maintainable, and reusable

In [None]:
stores_counts <- length(list_stores[[1]])

df_stores <- data.frame(character(stores_counts), stringsAsFactors=FALSE)
for (i in 2:14){
    df_stores[,i] <- character(stores_counts)
}

for (i in 15:16){
    df_stores[,i] <- numeric(stores_counts)
}

colnames(df_stores) <- stores_headers

In [None]:
str(df_stores)

In [None]:
for (i in 1:1){
    tmpList <- list_stores[[1]][[i]]
    for (j in 1:14){
        if (!is.null(tmpList[[j]]) && length(tmpList[[j]] > 0)){
            df_stores[i,j] <- tmpList[[j]][[1]]
        }
    }
    location_attributes <- attributes(tmpList[['location']])
    for (j in 15:16){
        if (!is.null(location_attributes[[j-13]])){
            df_stores[i,j] <- as.numeric(location_attributes[[j-13]])
        }
    }
    print (df_stores[i,])
}

In [None]:
for (i in 1:stores_counts){
    tmpList <- list_stores[[1]][[i]]
    for (j in 1:14){
        if (!is.null(tmpList[[j]]) && length(tmpList[[j]] > 0)){
            df_stores[i,j] <- tmpList[[j]][[1]]
        }
    }
    location_attributes <- attributes(tmpList[['location']])
    for (j in 15:16){
        if (!is.null(location_attributes[[j-13]])){
            df_stores[i,j] <- as.numeric(location_attributes[[j-13]])
        }
    }
}

In [None]:
print(list_stores[[1]][[10]])
print(df_stores[10,])

For complex XML data, the recommended approach is to use [XPath Query Language](https://en.wikipedia.org/wiki/XPath):
- /node = top-level node
- //node = node at any level
- node[@attr] = node that has an attribute named "attr"
- node[@attr='something'] = node that has an attribute named "attr" with value 'something'
- node/@attr = value of attribute `attr` in node that has such attributes. 

XPAth queries can be used with package xml2's `xml_path` function to describe operations on specific XML data elements whose tags and attributes match the query patterns

### 3. Download and process HTML pages

### 4. Process text data