# Exercise - Density-based Spatial Clustering of Applications with Noise (DBSCAN) in R
This notebook is designed to help get you familiar with using the DBSCAN algorithm in R through a simple exercise.  

In this notebook, we will load and explore 'seeds.txt' in R, which is a dataset containing measurements of geometrical properties of kernels belonging to three different varieties of wheat. Specifically, this exercise covers:

1. Installing and loading the DBSCAN library and the data in R
2. Creating a subset from the datatset
3. Using the dbscan function to obtain clusters
4. Convert the clusters to factors 
5. Attaching the clusters to the measurements
6. Visualizing the results


## Installing and loading the DBSCAN library and the data in R
We begin by installing the dbscan package and loading it along with the dataset .



In [None]:
#installing the library 'dbscan'
install.packages("dbscan", dependencies = TRUE)
library('dbscan')
#Downloading the file in the Data Scientist Workbench
download.file("https://ibm.box.com/shared/static/th5dsj5txtw052dt2k9i5hekkjhydda2.csv","/resources/seeds.csv")head(seeds)

## Using the dbscan function to obtain clusters


## Attaching the clusters to the measurements


## Visualizing the results


## Selecting columns

## Filtering Data
Filter the DataFrame to only retain rows with `mpg` less than 18

In [None]:
SparkR::head(SparkR::filter(sdf, sdf$mpg < 18))

## Operating on Columns
SparkR also provides a number of functions that can directly applied to columns for data processing and aggregation. The example below shows the use of basic arithmetic functions to convert lb to metric ton.

In [None]:
sdf$wtTon <- sdf$wt * 0.45
SparkR::head(sdf)

## Grouping, Aggregation
SparkR data frames support a number of commonly used functions to aggregate data after grouping. For example we can compute the average weight of cars by their cylinders as shown below:

In [None]:
SparkR::head(summarize(groupBy(sdf, sdf$cyl), wtavg = avg(sdf$wtTon)))

In [None]:
# We can also sort the output from the aggregation to get the most common cars
car_counts <-summarize(groupBy(sdf, sdf$cyl), count = n(sdf$wtTon))
SparkR::head(arrange(car_counts, desc(car_counts$count)))

### Running SQL Queries from Spark DataFrames
A Spark DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The `sql` function enables applications to run SQL queries programmatically and returns the result as a DataFrame.



In [None]:
# Register this DataFrame as a table.
registerTempTable(sdf, "cars")
# SQL statements can be run by using the sql method
highgearcars <- sql(sqlContext, "SELECT gear FROM cars WHERE cyl >= 4 AND cyl <= 9")
SparkR::head(highgearcars)


NOTE: This tutorial draws heavily from the original 
[Spark Quick Start Guide](http://spark.apache.org/docs/latest/quick-start.html)

## Want to learn more?

### Free courses on [Big Data University](https://bigdatauniversity.com/courses/analyzing-big-data-r-using-apache-spark/?utm_source=tutorial-spark-r&utm_medium=dswb&utm_campaign=bdu):

<a href="https://bigdatauniversity.com/courses/analyzing-big-data-r-using-apache-spark/?utm_source=tutorial-spark-r&utm_medium=dswb&utm_campaign=bdu"><img src = "https://ibm.box.com/shared/static/14f58d6iazn71i794oqsdlaimp9k51xq.png"> </a>


<h3>Authors:</h3>
<br>
<a href="https://ca.linkedin.com/in/saeedaghabozorgi">
    <div class="teacher-image" style="    float: left;
        width: 115px;
        height: 115px;
        margin-right: 10px;
        margin-bottom: 10px;
        border: 1px solid #CCC;
        padding: 3px;
        border-radius: 3px;
        text-align: center;"><img class="alignnone wp-image-2258 " src="https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg" alt="Saeed Aghabozorgi" width="178" height="178"/>
    </div>
</a>

<h4>Saeed Aghabozorgi</h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>

<br>

<a href="https://ca.linkedin.com/in/polonglin">
    <div class="teacher-image" style="    float: left;
        width: 115px;
        height: 115px;
        margin-right: 10px;
        margin-bottom: 10px;
        border: 1px solid #CCC;
        padding: 3px;
        border-radius: 3px;
        text-align: center;"><img class="alignnone size-medium wp-image-2177" src="https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg" alt="Polong Lin" width="300" height="300"/>
    </div>
</a>
<h4>Polong Lin</h4>
<p>
<a href="https://ca.linkedin.com/in/polonglin">Polong Lin</a> is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds a M.Sc. in Cognitive Psychology.</p>

<hr>
Copyright &copy; 2016 [Big Data University](https://bigdatauniversity.com/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​