#DataFrame Column Class

** Data Source **
* One hour of Pagecounts from the English Wikimedia projects captured August 5, 2016, at 12:00 PM UTC.
* Size on Disk: ~23 MB
* Type: Compressed Parquet File
* More Info: <a href="https://dumps.wikimedia.org/other/pagecounts-raw" target="_blank">Page view statistics for Wikimedia projects</a>

**Technical Accomplishments:**
* Continue exploring the `DataFrame` set of APIs.
* Introduce the `Column` class

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) **The Data Source**

We will be using the same data source as our previous notebook.

As such, we can go ahead and start by creating our initial `DataFrame`.

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()
spark.conf.set(sasEntity, sasToken)

parquetFile = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [0]:
parquetFile

Out[7]: 'wasbs://training@dbtrainwesteurope.blob.core.windows.net//wikipedia/pagecounts/staging_parquet_en_only_clean/'

In [0]:
dbutils.fs.mounts()

Out[8]: [MountInfo(mountPoint='/mnt/training', source='wasbs://training@dbtrainwesteurope.blob.core.windows.net/', encryptionType=''),
 MountInfo(mountPoint='/databricks-datasets', source='databricks-datasets', encryptionType=''),
 MountInfo(mountPoint='/databricks/mlflow-tracking', source='databricks/mlflow-tracking', encryptionType=''),
 MountInfo(mountPoint='/databricks-results', source='databricks-results', encryptionType=''),
 MountInfo(mountPoint='/databricks/mlflow-registry', source='databricks/mlflow-registry', encryptionType=''),
 MountInfo(mountPoint='/', source='DatabricksRoot', encryptionType='')]

In [0]:
%fs head /databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000

In [0]:
%fs head /databricks-datasets/wikipedia-datasets/data-001/pageviews/raw/pageviews_by_second.tsv

In [0]:
%fs ls /databricks-datasets/amazon/data20K


path,name,size,modificationTime
dbfs:/databricks-datasets/amazon/data20K/_SUCCESS,_SUCCESS,0,1456715438000
dbfs:/databricks-datasets/amazon/data20K/_common_metadata,_common_metadata,324,1456715438000
dbfs:/databricks-datasets/amazon/data20K/_metadata,_metadata,16159,1456715438000
dbfs:/databricks-datasets/amazon/data20K/part-r-00000-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,part-r-00000-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,403964,1456715437000
dbfs:/databricks-datasets/amazon/data20K/part-r-00001-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,part-r-00001-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,388283,1456715438000
dbfs:/databricks-datasets/amazon/data20K/part-r-00002-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,part-r-00002-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,389132,1456715438000
dbfs:/databricks-datasets/amazon/data20K/part-r-00003-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,part-r-00003-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,380250,1456715438000
dbfs:/databricks-datasets/amazon/data20K/part-r-00004-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,part-r-00004-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,377018,1456715438000
dbfs:/databricks-datasets/amazon/data20K/part-r-00005-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,part-r-00005-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,364893,1456715438000
dbfs:/databricks-datasets/amazon/data20K/part-r-00006-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,part-r-00006-112e73de-1ab1-447b-b167-0919dd731adf.gz.parquet,395181,1456715438000


In [0]:
%fs head /databricks-datasets/bikeSharing/data-001/day.csv


In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/Migration.csv

In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/UN.csv

In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/States.csv


In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/Salaries.csv


In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/Prestige.csv


In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/Pottery.csv


In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/Greene.csv


In [0]:
%fs head /databricks-datasets/Rdatasets/data-001/csv/car/Chile.csv


In [0]:
file='/databricks-datasets/Rdatasets/data-001/csv/car/Greene.csv'
df= spark.read.option("header", "true").csv(file).cache()
display(df)

_c0,judge,nation,rater,decision,language,location,success
13,Heald,Lebanon,no,no,English,Toronto,-1.09861
15,Heald,Sri.Lanka,no,no,English,Toronto,-0.75377
19,Heald,El.Salvador,no,yes,English,Toronto,-1.04597
30,MacGuigan,Czechoslovakia,no,yes,French,Montreal,0.40547
36,Desjardins,Lebanon,yes,yes,French,Montreal,-1.09861
42,Stone,Lebanon,yes,yes,English,Toronto,-1.09861
45,Pratte,Ghana,no,no,English,Toronto,-1.20831
46,MacGuigan,Somalia,no,no,English,Toronto,-0.99462
51,MacGuigan,Somalia,yes,no,French,Montreal,-0.99462
52,Pratte,China,no,no,English,other,-0.99462


In [0]:

#  with pyspark
import pyspark.pandas as ps
 
csvFile = '/databricks-datasets/Rdatasets/data-001/csv/car/Greene.csv'
df = ps.read_csv(csvFile, sep=",")

df.head()

Unnamed: 0,_c0,judge,nation,rater,decision,language,location,success
0,13,Heald,Lebanon,no,no,English,Toronto,-1.09861
1,15,Heald,Sri.Lanka,no,no,English,Toronto,-0.75377
2,19,Heald,El.Salvador,no,yes,English,Toronto,-1.04597
3,30,MacGuigan,Czechoslovakia,no,yes,French,Montreal,0.40547
4,36,Desjardins,Lebanon,yes,yes,French,Montreal,-1.09861


As we view the data, we can see that there is no real rhyme or reason as to how the data is sorted.
* We cannot even tell if the column **project** is sorted - we are seeing only the first 1,000 of some 2.3 million records.
* The column **article** is not sorted as evident by the article **A_Little_Boy_Lost** appearing between a bunch of articles starting with numbers and symbols.
* The column **requests** is clearly not sorted.
* And our **bytes_served** contains nothing but zeros.

So let's start by sorting our data. In doing this, we can answer the following question:

What are the top 10 most requested articles?

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) orderBy(..) & sort(..)

If you look at the API docs, `orderBy(..)` is described like this:
> Returns a new Dataset sorted by the given expressions.

Both `orderBy(..)` and `sort(..)` arrange all the records in the `DataFrame` as specified.
* Like `distinct()` and `dropDuplicates()`, `sort(..)` and `orderBy(..)` are aliases for each other.
  * `sort(..)` appealing to functional programmers.
  * `orderBy(..)` appealing to developers with an SQL background.
* Like `orderBy(..)` there are two variants of these two methods:
  * `orderBy(Column)`
  * `orderBy(String)`
  * `sort(Column)`
  * `sort(String)`

All we need to do now is sort our previous `DataFrame`.

In [0]:
#  not working like this

df.orderBy("language").show(10, False)

df.orderBy("language desc").show(10, False)

[0;31m---------------------------------------------------------------------------[0m
[0;31mKeyError[0m                                  Traceback (most recent call last)
[0;32m/databricks/spark/python/pyspark/pandas/frame.py[0m in [0;36m__getattr__[0;34m(self, key)[0m
[1;32m  12502[0m         [0;32mtry[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m> 12503[0;31m             [0;32mreturn[0m [0mself[0m[0;34m.[0m[0mloc[0m[0;34m[[0m[0;34m:[0m[0;34m,[0m [0mkey[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m  12504[0m         [0;32mexcept[0m [0mKeyError[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m

[0;32m/databricks/spark/python/pyspark/pandas/indexing.py[0m in [0;36m__getitem__[0;34m(self, key)[0m
[1;32m    480[0m                 [0mseries_name[0m[0;34m,[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 481[0;31m             ) = self._select_cols(cols_sel)
[0m[1;32m    482[0m [0;34m[0m[0m

[0;32m/databricks/spark/python/pyspark/pandas/indexing.p

In [0]:
df= spark.read.option("header", "true").csv(file).orderBy( col("language").desc() )
display(df)

_c0,judge,nation,rater,decision,language,location,success
30,MacGuigan,Czechoslovakia,no,yes,French,Montreal,0.40547
36,Desjardins,Lebanon,yes,yes,French,Montreal,-1.09861
51,MacGuigan,Somalia,yes,no,French,Montreal,-0.99462
75,Hugessen,Lebanon,no,no,French,Montreal,-1.09861
77,MacGuigan,China,no,no,French,Montreal,-0.99462
113,Hugessen,Lebanon,no,no,French,Montreal,-1.09861
114,Hugessen,El.Salvador,yes,no,French,Montreal,-1.04597
120,Hugessen,Sri.Lanka,no,no,French,Montreal,-0.75377
135,Hugessen,Somalia,yes,no,French,Montreal,-0.99462
140,Hugessen,Somalia,no,no,French,Montreal,-0.99462


### Review Column Class

The `Column` objects provide us a programmatic way to build up SQL-ish expressions.

Besides the `Column.desc()` operation we used above, we have a number of other operations that can be performed on a `Column` object.

Here is a preview of the various functions - we will cover many of these as we progress through the class:

**Column Functions**
* Various mathematical functions such as add, subtract, multiply & divide
* Various bitwise operators such as AND, OR & XOR
* Various null tests such as `isNull()`, `isNotNull()` & `isNaN()`.
* `as(..)`, `alias(..)` & `name(..)` - Returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode).
* `between(..)` - A boolean expression that is evaluated to true if the value of this expression is between the given columns.
* `cast(..)` & `astype(..)` - Convert the column into type dataType.
* `asc(..)` - Returns a sort expression based on the ascending order of the given column name.
* `desc(..)` - Returns a sort expression based on the descending order of the given column name.
* `startswith(..)` - String starts with.
* `endswith(..)` - String ends with another string literal.
* `isin(..)` - A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
* `like(..)` - SQL like expression
* `rlike(..)` - SQL RLIKE expression (LIKE with Regex).
* `substr(..)` - An expression that returns a substring.
* `when(..)` & `otherwise(..)` - Evaluates a list of conditions and returns one of multiple possible result expressions.

The complete list of functions differs from language to language.

## Next steps

Start the next lesson, [Work with Column expressions]($./2.DataFrame-Column-Expressions)