# 10 minutes to Koalas

This is a short introduction to [Koalas](https://github.com/databricks/koalas), geared mainly for new users. This notebook shows you some key differences between pandas and Koalas.

Customarily, we import Koalas as follows:

In [2]:
import pandas as pd
import numpy as np
import databricks.koalas as ks

import warnings
warnings.filterwarnings("ignore")  # Ignore warnings coming from Arrow optimizations

## Object Creation

Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

In [5]:
s = ks.Series([1, 3, 5, np.nan, 6, 8])

s

In [6]:
type(s)

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

In [8]:
kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [10]:
dates = pd.date_range('20130101', periods=6)

dates

In [11]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

pdf

Unnamed: 0,A,B,C,D
2013-01-01,0.571412,1.130104,-1.047144,-0.994619
2013-01-02,0.160117,-0.003886,0.402744,-1.455676
2013-01-03,-1.491063,-1.240989,-0.396866,-0.875398
2013-01-04,-0.204507,-0.882168,0.206942,-0.365183
2013-01-05,-0.338535,-1.286701,1.109749,-1.51458
2013-01-06,0.046836,-0.586225,-1.734231,0.588693


Now, this pandas DataFrame can be converted to a Koalas DataFrame

In [13]:
kdf = ks.from_pandas(pdf)

type(kdf)

It looks and behaves the same as a pandas DataFrame though

In [15]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,0.571412,1.130104,-1.047144,-0.994619
2013-01-02,0.160117,-0.003886,0.402744,-1.455676
2013-01-03,-1.491063,-1.240989,-0.396866,-0.875398
2013-01-04,-0.204507,-0.882168,0.206942,-0.365183
2013-01-05,-0.338535,-1.286701,1.109749,-1.51458
2013-01-06,0.046836,-0.586225,-1.734231,0.588693


Also, it is possible to create a Koalas DataFrame from Spark DataFrame.  

Creating a Spark DataFrame from pandas DataFrame

In [17]:
sdf = spark.createDataFrame(pdf)

Creating Koalas DataFrame from Spark DataFrame.
`to_koalas()` is automatically attached to Spark DataFrame and available as an API when Koalas is imported.

In [19]:
kdf = sdf.to_koalas()

kdf

Unnamed: 0,A,B,C,D
0,0.571412,1.130104,-1.047144,-0.994619
1,0.160117,-0.003886,0.402744,-1.455676
2,-1.491063,-1.240989,-0.396866,-0.875398
3,-0.204507,-0.882168,0.206942,-0.365183
4,-0.338535,-1.286701,1.109749,-1.51458
5,0.046836,-0.586225,-1.734231,0.588693


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and pandas are currently supported.

In [21]:
print(kdf.dtypes)

## Viewing Data

See the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html).

See the top rows of the frame. The results may not be the same as pandas though: unlike pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` or `iloc` instead.

In [24]:
kdf.head()

Unnamed: 0,A,B,C,D
0,0.571412,1.130104,-1.047144,-0.994619
1,0.160117,-0.003886,0.402744,-1.455676
2,-1.491063,-1.240989,-0.396866,-0.875398
3,-0.204507,-0.882168,0.206942,-0.365183
4,-0.338535,-1.286701,1.109749,-1.51458


Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [26]:
kdf.index

In [27]:
kdf.columns

In [28]:
kdf.to_numpy()

Describe shows a quick statistic summary of your data

In [30]:
kdf.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.20929,-0.478311,-0.243134,-0.769461
std,0.703026,0.918912,1.033775,0.786897
min,-1.491063,-1.286701,-1.734231,-1.51458
25%,-0.338535,-1.240989,-1.047144,-1.455676
50%,-0.204507,-0.882168,-0.396866,-0.994619
75%,0.160117,-0.003886,0.402744,-0.365183
max,0.571412,1.130104,1.109749,0.588693


Transposing your data

In [32]:
kdf.T

Unnamed: 0,0,1,2,3,4,5
A,0.571412,0.160117,-1.491063,-0.204507,-0.338535,0.046836
B,1.130104,-0.003886,-1.240989,-0.882168,-1.286701,-0.586225
C,-1.047144,0.402744,-0.396866,0.206942,1.109749,-1.734231
D,-0.994619,-1.455676,-0.875398,-0.365183,-1.51458,0.588693


Sorting by its index

In [34]:
kdf.sort_index(ascending=False)

Unnamed: 0,A,B,C,D
5,0.046836,-0.586225,-1.734231,0.588693
4,-0.338535,-1.286701,1.109749,-1.51458
3,-0.204507,-0.882168,0.206942,-0.365183
2,-1.491063,-1.240989,-0.396866,-0.875398
1,0.160117,-0.003886,0.402744,-1.455676
0,0.571412,1.130104,-1.047144,-0.994619


Sorting by value

In [36]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
4,-0.338535,-1.286701,1.109749,-1.51458
2,-1.491063,-1.240989,-0.396866,-0.875398
3,-0.204507,-0.882168,0.206942,-0.365183
5,0.046836,-0.586225,-1.734231,0.588693
1,0.160117,-0.003886,0.402744,-1.455676
0,0.571412,1.130104,-1.047144,-0.994619


## Missing Data
Koalas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations.

In [38]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [39]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [40]:
kdf1 = ks.from_pandas(pdf1)

kdf1

Unnamed: 0,A,B,C,D,E
2013-01-01,0.571412,1.130104,-1.047144,-0.994619,1.0
2013-01-02,0.160117,-0.003886,0.402744,-1.455676,1.0
2013-01-03,-1.491063,-1.240989,-0.396866,-0.875398,
2013-01-04,-0.204507,-0.882168,0.206942,-0.365183,


To drop any rows that have missing data.

In [42]:
kdf1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,0.571412,1.130104,-1.047144,-0.994619,1.0
2013-01-02,0.160117,-0.003886,0.402744,-1.455676,1.0


Filling missing data.

In [44]:
kdf1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,0.571412,1.130104,-1.047144,-0.994619,1.0
2013-01-02,0.160117,-0.003886,0.402744,-1.455676,1.0
2013-01-03,-1.491063,-1.240989,-0.396866,-0.875398,5.0
2013-01-04,-0.204507,-0.882168,0.206942,-0.365183,5.0


## Operations

### Stats
Operations in general exclude missing data.

Performing a descriptive statistic:

In [47]:
print(kdf.mean())

### Spark Configurations

Various configurations in PySpark could be applied internally in Koalas.
For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See <a href="https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html">PySpark Usage Guide for Pandas with Apache Arrow</a>.

In [49]:
prev = spark.conf.get("spark.sql.execution.arrow.enabled")  # Keep its default value.
ks.set_option("compute.default_index_type", "distributed")  # Use default index prevent overhead.

In [50]:
spark.conf.set("spark.sql.execution.arrow.enabled", True)
%timeit ks.range(300000).to_pandas()

In [51]:
spark.conf.set("spark.sql.execution.arrow.enabled", False)
%timeit ks.range(300000).to_pandas()

In [52]:
ks.reset_option("compute.default_index_type")
spark.conf.set("spark.sql.execution.arrow.enabled", prev)  # Set its default value back.

## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [54]:
kdf = ks.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

kdf

Unnamed: 0,A,B,C,D
0,foo,one,0.896043,0.508333
1,bar,one,-0.933421,0.413316
2,foo,two,-1.48986,1.345921
3,bar,three,-0.362581,0.115927
4,foo,two,-0.921594,0.334012
5,bar,two,0.156917,2.019296
6,foo,one,-0.232733,0.308869
7,foo,three,0.251413,0.842143


Grouping and then applying the [sum()](https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.sum.html#databricks.koalas.groupby.GroupBy.sum) function to the resulting groups.

In [56]:
kdf.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-1.139086,2.54854
foo,-1.496731,3.339278


Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [58]:
kdf.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,one,0.66331,0.817202
foo,two,-2.411454,1.679932
bar,three,-0.362581,0.115927
foo,three,0.251413,0.842143
bar,two,0.156917,2.019296
bar,one,-0.933421,0.413316


## Plotting
See the <a href="https://koalas.readthedocs.io/en/latest/reference/frame.html#plotting">Plotting</a> docs.

In [60]:
from matplotlib import pyplot as plt
%matplotlib inline

On a DataFrame, the <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.frame.DataFrame.plot.html#databricks.koalas.frame.DataFrame.plot">plot()</a> method is a convenience to plot all of the columns with labels:

In [62]:
pdf = pd.DataFrame(np.random.randn(1000, 4),
                   columns=['A', 'B', 'C', 'D'])

In [63]:
kdf = ks.from_pandas(pdf)

In [64]:
cumsum = kdf.cumsum()

In [65]:
cumsum.plot()

## Getting data in/out
See the <a href="https://koalas.readthedocs.io/en/latest/reference/io.html">Input/Output
</a> docs.

### CSV

CSV is straightforward and easy to use. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html#databricks.koalas.DataFrame.to_csv">here</a> to write a CSV file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_csv.html#databricks.koalas.read_csv">here</a> to read a CSV file.

In [68]:
kdf.to_csv('foo.csv')
ks.read_csv('foo.csv').head(10)

Unnamed: 0,A,B,C,D
0,0.693516,1.119301,-0.04444,-0.500197
1,0.833401,-1.935379,-0.796136,-1.862306
2,-0.680454,1.37952,-1.151356,0.863208
3,-1.44319,-1.580401,0.667405,-0.060056
4,-0.303501,-0.593678,-0.37005,-0.204131
5,0.464784,0.230441,-0.726708,-0.873669
6,-0.16764,1.075956,-1.087585,-0.794177
7,0.657539,-0.563552,0.982046,0.079383
8,1.866576,1.958371,-0.749136,1.085927
9,-0.681058,-0.113137,-0.671258,-0.37791


### Parquet

Parquet is an efficient and compact file format to read and write faster. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html#databricks.koalas.DataFrame.to_parquet">here</a> to write a Parquet file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_parquet.html#databricks.koalas.read_parquet">here</a> to read a Parquet file.

In [70]:
kdf.to_parquet('bar.parquet')
ks.read_parquet('bar.parquet').head(10)

Unnamed: 0,A,B,C,D
0,0.406628,1.226314,0.7017,0.213339
1,-0.504778,-0.742683,-2.400714,1.302415
2,-0.239768,0.75075,0.406187,-1.014281
3,-0.942408,-1.773379,-1.72657,-1.430995
4,-0.421397,-1.305309,-1.845884,0.189157
5,1.226664,-1.38374,-2.150283,0.249228
6,-1.225665,-1.858122,0.356379,0.087774
7,-1.071804,0.689514,0.222678,1.379892
8,-0.868509,0.02613,-0.37653,-0.789555
9,-0.986906,0.544499,-1.192561,-0.77651


### Spark IO

In addition, Koalas fully support Spark's various datasources such as ORC and an external datasource.  See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html#databricks.koalas.DataFrame.to_spark_io">here</a> to write it to the specified datasource and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_spark_io.html#databricks.koalas.read_spark_io">here</a> to read it from the datasource.

In [72]:
kdf.to_spark_io('zoo.orc', format="orc")
ks.read_spark_io('zoo.orc', format="orc").head(10)

Unnamed: 0,A,B,C,D
0,0.693516,1.119301,-0.04444,-0.500197
1,0.833401,-1.935379,-0.796136,-1.862306
2,-0.680454,1.37952,-1.151356,0.863208
3,-1.44319,-1.580401,0.667405,-0.060056
4,-0.303501,-0.593678,-0.37005,-0.204131
5,0.464784,0.230441,-0.726708,-0.873669
6,-0.16764,1.075956,-1.087585,-0.794177
7,0.657539,-0.563552,0.982046,0.079383
8,1.866576,1.958371,-0.749136,1.085927
9,-0.681058,-0.113137,-0.671258,-0.37791
