# 10 minutes to Koalas

This is a short introduction to [Koalas](https://github.com/databricks/koalas), geared mainly for new users. This notebook shows you some key differences between pandas and Koalas.

Customarily, we import Koalas as follows:

In [2]:
import pandas as pd
import numpy as np
import databricks.koalas as ks

import warnings
warnings.filterwarnings("ignore")  # Ignore warnings coming from Arrow optimizations

## Object Creation

Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

In [5]:
s = ks.Series([1, 3, 5, np.nan, 6, 8])

s

In [6]:
type(s)

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

In [8]:
kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [10]:
dates = pd.date_range('20130101', periods=6)

dates

In [11]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

pdf

Unnamed: 0,A,B,C,D
2013-01-01,-0.226162,1.674793,-0.854056,1.022752
2013-01-02,0.206535,0.287895,0.541482,-0.859061
2013-01-03,-0.195844,-0.28904,-0.434537,-0.547957
2013-01-04,1.478943,-0.82007,0.641087,0.15408
2013-01-05,1.971336,0.735354,0.717406,0.549194
2013-01-06,2.584431,-0.119188,2.043724,-0.825224


Now, this pandas DataFrame can be converted to a Koalas DataFrame

In [13]:
kdf = ks.from_pandas(pdf)

type(kdf)

It looks and behaves the same as a pandas DataFrame though

In [15]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,-0.226162,1.674793,-0.854056,1.022752
2013-01-02,0.206535,0.287895,0.541482,-0.859061
2013-01-03,-0.195844,-0.28904,-0.434537,-0.547957
2013-01-04,1.478943,-0.82007,0.641087,0.15408
2013-01-05,1.971336,0.735354,0.717406,0.549194
2013-01-06,2.584431,-0.119188,2.043724,-0.825224


Also, it is possible to create a Koalas DataFrame from Spark DataFrame.  

Creating a Spark DataFrame from pandas DataFrame

In [17]:
sdf = spark.createDataFrame(pdf)

display(sdf)

A,B,C,D
-0.2261615099663738,1.674793217690699,-0.8540562429250863,1.0227517007159157
0.2065352143271834,0.2878952378271717,0.541481901992384,-0.8590614086302373
-0.19584432390161,-0.2890397876684686,-0.43453683867519,-0.5479565029008695
1.4789433136283276,-0.8200704440972859,0.6410874379685658,0.1540804853353233
1.971335976108796,0.735354276539254,0.7174059238777374,0.5491938744166764
2.584430810585542,-0.1191876867273585,2.043724323941565,-0.8252243589777514


Creating Koalas DataFrame from Spark DataFrame.
`to_koalas()` is automatically attached to Spark DataFrame and available as an API when Koalas is imported.

In [19]:
kdf = sdf.to_koalas()

kdf

Unnamed: 0,A,B,C,D
0,-0.226162,1.674793,-0.854056,1.022752
1,0.206535,0.287895,0.541482,-0.859061
2,-0.195844,-0.28904,-0.434537,-0.547957
3,1.478943,-0.82007,0.641087,0.15408
4,1.971336,0.735354,0.717406,0.549194
5,2.584431,-0.119188,2.043724,-0.825224


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and pandas are currently supported.

In [21]:
print(kdf.dtypes)

## Viewing Data

See the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html).

See the top rows of the frame. The results may not be the same as pandas though: unlike pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` or `iloc` instead.

In [24]:
kdf.head()

Unnamed: 0,A,B,C,D
0,-0.226162,1.674793,-0.854056,1.022752
1,0.206535,0.287895,0.541482,-0.859061
2,-0.195844,-0.28904,-0.434537,-0.547957
3,1.478943,-0.82007,0.641087,0.15408
4,1.971336,0.735354,0.717406,0.549194


Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [26]:
kdf.index

In [27]:
kdf.columns

In [28]:
kdf.to_numpy()

Describe shows a quick statistic summary of your data

In [30]:
kdf.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.969873,0.244957,0.442518,-0.084369
std,1.203408,0.876219,1.013803,0.780757
min,-0.226162,-0.82007,-0.854056,-0.859061
25%,-0.195844,-0.28904,-0.434537,-0.825224
50%,0.206535,-0.119188,0.541482,-0.547957
75%,1.971336,0.735354,0.717406,0.549194
max,2.584431,1.674793,2.043724,1.022752


Transposing your data

In [32]:
kdf.T

Unnamed: 0,0,1,2,3,4,5
A,-0.226162,0.206535,-0.195844,1.478943,1.971336,2.584431
B,1.674793,0.287895,-0.28904,-0.82007,0.735354,-0.119188
C,-0.854056,0.541482,-0.434537,0.641087,0.717406,2.043724
D,1.022752,-0.859061,-0.547957,0.15408,0.549194,-0.825224


Sorting by its index

In [34]:
kdf.sort_index(ascending=False)

Unnamed: 0,A,B,C,D
5,2.584431,-0.119188,2.043724,-0.825224
4,1.971336,0.735354,0.717406,0.549194
3,1.478943,-0.82007,0.641087,0.15408
2,-0.195844,-0.28904,-0.434537,-0.547957
1,0.206535,0.287895,0.541482,-0.859061
0,-0.226162,1.674793,-0.854056,1.022752


Sorting by value

In [36]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
3,1.478943,-0.82007,0.641087,0.15408
2,-0.195844,-0.28904,-0.434537,-0.547957
5,2.584431,-0.119188,2.043724,-0.825224
1,0.206535,0.287895,0.541482,-0.859061
4,1.971336,0.735354,0.717406,0.549194
0,-0.226162,1.674793,-0.854056,1.022752


## Missing Data
Koalas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations.

In [38]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [39]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [40]:
kdf1 = ks.from_pandas(pdf1)

kdf1

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.226162,1.674793,-0.854056,1.022752,1.0
2013-01-02,0.206535,0.287895,0.541482,-0.859061,1.0
2013-01-03,-0.195844,-0.28904,-0.434537,-0.547957,
2013-01-04,1.478943,-0.82007,0.641087,0.15408,


To drop any rows that have missing data.

In [42]:
kdf1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.226162,1.674793,-0.854056,1.022752,1.0
2013-01-02,0.206535,0.287895,0.541482,-0.859061,1.0


Filling missing data.

In [44]:
kdf1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.226162,1.674793,-0.854056,1.022752,1.0
2013-01-02,0.206535,0.287895,0.541482,-0.859061,1.0
2013-01-03,-0.195844,-0.28904,-0.434537,-0.547957,5.0
2013-01-04,1.478943,-0.82007,0.641087,0.15408,5.0


## Operations

### Stats
Operations in general exclude missing data.

Performing a descriptive statistic:

In [47]:
print(kdf.mean())

### Spark Configurations

Various configurations in PySpark could be applied internally in Koalas.
For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See <a href="https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html">PySpark Usage Guide for Pandas with Apache Arrow</a>.

In [49]:
prev = spark.conf.get("spark.sql.execution.arrow.enabled")  # Keep its default value.
ks.set_option("compute.default_index_type", "distributed")  # Use default index prevent overhead.

In [50]:
spark.conf.set("spark.sql.execution.arrow.enabled", True)
%timeit ks.range(300000).to_pandas()

In [51]:
spark.conf.set("spark.sql.execution.arrow.enabled", False)
%timeit ks.range(300000).to_pandas()

In [52]:
ks.reset_option("compute.default_index_type")
spark.conf.set("spark.sql.execution.arrow.enabled", prev)  # Set its default value back.

## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [54]:
kdf = ks.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

kdf

Unnamed: 0,A,B,C,D
0,foo,one,-0.998764,-0.316316
1,bar,one,0.489795,-1.559307
2,foo,two,0.449214,-0.825489
3,bar,three,-0.129954,0.876669
4,foo,two,0.627043,-0.838689
5,bar,two,-0.933623,0.228107
6,foo,one,1.375377,-0.175813
7,foo,three,0.420416,0.08999


Grouping and then applying the [sum()](https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.sum.html#databricks.koalas.groupby.GroupBy.sum) function to the resulting groups.

In [56]:
kdf.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-0.573782,-0.45453
foo,1.873285,-2.066317


Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [58]:
kdf.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,one,0.376612,-0.492129
foo,two,1.076257,-1.664178
bar,three,-0.129954,0.876669
foo,three,0.420416,0.08999
bar,two,-0.933623,0.228107
bar,one,0.489795,-1.559307


## Plotting
See the <a href="https://koalas.readthedocs.io/en/latest/reference/frame.html#plotting">Plotting</a> docs.

In [60]:
from matplotlib import pyplot as plt

On a DataFrame, the <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.frame.DataFrame.plot.html#databricks.koalas.frame.DataFrame.plot">plot()</a> method is a convenience to plot all of the columns with labels:

In [62]:
pdf = pd.DataFrame(np.random.randn(1000, 4),
                   columns=['A', 'B', 'C', 'D'])

In [63]:
kdf = ks.from_pandas(pdf)

In [64]:
cumsum = kdf.cumsum()

In [65]:
display(cumsum.plot.line())

## Getting data in/out
See the <a href="https://koalas.readthedocs.io/en/latest/reference/io.html">Input/Output
</a> docs.

### CSV

CSV is straightforward and easy to use. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html#databricks.koalas.DataFrame.to_csv">here</a> to write a CSV file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_csv.html#databricks.koalas.read_csv">here</a> to read a CSV file.

In [68]:
kdf.to_csv('foo.csv')
ks.read_csv('foo.csv').head(10)

Unnamed: 0,A,B,C,D
0,0.468022,-1.872823,0.595022,0.714824
1,-0.782861,0.025656,-0.402027,0.80435
2,1.475614,-0.245947,0.763012,-0.597314
3,0.501749,-1.88105,0.926464,1.10436
4,0.301102,-1.571696,-0.089858,-0.050796
5,-0.3689,1.404831,0.576704,-0.824325
6,0.011766,-0.858733,-0.06317,1.133711
7,0.149376,-1.004977,0.823343,0.085745
8,-0.539245,-0.366833,0.07386,0.097723
9,-0.037103,1.992246,-2.076988,0.009194


### Parquet

Parquet is an efficient and compact file format to read and write faster. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html#databricks.koalas.DataFrame.to_parquet">here</a> to write a Parquet file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_parquet.html#databricks.koalas.read_parquet">here</a> to read a Parquet file.

In [70]:
kdf.to_parquet('bar.parquet')
ks.read_parquet('bar.parquet').head(10)

Unnamed: 0,A,B,C,D
0,1.24089,-1.253864,0.398638,-0.945324
1,0.205416,-0.352338,-0.001231,0.337425
2,-1.246664,0.636766,1.121499,-0.004852
3,-0.642459,-0.053354,-1.599536,-0.94218
4,-0.120679,0.318601,0.638337,-1.192724
5,-1.695289,-1.626982,-0.235787,1.598541
6,-0.247538,-0.676712,0.106915,-1.044883
7,-1.081124,-0.455626,0.979824,1.443102
8,-1.012571,-1.015224,0.135464,1.006207
9,0.67208,-0.1306,0.131746,0.020615


### Spark IO

In addition, Koalas fully support Spark's various datasources such as ORC and an external datasource.  See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html#databricks.koalas.DataFrame.to_spark_io">here</a> to write it to the specified datasource and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_spark_io.html#databricks.koalas.read_spark_io">here</a> to read it from the datasource.

In [72]:
kdf.to_spark_io('zoo.orc', format="orc")
ks.read_spark_io('zoo.orc', format="orc").head(10)

Unnamed: 0,A,B,C,D
0,1.24089,-1.253864,0.398638,-0.945324
1,0.205416,-0.352338,-0.001231,0.337425
2,-1.246664,0.636766,1.121499,-0.004852
3,-0.642459,-0.053354,-1.599536,-0.94218
4,-0.120679,0.318601,0.638337,-1.192724
5,-1.695289,-1.626982,-0.235787,1.598541
6,-0.247538,-0.676712,0.106915,-1.044883
7,-1.081124,-0.455626,0.979824,1.443102
8,-1.012571,-1.015224,0.135464,1.006207
9,0.67208,-0.1306,0.131746,0.020615
