# 10 minutes to Koalas

This is a short introduction to Koalas, geared mainly for new users. This notebook shows you some key differences between Pandas and Koalas.

Customarily, we import Koalas as follows:

In [1]:
import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession


From now on, Spark will behave in a way that is closer to Pandas:
 - Spark DataFrames will have a large number of extra functions that mimic the Pandas functions
 - Spark columns will mimic the beahvior of Pandas series
 - the `pyspark` package and the `spark` context object will have extra functions that mimic functions found in the `pandas` package.

## Object Creation



Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

In [2]:
s = ks.Series([1,3,5,np.nan,6,8])

In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: 0, dtype: float64

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

In [4]:
kdf = ks.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

In [5]:
kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a Pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [8]:
dates = pd.date_range('20130101', periods=6)

In [9]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
pdf = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [11]:
pdf

Unnamed: 0,A,B,C,D
2013-01-01,-1.393321,0.635416,0.441371,-0.127387
2013-01-02,-1.087722,-0.646794,-2.866093,0.745359
2013-01-03,1.531936,0.05293,-0.909015,-1.051811
2013-01-04,-0.835399,0.961289,-0.47225,-1.211273
2013-01-05,-0.40831,-0.30997,-1.403968,1.159429
2013-01-06,1.834456,0.499968,0.529479,0.03059


Now, this Pandas DataFrame can be converted to a Koalas DataFrame

In [12]:
kdf = ks.DataFrame(pdf)

In [13]:
type(kdf)

databricks.koalas.frame.DataFrame

It looks and behaves the same as a Pandas DataFrame though

In [14]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,-1.393321,0.635416,0.441371,-0.127387
2013-01-02,-1.087722,-0.646794,-2.866093,0.745359
2013-01-03,1.531936,0.05293,-0.909015,-1.051811
2013-01-04,-0.835399,0.961289,-0.47225,-1.211273
2013-01-05,-0.40831,-0.30997,-1.403968,1.159429
2013-01-06,1.834456,0.499968,0.529479,0.03059


Also, it is possible to create a Koalas DataFrame from Spark DataFrame.  

Creating a Spark DataFrame from Pandas DataFrame

In [15]:
spark = SparkSession.builder.getOrCreate()

In [16]:
sdf = spark.createDataFrame(pdf)

In [17]:
sdf.show()

+-------------------+--------------------+--------------------+--------------------+
|                  A|                   B|                   C|                   D|
+-------------------+--------------------+--------------------+--------------------+
| -1.393321321393808|  0.6354155415524972|  0.4413706725932695|-0.12738682403468904|
|-1.0877215500653825| -0.6467944355955646| -2.8660925331885347|  0.7453592310682132|
| 1.5319361184249165|0.052929987493802844| -0.9090146316633693| -1.0518105317806044|
| -0.835399423823058|  0.9612886532087072|-0.47225025469198956|  -1.211272799833637|
|-0.4083095042240772|   -0.30996968034992| -1.4039677152153442|  1.1594286473606839|
|  1.834456488218577|  0.4999681110553894|  0.5294791051664477|0.030589700683494946|
+-------------------+--------------------+--------------------+--------------------+



Creating Koalas DataFrame from Spark DataFrame

In [23]:
kdf = ks.DataFrame(sdf)

In [24]:
kdf

Unnamed: 0,A,B,C,D
0,-1.393321,0.635416,0.441371,-0.127387
1,-1.087722,-0.646794,-2.866093,0.745359
2,1.531936,0.05293,-0.909015,-1.051811
3,-0.835399,0.961289,-0.47225,-1.211273
4,-0.40831,-0.30997,-1.403968,1.159429
5,1.834456,0.499968,0.529479,0.03059


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and Pandas are currently supported.

In [25]:
kdf.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

## Viewing Data

See the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html).

See the top rows of the frame. The results may not be the same as Pandas though: unlike Pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` or `iloc` instead.

In [26]:
kdf.head()

Unnamed: 0,A,B,C,D
0,-1.393321,0.635416,0.441371,-0.127387
1,-1.087722,-0.646794,-2.866093,0.745359
2,1.531936,0.05293,-0.909015,-1.051811
3,-0.835399,0.961289,-0.47225,-1.211273
4,-0.40831,-0.30997,-1.403968,1.159429


Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [30]:
kdf.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [31]:
kdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [32]:
kdf.to_numpy()

array([[-1.39332132,  0.63541554,  0.44137067, -0.12738682],
       [-1.08772155, -0.64679444, -2.86609253,  0.74535923],
       [ 1.53193612,  0.05292999, -0.90901463, -1.05181053],
       [-0.83539942,  0.96128865, -0.47225025, -1.2112728 ],
       [-0.4083095 , -0.30996968, -1.40396772,  1.15942865],
       [ 1.83445649,  0.49996811,  0.52947911,  0.0305897 ]])

Describe shows a quick statistic summary of your data

In [33]:
kdf.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.059727,0.198806,-0.780079,-0.075849
std,1.391384,0.60957,1.269563,0.94403
min,-1.393321,-0.646794,-2.866093,-1.211273
25%,-1.087722,-0.30997,-1.403968,-1.051811
50%,-0.835399,0.05293,-0.909015,-0.127387
75%,1.531936,0.635416,0.441371,0.745359
max,1.834456,0.961289,0.529479,1.159429


Transposing your data

In [35]:
kdf.T

Unnamed: 0,0,1,2,3,4,5
A,-1.393321,-1.087722,1.531936,-0.835399,-0.40831,1.834456
B,0.635416,-0.646794,0.05293,0.961289,-0.30997,0.499968
C,0.441371,-2.866093,-0.909015,-0.47225,-1.403968,0.529479
D,-0.127387,0.745359,-1.051811,-1.211273,1.159429,0.03059


Sorting by its index

In [38]:
kdf.sort_index(ascending=False)

Unnamed: 0,A,B,C,D
5,1.834456,0.499968,0.529479,0.03059
4,-0.40831,-0.30997,-1.403968,1.159429
3,-0.835399,0.961289,-0.47225,-1.211273
2,1.531936,0.05293,-0.909015,-1.051811
1,-1.087722,-0.646794,-2.866093,0.745359
0,-1.393321,0.635416,0.441371,-0.127387


Sorting by value

In [39]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
1,-1.087722,-0.646794,-2.866093,0.745359
4,-0.40831,-0.30997,-1.403968,1.159429
2,1.531936,0.05293,-0.909015,-1.051811
5,1.834456,0.499968,0.529479,0.03059
0,-1.393321,0.635416,0.441371,-0.127387
3,-0.835399,0.961289,-0.47225,-1.211273


## Missing Data
Koalas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. 


In [40]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [41]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [42]:
kdf1 = ks.from_pandas(pdf1)

In [43]:
kdf1

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.393321,0.635416,0.441371,-0.127387,1.0
2013-01-02,-1.087722,-0.646794,-2.866093,0.745359,1.0
2013-01-03,1.531936,0.05293,-0.909015,-1.051811,
2013-01-04,-0.835399,0.961289,-0.47225,-1.211273,


To drop any rows that have missing data.

In [44]:
kdf1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.393321,0.635416,0.441371,-0.127387,1.0
2013-01-02,-1.087722,-0.646794,-2.866093,0.745359,1.0


Filling missing data.

In [45]:
kdf1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.393321,0.635416,0.441371,-0.127387,1.0
2013-01-02,-1.087722,-0.646794,-2.866093,0.745359,1.0
2013-01-03,1.531936,0.05293,-0.909015,-1.051811,5.0
2013-01-04,-0.835399,0.961289,-0.47225,-1.211273,5.0


## Operations

### Stats
Operations in general exclude missing data.

Performing a descriptive statistic:

In [46]:
kdf.mean()

A   -0.059727
B    0.198806
C   -0.780079
D   -0.075849
dtype: float64

## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [47]:
kdf = ks.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

In [48]:
kdf

Unnamed: 0,A,B,C,D
0,foo,one,0.484484,1.274721
1,bar,one,0.858435,-1.687321
2,foo,two,0.574069,0.237922
3,bar,three,1.766442,0.13968
4,foo,two,-0.76988,0.394742
5,bar,two,-0.518663,-0.134879
6,foo,one,0.430169,0.435465
7,foo,three,1.58083,-1.45383


Grouping and then applying the [sum()](https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.sum.html#databricks.koalas.groupby.GroupBy.sum) function to the resulting groups.

In [49]:
kdf.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,2.106214,-1.682521
foo,2.299673,0.88902


Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [56]:
kdf.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2.199766,0.86435
bar,three,0.209367,-2.30027
bar,two,-0.172825,0.753094
foo,one,-1.562944,0.149773
foo,three,-1.168062,0.530754
foo,two,-1.123872,-1.962764
