# 10 Minutes to Koalas.

This is a short introduction to Koalas, geared mainly for new users. This notebook shows you some key differences between Pandas and Koalas.

The original Pandas tutorial is located [here](http://pandas.pydata.org/pandas-docs/stable/10min.html)


Customarily, we import Koalas as follows:

In [1]:
import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession


From now on, Spark will behave in a way that is closer to Pandas:
 - Spark DataFrames will have a large number of extra functions that mimic the Pandas functions
 - Spark columns will mimic the beahvior of Pandas series
 - the `pyspark` package and the `spark` context object will have extra functions that mimic functions found in the `pandas` package.

## Object Creation



Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

In [2]:
s = ks.Series([1,3,5,np.nan,6,8])

In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: 0, dtype: float64

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

In [4]:
kdf = ks.DataFrame({'a': [1, 2, 3, 4, 5, 6],'b': [100, 200, 300, 400, 500, 600],
'c': ["one", "two", "three", "four", "five", "six"]},index=[10, 20, 30, 40, 50, 60])

In [5]:
kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a Pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [6]:
dates = pd.date_range('20130101', periods=6)

In [7]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
pdf = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [9]:
pdf

Unnamed: 0,A,B,C,D
2013-01-01,0.490529,-0.150846,-0.175013,-1.29911
2013-01-02,1.746899,-1.368954,0.116949,1.933727
2013-01-03,-0.300884,-0.075256,0.965327,-1.307201
2013-01-04,1.492372,-0.949363,1.800348,-0.858068
2013-01-05,-0.112239,-0.687341,1.845931,0.120126
2013-01-06,0.80147,1.771401,-1.591521,0.347055


Now, this Pandas DataFrame can be converted to a Koalas DataFrame

In [10]:
kdf = ks.from_pandas(pdf)

In [11]:
type(kdf)

databricks.koalas.frame.DataFrame

It looks and behaves the same as a Pandas DataFrame though

In [12]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,0.490529,-0.150846,-0.175013,-1.29911
2013-01-02,1.746899,-1.368954,0.116949,1.933727
2013-01-03,-0.300884,-0.075256,0.965327,-1.307201
2013-01-04,1.492372,-0.949363,1.800348,-0.858068
2013-01-05,-0.112239,-0.687341,1.845931,0.120126
2013-01-06,0.80147,1.771401,-1.591521,0.347055


Also, it's possible to create a Koalas DataFrame from Spark DataFrame.  

Creating a Spark DataFrame from Pandas DataFrame

In [13]:
spark = SparkSession.builder.getOrCreate()

In [14]:
sdf = spark.createDataFrame(pdf)

In [15]:
sdf.show()

+--------------------+--------------------+--------------------+-------------------+
|                   A|                   B|                   C|                  D|
+--------------------+--------------------+--------------------+-------------------+
| 0.49052930207545453|-0.15084630508881658|-0.17501284495155198|-1.2991099950569902|
|  1.7468993664252628| -1.3689544612922313| 0.11694902754450245| 1.9337265767900818|
| -0.3008835579682537|-0.07525599367145301|  0.9653271548144997|-1.3072005352722866|
|  1.4923716789874673| -0.9493631884993887|  1.8003484183785783|-0.8580683828797455|
|-0.11223901147010992| -0.6873410413664283|   1.845930976520679|0.12012562785901669|
|  0.8014697927211806|  1.7714008966600994| -1.5915206075666934|0.34705528025742016|
+--------------------+--------------------+--------------------+-------------------+



Creating Koalas DataFrame from Spark DataFrame

In [16]:
kdf = ks.DataFrame(sdf)

In [17]:
kdf

Unnamed: 0,A,B,C,D
0,0.490529,-0.150846,-0.175013,-1.29911
1,1.746899,-1.368954,0.116949,1.933727
2,-0.300884,-0.075256,0.965327,-1.307201
3,1.492372,-0.949363,1.800348,-0.858068
4,-0.112239,-0.687341,1.845931,0.120126
5,0.80147,1.771401,-1.591521,0.347055


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and Pandas are currently supported.

In [19]:
# Currently a bug on .dtypes, this will be fixed via
# https://github.com/databricks/koalas/commit/b1e9033691cce74236382c23010c1ef4b1572f0e
# targeted 0.2.0
# kdf2.dtypes

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [22]:
kdf.<TAB>

SyntaxError: invalid syntax (<ipython-input-22-61dc798c35d9>, line 1)

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

## Viewing Data

See the [Basics section](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics) 

See the top rows of the frame. The results may not be the same as Pandas though: unlike Pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` instead.

In [18]:
kdf.head()

Unnamed: 0,A,B,C,D
0,0.490529,-0.150846,-0.175013,-1.29911
1,1.746899,-1.368954,0.116949,1.933727
2,-0.300884,-0.075256,0.965327,-1.307201
3,1.492372,-0.949363,1.800348,-0.858068
4,-0.112239,-0.687341,1.845931,0.120126


For now, the tail is not supported. In the context of Spark, it may be dangerous because it can easily return too many rows, which are going to saturate the memory of the host computer.

In [19]:
#kdf.tail(3)

PandasNotImplementedError: The method `pd.DataFrame.tail()` is not implemented yet.

Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [20]:
kdf.index

KeyError: 'Currently supported only when the DataFrame has a single index.'

In [21]:
kdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

For the same reason, values is not supported: it will usually return too much data for the user to handle.

Describe shows a quick statistic summary of your data

In [24]:
# This currently does not work, but will be resolved soon
# kdf.describe()

Transposing your data: Transposing is not allowed for now, as it can cause the data to be all aggregated into a single computer.

In [23]:
# df.T

Sorting by an axis

In [27]:
# TODO
# kdf.sort_index(axis=1, ascending=False)

Sorting by value

In [26]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-01,0.83062,-0.637647,0.424005,0.767639
2013-01-05,-0.533946,-0.468093,-1.199911,0.897936
2013-01-02,1.220761,-0.24743,0.246443,2.964931
2013-01-06,-1.533739,-0.238883,-0.680672,0.026554
2013-01-03,0.162569,0.19774,-0.90009,0.907806
2013-01-04,1.400141,0.334369,-0.283772,-0.724484


## Missing Data
Koalas primarily uses the value np.nan to represent missing data. It is by default not included in computations. 


In [33]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [34]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [38]:
kdf1 = ks.from_pandas(pdf1)

In [39]:
kdf1

Unnamed: 0,A,B,C,D,E
2013-01-01,0.490529,-0.150846,-0.175013,-1.29911,1.0
2013-01-02,1.746899,-1.368954,0.116949,1.933727,1.0
2013-01-03,-0.300884,-0.075256,0.965327,-1.307201,
2013-01-04,1.492372,-0.949363,1.800348,-0.858068,


To drop any rows that have missing data.

In [40]:
kdf1.dropna(how='any')

Unnamed: 0,A,B,C,D,E
2013-01-01,0.490529,-0.150846,-0.175013,-1.29911,1.0
2013-01-02,1.746899,-1.368954,0.116949,1.933727,1.0


Filling missing data.

In [41]:
kdf1.fillna(value=5)

Unnamed: 0,A,B,C,D,E
2013-01-01,0.490529,-0.150846,-0.175013,-1.29911,1.0
2013-01-02,1.746899,-1.368954,0.116949,1.933727,1.0
2013-01-03,-0.300884,-0.075256,0.965327,-1.307201,5.0
2013-01-04,1.492372,-0.949363,1.800348,-0.858068,5.0


## Operations

### Stats
Operations in general exclude missing data.

Performing a descriptive statistic:

In [44]:
kdf.mean()

A    0.686358
B   -0.243393
C    0.493670
D   -0.177245
dtype: float64

## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [53]:
kdf = ks.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

In [54]:
kdf

Unnamed: 0,A,B,C,D
0,foo,one,-0.975559,-1.03782
1,bar,one,2.199766,0.86435
2,foo,two,-1.529515,-1.0434
3,bar,three,0.209367,-2.30027
4,foo,two,0.405643,-0.919364
5,bar,two,-0.172825,0.753094
6,foo,one,-0.587384,1.187593
7,foo,three,-1.168062,0.530754


Grouping and then applying the [sum()](https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.sum.html#databricks.koalas.groupby.GroupBy.sum) function to the resulting groups.

In [55]:
kdf.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,2.236308,-0.682826
foo,-3.854878,-1.282238


Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [56]:
kdf.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2.199766,0.86435
bar,three,0.209367,-2.30027
bar,two,-0.172825,0.753094
foo,one,-1.562944,0.149773
foo,three,-1.168062,0.530754
foo,two,-1.123872,-1.962764
