# 10 Minutes to Koalas.

This is a short introduction to Koalas, geared mainly for new users. This notebook shows you some key differences between Pandas and Koalas.

The original Pandas tutorial is located here:
http://pandas.pydata.org/pandas-docs/stable/10min.html

Customarily, we import Pandas as follows:

In [3]:
import pandas as pd
import numpy as np
import databricks.koalas as ks

Spark it usually loaded already into the interpreter.

In [8]:
# spark

Activating the PandasOnSpark is a simple matter of importing the following package:

In [10]:
# import pandorable_sparky
#import pyarrow
# pyarrow.__version__

'0.10.0'


From now on, Spark will behave in a way that is closer to Pandas:
 - Spark DataFrames will have a large number of extra functions that mimic the Pandas functions
 - Spark columns will mimic the beahvior of Pandas series
 - the `pyspark` package and the `spark` context object will have extra functions that mimic functions found in the `pandas` package.

## Object Creation

See the [Data Structure Intro section](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro) 

Creating a Series by passing a list of values, letting Koalas create a default integer index:

In [13]:
s = ks.Series([1,3,5,np.nan,6,8])

In [14]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: 0, dtype: float64

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [45]:
kdf = ks.DataFrame({'a': [1, 2, 3, 4, 5, 6],'b': [100, 200, 300, 400, 500, 600],
'c': ["one", "two", "three", "four", "five", "six"]},index=[10, 20, 30, 40, 50, 60])

In [48]:
kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [16]:
dates = pd.date_range('20130101', periods=6)

In [20]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [21]:
pdf = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [22]:
pdf

Unnamed: 0,A,B,C,D
2013-01-01,-1.266642,0.919957,1.258046,-0.633785
2013-01-02,0.754034,-1.08401,-0.305329,0.756747
2013-01-03,-0.776311,-1.485204,-0.251275,0.807865
2013-01-04,1.524266,0.723583,0.5831,1.59651
2013-01-05,-2.399535,-1.138287,0.085331,0.508602
2013-01-06,-1.387871,0.035259,0.91216,1.579819


Now, this Pandas DataFrame can be converted to a Koalas DataFrame

In [11]:
kdf = ks.from_pandas(pdf)

In [12]:
type(kdf)

databricks.koalas.frame.DataFrame

It looks and behaves the same as a Pandas DataFrame though:

In [14]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,0.83062,-0.637647,0.424005,0.767639
2013-01-02,1.220761,-0.24743,0.246443,2.964931
2013-01-03,0.162569,0.19774,-0.90009,0.907806
2013-01-04,1.400141,0.334369,-0.283772,-0.724484
2013-01-05,-0.533946,-0.468093,-1.199911,0.897936
2013-01-06,-1.533739,-0.238883,-0.680672,0.026554


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


In [40]:
kdf2 = ks.DataFrame({'A':1.,
              'B':pd.Timestamp('20130102'),
              'C':pd.Series(1,index=list(range(4)),dtype='float32'),
              'D':np.array([3]*4,dtype='int32'),
              'E':pd.Categorical(["test","train","test","train"]),
              'F':'foo'})

TypeError: data type not understood

In [17]:
kdf2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and Pandas are currently supported.

In [18]:
# Currently a bug on .dtypes, this will be fixed via
# https://github.com/databricks/koalas/commit/b1e9033691cce74236382c23010c1ef4b1572f0e
# targeted 0.2.0
kdf2.dtypes

AttributeError: 'DataFrame' object has no attribute 'dtypes'

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [16]:
# df2.<TAB>

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

## Viewing Data

See the [Basics section](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics) 

See the top rows of the frame. The results may not be the same as Pandas though: unlike Pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` instead.

In [20]:
kdf.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.83062,-0.637647,0.424005,0.767639
2013-01-02,1.220761,-0.24743,0.246443,2.964931
2013-01-03,0.162569,0.19774,-0.90009,0.907806
2013-01-04,1.400141,0.334369,-0.283772,-0.724484
2013-01-05,-0.533946,-0.468093,-1.199911,0.897936


For now, the tail is not supported. In the context of Spark, it may be dangerous because it can easily return too many rows, which are going to saturate the memory of the host computer.

In [24]:
#kdf.tail(3)

PandasNotImplementedError: The method `pd.DataFrame.tail()` is not implemented yet.

Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [29]:
kdf.index

0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
4   2013-01-05
5   2013-01-06
Name: __index_level_0__, dtype: datetime64[ns]

In [30]:
kdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

For the same reason, values is not supported: it will usually return too much data for the user to handle.

In [31]:
kdf.to_numpy()

array([[ 2.05055926,  0.48966381, -1.39305841,  1.66738977],
       [-0.25132452,  1.61328083,  0.48646894, -1.74489528],
       [ 0.40069194, -0.86943688, -0.04472291, -0.91568735],
       [ 0.71411596,  1.24086838, -0.30526011, -0.17486808],
       [-0.0263938 ,  1.01327006,  0.10317603, -0.58908149],
       [-0.69180705, -0.79513103,  1.68188449,  0.31527069]])

Describe shows a quick statistic summary of your data

In [25]:
# This currently does not work, but will be resolved soon
kdf.describe()

PandasNotImplementedError: The method `pd.DataFrame.describe()` is not implemented yet.

Transposing your data: Transposing is not allowed for now, as it can cause the data to be all aggregated into a single computer.

In [23]:
# df.T

Sorting by an axis

In [24]:
# TODO
# df.sort_index(axis=1, ascending=False)

Sorting by value

In [26]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-01,0.83062,-0.637647,0.424005,0.767639
2013-01-05,-0.533946,-0.468093,-1.199911,0.897936
2013-01-02,1.220761,-0.24743,0.246443,2.964931
2013-01-06,-1.533739,-0.238883,-0.680672,0.026554
2013-01-03,0.162569,0.19774,-0.90009,0.907806
2013-01-04,1.400141,0.334369,-0.283772,-0.724484
