# 10 Minutes to Koalas.

This is a short introduction to Koalas, geared mainly for new users. This notebook shows you some key differences between Pandas and Koalas.

The original Pandas tutorial is located here:
http://pandas.pydata.org/pandas-docs/stable/10min.html

Customarily, we import Pandas as follows:

In [1]:
import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession

Spark it usually loaded already into the interpreter.

In [3]:
# spark

Activating the PandasOnSpark is a simple matter of importing the following package:

In [2]:
# import pandorable_sparky
#import pyarrow
# pyarrow.__version__


From now on, Spark will behave in a way that is closer to Pandas:
 - Spark DataFrames will have a large number of extra functions that mimic the Pandas functions
 - Spark columns will mimic the beahvior of Pandas series
 - the `pyspark` package and the `spark` context object will have extra functions that mimic functions found in the `pandas` package.

## Object Creation

See the [Data Structure Intro section](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro) 

Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

In [4]:
s = ks.Series([1,3,5,np.nan,6,8])

In [5]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: 0, dtype: float64

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

In [6]:
kdf = ks.DataFrame({'a': [1, 2, 3, 4, 5, 6],'b': [100, 200, 300, 400, 500, 600],
'c': ["one", "two", "three", "four", "five", "six"]},index=[10, 20, 30, 40, 50, 60])

In [7]:
kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a Pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [8]:
dates = pd.date_range('20130101', periods=6)

In [9]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
pdf = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [11]:
pdf

Unnamed: 0,A,B,C,D
2013-01-01,0.469515,1.692328,2.448141,1.388829
2013-01-02,-2.754278,0.060754,-0.055676,-0.54321
2013-01-03,-0.415742,0.014523,1.194958,0.52363
2013-01-04,-1.683029,0.762305,0.546031,0.418287
2013-01-05,0.325352,1.404649,0.401852,-0.173458
2013-01-06,-0.183483,-0.84805,0.24085,-0.939891


Now, this Pandas DataFrame can be converted to a Koalas DataFrame

In [12]:
kdf = ks.from_pandas(pdf)

In [13]:
type(kdf)

databricks.koalas.frame.DataFrame

It looks and behaves the same as a Pandas DataFrame though

In [14]:
kdf

Unnamed: 0,A,B,C,D
2013-01-01,0.469515,1.692328,2.448141,1.388829
2013-01-02,-2.754278,0.060754,-0.055676,-0.54321
2013-01-03,-0.415742,0.014523,1.194958,0.52363
2013-01-04,-1.683029,0.762305,0.546031,0.418287
2013-01-05,0.325352,1.404649,0.401852,-0.173458
2013-01-06,-0.183483,-0.84805,0.24085,-0.939891


Also, it's possible to create a Koalas DataFrame from Spark DataFrame.  
Creating a Spark DataFrame from Pandas DataFrame

In [15]:
spark = SparkSession.builder.getOrCreate()

In [16]:
sdf = spark.createDataFrame(pdf)

In [17]:
sdf.show()

+--------------------+-------------------+--------------------+--------------------+
|                   A|                  B|                   C|                   D|
+--------------------+-------------------+--------------------+--------------------+
| 0.46951476524593116|   1.69232820806441|  2.4481407646478446|  1.3888290920143131|
| -2.7542776700523692|0.06075365878992547|-0.05567555629170335| -0.5432099471873877|
|-0.41574219128321105|0.01452310955527324|   1.194957745707772|   0.523629782972406|
| -1.6830290270303236| 0.7623052858431563|  0.5460306789778872| 0.41828661593969513|
|  0.3253519159243422| 1.4046490662922437| 0.40185178750169864|-0.17345826704350045|
| -0.1834831800216344|-0.8480495475455248|  0.2408504808453694| -0.9398908994802203|
+--------------------+-------------------+--------------------+--------------------+



Creating Koalas DataFrame from Spark DataFrame

In [18]:
kdf = ks.DataFrame(sdf)

In [19]:
kdf

Unnamed: 0,A,B,C,D
0,0.469515,1.692328,2.448141,1.388829
1,-2.754278,0.060754,-0.055676,-0.54321
2,-0.415742,0.014523,1.194958,0.52363
3,-1.683029,0.762305,0.546031,0.418287
4,0.325352,1.404649,0.401852,-0.173458
5,-0.183483,-0.84805,0.24085,-0.939891


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and Pandas are currently supported.

In [19]:
# Currently a bug on .dtypes, this will be fixed via
# https://github.com/databricks/koalas/commit/b1e9033691cce74236382c23010c1ef4b1572f0e
# targeted 0.2.0
# kdf2.dtypes

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [22]:
kdf.<TAB>

SyntaxError: invalid syntax (<ipython-input-22-61dc798c35d9>, line 1)

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

## Viewing Data

See the [Basics section](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics) 

See the top rows of the frame. The results may not be the same as Pandas though: unlike Pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` instead.

In [20]:
kdf.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.83062,-0.637647,0.424005,0.767639
2013-01-02,1.220761,-0.24743,0.246443,2.964931
2013-01-03,0.162569,0.19774,-0.90009,0.907806
2013-01-04,1.400141,0.334369,-0.283772,-0.724484
2013-01-05,-0.533946,-0.468093,-1.199911,0.897936


For now, the tail is not supported. In the context of Spark, it may be dangerous because it can easily return too many rows, which are going to saturate the memory of the host computer.

In [24]:
#kdf.tail(3)

PandasNotImplementedError: The method `pd.DataFrame.tail()` is not implemented yet.

Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [29]:
kdf.index

0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
4   2013-01-05
5   2013-01-06
Name: __index_level_0__, dtype: datetime64[ns]

In [30]:
kdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

For the same reason, values is not supported: it will usually return too much data for the user to handle.

In [20]:
# Todo : change it !
kdf.to_numpy()

array([[-0.75811496, -0.7331042 , -1.09205119,  0.22310124],
       [ 2.48604039,  0.09742931, -0.22955946, -0.06743535],
       [-0.46871077, -0.65240527, -2.25607653, -0.03395906],
       [-0.89555621,  1.46830139, -0.14267751, -1.014546  ],
       [ 1.63999304,  0.0933624 , -1.2935593 ,  1.09878447],
       [ 0.14056595,  0.32721471, -1.01381956,  0.46292008]])

Describe shows a quick statistic summary of your data

In [25]:
# This currently does not work, but will be resolved soon
kdf.describe()

PandasNotImplementedError: The method `pd.DataFrame.describe()` is not implemented yet.

Transposing your data: Transposing is not allowed for now, as it can cause the data to be all aggregated into a single computer.

In [23]:
# df.T

Sorting by an axis

In [24]:
# TODO
# df.sort_index(axis=1, ascending=False)

Sorting by value

In [26]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-01,0.83062,-0.637647,0.424005,0.767639
2013-01-05,-0.533946,-0.468093,-1.199911,0.897936
2013-01-02,1.220761,-0.24743,0.246443,2.964931
2013-01-06,-1.533739,-0.238883,-0.680672,0.026554
2013-01-03,0.162569,0.19774,-0.90009,0.907806
2013-01-04,1.400141,0.334369,-0.283772,-0.724484
