# Datatable

Datatable overcomes the limitations of pandas and speeds up the process of EDA(exploratory data analysis). Datatable has been built by H20.ai, one of the leading AI ML companies in the world. Datatable is pretty similar to pandas and R data.table libraries. Datatable has proper documentation. Works with Python version 3.6+.

To read about it more, please refer [this](https://analyticsindiamag.com/hands-on-guide-to-datatable-library-for-faster-eda/) article.

# Code Implementation

## Installing datatable

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels tensorflow keras --user -q

In [None]:
!python -m pip install datatable --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

Dataset – [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)

The dataset contains transactions that have been made by credit cards in September 2013 by European cardholders. This dataset shows transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. A total of 31 features are time, class, amount and  V1 to V28.

Reading files

In [None]:
import pandas as pd
import time
start = time.time()
pandas_df = pd.read_csv('https://gitlab.com/AnalyticsIndiaMagazine/practicedatasets/-/raw/main/LoRAS/creditcard.csv')
end = time.time()
print(end - start)

With datatable:

In [None]:
import datatable as dt
start = time.time()
df = dt.fread('https://gitlab.com/AnalyticsIndiaMagazine/practicedatasets/-/raw/main/LoRAS/creditcard.csv')
end = time.time()
print(end - start)

Clearly, datatable performs much better than pandas. Datatable takes 30 milliseconds to fetch the data whereas pandas take more than 1.5 seconds. 

Dataset size:

In [None]:
print(df.shape) 

Feature Column names

In [None]:
print(df.names)

Column Types

In [None]:
print(df.stypes) 

Convert to numpy array

In [None]:
np_arr = df.to_numpy()

Convert to pandas 


In [None]:
df_pd = df.to_pandas()

Convert to python list object

In [None]:
py_obj = df.to_list()

Sorting Frame -> sort() function sorts the row by the column defined in ascending order.

With pandas:

In [None]:
start = time.time()
pandas_df.sort_values(by="Amount")
end = time.time()
print(end - start)

With datatable:

In [None]:
start = time.time()
df.sort("Amount")
end = time.time()
print(end - start)

GroupBy

Let us get the mean amount for each V1 feature. In datatable, operations of a Frame can be represented as dt[i,j,…] where i is row selector, j is column selector and .. are other modifiers. Derived from matrix notations.

In [None]:
start = time.time()
for i in range(500):
    pandas_df.groupby("V1")["Amount"].sum()
end = time.time()
print(end - start)

In [None]:
start = time.time()
for i in range(10):
    df[:, dt.sum(dt.f.Amount), dt.by(dt.f.V1)]
end = time.time()
print(end - start)

.f in dt.f means frame proxy referring to currently calling frame.

Here Datatable takes 1/4th the time of pandas.


Deleting a column

In [None]:
del df[:, 'V27']

Saving Frames

Saving Frame in disk as binary format and opening it later instantly

In [None]:
# df.to_jay("out.jay")
# df_dt = dt.open("out.jay")

Write the Frame



In [None]:
# df.to_csv('out.csv')