### load data from database
pandas vs blaze

#### Task
* query database just like using ORM without writing any SQL statements
* do calc based on the return results
* monitor the memory usage 

#### datasource
single table in remote database



##### 1. using blaze

In [1]:
from blaze import Data

In [2]:
DB_URL = "mysql://%s:%s@%s:%s/%s::secret_txn_tab" % ('secret')
orders = Data(DB_URL)
# noted the warning msg below

Blaze does not understand a SQLAlchemy type.
Blaze provided the following error:
	No SQL-datashape match for type BLOB
Skipping.
Blaze does not understand a SQLAlchemy type.
Blaze provided the following error:
	No SQL-datashape match for type BLOB
Skipping.


when I use blaze in production, I got `ValueError: Unsupported string encoding u'utf8mb4_unicode_ci` error.
the target database using utf8mb4 charset.

In [3]:
order_by_date = orders[(orders.channelid==70000) & (orders.ctime>=1437840000) & (orders.ctime<1437926400)& (orders.status==1)]

In [4]:
int(order_by_date.count())

# the count() method diff from pandas count()

256

In [5]:
int(order_by_date.amount.sum())/100000

# same usage with pandas

212018

In [6]:
order_by_date.userid.distinct().count()

# pandas use drop_duplicates() instead

##### 2. using pandas

In [7]:
from sqlalchemy import create_engine
import pandas as pd

In [8]:
DB_URL = "mysql://%s:%s@%s:%s/%s" % ('secret')
engine = create_engine(DB_URL)

In [9]:
orders = pd.read_sql('select * from secret_txn_tab',
                 con=engine)

# using pandas still need to write the sql statement

In [10]:
orders.head()

Unnamed: 0,txnid,userid,refund_txnid,checkoutid,type,amount,currency,channelid,status,channel_status,channel_txnid,ip,action_country,ctime,vtime,mtime,memo,extra_data
0,1000000,11183,0,95,0,89000000,THB,70000,0,0,,2224250195,SG,1432303352,0,1432303352,,{}
1,1000001,11184,0,100,0,59000000,THB,71000,0,200,,3669652968,SG,1432307027,0,1432309356,,"{""transfer_fields"": {""name"": ""Liu jing"", ""memo..."
2,1000002,11184,0,104,0,69000000,THB,70000,0,0,,3669652968,SG,1432307426,0,1432307426,,{}
3,1000003,11184,0,104,0,69000000,THB,70000,1,100,4323108605155000001366,3669652968,SG,1432310801,1432310864,1432310864,,"{""card_number"": ""426569xxxxxx3103"", ""auth_code..."
4,1000004,11174,0,111,0,59000000,THB,70000,0,0,,2088405459,TH,1432311210,0,1432311210,,{}


In [11]:
order_today = orders[(orders.channelid==70000) & (orders.ctime>=1437840000) & (orders.ctime<1437926400)& (orders.status==1)]

In [12]:
total_user = int(order_today.userid.drop_duplicates().count())
total_user

209

In [13]:
total_txn = int(order_today.txnid.count())
total_txn

256

In [14]:
total_amt = int(order_today.amount.sum()) / 500000
total_amt

42403

##### pandas vs blaze
1. the API diffs refer to [blaze website](http://blaze.pydata.org/en/latest/rosetta-pandas.html)
2. can be easily conver using odo or Data()
3. Blaze can simplify and make more readable some common IO tasks that one would want to do with pandas. These examples make use of the odo library. In many cases, blaze will able to handle datasets that can’t fit into main memory, which is something that can’t be easily done with pandas.(but this time I got those charset problems)


In [15]:
def df_size(df):
    """Return the size of a DataFrame in Megabyes"""
    total = 0.0
    for col in df:
        total += df[col].nbytes
    return total/1048576

In [16]:
df_size(orders)

2.9620513916015625

In [17]:
#how many records in this df?
orders.count()

txnid             21569
userid            21569
refund_txnid      21569
checkoutid        21569
type              21569
amount            21569
currency          21569
channelid         21569
status            21569
channel_status    21569
channel_txnid     21569
ip                21569
action_country    21569
ctime             21569
vtime             21569
mtime             21569
memo              21569
extra_data        21569
dtype: int64

In [18]:
# change to a more large dataset
DB_URL = "mysql://%s:%s@%s:%s/%s" % ('secret')
engine = create_engine(DB_URL)
records = pd.read_sql('select * from xx_realtime',
                 con=engine)

In [19]:
df_size(records)

16.8321533203125

In [20]:
records.count()

id          367704
type        367704
date        367704
tick        367704
location    367704
data        367704
dtype: int64

##### load as string? how

In [21]:
DB_URL = "mysql://%s:%s@%s:%s/%s::airpay_daily" % ('secret')
stats = Data(DB_URL)
data = stats[
            (stats.type==4) & (stats.date == '20150726') & (
            stats.location =='TH') & (stats.extra=='Total')]

In [22]:
data.head()

Unnamed: 0,id,type,date,location,extra,data
0,8624,4,2015-07-26,TH,Total,"{""txn_user"": 4147, ""txn_value"": 1895538.140000..."


In [23]:
data[0].data

# seems cannot get the data value like that

In [24]:
# use odo conver blaze object to pandas df
from odo import odo

In [25]:
df = odo(data, pd.DataFrame)

In [26]:
df

Unnamed: 0,id,type,date,location,extra,data
0,8624,4,2015-07-26,TH,Total,"{""txn_user"": 4147, ""txn_value"": 1895538.140000..."


In [27]:
x = df.iloc[0]['data']
x

'{"txn_user": 4147, "txn_value": 1895538.1400000025, "txn_num": 5512}'

In [None]:
json.loads(x)