<a href="https://colab.research.google.com/github/chrismarkella/Kaggle-access-from-Google-Colab/blob/master/apply_weakness.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!apt-get -qq install tree

Selecting previously unselected package tree.
(Reading database ... 135004 files and directories currently installed.)
Preparing to unpack .../tree_1.7.0-5_amd64.deb ...
Unpacking tree (1.7.0-5) ...
Setting up tree (1.7.0-5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [0]:
import os

import numpy as np
import pandas as pd

from getpass import getpass 

In [4]:
def access_kaggle():
    """
    Access Kaggle from Google Colab.
    If the /root/.kaggle does not exist then prompt for
    the username and for the Kaggle API key.
    Creates the kaggle.json access file in the /root/.kaggle/ folder. 
    """
    KAGGLE_ROOT = os.path.join('/root', '.kaggle')
    KAGGLE_PATH = os.path.join(KAGGLE_ROOT, 'kaggle.json')

    if '.kaggle' not in os.listdir(path='/root'):
        user = getpass(prompt='Kaggle username: ')
        key  = getpass(prompt='Kaggle API key: ')
        
        !mkdir $KAGGLE_ROOT
        !touch $KAGGLE_PATH
        !chmod 666 $KAGGLE_PATH
        with open(KAGGLE_PATH, mode='w') as f:
            f.write('{"username":"%s", "key":"%s"}' %(user, key))
            f.close()
        !chmod 600 $KAGGLE_PATH
        del user
        del key
        success_msg = "Kaggle is successfully set up. Good to go."
        print(f'{success_msg}')

access_kaggle()


Kaggle username: ··········
Kaggle API key: ··········
Kaggle is successfully set up. Good to go.


In [5]:
!kaggle datasets files benhamner/sf-bay-area-bike-share

name              size  creationDate         
---------------  -----  -------------------  
station.csv        6KB  2019-11-14 06:26:55  
trip.csv          76MB  2019-11-14 06:26:55  
weather.csv      428KB  2019-11-14 06:26:55  
database.sqlite    3GB  2019-11-14 06:26:55  
status.csv         2GB  2019-11-14 06:26:55  


In [6]:
!kaggle datasets download benhamner/sf-bay-area-bike-share -f status.csv

Downloading status.csv.zip to /content
 97% 177M/182M [00:01<00:00, 98.3MB/s]
100% 182M/182M [00:01<00:00, 114MB/s] 


In [7]:
!tree -sh

.
├── [4.0K]  sample_data
│   ├── [1.7K]  anscombe.json
│   ├── [294K]  california_housing_test.csv
│   ├── [1.6M]  california_housing_train.csv
│   ├── [ 17M]  mnist_test.csv
│   ├── [ 35M]  mnist_train_small.csv
│   └── [ 930]  README.md
└── [182M]  status.csv.zip

1 directory, 7 files


In [8]:
!unzip status.csv.zip
!rm status.csv.zip
!tree -sh

Archive:  status.csv.zip
  inflating: status.csv              
.
├── [4.0K]  sample_data
│   ├── [1.7K]  anscombe.json
│   ├── [294K]  california_housing_test.csv
│   ├── [1.6M]  california_housing_train.csv
│   ├── [ 17M]  mnist_test.csv
│   ├── [ 35M]  mnist_train_small.csv
│   └── [ 930]  README.md
└── [1.9G]  status.csv

1 directory, 7 files


In [26]:
import time

def timer(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        rv = func(*args, **kwargs)
        end = time.time()
        return rv, end-start
    return wrapper

@timer
def load_data(csv_path):
    return pd.read_csv(csv_path, sep=',')

df, time_elapsed = load_data('status.csv')
print(f'time elapsed: {time_elapsed}')
print(df.shape)

time elapsed: 82.98482918739319
(71984434, 4)


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
station_id         int64
bikes_available    int64
docks_available    int64
time               object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB


In [22]:
df_100k = df.iloc[:100*10**3]
df_200k = df.iloc[:200*10**3]
df_500k = df.iloc[:500*10**3]
df_1M = df.iloc[:10**6]

for _df in (df_100k, df_200k, df_500k, df_1M):
    print(_df.shape)
    print(f'size: {_df.shape[0]//1000}k')

(100000, 4)
size: 100k
(200000, 4)
size: 200k
(500000, 4)
size: 500k
(1000000, 4)
size: 1000k


In [29]:
_mean = df.station_id.mean()

def re_mean_for_apply(row):
    row.station_id = row.station_id - _mean
    return row

@timer
def re_mean(_df, method):
    if method == 'series':
        _df.station_id - _mean
    elif method == 'map':
        _df.station_id.map(lambda st_id: st_id - _mean)
    elif method == 'apply':
        _df.apply(func=re_mean_for_apply, axis='columns')
    return 1

methods = [
    'series',
    'map',
    'apply',
]

dfs = [
    df_100k,
    df_200k,
]

for _df in dfs:
    print(f'size: {_df.shape[0]//10**3}k')
    for _method in methods:
        _,time_elapsed = re_mean(_df, _method)
        print(f'{_method:6}, time elapsed: {time_elapsed:7.3f}')

size: 100k
series, time elapsed:   0.001
map   , time elapsed:   0.023
apply , time elapsed:  18.892
size: 200k
series, time elapsed:   0.001
map   , time elapsed:   0.045
apply , time elapsed:  37.668


###`Series` operations are killing `mapping` and `apply`
- `Series` is about 20-40 times faster then `mapping`
- `Series` is **20,000-40,000** times faster then `apply`

In [30]:
methods = [
    'series',
    'map',
]

dfs = [
    df_500k,
    df_1M,
    df,
]

for _df in dfs:
    print(f'size: {_df.shape[0]//10**3}k')
    for _method in methods:
        _,time_elapsed = re_mean(_df, _method)
        print(f'{_method:6}, time elapsed: {time_elapsed:7.3f}')

size: 500k
series, time elapsed:   0.003
map   , time elapsed:   0.123
size: 1000k
series, time elapsed:   0.002
map   , time elapsed:   0.237
size: 71984k
series, time elapsed:   0.155
map   , time elapsed:  17.716


###Apply did not even make it this far
- Series is just blazing fast.
- Only **155** msec for **72 millions** of rows 