<a href="https://colab.research.google.com/github/chrismarkella/Kaggle-access-from-Google-Colab/blob/master/squeeze_the_dataframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Reducing the size of a DataFrame.

In [1]:
!apt-get -qq install tree

Selecting previously unselected package tree.
(Reading database ... 135004 files and directories currently installed.)
Preparing to unpack .../tree_1.7.0-5_amd64.deb ...
Unpacking tree (1.7.0-5) ...
Setting up tree (1.7.0-5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [0]:
import os

import numpy as np
import pandas as pd

from getpass import getpass 

In [3]:
def access_kaggle():
    """
    Access Kaggle from Google Colab.
    If the /root/.kaggle does not exist then prompt for
    the username and for the Kaggle API key.
    Creates the kaggle.json access file in the /root/.kaggle/ folder. 
    """
    KAGGLE_ROOT = os.path.join('/root', '.kaggle')
    KAGGLE_PATH = os.path.join(KAGGLE_ROOT, 'kaggle.json')

    if '.kaggle' not in os.listdir(path='/root'):
        user = getpass(prompt='Kaggle username: ')
        key  = getpass(prompt='Kaggle API key: ')
        
        !mkdir $KAGGLE_ROOT
        !touch $KAGGLE_PATH
        !chmod 666 $KAGGLE_PATH
        with open(KAGGLE_PATH, mode='w') as f:
            f.write('{"username":"%s", "key":"%s"}' %(user, key))
            f.close()
        !chmod 600 $KAGGLE_PATH
        del user
        del key
        success_msg = "Kaggle is successfully set up. Good to go."
        print(f'{success_msg}')

access_kaggle()


Kaggle username: ··········
Kaggle API key: ··········
Kaggle is successfully set up. Good to go.


In [4]:
!kaggle datasets files benhamner/sf-bay-area-bike-share

name              size  creationDate         
---------------  -----  -------------------  
trip.csv          76MB  2019-11-14 06:26:55  
weather.csv      428KB  2019-11-14 06:26:55  
station.csv        6KB  2019-11-14 06:26:55  
database.sqlite    3GB  2019-11-14 06:26:55  
status.csv         2GB  2019-11-14 06:26:55  


In [5]:
!kaggle datasets download benhamner/sf-bay-area-bike-share -f status.csv

Downloading status.csv.zip to /content
 93% 169M/182M [00:01<00:00, 128MB/s]
100% 182M/182M [00:01<00:00, 130MB/s]


In [6]:
!tree -sh

.
├── [4.0K]  sample_data
│   ├── [1.7K]  anscombe.json
│   ├── [294K]  california_housing_test.csv
│   ├── [1.6M]  california_housing_train.csv
│   ├── [ 17M]  mnist_test.csv
│   ├── [ 35M]  mnist_train_small.csv
│   └── [ 930]  README.md
└── [182M]  status.csv.zip

1 directory, 7 files


In [7]:
!unzip status.csv.zip
!rm status.csv.zip
!tree -sh

Archive:  status.csv.zip
  inflating: status.csv              
.
├── [4.0K]  sample_data
│   ├── [1.7K]  anscombe.json
│   ├── [294K]  california_housing_test.csv
│   ├── [1.6M]  california_housing_train.csv
│   ├── [ 17M]  mnist_test.csv
│   ├── [ 35M]  mnist_train_small.csv
│   └── [ 930]  README.md
└── [1.9G]  status.csv

1 directory, 7 files


In [8]:
import time

def timer(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        rv = func(*args, **kwargs)
        end = time.time()
        return rv, end-start
    return wrapper

@timer
def load_data(csv_path):
    return pd.read_csv(csv_path, sep=',')

df, time_elapsed = load_data('status.csv')
print(f'time elapsed: {time_elapsed}')
print(df.shape)

time elapsed: 61.53510284423828
(71984434, 4)


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
station_id         int64
bikes_available    int64
docks_available    int64
time               object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB


In [10]:
df.describe()

Unnamed: 0,station_id,bikes_available,docks_available
count,71984430.0,71984430.0,71984430.0
mean,42.53149,8.394812,9.284729
std,23.76117,3.993586,4.175442
min,2.0,0.0,0.0
25%,24.0,6.0,6.0
50%,42.0,8.0,9.0
75%,63.0,11.0,12.0
max,84.0,27.0,27.0


###The range for the numerical data types are pretty small.
- `station_id`: 2-84
- `bikes_available`: 0-27
- `docks_available`: 0-27

`int64` looks too large for such a small numbers.

The current size is 2.1GB. Let's try to change the `int64`'s to `np.uint8`.

In [11]:
df.station_id = df.station_id.astype(np.uint8)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
station_id         uint8
bikes_available    int64
docks_available    int64
time               object
dtypes: int64(2), object(1), uint8(1)
memory usage: 1.7+ GB


###After changing only the `station_id`'s data type saved 400MB.

Changing the other two datatypes.

In [12]:
df.bikes_available = df.bikes_available.astype(np.uint8)
df.docks_available = df.docks_available.astype(np.uint8)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
station_id         uint8
bikes_available    uint8
docks_available    uint8
time               object
dtypes: object(1), uint8(3)
memory usage: 755.1+ MB


In [13]:
double_df = pd.concat([df.copy(), df.copy()], axis='index')
double_df.shape

(143968868, 4)

In [14]:
df_288M = pd.concat([double_df.copy(), double_df.copy()], axis='index')
print(df_288M.shape)
df_288M.info()

(287937736, 4)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 287937736 entries, 0 to 71984433
Data columns (total 4 columns):
station_id         uint8
bikes_available    uint8
docks_available    uint8
time               object
dtypes: object(1), uint8(3)
memory usage: 5.1+ GB


###Bonus
- Let's see how long will it take to re-mean a column with 288 millions of entries. Lay back. Will take a while...

In [15]:
_mean = df.station_id.mean()

@timer
def re_mean(_df, method):
    if method == 'series':
        _df.station_id - _mean
    elif method == 'map':
        _df.station_id.map(lambda st_id: st_id - _mean)
    elif method == 'apply':
        _df.apply(func=re_mean_for_apply, axis='columns')
    return 1

_method = 'series'
_,time_elapsed = re_mean(df_288M, _method)
print(f'{_method:6}, time elapsed: {time_elapsed:7.3f}')

series, time elapsed:   3.218


###Are you serious!?

3.2 sec for **288 millions** of rows.