# 1. Introduction

The main purpose of this notebook is to compare ways to open standard .csv files on your local machine.

We have 2 documents which we will open, inspect and transform with three different methods available in Python:

- [__Pandas__](https://pandas.pydata.org/),
- [__Dask__](https://dask.org/),
- [__PySpark__](https://spark.apache.org/).



So initially, let's import our first libraries:

In [1]:
from os import path
import pandas as pd

Now we can see the volume of out data:

In [2]:
medium_file = path.expanduser("./data/raw/Crimes.csv")
big_file = path.expanduser("./data/raw/Vehicles.csv")

file_list = {"Crimes" : medium_file, 
             "Vehicles" : big_file,
            }

for  key, value in file_list.items():
    print(f"{key} file size: {path.getsize(value) >> 20:.2f} MB")

Crimes file size: 1725.00 MB
Vehicles file size: 4814.00 MB


## 2. Pandas

We will start with loading files into DataFrames with __Pandas__ -> an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

#### 2.1. Medium DataFrame

##### Read data:

In [3]:
%%time
medium_df = pd.read_csv(medium_file)

Wall time: 34.9 s


##### Show last 2 rows:

In [4]:
%%time
medium_df.tail(2)

Wall time: 0 ns


Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Districts,Police Beats
6877120,1392468,G106773,01/01/2001 12:00:00 AM,023XX W HURON ST,1210,DECEPTIVE PRACTICE,THEFT OF LABOR/SERVICES,RESIDENCE,False,False,...,-87.686272,"(41.893824874, -87.686272003)",24.0,21184.0,25.0,546.0,41.0,28.0,15.0,80.0
6877121,1310313,G000070,01/01/2001 12:00:00 AM,109XX S LONGWOOD DR,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,-87.671448,"(41.694977232, -87.671447845)",33.0,22212.0,74.0,380.0,42.0,13.0,9.0,257.0


##### Number of rows, columns:

In [5]:
%%time
medium_df.shape

Wall time: 0 ns


(6877122, 30)

##### Basic statistics:

In [6]:
%%time
medium_df.describe()

Wall time: 7 s


Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Districts,Police Beats
count,6877122.0,6877122.0,6877075.0,6262298.0,6263626.0,6811932.0,6811932.0,6877122.0,6811932.0,6811932.0,6792140.0,6811932.0,6794854.0,6797012.0,6794966.0,6794901.0,6795930.0,6795953.0
mean,6326030.0,1190.924,11.30045,22.69056,37.57498,1164523.0,1885727.0,2008.515,41.84203,-87.67178,27.38247,19108.71,38.7246,381.3732,25.53626,31.46023,14.91913,150.5174
std,3097822.0,703.2927,6.945665,13.83333,21.53849,17155.58,32686.63,5.157344,0.08994461,0.06208706,15.26851,5736.594,20.09641,230.0487,14.77743,19.14224,6.45364,78.50072
min,634.0,111.0,1.0,1.0,0.0,0.0,0.0,2001.0,36.61945,-91.68657,1.0,2733.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,3467769.0,622.0,6.0,10.0,23.0,1152936.0,1859189.0,2004.0,41.76891,-87.71385,14.0,21184.0,25.0,176.0,12.0,15.0,10.0,83.0
50%,6310243.0,1111.0,10.0,22.0,32.0,1166001.0,1890614.0,2008.0,41.85551,-87.66615,27.0,21560.0,37.0,378.0,26.0,30.0,16.0,153.0
75%,9000136.0,1731.0,17.0,34.0,57.0,1176352.0,1909288.0,2013.0,41.90682,-87.62835,41.0,22243.0,58.0,577.0,37.0,50.0,20.0,221.0
max,11701390.0,2535.0,31.0,50.0,77.0,1205119.0,1951622.0,2019.0,42.02291,-87.52453,53.0,26912.0,77.0,801.0,50.0,61.0,25.0,277.0


##### Conditinal statements:

In [7]:
%%time
medium_df.loc[(medium_df["Community Areas"] > 50) & (medium_df["Boundaries - ZIP Codes"] < 60) & (medium_df["Description"] == "AUTOMOBILE")].head(2)

Wall time: 599 ms


Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Districts,Police Beats
214,11694283,JC271784,05/20/2019 06:00:00 PM,009XX W DIVERSEY PKWY,910,MOTOR VEHICLE THEFT,AUTOMOBILE,RESIDENCE,False,False,...,-87.651911,"(41.932671001, -87.651910528)",38.0,21190.0,57.0,681.0,25.0,22.0,5.0,31.0
651,11693300,JC270665,05/20/2019 04:00:00 AM,071XX S GREEN ST,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,-87.645607,"(41.764525607, -87.645606586)",31.0,21559.0,66.0,512.0,32.0,11.0,17.0,215.0


##### Aggregations:

In [8]:
%%time
medium_df.groupby("Police Districts").agg("count").round(2).head(5)

Wall time: 4.37 s


Unnamed: 0_level_0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Beats
Police Districts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,197490,197490,197490,197490,197490,197490,197490,197369,197490,197490,...,197490,197490,197490,197489,197490,197489,197453,197489,197489,197490
2.0,119541,119540,119541,119541,119541,119541,119541,119393,119541,119541,...,119541,119541,119541,119541,119541,119541,119541,119541,119541,119541
3.0,41,41,41,41,41,41,41,41,41,41,...,41,41,41,41,41,0,41,2,41,41
4.0,1453,1453,1453,1453,1453,1453,1453,1453,1453,1453,...,1453,1453,1453,1453,1453,0,505,0,6,1453
5.0,304851,304851,304851,304851,304851,304851,304851,304326,304851,304851,...,304851,304851,304851,304851,304851,304851,304851,304851,304851,304851


We will have to force to reset each variable after our measurements to clear memory space and then import libraries and needed variables once more.

In [9]:
%reset -f

#### 2.2. Big DataFrame:

In the largest dataset we will have to turn __low_memory__ to __False__ to proceed with the operation (DtypeWarning associated with column types would arrise).

In [10]:
from os import path
import pandas as pd

big_file = path.expanduser("./data/raw/Vehicles.csv")

##### Read data:

In [11]:
%%time
big_df = pd.read_csv(big_file,
                     low_memory=False)

Wall time: 8min 55s


So local computer can handle the task on big files > 4GB, but quite time consuming taks it is.

##### Show last 2 rows:

In [12]:
%%time
big_df.tail(2)

Wall time: 13.9 ms


Unnamed: 0.1,Unnamed: 0,pojazd_id,marka,kategoria,typ,model,wariant,wersja,rodzaj,podrodzaj,...,siedziba_wlasciciela_woj,siedziba_wlasciciela_pow,siedziba_wlasciciela_gmina,data_pierwszej_rej_w_kraju,createtimestamp,modifytimestamp,siedziba_wlasciciela_woj_teryt,akt_miejsce_rej_wojew_teryt,emisja_co2,emisja_co2_pal_alternatywne1
13063685,6753092,80800114204160,CITROEN,,,XANTIA,,,SAMOCHÓD OSOBOWY,KOMBI,...,MAZOWIECKIE,LEGIONOWSKI,JABŁONNA,1999-07-28,2019-03-15,2019-03-15,14.0,14,,
13063686,6753093,32820186672840,TOYOTA,13.0,,YARIS,,,SAMOCHÓD OSOBOWY,WIELOZADANIOWY,...,MAZOWIECKIE,WARSZAWSKI ZACHODNI,STARE BABICE,2013-10-29,2019-03-15,2019-03-15,14.0,14,104.0,


##### Number of rows, columns:

In [13]:
%%time
big_df.shape

Wall time: 0 ns


(13063687, 72)

##### Basic statistics:

In [14]:
%%time
big_df.describe()

Wall time: 28.3 s


Unnamed: 0.1,Unnamed: 0,pojazd_id,kategoria,pojemnosc_silnika,moc_do_masy,moc_silnika,moc_silnika_hybrydowego,masa_wlasna,masa_pgj,dopuszczalna_masa_calkowita,...,rozstaw_kol_max,rozstaw_kol_sred,rozstaw_kol_min,emisja_co2_redukcja,wersja_rpp,kod_rpp,siedziba_wlasciciela_woj_teryt,akt_miejsce_rej_wojew_teryt,emisja_co2,emisja_co2_pal_alternatywne1
count,13063690.0,13063690.0,4276366.0,12074010.0,1128423.0,9831770.0,12678.0,12516820.0,24215.0,12525750.0,...,1365783.0,1365972.0,1366715.0,24.0,7951085.0,7951085.0,9811910.0,13063690.0,1896983.0,941.0
mean,2514995.0,365230900000000.0,12700040000000.0,1912.105,0.05682439,70.77183,57.085567,1434.54,2145.469131,2809.068,...,1380.414,1549.918,1387.077,15.458333,6.874341,114585.1,12.70523,12.59851,135.7897,138.123273
std,1784089.0,1335397000000000.0,356144800000000.0,1941.93,1.198505,68.7427,34.860855,1855.161,3409.945596,5280.302,...,3644.857,22988.55,7568.882,34.118692,3.533673,239364.7,2.371407,1.606318,102.6126,16.823227
min,0.0,2233634.0,0.0,0.0,0.0,0.0,0.9,0.0,1.0,0.0,...,0.0,0.0,0.0,4.0,1.0,1.0,0.0,10.0,0.0,96.0
25%,1088640.0,26899650000000.0,13.0,1229.0,0.0,40.0,45.0,770.0,1205.0,1336.0,...,1478.0,1467.0,1468.0,5.0,1.0,3.0,12.0,12.0,118.0,131.0
50%,2177281.0,53777460000000.0,13.0,1596.0,0.0142,66.0,53.0,1159.0,1445.0,1690.0,...,1536.0,1529.0,1525.0,5.0,9.0,3.0,14.0,14.0,135.0,132.0
75%,3487172.0,80643250000000.0,11111110.0,1994.0,0.0387,92.0,60.0,1500.0,1885.0,2105.0,...,1574.0,1567.0,1562.0,6.0,9.0,201000.0,14.0,14.0,154.0,155.0
max,6753093.0,9007160000000000.0,1e+16,99973.0,890.0,12902.0,999.0,911940.0,105535.0,920920.0,...,2589258.0,8961559.0,5025828.0,132.0,9.0,1699000.0,73.0,14.0,121067.0,163.0


##### Conditinal statements:

In [15]:
%%time
big_df.loc[((big_df["marka"] == "TOYOTA") | (big_df["marka"] == "FORD")) &  (big_df["emisja_co2"] < 100)].head(2)

Wall time: 1.98 s


Unnamed: 0.1,Unnamed: 0,pojazd_id,marka,kategoria,typ,model,wariant,wersja,rodzaj,podrodzaj,...,siedziba_wlasciciela_woj,siedziba_wlasciciela_pow,siedziba_wlasciciela_gmina,data_pierwszej_rej_w_kraju,createtimestamp,modifytimestamp,siedziba_wlasciciela_woj_teryt,akt_miejsce_rej_wojew_teryt,emisja_co2,emisja_co2_pal_alternatywne1
868,868,51955064994346,TOYOTA,13.0,XW3(A),PRIUS,ZVW30(H),ZVW30L-AHXEBW(2C),SAMOCHÓD OSOBOWY,HATCHBACK,...,ŁÓDZKIE,ŁÓDŹ,ŁÓDŹ,2012-01-13,2019-03-15,2019-03-15,10.0,10,89.0,
1620,1620,4125588963370607,TOYOTA,13.0,XP13M(A),YARIS,KSP13(MH),KSP130L-CHMRKW(3L),SAMOCHÓD OSOBOWY,WIELOZADANIOWY,...,,,,2018-05-29,2019-03-15,2019-03-15,,10,99.0,


##### Aggregations:

In [16]:
%%time
big_df.groupby("rodzaj").agg("count").round(2).head(5)

Wall time: 47.7 s


Unnamed: 0_level_0,Unnamed: 0,pojazd_id,marka,kategoria,typ,model,wariant,wersja,podrodzaj,przeznaczenie,...,siedziba_wlasciciela_woj,siedziba_wlasciciela_pow,siedziba_wlasciciela_gmina,data_pierwszej_rej_w_kraju,createtimestamp,modifytimestamp,siedziba_wlasciciela_woj_teryt,akt_miejsce_rej_wojew_teryt,emisja_co2,emisja_co2_pal_alternatywne1
rodzaj,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AUTOBUS,52082,52082,51424,15830,29143,50963,29158,29158,52027,49195,...,38182,38499,38345,52014,52082,52082,38182,52082,1586,0
AUTOBUSY,3,3,3,0,0,3,0,0,3,0,...,3,3,3,1,3,3,3,3,0,0
BRAK DANYCH,3,3,3,1,0,1,0,0,3,3,...,3,3,3,3,3,3,3,3,0,0
CIAGN. ROLNICZY.,1,1,1,0,0,1,0,0,1,0,...,1,1,1,1,1,1,1,1,0,0
CIAGN. SAMOCHOD.,2,2,2,0,0,2,0,0,2,0,...,2,2,2,2,2,2,2,2,0,0


In [17]:
%reset -f

# 3. Dask

Now we will do the above operations with Dask ->  a library for parallel computing in Python.

#### 3.1. Medium DataFrame:

In [18]:
from os import path
import dask.dataframe as dd

medium_file = path.expanduser("./data/raw/Crimes.csv")

##### Read data:

In [19]:
%%time
medium_dd = dd.read_csv(medium_file,
                        assume_missing=True, # we set this value to True because dask first off tells us about NaN values in integer column
                        )

Wall time: 71.4 ms


Creating the DataFrames went much too fast, but we have to remember that Dask does __lazy evaluation__ of every method -> to compute the value of a function, we have to use __.compute()__ method. It will compute the result parallely in blocks, parallelizing every independent task at that time.

Still let's check timing of basic tasks we would like to do with our DataFrames.

##### Show last 2 rows:

In [20]:
%%time
medium_dd.tail(2)

Wall time: 477 ms


Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Districts,Police Beats
67808,1392468.0,G106773,01/01/2001 12:00:00 AM,023XX W HURON ST,1210,DECEPTIVE PRACTICE,THEFT OF LABOR/SERVICES,RESIDENCE,False,False,...,-87.686272,"(41.893824874, -87.686272003)",24.0,21184.0,25.0,546.0,41.0,28.0,15.0,80.0
67809,1310313.0,G000070,01/01/2001 12:00:00 AM,109XX S LONGWOOD DR,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,-87.671448,"(41.694977232, -87.671447845)",33.0,22212.0,74.0,380.0,42.0,13.0,9.0,257.0


##### Number of rows, columns:

In [21]:
%%time
print(f"shape: ({len(medium_dd)}, {len(medium_dd.columns)})")

shape: (6877122, 30)
Wall time: 25.1 s


##### Basic statistics:

In [22]:
%%time
medium_dd.describe().compute()

Wall time: 1min 20s


Unnamed: 0,ID,Arrest,Domestic,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Districts,Police Beats
count,6877122.0,6877122.0,6877122.0,6877122.0,6877075.0,6262298.0,6263626.0,6811932.0,6811932.0,6877122.0,6811932.0,6811932.0,6792140.0,6811932.0,6794854.0,6797012.0,6794966.0,6794901.0,6795930.0,6795953.0
mean,6326030.0,0.2762272,0.13186,1190.924,11.30045,22.69056,37.57498,1164523.0,1885727.0,2008.515,41.84203,-87.67178,27.38247,19108.71,38.7246,381.3732,25.53626,31.46023,14.91913,150.5174
std,3097822.0,0.4471306,0.3383385,703.2927,6.945665,13.83333,21.53849,17155.58,32686.63,5.157344,0.08994461,0.06208706,15.26851,5736.594,20.09641,230.0487,14.77743,19.14224,6.45364,78.50072
min,634.0,0.0,0.0,111.0,1.0,1.0,0.0,0.0,0.0,2001.0,36.61945,-91.68657,1.0,2733.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,4274221.0,0.0,0.0,631.0,6.0,15.0,26.0,1153612.0,1861556.0,2005.0,41.77552,-87.7113,15.0,21186.0,25.0,184.0,13.0,16.0,10.0,85.0
50%,7994002.0,0.0,0.0,1123.0,10.0,24.0,52.0,1166922.0,1894366.0,2011.0,41.86585,-87.66321,28.0,21569.0,37.0,382.0,27.0,30.0,16.0,157.0
75%,11604070.0,1.0,0.0,1823.0,17.0,35.0,76.0,1176547.0,1910536.0,2019.0,41.91032,-87.62772,41.0,22248.0,59.0,580.0,39.0,52.0,20.0,224.0
max,11701390.0,1.0,1.0,2535.0,31.0,50.0,77.0,1205119.0,1951622.0,2019.0,42.02291,-87.52453,53.0,26912.0,77.0,801.0,50.0,61.0,25.0,277.0


##### Conditinal statements:

In [23]:
%%time
medium_dd.loc[(medium_dd["Community Areas"] > 50) & (medium_dd["Boundaries - ZIP Codes"] < 60) & (medium_dd["Description"] == "AUTOMOBILE")].head(2)

Wall time: 1.58 s


Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Districts,Police Beats
214,11694283.0,JC271784,05/20/2019 06:00:00 PM,009XX W DIVERSEY PKWY,910,MOTOR VEHICLE THEFT,AUTOMOBILE,RESIDENCE,False,False,...,-87.651911,"(41.932671001, -87.651910528)",38.0,21190.0,57.0,681.0,25.0,22.0,5.0,31.0
651,11693300.0,JC270665,05/20/2019 04:00:00 AM,071XX S GREEN ST,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,-87.645607,"(41.764525607, -87.645606586)",31.0,21559.0,66.0,512.0,32.0,11.0,17.0,215.0


##### Aggregations:

In [24]:
%%time
medium_dd.groupby("Police Districts").agg("count").head(5)

Wall time: 29.6 s


Unnamed: 0_level_0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Beats
Police Districts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,197490,197490,197490,197490,197490,197490,197490,197369,197490,197490,...,197490,197490,197490,197489,197490,197489,197453,197489,197489,197490
2.0,119541,119540,119541,119541,119541,119541,119541,119393,119541,119541,...,119541,119541,119541,119541,119541,119541,119541,119541,119541,119541
3.0,41,41,41,41,41,41,41,41,41,41,...,41,41,41,41,41,0,41,2,41,41
4.0,1453,1453,1453,1453,1453,1453,1453,1453,1453,1453,...,1453,1453,1453,1453,1453,0,505,0,6,1453
5.0,304851,304851,304851,304851,304851,304851,304851,304326,304851,304851,...,304851,304851,304851,304851,304851,304851,304851,304851,304851,304851


In [25]:
%reset -f

#### 3.2. Big DataFrame:

In [26]:
from os import path
import dask.dataframe as dd

big_file = path.expanduser("./data/raw/Vehicles.csv")

##### Read data:

In [27]:
%%time
big_dd = dd.read_csv(big_file,
                     assume_missing=True, # we set this value to True because dask first off tells us about NaN values in integer column
                     dtype={'rok_produkcji': 'object'}, # we set this value due to dask's dtype inference failing
                     low_memory=False, # we set this value to False because columns have mixed types and Dask gives us warning signs on this topic,
                     )

Wall time: 172 ms


##### Show last 2 rows:

In [28]:
%%time
big_dd.tail(2)

Wall time: 2.02 s


Unnamed: 0.1,Unnamed: 0,pojazd_id,marka,kategoria,typ,model,wariant,wersja,rodzaj,podrodzaj,...,siedziba_wlasciciela_woj,siedziba_wlasciciela_pow,siedziba_wlasciciela_gmina,data_pierwszej_rej_w_kraju,createtimestamp,modifytimestamp,siedziba_wlasciciela_woj_teryt,akt_miejsce_rej_wojew_teryt,emisja_co2,emisja_co2_pal_alternatywne1
149139,6753092.0,80800110000000.0,CITROEN,,,XANTIA,,,SAMOCHÓD OSOBOWY,KOMBI,...,MAZOWIECKIE,LEGIONOWSKI,JABŁONNA,1999-07-28,2019-03-15,2019-03-15,14.0,14.0,,
149140,6753093.0,32820190000000.0,TOYOTA,13.0,,YARIS,,,SAMOCHÓD OSOBOWY,WIELOZADANIOWY,...,MAZOWIECKIE,WARSZAWSKI ZACHODNI,STARE BABICE,2013-10-29,2019-03-15,2019-03-15,14.0,14.0,104.0,


##### Number of rows, columns:

In [29]:
%%time
print(f"shape: ({len(big_dd)}, {len(big_dd.columns)})")

shape: (13063687, 72)
Wall time: 1min 40s


##### Basic statistics:

In [30]:
%%time
big_dd.describe().compute()

ValueError: No non-trivial arrays found

At the time doing the analysis is seems we have a bug in the method, waiting for it to be fixed (https://github.com/dask/dask/issues/2792).

##### Conditinal statements:

In [31]:
%%time
big_dd.loc[((big_dd["marka"] == "TOYOTA") | (big_dd["marka"] == "FORD")) &  (big_dd["emisja_co2"] < 100)].head(2)

Wall time: 2.32 s


Unnamed: 0.1,Unnamed: 0,pojazd_id,marka,kategoria,typ,model,wariant,wersja,rodzaj,podrodzaj,...,siedziba_wlasciciela_woj,siedziba_wlasciciela_pow,siedziba_wlasciciela_gmina,data_pierwszej_rej_w_kraju,createtimestamp,modifytimestamp,siedziba_wlasciciela_woj_teryt,akt_miejsce_rej_wojew_teryt,emisja_co2,emisja_co2_pal_alternatywne1
868,868.0,51955060000000.0,TOYOTA,13.0,XW3(A),PRIUS,ZVW30(H),ZVW30L-AHXEBW(2C),SAMOCHÓD OSOBOWY,HATCHBACK,...,ŁÓDZKIE,ŁÓDŹ,ŁÓDŹ,2012-01-13,2019-03-15,2019-03-15,10.0,10.0,89.0,
1620,1620.0,4125589000000000.0,TOYOTA,13.0,XP13M(A),YARIS,KSP13(MH),KSP130L-CHMRKW(3L),SAMOCHÓD OSOBOWY,WIELOZADANIOWY,...,,,,2018-05-29,2019-03-15,2019-03-15,,10.0,99.0,


##### Aggregations:

In [32]:
%%time
big_dd.groupby("rodzaj").agg("count").round(2).head(5)

Wall time: 2min 3s


Unnamed: 0_level_0,Unnamed: 0,pojazd_id,marka,kategoria,typ,model,wariant,wersja,podrodzaj,przeznaczenie,...,siedziba_wlasciciela_woj,siedziba_wlasciciela_pow,siedziba_wlasciciela_gmina,data_pierwszej_rej_w_kraju,createtimestamp,modifytimestamp,siedziba_wlasciciela_woj_teryt,akt_miejsce_rej_wojew_teryt,emisja_co2,emisja_co2_pal_alternatywne1
rodzaj,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AUTOBUS,52082,52082,51424,15830,29143,50963,29158,29158,52027,49195,...,38182,38499,38345,52014,52082,52082,38182,52082,1586,0
AUTOBUSY,3,3,3,0,0,3,0,0,3,0,...,3,3,3,1,3,3,3,3,0,0
BRAK DANYCH,3,3,3,1,0,1,0,0,3,3,...,3,3,3,3,3,3,3,3,0,0
CIAGN. ROLNICZY.,1,1,1,0,0,1,0,0,1,0,...,1,1,1,1,1,1,1,1,0,0
CIAGN. SAMOCHOD.,2,2,2,0,0,2,0,0,2,0,...,2,2,2,2,2,2,2,2,0,0


In [33]:
%reset -f

Interestingly we receive an ValueError in place where Pandas computed without any problem.

# 4. Spark

# 5. Summary

TODO:

https://blog.dask.org/2018/08/28/dataframe-performance-high-level

http://docs.dask.org/en/latest/spark.html