# Prerequisites 

go to [my drive](https://drive.google.com/drive/u/2/folders/13p-gOYiPuyFNu4iH2qh_SNaR5vPrHtC_) click on three dots and create shortcut to your drive

**There are duplicated rows with same name as one name can have multiple wikipedia entries**

# imports and data

In [1]:
pip install --upgrade pyarrow

Collecting pyarrow
  Downloading pyarrow-6.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
[K     |████████████████████████████████| 25.6 MB 61.4 MB/s 
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 3.0.0
    Uninstalling pyarrow-3.0.0:
      Successfully uninstalled pyarrow-3.0.0
Successfully installed pyarrow-6.0.1


In [2]:
from google.colab import drive
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

from tqdm import tqdm

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


Read files as in https://arrow.apache.org/docs/python/dataset.html

In [4]:
path = "/content/drive/MyDrive/quotes/quotes-2020-enhanced10000"
dataset = ds.dataset(path, format="parquet")

list files

In [None]:
dataset.files

count rows without loading whole dataset

In [9]:
%%time
dataset.count_rows()

CPU times: user 3.28 ms, sys: 0 ns, total: 3.28 ms
Wall time: 3.3 ms


13588576

read first n rows

In [None]:
rows = 100
quotes = dataset.head(rows).to_pandas()

read whole dataset

In [None]:
quotes = dataset.to_table().to_pandas()

read in batches

In [None]:
for batch in dataset.to_batches():
  quotes = batch.to_pandas()

load only some columns, **faster than filtering after loading!**


In [13]:
%%time
for batch in dataset.to_batches(columns=["quoteID", "speaker", "gender"]):
  quotes = batch.to_pandas()
  break

CPU times: user 56.2 ms, sys: 37.6 ms, total: 93.8 ms
Wall time: 80.6 ms


In [14]:
%%time
for batch in dataset.to_batches():
  quotes = batch.to_pandas()
  quotes = quotes[["quoteID", "speaker", "gender"]]
  break

CPU times: user 466 ms, sys: 364 ms, total: 831 ms
Wall time: 572 ms


describe and other

In [None]:
quotes.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,date_of_birth,nationality,gender,ethnic_group,occupation,party,academic_degree,id,candidacy,religion,_merge
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,,2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E,,,,,,,,,,,left_only
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,Q367796,2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E,[+1941-08-01T00:00:00Z],United States of America,female,,politician,Republican Party,,Q367796,,United Methodist Church,both
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,,2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E,,,,,,,,,,,left_only
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,,2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E,,,,,,,,,,,left_only
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,Q20684375,2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E,[+1984-09-26T00:00:00Z],,female,,,,,Q20684375,,,both


In [None]:
quotes.describe()

Unnamed: 0,numOccurrences
count,100.0
mean,1.49
std,1.720318
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,12.0


In [None]:
quotes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   quoteID          100 non-null    object        
 1   quotation        100 non-null    object        
 2   speaker          100 non-null    object        
 3   qids             66 non-null     object        
 4   date             100 non-null    datetime64[ns]
 5   numOccurrences   100 non-null    int64         
 6   probas           100 non-null    object        
 7   urls             100 non-null    object        
 8   phase            100 non-null    object        
 9   date_of_birth    54 non-null     object        
 10  nationality      49 non-null     object        
 11  gender           63 non-null     object        
 12  ethnic_group     3 non-null      object        
 13  occupation       61 non-null     object        
 14  party            12 non-null     object    