# Módulo de Programação Python

# Trilha Python - Aula 22/23: Revisão e Resolução de Problemas 

<img align="center" style="padding-right:10px;" src="Figuras/aula-10_fig_01.png">

__Objetivo__:   

Revisar e consolidar o conteúdo de __Pandas__ avançado.


## Revisão sobre persistência da dados em Pandas.

Nas Instruções práticas desta semana trabalhamos com informação armazenada em arquivos __CSV__ obtidos de diferentes fontes.

Para análise de dados e aprendizado de máquina, é uma prática comum armazenar os dados e modelos em formato __CSV__. 

Embora o formato __CSV__ ajude a armazenar dados em um formato tabular retangular, ele nem sempre é o mais adequado para guardar todos os ``Dataframe``s do __Pandas__. 

Os arquivos __CSV__ tendem a ser lentos para leitura e gravação, ocupam mais memória e espaço e, o mais importante, os CSVs não armazenam informações sobre os tipos de dados.

### Como o CSV lida com diferentes formatos de arquivo?

O __Pandas__ oferece suporte a uma ampla variedade de formatos e subformatos de dados para facilitar o trabalho com grandes conjuntos de dados. 

Alguns dos formatos mais populares são ``object``, ``string``, ``timedelta``, ``int``, ``float`` etc. 

__Pandas__ usam arrays __NumPy__ para armazenar os dados. Portanto, cada tipo de dados como ``int`` é armazenado na forma de ``int8``, ``int16`` etc, para maior eficiência. 

Porém, o __CSV__ não armazena informações sobre os tipos de dados. 

O usuário precisa especificá-lo com o método ``read_csv`` e tipos de dados como ``timedelta`` são armazenados como strings.

In [1]:
# Vamos ler um csv com Pandas
import pandas as pd
import numpy as np



In [2]:
import random
import string
import datetime
import os

In [3]:
def gera_strings(tamanho):
    '''
    Função para gerar palavras aleatória de determinado tamanho
    '''
    letras = string.ascii_letters
    palavra = ''.join([random.choice(letras)for i in range(random.randint(3,tamanho))])
    return palavra

def get_sentence(tamanho):
    '''
    Função para gerar frases aleatórias de determinado tamanho
    '''
    sentence = ' '.join([gera_strings(random.randint(3,10)) for i in range(random.randint(3,tamanho))])
    return sentence

def gera_data(inicio, fim, formato):
    '''
    Função para gerar datas aleatórias em um intervalo
    '''
    inicio = datetime.datetime.strptime(inicio, formato)
    fim  = datetime.datetime.strptime(fim, formato)
    total = fim - inicio
    int_total = (total.days * 24 * 60 * 60) + total.seconds    
    random_second = random.randrange(int_total)
    return(inicio + datetime.timedelta(seconds=random_second)).strftime(formato)

def random_data(size):
    data = []
    for i in range(size):
        data.append(
            [random.randint(-127,127), 
             random.randint(-32768,32767),  
             random.randint(-2147483648,2147483647), 
             random.randint(-9223372036854775808 ,9223372036854775807),  
             random.randint(0,255),  
             round(random.uniform(0,10000),2),
             round(random.uniform(0,1000000),2),
             gera_strings(10),
             get_sentence(5),
             random.choice([gera_strings(10) for i in range(25)]),
             gera_data("1900-01-01","2020-05-01","%Y-%m-%d"),
             gera_data("1900-01-01T00:00:00","2020-05-01T23:59:59","%Y-%m-%dT%H:%M:%S"),
             random.choice([True,False])])
    return data

def gera_indice(size):
    data = []
    cont = 0
    while cont < size:
        strIndex = gera_strings(5)
        strIndex = strIndex.upper() + str(random.randint(0,1000))
        while strIndex in data:
            strIndex = gera_strings(5)
            strIndex = strIndex.upper() + str(random.randint(0,1000))
        data.append(strIndex)
        cont += 1
    return data

def test_data(size, drop_timedelta, drop_timezone):
    data = random_data(size)
    index = gera_indice(size)
    labels = ["Int8", "Int16", "Int32", "Int64", "UInt8", 
              "Float32", "Float64", "String", "Sentence", 
              "Category", "Date", "DateTime", "Bool"]
    
    df = pd.DataFrame(data,index=index, columns=labels)

    df["Int8"] = df["Int8"].astype("int8")
    df["Int16"] = df["Int16"].astype("int16")
    df["Int32"] = df["Int32"].astype("int32")
    df["UInt8"] = df["UInt8"].astype("uint8")
    df["Float32"] = df["Float32"].astype("float32")
    df["Category"] = df["Category"].astype("category")
    df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")
    df["DateTime"] = pd.to_datetime(df["DateTime"], format="%Y-%m-%dT%H:%M:%S")
    if not drop_timedelta:
        df["TimeDelta"] = df["DateTime"]-df["Date"]
    if not drop_timezone:
        df["DateTime+Zone"] = df["DateTime+Zone"].dt.tz_localize('Europe/Vienna')
    return df

In [4]:
dataset_size = 50000
df= test_data(dataset_size, True, True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, GNK434 to IJXU8
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Int8      50000 non-null  int8          
 1   Int16     50000 non-null  int16         
 2   Int32     50000 non-null  int32         
 3   Int64     50000 non-null  int64         
 4   UInt8     50000 non-null  uint8         
 5   Float32   50000 non-null  float32       
 6   Float64   50000 non-null  float64       
 7   String    50000 non-null  object        
 8   Sentence  50000 non-null  object        
 9   Category  50000 non-null  category      
 10  Date      50000 non-null  datetime64[ns]
 11  DateTime  50000 non-null  datetime64[ns]
 12  Bool      50000 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](2), float32(1), float64(1), int16(1), int32(1), int64(1), int8(1), object(2), uint8(1)
memory usage: 4.9+ MB


In [5]:
df.head()

Unnamed: 0,Int8,Int16,Int32,Int64,UInt8,Float32,Float64,String,Sentence,Category,Date,DateTime,Bool
GNK434,-90,19538,1285599156,-8445666702891533375,50,2805.679932,377550.24,yLDNw,Ysqa BVDvHW BQGvLoOQDI JODo UbxP,ZLjTbu,1981-02-21,1944-01-13 06:39:58,True
HSFG805,102,24520,-1815828633,-1189016105016985607,109,3379.5,202050.33,qqvL,LNu sVQhO xTlcK dGIgL ghzdfhvFL,bhDCy,1927-07-17,1991-09-13 18:54:11,True
OKS957,-2,4245,-1410703770,8183009552400292276,246,7415.27002,274061.23,QxI,sYaK YrAsIrEjR nWoXt,wzyQRvp,1975-05-09,1985-08-24 20:13:12,True
DUB110,-127,560,-9356509,4600095471447551947,24,2470.070068,222552.13,ugBIYIvCz,tJshehcX fmh lAhb bvH,LSdZLkB,1910-04-11,1921-05-17 01:25:52,False
WRLNP791,-89,-18455,-658231333,-2746798921101420016,30,9013.990234,956196.18,dWzgZGsuD,MoLdfu GVR rJqaWFmS jNW FbT,UsIqc,1909-11-06,2009-05-16 07:15:32,True


In [6]:
%timeit -n 1 -r 1  df.to_csv("test_data.csv", index=True, index_label="Index")
%timeit -n 1 -r 1  pd.read_csv("test_data.csv", index_col="Index")
dfCSV = pd.read_csv("test_data.csv", index_col="Index")
dfCSV.info()
dfCSV.head()

305 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
117 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, GNK434 to IJXU8
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Int8      50000 non-null  int64  
 1   Int16     50000 non-null  int64  
 2   Int32     50000 non-null  int64  
 3   Int64     50000 non-null  int64  
 4   UInt8     50000 non-null  int64  
 5   Float32   50000 non-null  float64
 6   Float64   50000 non-null  float64
 7   String    50000 non-null  object 
 8   Sentence  50000 non-null  object 
 9   Category  50000 non-null  object 
 10  Date      50000 non-null  object 
 11  DateTime  50000 non-null  object 
 12  Bool      50000 non-null  bool   
dtypes: bool(1), float64(2), int64(5), object(5)
memory usage: 5.0+ MB


Unnamed: 0_level_0,Int8,Int16,Int32,Int64,UInt8,Float32,Float64,String,Sentence,Category,Date,DateTime,Bool
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
GNK434,-90,19538,1285599156,-8445666702891533375,50,2805.68,377550.24,yLDNw,Ysqa BVDvHW BQGvLoOQDI JODo UbxP,ZLjTbu,1981-02-21,1944-01-13 06:39:58,True
HSFG805,102,24520,-1815828633,-1189016105016985607,109,3379.5,202050.33,qqvL,LNu sVQhO xTlcK dGIgL ghzdfhvFL,bhDCy,1927-07-17,1991-09-13 18:54:11,True
OKS957,-2,4245,-1410703770,8183009552400292276,246,7415.27,274061.23,QxI,sYaK YrAsIrEjR nWoXt,wzyQRvp,1975-05-09,1985-08-24 20:13:12,True
DUB110,-127,560,-9356509,4600095471447551947,24,2470.07,222552.13,ugBIYIvCz,tJshehcX fmh lAhb bvH,LSdZLkB,1910-04-11,1921-05-17 01:25:52,False
WRLNP791,-89,-18455,-658231333,-2746798921101420016,30,9013.99,956196.18,dWzgZGsuD,MoLdfu GVR rJqaWFmS jNW FbT,UsIqc,1909-11-06,2009-05-16 07:15:32,True


In [7]:
file_stats = os.stat("test_data.csv")
print("Tamanho do arquivo: {} bytes".format(file_stats.st_size))

Tamanho do arquivo: 7238854 bytes


In [8]:
%timeit -n 1 -r 1  df.to_csv("test_data.zip", index=True, index_label="Index", compression="zip")
%timeit -n 1 -r 1  pd.read_csv("test_data.zip", index_col="Index", compression="zip")
dfZIP = pd.read_csv("test_data.zip", index_col="Index", compression="zip")
dfZIP.info()
dfZIP.head()

546 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
143 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, GNK434 to IJXU8
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Int8      50000 non-null  int64  
 1   Int16     50000 non-null  int64  
 2   Int32     50000 non-null  int64  
 3   Int64     50000 non-null  int64  
 4   UInt8     50000 non-null  int64  
 5   Float32   50000 non-null  float64
 6   Float64   50000 non-null  float64
 7   String    50000 non-null  object 
 8   Sentence  50000 non-null  object 
 9   Category  50000 non-null  object 
 10  Date      50000 non-null  object 
 11  DateTime  50000 non-null  object 
 12  Bool      50000 non-null  bool   
dtypes: bool(1), float64(2), int64(5), object(5)
memory usage: 5.0+ MB


Unnamed: 0_level_0,Int8,Int16,Int32,Int64,UInt8,Float32,Float64,String,Sentence,Category,Date,DateTime,Bool
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
GNK434,-90,19538,1285599156,-8445666702891533375,50,2805.68,377550.24,yLDNw,Ysqa BVDvHW BQGvLoOQDI JODo UbxP,ZLjTbu,1981-02-21,1944-01-13 06:39:58,True
HSFG805,102,24520,-1815828633,-1189016105016985607,109,3379.5,202050.33,qqvL,LNu sVQhO xTlcK dGIgL ghzdfhvFL,bhDCy,1927-07-17,1991-09-13 18:54:11,True
OKS957,-2,4245,-1410703770,8183009552400292276,246,7415.27,274061.23,QxI,sYaK YrAsIrEjR nWoXt,wzyQRvp,1975-05-09,1985-08-24 20:13:12,True
DUB110,-127,560,-9356509,4600095471447551947,24,2470.07,222552.13,ugBIYIvCz,tJshehcX fmh lAhb bvH,LSdZLkB,1910-04-11,1921-05-17 01:25:52,False
WRLNP791,-89,-18455,-658231333,-2746798921101420016,30,9013.99,956196.18,dWzgZGsuD,MoLdfu GVR rJqaWFmS jNW FbT,UsIqc,1909-11-06,2009-05-16 07:15:32,True


In [9]:
file_stats = os.stat("test_data.zip")
print("Tamanho do arquivo: {} bytes".format(file_stats.st_size))

Tamanho do arquivo: 4340678 bytes


### Outros métodos de armazenamento diferentes de CSV

1. __Pickle__: Trata-se do formato nativo de __Python__, popular para serialização de objetos. A vantagem do __Pickle__ é que ele permite que o código __Python__ implemente qualquer tipo de melhoria. É muito mais rápido quando comparado aos arquivos __CSV__ e reduz o tamanho do arquivo para quase metade. se comparado ao arquivo __CSV__, usando suas técnicas de compactação. Além disso, não há necessidade de especificar vários parâmetros para cada coluna de dados.
A maneira de implementar isso é simples.

```
dataframe.to_pickle(caminho)

Caminho: onde os dados serão armazenados
```

In [10]:
%timeit -n 1 -r 1  df.to_pickle("test_data.pkl")
%timeit -n 1 -r 1  pd.read_pickle("test_data.pkl")
dfPickle = pd.read_pickle("test_data.pkl")
dfPickle.info()
dfPickle.head()

39.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
15.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, GNK434 to IJXU8
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Int8      50000 non-null  int8          
 1   Int16     50000 non-null  int16         
 2   Int32     50000 non-null  int32         
 3   Int64     50000 non-null  int64         
 4   UInt8     50000 non-null  uint8         
 5   Float32   50000 non-null  float32       
 6   Float64   50000 non-null  float64       
 7   String    50000 non-null  object        
 8   Sentence  50000 non-null  object        
 9   Category  50000 non-null  category      
 10  Date      50000 non-null  datetime64[ns]
 11  DateTime  50000 non-null  datetime64[ns]
 12  Bool      50000 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](2), float32(1), float64(1)

Unnamed: 0,Int8,Int16,Int32,Int64,UInt8,Float32,Float64,String,Sentence,Category,Date,DateTime,Bool
GNK434,-90,19538,1285599156,-8445666702891533375,50,2805.679932,377550.24,yLDNw,Ysqa BVDvHW BQGvLoOQDI JODo UbxP,ZLjTbu,1981-02-21,1944-01-13 06:39:58,True
HSFG805,102,24520,-1815828633,-1189016105016985607,109,3379.5,202050.33,qqvL,LNu sVQhO xTlcK dGIgL ghzdfhvFL,bhDCy,1927-07-17,1991-09-13 18:54:11,True
OKS957,-2,4245,-1410703770,8183009552400292276,246,7415.27002,274061.23,QxI,sYaK YrAsIrEjR nWoXt,wzyQRvp,1975-05-09,1985-08-24 20:13:12,True
DUB110,-127,560,-9356509,4600095471447551947,24,2470.070068,222552.13,ugBIYIvCz,tJshehcX fmh lAhb bvH,LSdZLkB,1910-04-11,1921-05-17 01:25:52,False
WRLNP791,-89,-18455,-658231333,-2746798921101420016,30,9013.990234,956196.18,dWzgZGsuD,MoLdfu GVR rJqaWFmS jNW FbT,UsIqc,1909-11-06,2009-05-16 07:15:32,True


In [11]:
file_stats = os.stat("test_data.pkl")
print("Tamanho do arquivo: {} bytes".format(file_stats.st_size))

Tamanho do arquivo: 5149621 bytes


In [12]:
%timeit -n 1 -r 1  df.to_pickle("test_data.pkl.bz2", compression="bz2")
%timeit -n 1 -r 1  pd.read_pickle("test_data.pkl.bz2")
dfPickle = pd.read_pickle("test_data.pkl.bz2")
dfPickle.info()
dfPickle.head()

435 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
194 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, GNK434 to IJXU8
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Int8      50000 non-null  int8          
 1   Int16     50000 non-null  int16         
 2   Int32     50000 non-null  int32         
 3   Int64     50000 non-null  int64         
 4   UInt8     50000 non-null  uint8         
 5   Float32   50000 non-null  float32       
 6   Float64   50000 non-null  float64       
 7   String    50000 non-null  object        
 8   Sentence  50000 non-null  object        
 9   Category  50000 non-null  category      
 10  Date      50000 non-null  datetime64[ns]
 11  DateTime  50000 non-null  datetime64[ns]
 12  Bool      50000 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](2), float32(1), float64(1), 

Unnamed: 0,Int8,Int16,Int32,Int64,UInt8,Float32,Float64,String,Sentence,Category,Date,DateTime,Bool
GNK434,-90,19538,1285599156,-8445666702891533375,50,2805.679932,377550.24,yLDNw,Ysqa BVDvHW BQGvLoOQDI JODo UbxP,ZLjTbu,1981-02-21,1944-01-13 06:39:58,True
HSFG805,102,24520,-1815828633,-1189016105016985607,109,3379.5,202050.33,qqvL,LNu sVQhO xTlcK dGIgL ghzdfhvFL,bhDCy,1927-07-17,1991-09-13 18:54:11,True
OKS957,-2,4245,-1410703770,8183009552400292276,246,7415.27002,274061.23,QxI,sYaK YrAsIrEjR nWoXt,wzyQRvp,1975-05-09,1985-08-24 20:13:12,True
DUB110,-127,560,-9356509,4600095471447551947,24,2470.070068,222552.13,ugBIYIvCz,tJshehcX fmh lAhb bvH,LSdZLkB,1910-04-11,1921-05-17 01:25:52,False
WRLNP791,-89,-18455,-658231333,-2746798921101420016,30,9013.990234,956196.18,dWzgZGsuD,MoLdfu GVR rJqaWFmS jNW FbT,UsIqc,1909-11-06,2009-05-16 07:15:32,True


In [13]:
file_stats = os.stat("test_data.pkl.bz2")
print("Tamanho do arquivo: {} bytes".format(file_stats.st_size))

Tamanho do arquivo: 3709211 bytes


2. __Parquet__: Este é um formato de armazenamento compactado usado no ecossistema __Hadoop__. Ele permite serializar estruturas aninhadas complexas, suporta compactação e codificação por coluna e oferece leituras rápidas. A vantagem, em relação ao __CSV__, é que ele armazena informações sobre todos os tipos de dados, é mais rápido e oferece amplo suporte no ecossistema __Hadoop__ permitindo uma filtragem rápida.

Para implementar em python:

```
dataframe.to_parquet(caminho, mecanismo, compression, index, partição_cols)

Caminho: onde os dados são armazenados

Motor: motor pyarrow ou fastparquet

Compressão: permitindo escolher vários métodos de compressão

Índice: salva o índice do dataframe

Partition_cols: especifique a ordem de particionamento da coluna
```


In [14]:
%timeit -n 1 -r 1  df.to_parquet("test_data.parquet", index=True) 
%timeit -n 1 -r 1  pd.read_parquet("test_data.parquet")
dfParquet = pd.read_parquet("test_data.parquet")
dfParquet.info()
dfParquet.head()

163 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
118 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, GNK434 to IJXU8
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Int8      50000 non-null  int8          
 1   Int16     50000 non-null  int16         
 2   Int32     50000 non-null  int32         
 3   Int64     50000 non-null  int64         
 4   UInt8     50000 non-null  uint8         
 5   Float32   50000 non-null  float32       
 6   Float64   50000 non-null  float64       
 7   String    50000 non-null  object        
 8   Sentence  50000 non-null  object        
 9   Category  50000 non-null  category      
 10  Date      50000 non-null  datetime64[ns]
 11  DateTime  50000 non-null  datetime64[ns]
 12  Bool      50000 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](2), float32(1), float64(1), 

Unnamed: 0,Int8,Int16,Int32,Int64,UInt8,Float32,Float64,String,Sentence,Category,Date,DateTime,Bool
GNK434,-90,19538,1285599156,-8445666702891533375,50,2805.679932,377550.24,yLDNw,Ysqa BVDvHW BQGvLoOQDI JODo UbxP,ZLjTbu,1981-02-21,1944-01-13 06:39:58,True
HSFG805,102,24520,-1815828633,-1189016105016985607,109,3379.5,202050.33,qqvL,LNu sVQhO xTlcK dGIgL ghzdfhvFL,bhDCy,1927-07-17,1991-09-13 18:54:11,True
OKS957,-2,4245,-1410703770,8183009552400292276,246,7415.27002,274061.23,QxI,sYaK YrAsIrEjR nWoXt,wzyQRvp,1975-05-09,1985-08-24 20:13:12,True
DUB110,-127,560,-9356509,4600095471447551947,24,2470.070068,222552.13,ugBIYIvCz,tJshehcX fmh lAhb bvH,LSdZLkB,1910-04-11,1921-05-17 01:25:52,False
WRLNP791,-89,-18455,-658231333,-2746798921101420016,30,9013.990234,956196.18,dWzgZGsuD,MoLdfu GVR rJqaWFmS jNW FbT,UsIqc,1909-11-06,2009-05-16 07:15:32,True


In [15]:
file_stats = os.stat("test_data.parquet")
print("Tamanho do arquivo: {} bytes".format(file_stats.st_size))

Tamanho do arquivo: 5631783 bytes


In [16]:
%timeit -n 1 -r 1  df.to_parquet("test_data.parquet.gzip", index=True, compression="gzip") 
%timeit -n 1 -r 1  pd.read_parquet("test_data.parquet.gzip")
dfParquet = pd.read_parquet("test_data.parquet.gzip")
dfParquet.info()
dfParquet.head()

273 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
34.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, GNK434 to IJXU8
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Int8      50000 non-null  int8          
 1   Int16     50000 non-null  int16         
 2   Int32     50000 non-null  int32         
 3   Int64     50000 non-null  int64         
 4   UInt8     50000 non-null  uint8         
 5   Float32   50000 non-null  float32       
 6   Float64   50000 non-null  float64       
 7   String    50000 non-null  object        
 8   Sentence  50000 non-null  object        
 9   Category  50000 non-null  category      
 10  Date      50000 non-null  datetime64[ns]
 11  DateTime  50000 non-null  datetime64[ns]
 12  Bool      50000 non-null  bool          
dtypes: bool(1), category(1), datetime64[ns](2), float32(1), float64(1),

Unnamed: 0,Int8,Int16,Int32,Int64,UInt8,Float32,Float64,String,Sentence,Category,Date,DateTime,Bool
GNK434,-90,19538,1285599156,-8445666702891533375,50,2805.679932,377550.24,yLDNw,Ysqa BVDvHW BQGvLoOQDI JODo UbxP,ZLjTbu,1981-02-21,1944-01-13 06:39:58,True
HSFG805,102,24520,-1815828633,-1189016105016985607,109,3379.5,202050.33,qqvL,LNu sVQhO xTlcK dGIgL ghzdfhvFL,bhDCy,1927-07-17,1991-09-13 18:54:11,True
OKS957,-2,4245,-1410703770,8183009552400292276,246,7415.27002,274061.23,QxI,sYaK YrAsIrEjR nWoXt,wzyQRvp,1975-05-09,1985-08-24 20:13:12,True
DUB110,-127,560,-9356509,4600095471447551947,24,2470.070068,222552.13,ugBIYIvCz,tJshehcX fmh lAhb bvH,LSdZLkB,1910-04-11,1921-05-17 01:25:52,False
WRLNP791,-89,-18455,-658231333,-2746798921101420016,30,9013.990234,956196.18,dWzgZGsuD,MoLdfu GVR rJqaWFmS jNW FbT,UsIqc,1909-11-06,2009-05-16 07:15:32,True


In [17]:
file_stats = os.stat("test_data.parquet.gzip")
print("Tamanho do arquivo: {} bytes".format(file_stats.st_size))

Tamanho do arquivo: 4545526 bytes
