# Dados Tabulares com `DataFrames.jl`

In [2]:
using Pkg
using BenchmarkTools
using CSV
using CategoricalArrays
using DataFrames
using XLSX
using Statistics: mean, std, cor
using Downloads: download

In [3]:
df_1 = DataFrame(x_1=rand(5), x_2=rand(5), x_3=rand(5), y_a=rand(5), y_b=rand(5))

Unnamed: 0_level_0,x_1,x_2,x_3,y_a,y_b
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.58738,0.596524,0.251127,0.705915,0.618689
2,0.708135,0.995928,0.0600837,0.559235,0.860796
3,0.0413412,0.87194,0.93334,0.271217,0.405933
4,0.586104,0.0555631,0.195194,0.538403,0.846
5,0.573897,0.993305,0.666398,0.353688,0.50758


In [4]:
typeof(df_1)

DataFrame

# Informações de um DataFrame

* `size(df)`: tupla das dimensões (similar ao `df.shape` de Python)

* `nrow(df)` e `ncol(df)`: número de linhas e número de colunas

* `first(df, 5)` e `last(df, 5)`: 5 primeiras ou últimas linhas com o header

* `describe(df)`: similar ao `df.describe()` de Pandas

* `names(df)`: vetor de colunas como `String`s

* `propertynames(df)`: vetor de colunas como `Symbol`s

* `hasproperty(df, :x1)`: retorna um `Bool` se a coluna `x1` ∈ `df` 

* `columnindex(df, :x2)`: returna o `index` da coluna `x2` ∈ `df`

* `colwise(sum, df)`: operações _column-wise_

* `df2 = copy(df)`: copia um DataFrame

In [5]:
size(df_1)

(5, 5)

In [6]:
first(df_1, 3)

Unnamed: 0_level_0,x_1,x_2,x_3,y_a,y_b
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.58738,0.596524,0.251127,0.705915,0.618689
2,0.708135,0.995928,0.0600837,0.559235,0.860796
3,0.0413412,0.87194,0.93334,0.271217,0.405933


In [7]:
ncol(df_1)

5

In [8]:
names(df_1)

5-element Vector{String}:
 "x_1"
 "x_2"
 "x_3"
 "y_a"
 "y_b"

> Converter para Matrix

In [9]:
Matrix(df_1)

5×5 Matrix{Float64}:
 0.58738    0.596524   0.251127   0.705915  0.618689
 0.708135   0.995928   0.0600837  0.559235  0.860796
 0.0413412  0.87194    0.93334    0.271217  0.405933
 0.586104   0.0555631  0.195194   0.538403  0.846
 0.573897   0.993305   0.666398   0.353688  0.50758

# Estatísticas descritivas com `describe`

> Por padrão `describe(df)` é `describe(df, :mean, :min, :median, :max, :nmissing, :eltype)`

In [10]:
describe(df_1)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Int64,DataType
1,x_1,0.499371,0.0413412,0.586104,0.708135,0,Float64
2,x_2,0.702652,0.0555631,0.87194,0.995928,0,Float64
3,x_3,0.421228,0.0600837,0.251127,0.93334,0,Float64
4,y_a,0.485692,0.271217,0.538403,0.705915,0,Float64
5,y_b,0.6478,0.405933,0.618689,0.860796,0,Float64


Mas você pode escolher o que você quiser:

* `:mean:` média

* `:std:` desvio padrão

* `:min:` mínimo

* `:q25:` quartil 25

* `:median:` mediana

* `:q75:` quartil 75

* `:max:` máximo

* `:nunique:` número de valores únicos

* `:nmissing:` número de valores faltantes

* `:first:` primeiro valor

* `:last:` último valor

* `:eltype:` tipo de elemento (e.g. `Float64`, `Int64`, `String`)

In [11]:
describe(df_1, :mean, :median, :std)

Unnamed: 0_level_0,variable,mean,median,std
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64
1,x_1,0.499371,0.586104,0.261819
2,x_2,0.702652,0.87194,0.39659
3,x_3,0.421228,0.251127,0.364972
4,y_a,0.485692,0.538403,0.173284
5,y_b,0.6478,0.618689,0.202274


Pode até inventar a sua propria função de sumarização

In [20]:
ic_sup(x) = mean(x) + 1.96 * std(x)
ic_inf(x) = mean(x) - 1.96 * std(x)

ic_inf (generic function with 1 method)

In [21]:
describe(df_1, :mean, ic_inf => :ic_inf, ic_sup => :ic_sup)

Unnamed: 0_level_0,variable,mean,ic_inf,ic_sup
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64
1,x_1,0.499371,-0.0137931,1.01254
2,x_2,0.702652,-0.0746652,1.47997
3,x_3,0.421228,-0.294117,1.13657
4,y_a,0.485692,0.146055,0.825329
5,y_b,0.6478,0.251342,1.04426


# Input/Output (IO)

* `CSV.jl`: para ler qualquer arquivo delimitado – `.csv`, `.tsv` etc.

* `XLSX.jl`: para ler arquivos Excel `.xslx` e `.xls`.

* `JSONTables.jl`: para ler arquivos JSON `.json`.

* `Arrow.jl`: formato Apache Arrow para Big Data (que não cabe na RAM).

* `JuliaDB.jl`: leitura e manipulação de Big Data (que não cabe na RAM).

* __Banco de Dados__: Julia também trabalha bem com banco de dados. Veja [juliadatabases.org](https://juliadatabases.org/)