# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 5, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under `DataFrames` master on 2017-12-05.
I will try to keep it up to date as the package evolves. This tutorial covers `DataFrames`, `CSV`, `Missings` and `CategoricalArrays` only. It does not show any additional packages that can be used with `DataFrames`.

In [1]:
using DataFrames # load package

## Constructors

In [2]:
DataFrame() # empty DataFrame

In [3]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3])) # keyword arguments

Unnamed: 0,A,B,C
1,1,0.149461,diw
2,2,0.692835,sk7
3,3,0.464265,Jd3


In [4]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x) # from dictionary, columns will be sorted

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [5]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b']) # from pairs

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [6]:
DataFrame([rand(3) for i in 1:3]) # from vector of vectors

Unnamed: 0,x1,x2,x3
1,0.666839,0.807526,0.0877875
2,0.513415,0.00341089,0.847056
3,0.611787,0.32078,0.275785


In [7]:
DataFrame(rand(3)) # edge case vector of atoms

Unnamed: 0,x1,x2,x3
1,0.536016,0.104192,0.944227


In [8]:
DataFrame(rand(3), [:A, :B, :C]) # pass second argument to give column names

Unnamed: 0,A,B,C
1,0.576838,0.563171,0.292853


In [9]:
DataFrame(rand(3,4)) # from matrix

Unnamed: 0,x1,x2,x3,x4
1,0.331417,0.0923906,0.167039,0.848475
2,0.0177684,0.533385,0.808593,0.748703
3,0.798813,0.518626,0.389481,0.936089


In [10]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1) # pass column types, names and number of rows
# we get missing because Any >: Missing

Unnamed: 0,A,B,C
1,-1,1.98836e-315,missing


In [11]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)
# it was created OK, only value for String is #undef so Jupyer has a problem with printing it

UndefRefError: [91mUndefRefError: access to undefined reference[39m

In [12]:
DataFrame([Int, Float64, String], [:A, :B, :C], 0) # columns are created, but there are no rows

Unnamed: 0,A,B,C


In [13]:
DataFrame(Int, 3, 5) # a quick way to create homogenous DataFrame

Unnamed: 0,x1,x2,x3,x4,x5
1,0,459006896,116398000,180082888,163524656
2,0,432728688,432728688,179123592,459010224
3,0,459006128,111937424,179198296,111968528


In [14]:
DataFrame([Int, Float64], 4) # similar, but with nonhomogenous columns

Unnamed: 0,x1,x2
1,116398000,2.26752e-315
2,111939248,2.11543e-315
3,111937424,2.26752e-315
4,111935984,0.0


In [15]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"], D = [1, "a"])
convert(Array, x) # convert DataFrame to Matrix

2×4 Array{Any,2}:
 1  1.0       "a"  1   
 2   missing  "b"   "a"

In [16]:
y = DataFrame(x) # no change
z = copy(x) # copy (shallow)
(x === y), (x === z), isequal(x, z)

(true, false, true)