In [1]:
using DataFrames

In [2]:
?(DataFrame)

search: [0m[1mD[22m[0m[1ma[22m[0m[1mt[22m[0m[1ma[22m[0m[1mF[22m[0m[1mr[22m[0m[1ma[22m[0m[1mm[22m[0m[1me[22m [0m[1mD[22m[0m[1ma[22m[0m[1mt[22m[0m[1ma[22m[0m[1mF[22m[0m[1mr[22m[0m[1ma[22m[0m[1mm[22m[0m[1me[22m! [0m[1mD[22m[0m[1ma[22m[0m[1mt[22m[0m[1ma[22m[0m[1mF[22m[0m[1mr[22m[0m[1ma[22m[0m[1mm[22m[0m[1me[22ms [0m[1mD[22m[0m[1ma[22m[0m[1mt[22m[0m[1ma[22m[0m[1mF[22m[0m[1mr[22m[0m[1ma[22m[0m[1mm[22m[0m[1me[22mRow Sub[0m[1mD[22m[0m[1ma[22m[0m[1mt[22m[0m[1ma[22m[0m[1mF[22m[0m[1mr[22m[0m[1ma[22m[0m[1mm[22m[0m[1me[22m



```
DataFrame <: AbstractDataFrame
```

An AbstractDataFrame that stores a set of named columns

The columns are normally AbstractVectors stored in memory, particularly a Vector or CategoricalVector.

# Constructors

```julia
DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
DataFrame(pairs::AbstractVector{<:Pair}; makeunique::Bool=false, copycols::Bool=true)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(kwargs..., copycols::Bool=true)

DataFrame(columns::AbstractVecOrMat, names::Union{AbstractVector, Symbol};
          makeunique::Bool=false, copycols::Bool=true)

DataFrame(table; copycols::Bool=true)
DataFrame(::DataFrameRow)
DataFrame(::GroupedDataFrame; keepkeys::Bool=true)
```

# Keyword arguments

  * `copycols` : whether vectors passed as columns should be copied; by default set to `true` and the vectors are copied; if set to `false` then the constructor will still copy the passed columns if it is not possible to construct a `DataFrame` without materializing new columns.
  * `makeunique` : if `false` (the default), an error will be raised

(note that not all constructors support these keyword arguments)

# Details on behavior of different constructors

It is allowed to pass a vector of `Pair`s, a list of `Pair`s as positional arguments, or a list of keyword arguments. In this case each pair is considered to represent a column name to column value mapping and column name must be a `Symbol` or string. Alternatively a dictionary can be passed to the constructor in which case its entries are considered to define the column name and column value pairs. If the dictionary is a `Dict` then column names will be sorted in the returned `DataFrame`.

In all the constructors described above column value can be a vector which is consumed as is or an object of any other type (except `AbstractArray`). In the latter case the passed value is automatically repeated to fill a new vector of the appropriate length. As a particular rule values stored in a `Ref` or a `0`-dimensional `AbstractArray` are unwrapped and treated in the same way.

It is also allowed to pass a vector of vectors or a matrix as as the first argument. In this case the second argument must be a vector of `Symbol`s or strings specifying column names, or the symbol `:auto` to generate column names `x1`, `x2`, ... automatically.

If a single positional argument is passed to a `DataFrame` constructor then it is assumed to be of type that implements the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface using which the returned `DataFrame` is materialized.

Finally it is allowed to construct a `DataFrame` from a `DataFrameRow` or a `GroupedDataFrame`. In the latter case the `keepkeys` keyword argument specifies whether the resulting `DataFrame` should contain the grouping columns of the passed `GroupedDataFrame` and the order of rows in the result follows the order of groups in the `GroupedDataFrame` passed.

# Notes

The `DataFrame` constructor by default copies all columns vectors passed to it. Pass the `copycols=false` keyword argument (where supported) to reuse vectors without copying them.

By default an error will be raised if duplicates in column names are found. Pass `makeunique=true` keyword argument (where supported) to accept duplicate names, in which case they will be suffixed with `_i` (`i` starting at 1 for the first duplicate).

If an `AbstractRange` is passed to a `DataFrame` constructor as a column it is always collected to a `Vector` (even if `copycols=false`). As a general rule `AbstractRange` values are always materialized to a `Vector` by all functions in DataFrames.jl before being stored in a `DataFrame`.

The `DataFrame` type is designed to allow column types to vary and to be dynamically changed also after it is constructed. Therefore `DataFrame`s are not type stable. For performance-critical code that requires type-stability either use the functionality provided by `select`/`transform`/`combine` functions, use `Tables.columntable` and `Tables.namedtupleiterator` functions, use barrier functions, or provide type assertions to the variables that hold columns extracted from a `DataFrame`.

# Examples

```julia
julia> DataFrame((a=[1, 2], b=[3, 4])) # Tables.jl table constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> DataFrame([(a=1, b=0), (a=2, b=0)]) # Tables.jl table constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame("a" => 1:2, "b" => 0) # Pair constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame([:a => 1:2, :b => 0]) # vector of Pairs constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame(Dict(:a => 1:2, :b => 0)) # dictionary constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame(a=1:2, b=0) # keyword argument constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame([[1, 2], [0, 0]], [:a, :b]) # vector of vectors constructor
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0

julia> DataFrame([1 0; 2 0], :auto) # matrix constructor
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      0
   2 │     2      0
```


In [3]:
tuples = (:H => [5, 10], :J => [10, 15])

(:H => [5, 10], :J => [10, 15])

In [4]:
typeof(tuples)

Tuple{Pair{Symbol,Array{Int64,1}},Pair{Symbol,Array{Int64,1}}}

In [5]:
df = DataFrame(tuples)

Unnamed: 0_level_0,H,J
Unnamed: 0_level_1,Int64,Int64
1,5,10
2,10,15


In [6]:
dict = Dict(:H => [5, 10, 15, 20], :J => [1, 2, 3, 4])

Dict{Symbol,Array{Int64,1}} with 2 entries:
  :H => [5, 10, 15, 20]
  :J => [1, 2, 3, 4]

In [7]:
typeof(dict)

Dict{Symbol,Array{Int64,1}}

In [8]:
df = DataFrame(dict)

Unnamed: 0_level_0,H,J
Unnamed: 0_level_1,Int64,Int64
1,5,1
2,10,2
3,15,3
4,20,4


In [9]:
using CSV

In [10]:
using DataFrames; df = CSV.read("car data.csv", DataFrame)

Unnamed: 0_level_0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type
Unnamed: 0_level_1,String,Int64,Float64,Float64,Int64,String,String
1,ritz,2014,3.35,5.59,27000,Petrol,Dealer
2,sx4,2013,4.75,9.54,43000,Diesel,Dealer
3,ciaz,2017,7.25,9.85,6900,Petrol,Dealer
4,wagon r,2011,2.85,4.15,5200,Petrol,Dealer
5,swift,2014,4.6,6.87,42450,Diesel,Dealer
6,vitara brezza,2018,9.25,9.83,2071,Diesel,Dealer
7,ciaz,2015,6.75,8.12,18796,Petrol,Dealer
8,s cross,2015,6.5,8.61,33429,Diesel,Dealer
9,ciaz,2016,8.75,8.89,20273,Diesel,Dealer
10,ciaz,2015,7.45,8.92,42367,Diesel,Dealer


In [11]:
df = DataFrame(:X => [5,10,15,20], :Y => [5,10,15,20])

Unnamed: 0_level_0,X,Y
Unnamed: 0_level_1,Int64,Int64
1,5,5
2,10,10
3,15,15
4,20,20


In [12]:
push!(df, [10, 15])

Unnamed: 0_level_0,X,Y
Unnamed: 0_level_1,Int64,Int64
1,5,5
2,10,10
3,15,15
4,20,20
5,10,15


In [13]:
# Child of struct:
df.X

5-element Array{Int64,1}:
  5
 10
 15
 20
 10

In [14]:
# Key:
df[!, :X]

5-element Array{Int64,1}:
  5
 10
 15
 20
 10

In [15]:
names(df)

2-element Array{String,1}:
 "X"
 "Y"

In [16]:
df[!, :Z] = [1, 2, 3, 4, 5]

5-element Array{Int64,1}:
 1
 2
 3
 4
 5

In [17]:
show(df)

[1m5×3 DataFrame[0m
[1m Row [0m│[1m X     [0m[1m Y     [0m[1m Z     [0m
[1m     [0m│[90m Int64 [0m[90m Int64 [0m[90m Int64 [0m
─────┼─────────────────────
   1 │     5      5      1
   2 │    10     10      2
   3 │    15     15      3
   4 │    20     20      4
   5 │    10     15      5

In [18]:
show(df, allcols = true)

[1m5×3 DataFrame[0m
[1m Row [0m│[1m X     [0m[1m Y     [0m[1m Z     [0m
[1m     [0m│[90m Int64 [0m[90m Int64 [0m[90m Int64 [0m
─────┼─────────────────────
   1 │     5      5      1
   2 │    10     10      2
   3 │    15     15      3
   4 │    20     20      4
   5 │    10     15      5

In [19]:
first(df, 2)

Unnamed: 0_level_0,X,Y,Z
Unnamed: 0_level_1,Int64,Int64,Int64
1,5,5,1
2,10,10,2


In [20]:
last(df, 3)

Unnamed: 0_level_0,X,Y,Z
Unnamed: 0_level_1,Int64,Int64,Int64
1,15,15,3
2,20,20,4
3,10,15,5


In [21]:
df[2:4, :]

Unnamed: 0_level_0,X,Y,Z
Unnamed: 0_level_1,Int64,Int64,Int64
1,10,10,2
2,15,15,3
3,20,20,4


In [22]:
df[Not(2:4), :]

Unnamed: 0_level_0,X,Y,Z
Unnamed: 0_level_1,Int64,Int64,Int64
1,5,5,1
2,10,15,5


In [23]:
show(df)

[1m5×3 DataFrame[0m
[1m Row [0m│[1m X     [0m[1m Y     [0m[1m Z     [0m
[1m     [0m│[90m Int64 [0m[90m Int64 [0m[90m Int64 [0m
─────┼─────────────────────
   1 │     5      5      1
   2 │    10     10      2
   3 │    15     15      3
   4 │    20     20      4
   5 │    10     15      5

In [24]:
df[2, 2]

10

In [27]:
df2 = DataFrame(:A => [5, 10, 15, 20, 25], :Y => [5, 10, 15, 20, 15])

Unnamed: 0_level_0,A,Y
Unnamed: 0_level_1,Int64,Int64
1,5,5
2,10,10
3,15,15
4,20,20
5,25,15


In [28]:
innerjoin(df, df2, on = :Y)

Unnamed: 0_level_0,X,Y,Z,A
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,5,5,1,5
2,10,10,2,10
3,15,15,3,15
4,15,15,3,25
5,20,20,4,20
6,10,15,5,15
7,10,15,5,25


In [29]:
?(outerjoin)

search: [0m[1mo[22m[0m[1mu[22m[0m[1mt[22m[0m[1me[22m[0m[1mr[22m[0m[1mj[22m[0m[1mo[22m[0m[1mi[22m[0m[1mn[22m



```
outerjoin(df1, df2; on, makeunique=false, indicator=nothing, validate=(false, false),
          renamecols=(identity => identity), matchmissing=:error)
outerjoin(df1, df2, dfs...; on, makeunique = false,
          validate = (false, false), matchmissing=:error)
```

Perform an outer join of two or more data frame objects and return a `DataFrame` containing the result. An outer join includes rows with keys that appear in any of the passed data frames.

The order of rows in the result is undefined and may change in the future releases.

# Arguments

  * `df1`, `df2`, `dfs...` : the `AbstractDataFrames` to be joined

# Keyword Arguments

  * `on` : A column name to join `df1` and `df2` on. If the columns on which `df1` and `df2` will be joined have different names, then a `left=>right` pair can be passed. It is also allowed to perform a join on multiple columns, in which case a vector of column names or column name pairs can be passed (mixing names and pairs is allowed). If more than two data frames are joined then only a column name or a vector of column names are allowed. `on` is a required argument.
  * `makeunique` : if `false` (the default), an error will be raised if duplicate names are found in columns not joined on; if `true`, duplicate names will be suffixed with `_i` (`i` starting at 1 for the first duplicate).
  * `indicator` : Default: `nothing`. If a `Symbol` or string, adds categorical indicator column with the given name for whether a row appeared in only `df1` (`"left_only"`), only `df2` (`"right_only"`) or in both (`"both"`). If the name is already in use, the column name will be modified if `makeunique=true`. This argument is only supported when joining exactly two data frames.
  * `validate` : whether to check that columns passed as the `on` argument define unique keys in each input data frame (according to `isequal`). Can be a tuple or a pair, with the first element indicating whether to run check for `df1` and the second element for `df2`. By default no check is performed.
  * `renamecols` : a `Pair` specifying how columns of left and right data frames should be renamed in the resulting data frame. Each element of the pair can be a string or a `Symbol` can be passed in which case it is appended to the original column name; alternatively a function can be passed in which case it is applied to each column name, which is passed to it as a `String`. Note that `renamecols` does not affect `on` columns, whose names are always taken from the left data frame and left unchanged.
  * `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are matched (`isequal` is used for comparisons of rows for equality)

All columns of the returned data table will support missing values.

It is not allowed to join on columns that contain `NaN` or `-0.0` in real or imaginary part of the number. If you need to perform a join on such values use CategoricalArrays.jl and transform a column containing such values into a `CategoricalVector`.

When merging `on` categorical columns that differ in the ordering of their levels, the ordering of the left data frame takes precedence over the ordering of the right data frame.

If more than two data frames are passed, the join is performed recursively with left associativity. In this case the `indicator` keyword argument is not supported and `validate` keyword argument is applied recursively with left associativity.

See also: [`innerjoin`](@ref), [`leftjoin`](@ref), [`rightjoin`](@ref),           [`semijoin`](@ref), [`antijoin`](@ref), [`crossjoin`](@ref).

# Examples

```julia
julia> name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
3×2 DataFrame
 Row │ ID     Name
     │ Int64  String
─────┼──────────────────
   1 │     1  John Doe
   2 │     2  Jane Doe
   3 │     3  Joe Blogs

julia> job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ ID     Job
     │ Int64  String
─────┼───────────────
   1 │     1  Lawyer
   2 │     2  Doctor
   3 │     4  Farmer

julia> outerjoin(name, job, on = :ID)
4×3 DataFrame
 Row │ ID     Name       Job
     │ Int64  String?    String?
─────┼───────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     3  Joe Blogs  missing
   4 │     4  missing    Farmer

julia> job2 = DataFrame(identifier = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])
3×2 DataFrame
 Row │ identifier  Job
     │ Int64       String
─────┼────────────────────
   1 │          1  Lawyer
   2 │          2  Doctor
   3 │          4  Farmer

julia> rightjoin(name, job2, on = :ID => :identifier, renamecols = "_left" => "_right")
3×3 DataFrame
 Row │ ID     Name_left  Job_right
     │ Int64  String?    String
─────┼─────────────────────────────
   1 │     1  John Doe   Lawyer
   2 │     2  Jane Doe   Doctor
   3 │     4  missing    Farmer

julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => lowercase)
3×3 DataFrame
 Row │ ID     NAME      job
     │ Int64  String?   String
─────┼─────────────────────────
   1 │     1  John Doe  Lawyer
   2 │     2  Jane Doe  Doctor
   3 │     4  missing   Farmer
```


In [33]:
df

Unnamed: 0_level_0,X,Y,Z
Unnamed: 0_level_1,Int64,Int64,Int64
1,5,5,1
2,10,10,2
3,15,15,3
4,20,20,4
5,10,15,5


In [36]:
df

Unnamed: 0_level_0,X,Y,Z
Unnamed: 0_level_1,Int64,Int64,Int64
1,5,5,1
2,10,10,2
3,15,15,3
4,20,20,4
5,10,15,5


In [39]:
sort!(df)

Unnamed: 0_level_0,X,Y,Z
Unnamed: 0_level_1,Int64,Int64,Int64
1,5,5,1
2,10,10,2
3,10,15,5
4,15,15,3
5,20,20,4


In [40]:
df = DataFrame(:A => [5, missing, 6], :B => [5, 10, 2])

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64?,Int64
1,5,5
2,missing,10
3,6,2


In [41]:
dropmissing(df)

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Int64
1,5,5
2,6,2


In [42]:
df

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64?,Int64
1,5,5
2,missing,10
3,6,2


In [43]:
dropmissing(df, :A)

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Int64
1,5,5
2,6,2
