# Environment setup for data frames tutorial

## Bogumił Kamiński

Welcome to DataFrames.jl introduction!

This set of Jupyter notebooks is intended to give you an overwiew of what functionality DataFrames.jl has based on practical examples.

You can find reviews of functionality of DataFrames.jl (not as exercises as this tutorial but task-type oriented) in the following locations:
* an official manual at https://juliadata.github.io/DataFrames.jl/stable/
* a tutorial going through all functionalities of DataFrames.jl at https://github.com/bkamins/Julia-DataFrames-Tutorial

We also assume that you have a basic knowledge of the Julia language and the Julia ecosystem. There are great tutorials on this topic in [JuliaAcademy](https://juliaacademy.com/), so I encourage you to check them out.

As this is a hands-on tutorial you can expect that the examples will be implemented in a way as I would write them when doing actual project.

The notebooks were prepared under Julia 1.5.3 and tested under Julia 1.6.1. If you have a different version of Julia installed change the kernel in *Kernel/Change kernel* option in menu (assuming you are on a Julia 1.x all examples should work without a problem).

In [1]:
VERSION

v"1.6.3"

Jupyter Notebook automatically activates project environment if it is found in the working directory.

So first let us check if we have Project.toml and Manifest.toml files present (they should be present if you cloned the repository of this tutorial).

In [2]:
isfile.(["Project.toml", "Manifest.toml"])

2-element BitVector:
 0
 0

You should get `1` printed (meaning `true`) in both entries of a vector.

Now we are sure that you are going to use exactly the same versions of the packages that I use when running this tutorial.

Let us check what packages (and in what versions) we will use.

In [3]:
] status

[32m[1m      Status[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [7073ff75] [39mIJulia v1.23.2


In [5]:
] add DataFrames

[32m[1m    Updating[22m[39m registry at `C:\Users\CaioSainVallio\.julia\registries\General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m InvertedIndices ─ v1.1.0
[32m[1m   Installed[22m[39m PooledArrays ──── v1.3.0
[32m[1m   Installed[22m[39m DataFrames ────── v1.2.2
[32m[1m   Installed[22m[39m PrettyTables ──── v1.2.2
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [a93c6f00] [39m[92m+ DataFrames v1.2.2[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [90m [34da2185] [39m[92m+ Compat v3.39.0[39m
 [90m [a8cc5b0e] [39m[92m+ Crayons v4.0.4[39m
 [90m [9a962f9c] [39m[92m+ DataAPI v1.9.0[39m
 [90m [a93c6f00] [39m[92m+ DataFrames v1.2.2[39m
 [90m [864edb3b] [39m[92m+ DataStructures v0.18.10[39m
 [90m [e2d170a0] [39m[9

In [6]:
] add CSV

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m CodecZlib ────────── v0.7.0
[32m[1m   Installed[22m[39m InlineStrings ────── v1.0.1
[32m[1m   Installed[22m[39m WeakRefStrings ───── v1.4.1
[32m[1m   Installed[22m[39m SentinelArrays ───── v1.3.7
[32m[1m   Installed[22m[39m FilePathsBase ────── v0.9.12
[32m[1m   Installed[22m[39m TranscodingStreams ─ v0.9.6
[32m[1m   Installed[22m[39m CSV ──────────────── v0.9.6
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.9.6[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.9.6[39m
 [90m [944b1d66] [39m[92m+ CodecZlib v0.7.0[39m
 [90m [48062228] [39m[92m+ FilePathsBase v0.9.12[39m
 [90m [842dd82b] [39m[92m+ InlineStrings v1.0.1[39m
 [90m [91c51154] [39m[92m+ SentinelArrays v1.3.7[39m
 [90m [3bb67fe8] [

In [7]:
] add FreqTables

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m FreqTables ──────── v0.4.5
[32m[1m   Installed[22m[39m Combinatorics ───── v1.0.2
[32m[1m   Installed[22m[39m NamedArrays ─────── v0.9.6
[32m[1m   Installed[22m[39m CategoricalArrays ─ v0.10.1
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [da1fdf0e] [39m[92m+ FreqTables v0.4.5[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [90m [324d7699] [39m[92m+ CategoricalArrays v0.10.1[39m
 [90m [861a8166] [39m[92m+ Combinatorics v1.0.2[39m
 [90m [da1fdf0e] [39m[92m+ FreqTables v0.4.5[39m
 [90m [86f7a689] [39m[92m+ NamedArrays v0.9.6[39m
 [90m [ae029012] [39m[92m+ Requires v1.1.3[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mCombinatorics[39m
[32m  ✓ [39m[90mCategoricalArrays[39m
[32m  ✓ [39m[90mNamedArrays[39m
[32m  ✓ [39mFreqTables
  4 de

In [8]:
] add GLM

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Rmath_jll ──────── v0.3.0+0
[32m[1m   Installed[22m[39m Rmath ──────────── v0.7.0
[32m[1m   Installed[22m[39m PDMats ─────────── v0.11.1
[32m[1m   Installed[22m[39m StatsModels ────── v0.6.27
[32m[1m   Installed[22m[39m GLM ────────────── v1.5.1
[32m[1m   Installed[22m[39m QuadGK ─────────── v2.4.2
[32m[1m   Installed[22m[39m FillArrays ─────── v0.12.7
[32m[1m   Installed[22m[39m OpenSpecFun_jll ── v0.5.5+0
[32m[1m   Installed[22m[39m ShiftedArrays ──── v1.0.0
[32m[1m   Installed[22m[39m SpecialFunctions ─ v1.7.0
[32m[1m   Installed[22m[39m Distributions ──── v0.25.20
[32m[1m   Installed[22m[39m StatsFuns ──────── v0.9.12
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [38e38edf] [39m[92m+ GLM v1.5.1[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [

In [9]:
] add PyPlot

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m PyPlot ─ v2.10.0
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [d330b81b] [39m[92m+ PyPlot v2.10.0[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [90m [3da002f7] [39m[92m+ ColorTypes v0.11.0[39m
 [90m [5ae59095] [39m[92m+ Colors v0.12.8[39m
 [90m [53c48c17] [39m[92m+ FixedPointNumbers v0.8.4[39m
 [90m [b964fa9f] [39m[92m+ LaTeXStrings v1.2.1[39m
 [90m [1914dd2f] [39m[92m+ MacroTools v0.5.8[39m
 [90m [438e738f] [39m[92m+ PyCall v1.92.3[39m
 [90m [d330b81b] [39m[92m+ PyPlot v2.10.0[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mPyPlot
  1 dependency successfully precompiled in 8 seconds (70 already precompiled)


In [10]:
] add Pipe

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Pipe ─ v1.3.0
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [b98c9c47] [39m[92m+ Pipe v1.3.0[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [90m [b98c9c47] [39m[92m+ Pipe v1.3.0[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mPipe
  1 dependency successfully precompiled in 2 seconds (71 already precompiled)


In [11]:
] add Arrow

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Lz4_jll ───── v1.9.3+0
[32m[1m   Installed[22m[39m CodecZstd ─── v0.7.2
[32m[1m   Installed[22m[39m ArrowTypes ── v1.2.1
[32m[1m   Installed[22m[39m CodecLz4 ──── v0.4.0
[32m[1m   Installed[22m[39m CEnum ─────── v0.4.1
[32m[1m   Installed[22m[39m ExprTools ─── v0.1.6
[32m[1m   Installed[22m[39m Mocking ───── v0.7.3
[32m[1m   Installed[22m[39m Arrow ─────── v2.1.0
[32m[1m   Installed[22m[39m BitIntegers ─ v0.2.5
[32m[1m   Installed[22m[39m TimeZones ─── v1.6.0
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [69666777] [39m[92m+ Arrow v2.1.0[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [90m [69666777] [39m[92m+ Arrow v2.1.0[39m
 [90m [31f734f8] [39m[92m+ ArrowTypes v1.2.1[39m
 [90m [c3b6d118] [39m[92m+ BitIntegers v0.2.5[39m
 [90m [fa961155] [

In [12]:
] add Unitful

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m ConstructionBase ─ v1.3.0
[32m[1m   Installed[22m[39m Unitful ────────── v1.9.0
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [1986cc42] [39m[92m+ Unitful v1.9.0[39m
[32m[1m    Updating[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Manifest.toml`
 [90m [187b0558] [39m[92m+ ConstructionBase v1.3.0[39m
 [90m [1986cc42] [39m[92m+ Unitful v1.9.0[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mConstructionBase[39m
[32m  ✓ [39mUnitful
  2 dependencies successfully precompiled in 26 seconds (84 already precompiled)


In [13]:
] status

[32m[1m      Status[22m[39m `C:\Users\CaioSainVallio\.julia\environments\v1.6\Project.toml`
 [90m [69666777] [39mArrow v2.1.0
 [90m [336ed68f] [39mCSV v0.9.6
 [90m [a93c6f00] [39mDataFrames v1.2.2
 [90m [da1fdf0e] [39mFreqTables v0.4.5
 [90m [38e38edf] [39mGLM v1.5.1
 [90m [7073ff75] [39mIJulia v1.23.2
 [90m [b98c9c47] [39mPipe v1.3.0
 [90m [d330b81b] [39mPyPlot v2.10.0
 [90m [1986cc42] [39mUnitful v1.9.0


These notebooks should work with DataFrames versions 0.22 and 1.2.

If checking the status of the packages gives a warning that some of the packages are not downloaded run the `instantiate` instruction from the following line.

In [15]:
] instantiate

<div class="alert alert-block alert-info">
    <p><b>PyPlot.jl configuration:</b></p>
    <p>In some environments automatic installation of PyPlot.jl might fail. If you encounter this ussue please refer to <a href="https://github.com/JuliaPy/PyPlot.jl#installation">the PyPlot.jl installation instructions</a>. </p>
</div>

In particular typically executing the following commands:

```
using Pkg
ENV["PYTHON"]=""
Pkg.build("PyCall")
```

should resolve the PyPlot.jl installation issues. However, on OS X sometimes more configuration steps are required. You can find the detailed instructions [here](https://github.com/JuliaPy/PyPlot.jl#os-x).

As you see we will use the following packages:

Package | Description
:-|:-
DataFrames.jl | a core package that is a subject of this tutorial; it is used for data manipulation; we use version 0.21.0 of this package
CSV.jl | a package for reading/writing of CSV files
FreqTables.jl | a very useful package for creating frequency tables
GLM.jl | a package for fitting Generalized Linear Models (as no data science tutorial would be complete without building some predictive model)
PyPlot.jl | a package for plotting; there are many options in the Julia ecosystem to choose from; in this tutorial we use PyPlot.jl as it is based on Matplotlib so if you have experience with the Python data science technology stack it should be familiar
Pipe.jl | a package that makes chaining of operations super powerful (which is something you probably know from `%>%` in R)
Arrow.jl | a package for working with data in Apache Arrow format
Unitful.jl | a package for working with physical units (like kg, cm, ...)