## DataFrames part I

In this session we are going to start working with tabular data. The `DataFrames` package can help us analyze efficiently our data. For people having experience in R or Python the term DataFrame will be familiar.

In [2]:
using CSV
using DataFrames

Let's get our first DataFrame. Keep in mind that the `CSV` package is also needed here.

In [3]:
pedigree = DataFrame(CSV.File("tilapia_pedigree.csv"))

Row,Animal,Father,Mother,Line
Unnamed: 0_level_1,Int64,String15,String7,String15
1,10977965,Sire_01,Dam_01,Wami
2,10978207,Sire_01,Dam_01,Wami
3,10978378,Sire_01,Dam_01,Wami
4,10978460,Sire_01,Dam_01,Wami
5,10977732,Sire_01,Dam_01,Wami
6,10978532,Sire_01,Dam_01,Wami
7,10978612,Sire_01,Dam_01,Wami
8,10977963,Sire_01,Dam_01,Wami
9,10977682,Sire_01,Dam_01,Wami
10,10978295,Sire_01,Dam_01,Wami


Equivalently

In [4]:
pedigree = CSV.read("tilapia_pedigree.csv", DataFrame);

We cannot take for granted that every file is delimited with a comma. Useful to remember the `delim` argument as we see below.

In [8]:
pheno = DataFrame(CSV.File("tilapia_pheno.txt", delim=";"))

Row,Animal_Id,Location,Line,Weight_initial,Weight_final,Length_initial,Length_final
Unnamed: 0_level_1,Int64,String15,String15,Float64,Float64,Float64,Float64
1,10977965,Kunduchi,Wami,84.3,201.5,14.2,18.0
2,10978207,Kunduchi,Wami,67.8,0.0,13.4,0.0
3,10978378,Kunduchi,Wami,103.94,225.9,14.4,18.5
4,10978460,Kunduchi,Wami,84.85,239.2,14.5,19.9
5,10977732,Kunduchi,Wami,78.29,195.6,14.0,19.0
6,10978532,Kunduchi,Wami,89.93,220.0,14.7,19.0
7,10978612,Kunduchi,Wami,117.83,270.9,15.0,20.0
8,10977963,Kunduchi,Wami,92.48,209.4,14.9,18.3
9,10977682,Kunduchi,Wami,95.54,178.8,14.9,17.7
10,10978295,Kunduchi,Wami,70.27,135.2,13.3,16.0


Or if you if prefer the approach below

In [9]:
pheno = CSV.read("tilapia_pheno.txt", DataFrame; delim=";");

Skimming through the table above we can spot some zeroes. In that case having a weight or lenght of zero is not possible. So most likely zero denotes a missing value here. We will see more about mssing values later in the course. At this stage just wanted to show that we have the possibility when loading a dataframe to specify the symbol that's used to denote a missing value.

In [10]:
pheno = CSV.read("tilapia_pheno.txt", DataFrame; delim=";", missingstring="0")

Row,Animal_Id,Location,Line,Weight_initial,Weight_final,Length_initial,Length_final
Unnamed: 0_level_1,Int64,String15,String15,Float64,Float64?,Float64,Float64?
1,10977965,Kunduchi,Wami,84.3,201.5,14.2,18.0
2,10978207,Kunduchi,Wami,67.8,missing,13.4,missing
3,10978378,Kunduchi,Wami,103.94,225.9,14.4,18.5
4,10978460,Kunduchi,Wami,84.85,239.2,14.5,19.9
5,10977732,Kunduchi,Wami,78.29,195.6,14.0,19.0
6,10978532,Kunduchi,Wami,89.93,220.0,14.7,19.0
7,10978612,Kunduchi,Wami,117.83,270.9,15.0,20.0
8,10977963,Kunduchi,Wami,92.48,209.4,14.9,18.3
9,10977682,Kunduchi,Wami,95.54,178.8,14.9,17.7
10,10978295,Kunduchi,Wami,70.27,135.2,13.3,16.0


With the following we can see how many columns or rows our dataframe has

In [11]:
ncol(pheno)

7

In [12]:
nrow(pheno)

1752

The following function is really useful when working with dataframes in Julia. It allows us to get instantly a short overview of any numeric columns as well as the type of data each column holds.

In [13]:
describe(pedigree)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,Animal,10976400.0,10973049,10976100.0,10988600,0,Int64
2,Father,,Sire_01,,Sire_09,0,String15
3,Mother,,Dam_01,,Dam_09,0,String7
4,Line,,Bwawani,,Wami,0,String15


Let's see do the same for the 2nd dataframe.

In [14]:
describe(pheno)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,Animal_Id,10976400.0,10973049,10976100.0,10988600,0,Int64
2,Location,,Kunduchi,,Pangani,0,String15
3,Line,,Bwawani,,Wami,0,String15
4,Weight_initial,46.0318,8.44,41.8,128.85,0,Float64
5,Weight_final,113.918,18.19,103.5,325.4,359,"Union{Missing, Float64}"
6,Length_initial,11.0472,5.0,11.0,16.7,0,Float64
7,Length_final,14.5936,9.0,14.5,21.5,359,"Union{Missing, Float64}"


### Slicing dataframes

Extracting sections of a dataframe is very common. Let's see how we can accomplish that.

In [15]:
pheno[20:30,:]

Row,Animal_Id,Location,Line,Weight_initial,Weight_final,Length_initial,Length_final
Unnamed: 0_level_1,Int64,String15,String15,Float64,Float64?,Float64,Float64?
1,10978586,Kunduchi,Wami,84.97,179.3,14.0,17.1
2,10978426,Kunduchi,Wami,69.47,missing,13.3,missing
3,10978463,Kunduchi,Wami,70.53,196.1,13.4,18.0
4,10977973,Kunduchi,Wami,69.51,171.1,13.3,16.7
5,10978434,Kunduchi,Wami,58.31,134.0,12.4,15.9
6,10978506,Kunduchi,Wami,66.77,176.3,13.0,17.7
7,10977686,Kunduchi,Wami,59.62,205.2,12.6,18.0
8,10978184,Kunduchi,Wami,67.13,176.4,13.1,17.0
9,10978407,Kunduchi,Wami,63.1,124.1,12.7,15.0
10,10977751,Kunduchi,Wami,49.72,142.3,12.2,18.3


Here we also select based on columns. We extract rows 20 to 30 as above and the first 4 columns.

In [16]:
pheno[20:30,1:4]

Row,Animal_Id,Location,Line,Weight_initial
Unnamed: 0_level_1,Int64,String15,String15,Float64
1,10978586,Kunduchi,Wami,84.97
2,10978426,Kunduchi,Wami,69.47
3,10978463,Kunduchi,Wami,70.53
4,10977973,Kunduchi,Wami,69.51
5,10978434,Kunduchi,Wami,58.31
6,10978506,Kunduchi,Wami,66.77
7,10977686,Kunduchi,Wami,59.62
8,10978184,Kunduchi,Wami,67.13
9,10978407,Kunduchi,Wami,63.1
10,10977751,Kunduchi,Wami,49.72


We can also use column names.

In [17]:
pheno[20:30,["Animal_Id","Line"]]

Row,Animal_Id,Line
Unnamed: 0_level_1,Int64,String15
1,10978586,Wami
2,10978426,Wami
3,10978463,Wami
4,10977973,Wami
5,10978434,Wami
6,10978506,Wami
7,10977686,Wami
8,10978184,Wami
9,10978407,Wami
10,10977751,Wami


Alternatively we can refer to column names as follows. It's a special syntax be briefly mentioned in a previous session. The column names can be represented as so called `Symbols` in Julia. Keep in mind that `Symbols`is not only specific to column names can be used in other cases as well. 

In [31]:
pheno[20:30,[:Animal_Id, :Location, :Line, :Weight_initial]]

Row,Animal_Id,Location,Line,Weight_initial
Unnamed: 0_level_1,Int64,String15,String15,Float64
1,10978586,Kunduchi,Wami,84.97
2,10978426,Kunduchi,Wami,69.47
3,10978463,Kunduchi,Wami,70.53
4,10977973,Kunduchi,Wami,69.51
5,10978434,Kunduchi,Wami,58.31
6,10978506,Kunduchi,Wami,66.77
7,10977686,Kunduchi,Wami,59.62
8,10978184,Kunduchi,Wami,67.13
9,10978407,Kunduchi,Wami,63.1
10,10977751,Kunduchi,Wami,49.72


We can do more elaborate slicing. Let's say that we want only rows where `Weight_initial`is 40 or more.

In [73]:
pheno.Weight_initial .≥ 40 

1752-element BitVector:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 0
 0
 1
 1
 0
 0
 1
 0
 0

In [18]:
pheno[pheno.Weight_initial .≥ 40,:]

Row,Animal_Id,Location,Line,Weight_initial,Weight_final,Length_initial,Length_final
Unnamed: 0_level_1,Int64,String15,String15,Float64,Float64?,Float64,Float64?
1,10977965,Kunduchi,Wami,84.3,201.5,14.2,18.0
2,10978207,Kunduchi,Wami,67.8,missing,13.4,missing
3,10978378,Kunduchi,Wami,103.94,225.9,14.4,18.5
4,10978460,Kunduchi,Wami,84.85,239.2,14.5,19.9
5,10977732,Kunduchi,Wami,78.29,195.6,14.0,19.0
6,10978532,Kunduchi,Wami,89.93,220.0,14.7,19.0
7,10978612,Kunduchi,Wami,117.83,270.9,15.0,20.0
8,10977963,Kunduchi,Wami,92.48,209.4,14.9,18.3
9,10977682,Kunduchi,Wami,95.54,178.8,14.9,17.7
10,10978295,Kunduchi,Wami,70.27,135.2,13.3,16.0


When we want to select all rows we can use either `:` or `!`. Let's see their difference.

In [19]:
using BenchmarkTools

In [20]:
@benchmark $pheno[:,[:Location, :Line]]

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.062 μs[22m[39m … [35m 1.656 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 95.93%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.946 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m3.093 μs[22m[39m ± [32m32.633 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m32.66% ±  3.30%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▇[39m█[39m▇[34m▅[39m[39m▃[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[39m▂[39m▂[39m▂[39m▂[39m▂[39m

Now will try the same but this time select rows using `!`.

In [21]:
@benchmark $pheno[!, [:Location, :Line]]

BenchmarkTools.Trial: 10000 samples with 204 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m372.343 ns[22m[39m … [35m45.820 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 98.14%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m417.892 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m509.741 ns[22m[39m ± [32m 1.136 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m12.61% ±  6.48%

  [39m▄[39m█[39m█[39m▆[34m▇[39m[39m▆[39m▅[39m▅[39m▅[39m▄[39m▄[39m▃[32m▃[39m[39m▃[39m▂[39m▁[39m▂[39m▁[39m▁[39m [39m▁[39m▁[39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[39m█[39m█[39

As we witness using `!` both faster and requires substantially less memmory. This is because when we use `:` a copy is created behind the scenes, while with `!`we refer to the source dataframe. As such in the latter case if we proceed with any changes the source dataframe will be affected. Keep this in mind as it can result to difficult to spot bugs.

Now let's explore some more possibilities of how we can select columns. First will use regular expressions to select columns that have in their name `Weight`

In [22]:
pheno[!,r"Weight"]

Row,Weight_initial,Weight_final
Unnamed: 0_level_1,Float64,Float64?
1,84.3,201.5
2,67.8,missing
3,103.94,225.9
4,84.85,239.2
5,78.29,195.6
6,89.93,220.0
7,117.83,270.9
8,92.48,209.4
9,95.54,178.8
10,70.27,135.2


If I would need to do the opposite e.g. select columns that don't contain `Weight`

In [23]:
pheno[!,Not(r"Weight")]

Row,Animal_Id,Location,Line,Length_initial,Length_final
Unnamed: 0_level_1,Int64,String15,String15,Float64,Float64?
1,10977965,Kunduchi,Wami,14.2,18.0
2,10978207,Kunduchi,Wami,13.4,missing
3,10978378,Kunduchi,Wami,14.4,18.5
4,10978460,Kunduchi,Wami,14.5,19.9
5,10977732,Kunduchi,Wami,14.0,19.0
6,10978532,Kunduchi,Wami,14.7,19.0
7,10978612,Kunduchi,Wami,15.0,20.0
8,10977963,Kunduchi,Wami,14.9,18.3
9,10977682,Kunduchi,Wami,14.9,17.7
10,10978295,Kunduchi,Wami,13.3,16.0


Below I am selecting all columns between `Line` and `Length_initial`

In [24]:
pheno[!,Between("Line","Length_initial")]

Row,Line,Weight_initial,Weight_final,Length_initial
Unnamed: 0_level_1,String15,Float64,Float64?,Float64
1,Wami,84.3,201.5,14.2
2,Wami,67.8,missing,13.4
3,Wami,103.94,225.9,14.4
4,Wami,84.85,239.2,14.5
5,Wami,78.29,195.6,14.0
6,Wami,89.93,220.0,14.7
7,Wami,117.83,270.9,15.0
8,Wami,92.48,209.4,14.9
9,Wami,95.54,178.8,14.9
10,Wami,70.27,135.2,13.3


In the following select columns that start with `Length`

In [25]:
pheno[!,Cols(startswith("Length"))]

Row,Length_initial,Length_final
Unnamed: 0_level_1,Float64,Float64?
1,14.2,18.0
2,13.4,missing
3,14.4,18.5
4,14.5,19.9
5,14.0,19.0
6,14.7,19.0
7,15.0,20.0
8,14.9,18.3
9,14.9,17.7
10,13.3,16.0


In cases where we are dealing with dataframes that have a large number of columns and we want to perform an elaborate selection the following could be of help.

In [29]:
 names(pheno)

7-element Vector{String}:
 "Animal_Id"
 "Location"
 "Line"
 "Weight_initial"
 "Weight_final"
 "Length_initial"
 "Length_final"

Using the built-in `names`function we can make complex selections in a kind of two step process.

In [30]:
cols_needed = names(pheno, Cols(r"Weight",startswith("Li")))

3-element Vector{String}:
 "Weight_initial"
 "Weight_final"
 "Line"

In [31]:
pheno[!,cols_needed]

Row,Weight_initial,Weight_final,Line
Unnamed: 0_level_1,Float64,Float64?,String15
1,84.3,201.5,Wami
2,67.8,missing,Wami
3,103.94,225.9,Wami
4,84.85,239.2,Wami
5,78.29,195.6,Wami
6,89.93,220.0,Wami
7,117.83,270.9,Wami
8,92.48,209.4,Wami
9,95.54,178.8,Wami
10,70.27,135.2,Wami


The following possibility could also be handy.

In [32]:
names(pheno,Real)

3-element Vector{String}:
 "Animal_Id"
 "Weight_initial"
 "Length_initial"

In this particular case though the first column is more of an ID show change it's type to string.

In [33]:
pheno.Animal_Id = string.(pheno.Animal_Id)

1752-element Vector{String}:
 "10977965"
 "10978207"
 "10978378"
 "10978460"
 "10977732"
 "10978532"
 "10978612"
 "10977963"
 "10977682"
 "10978295"
 ⋮
 "10977635"
 "10978151"
 "10977918"
 "10977945"
 "10978369"
 "10978013"
 "10978579"
 "10977708"
 "10977975"

In [34]:
pheno

Row,Animal_Id,Location,Line,Weight_initial,Weight_final,Length_initial,Length_final
Unnamed: 0_level_1,String,String15,String15,Float64,Float64?,Float64,Float64?
1,10977965,Kunduchi,Wami,84.3,201.5,14.2,18.0
2,10978207,Kunduchi,Wami,67.8,missing,13.4,missing
3,10978378,Kunduchi,Wami,103.94,225.9,14.4,18.5
4,10978460,Kunduchi,Wami,84.85,239.2,14.5,19.9
5,10977732,Kunduchi,Wami,78.29,195.6,14.0,19.0
6,10978532,Kunduchi,Wami,89.93,220.0,14.7,19.0
7,10978612,Kunduchi,Wami,117.83,270.9,15.0,20.0
8,10977963,Kunduchi,Wami,92.48,209.4,14.9,18.3
9,10977682,Kunduchi,Wami,95.54,178.8,14.9,17.7
10,10978295,Kunduchi,Wami,70.27,135.2,13.3,16.0


If we are interested in selecting columns with different data types.

In [39]:
names(pheno, Union{String,Real, Missing})

5-element Vector{String}:
 "Animal_Id"
 "Weight_initial"
 "Weight_final"
 "Length_initial"
 "Length_final"

Now let's how we can get insights from our data. 

In [40]:
using Statistics

In [41]:
mean(pheno.Weight_initial)

46.03179223744293

Let's try to do something bit more elaborate. We want to create a new vector that contains standardized weights.

In [42]:
pheno.weight_init_std = (pheno.Weight_initial - 
    mean(pheno.Weight_initial))/std(pheno.Weight_initial)

MethodError: MethodError: no method matching -(::Vector{Float64}, ::Float64)
For element-wise subtraction, use broadcasting with dot syntax: array .- scalar
The function `-` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  -(!Matched::Missing, ::Number)
   @ Base missing.jl:123
  -(!Matched::Complex{Bool}, ::Real)
   @ Base complex.jl:329
  -(!Matched::BigFloat, ::Union{Float16, Float32, Float64})
   @ Base mpfr.jl:565
  ...


The above didn't work as expected. Remember the `.` when we want to vectorize.

In [47]:
pheno.weight_init_std = (pheno.Weight_initial .- 
    mean(pheno.Weight_initial))/std(pheno.Weight_initial)

1752-element Vector{Float64}:
  1.735548950937706
  0.9872369874364745
  2.6262669487052324
  1.7604926830544134
  1.4629819872624092
  1.990882063332369
  3.256209565252633
  2.1065302758734683
  2.245308130922788
  1.099257020760598
  ⋮
 -0.6998663484572109
 -0.7338805286163578
 -0.2372734982928134
  0.15910508116177838
 -0.8663090700359697
 -1.2703975303266346
  0.04254982381643505
 -0.2812651712986433
 -0.6300238985304293

Let's verify whether it worked.

In [44]:
mean(weight_init_std)

-3.1228191012287506e-16

In [45]:
std(weight_init_std)

1.0

If we wanted to create a new column with the above result

In [48]:
describe(pheno)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,Animal_Id,,10973049,,10988600,0,String
2,Location,,Kunduchi,,Pangani,0,String15
3,Line,,Bwawani,,Wami,0,String15
4,Weight_initial,46.0318,8.44,41.8,128.85,0,Float64
5,Weight_final,113.918,18.19,103.5,325.4,359,"Union{Missing, Float64}"
6,Length_initial,11.0472,5.0,11.0,16.7,0,Float64
7,Length_final,14.5936,9.0,14.5,21.5,359,"Union{Missing, Float64}"
8,weight_init_std,-3.12282e-16,-1.70487,-0.191921,3.75599,0,Float64


The `insertcols!`function gives us flexibility in terms of controling the position of the newly added column. Let's see a few examples.

In [49]:
insertcols!(pheno, :country => "Africa")

Row,Animal_Id,Location,Line,Weight_initial,Weight_final,Length_initial,Length_final,weight_init_std,country
Unnamed: 0_level_1,String,String15,String15,Float64,Float64?,Float64,Float64?,Float64,String
1,10977965,Kunduchi,Wami,84.3,201.5,14.2,18.0,1.73555,Africa
2,10978207,Kunduchi,Wami,67.8,missing,13.4,missing,0.987237,Africa
3,10978378,Kunduchi,Wami,103.94,225.9,14.4,18.5,2.62627,Africa
4,10978460,Kunduchi,Wami,84.85,239.2,14.5,19.9,1.76049,Africa
5,10977732,Kunduchi,Wami,78.29,195.6,14.0,19.0,1.46298,Africa
6,10978532,Kunduchi,Wami,89.93,220.0,14.7,19.0,1.99088,Africa
7,10978612,Kunduchi,Wami,117.83,270.9,15.0,20.0,3.25621,Africa
8,10977963,Kunduchi,Wami,92.48,209.4,14.9,18.3,2.10653,Africa
9,10977682,Kunduchi,Wami,95.54,178.8,14.9,17.7,2.24531,Africa
10,10978295,Kunduchi,Wami,70.27,135.2,13.3,16.0,1.09926,Africa


In the following a specify also the position of the new column.

In [50]:
insertcols!(pheno, :Line, :Fittness => "Adequate")

Row,Animal_Id,Location,Fittness,Line,Weight_initial,Weight_final,Length_initial,Length_final,weight_init_std,country
Unnamed: 0_level_1,String,String15,String,String15,Float64,Float64?,Float64,Float64?,Float64,String
1,10977965,Kunduchi,Adequate,Wami,84.3,201.5,14.2,18.0,1.73555,Africa
2,10978207,Kunduchi,Adequate,Wami,67.8,missing,13.4,missing,0.987237,Africa
3,10978378,Kunduchi,Adequate,Wami,103.94,225.9,14.4,18.5,2.62627,Africa
4,10978460,Kunduchi,Adequate,Wami,84.85,239.2,14.5,19.9,1.76049,Africa
5,10977732,Kunduchi,Adequate,Wami,78.29,195.6,14.0,19.0,1.46298,Africa
6,10978532,Kunduchi,Adequate,Wami,89.93,220.0,14.7,19.0,1.99088,Africa
7,10978612,Kunduchi,Adequate,Wami,117.83,270.9,15.0,20.0,3.25621,Africa
8,10977963,Kunduchi,Adequate,Wami,92.48,209.4,14.9,18.3,2.10653,Africa
9,10977682,Kunduchi,Adequate,Wami,95.54,178.8,14.9,17.7,2.24531,Africa
10,10978295,Kunduchi,Adequate,Wami,70.27,135.2,13.3,16.0,1.09926,Africa


If I want the new column to be after the one I refer to.

In [51]:
insertcols!(pheno, :Line, :harvest => "Yes"; after=true)

Row,Animal_Id,Location,Fittness,Line,harvest,Weight_initial,Weight_final,Length_initial,Length_final,weight_init_std,country
Unnamed: 0_level_1,String,String15,String,String15,String,Float64,Float64?,Float64,Float64?,Float64,String
1,10977965,Kunduchi,Adequate,Wami,Yes,84.3,201.5,14.2,18.0,1.73555,Africa
2,10978207,Kunduchi,Adequate,Wami,Yes,67.8,missing,13.4,missing,0.987237,Africa
3,10978378,Kunduchi,Adequate,Wami,Yes,103.94,225.9,14.4,18.5,2.62627,Africa
4,10978460,Kunduchi,Adequate,Wami,Yes,84.85,239.2,14.5,19.9,1.76049,Africa
5,10977732,Kunduchi,Adequate,Wami,Yes,78.29,195.6,14.0,19.0,1.46298,Africa
6,10978532,Kunduchi,Adequate,Wami,Yes,89.93,220.0,14.7,19.0,1.99088,Africa
7,10978612,Kunduchi,Adequate,Wami,Yes,117.83,270.9,15.0,20.0,3.25621,Africa
8,10977963,Kunduchi,Adequate,Wami,Yes,92.48,209.4,14.9,18.3,2.10653,Africa
9,10977682,Kunduchi,Adequate,Wami,Yes,95.54,178.8,14.9,17.7,2.24531,Africa
10,10978295,Kunduchi,Adequate,Wami,Yes,70.27,135.2,13.3,16.0,1.09926,Africa


We will finish this session with a core skill. How to save dataframes in a file.

In [52]:
CSV.write("pheno_modified.csv", pheno)

"pheno_modified.csv"

## Exercises

### Exercise 1

We will use a dataset from Kaggle https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand?resource=download on hotel bookings. You can find the file under today's folder with the name `hotel_bookings.csv`.

* Create a dataframe named hotel_bookings. Indicate that NULL denotes a missing value.
* How many rows an columns it contains.
* Create a data summary.
* Select rows 10000 - 10100 and the columns `reserved_room_type` and `reservation_status`.
* Select rows where `hotel`is `City Hotel`,`arrival_date_month`is `December` and `adr`is 100 or less. How many rows do you end up with?
* Select the column names that contain `previous`.
* Select the column names between `adults` and `is_repeated_guest`. 
* Create a new dataframe that contain all the numeric columns from hotel_bookings.
* Create a new column that contains the values of the `adr` column in the power of 3. Name it `adr_cube`.
* Write the modified dataframe in a zip format in a file named `hotel_bookings_modified.zip`. Use `:`to separate the columns.