# Data Basics

Before running this, please make sure to activate and instantiate the
tutorial-specific package environment, using this
[`Project.toml`](https://raw.githubusercontent.com/juliaai/DataScienceTutorials.jl/gh-pages/__generated/D0-loading/Project.toml) and
[this `Manifest.toml`](https://raw.githubusercontent.com/juliaai/DataScienceTutorials.jl/gh-pages/__generated/D0-loading/Manifest.toml), or by following
[these](https://juliaai.github.io/DataScienceTutorials.jl/#learning_by_doing) detailed instructions.

In [1]:
using Pkg; Pkg.activate("D:/JULIA/6_ML_with_Julia/D0-loading") ; Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `D:\JULIA\6_ML_with_Julia\D0-loading`


LoadError: `StatsBase` is a direct dependency, but does not appear in the manifest. If you intend `StatsBase` to be a direct dependency, run `Pkg.resolve()` to populate the manifest. Otherwise, remove `StatsBase` with `Pkg.rm("StatsBase")`. Finally, run `Pkg.instantiate()` again.

In [2]:
Pkg.resolve()

[32m[1m    Updating[22m[39m `D:\JULIA\6_ML_with_Julia\D0-loading\Project.toml`
 [90m [2913bbd2] [39m[92m+ StatsBase v0.33.21[39m
[32m[1m    Updating[22m[39m `D:\JULIA\6_ML_with_Julia\D0-loading\Manifest.toml`
 [90m [d360d2e6] [39m[92m+ ChainRulesCore v1.15.6[39m
 [90m [9e997f8a] [39m[92m+ ChangesOfVariables v0.1.4[39m
 [90m [ffbed154] [39m[92m+ DocStringExtensions v0.9.1[39m
 [90m [3587e190] [39m[92m+ InverseFunctions v0.1.8[39m
 [90m [92d709cd] [39m[92m+ IrrationalConstants v0.1.1[39m
 [90m [2ab3a3ac] [39m[92m+ LogExpFunctions v0.3.18[39m
 [90m [82ae8749] [39m[92m+ StatsAPI v1.5.0[39m
 [90m [2913bbd2] [39m[92m+ StatsBase v0.33.21[39m
 [90m [7b1f6079] [39m[92m+ FileWatching[39m


In [3]:
using Pkg; Pkg.activate("D:/JULIA/6_ML_with_Julia/D0-loading") ; Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `D:\JULIA\6_ML_with_Julia\D0-loading`


In this short tutorial we discuss two ways to easily load data in Julia:

1. loading a standard dataset via `RDatasets.jl`,
1. loading a local file with `CSV.jl`,

## Using RDatasets

The package [RDatasets.jl](https://github.com/JuliaStats/RDatasets.jl) provides access to most of the many datasets listed on [this page](http://vincentarelbundock.github.io/Rdatasets/datasets.html).
These are well known, standard datasets that can be used to get started with data processing and classical machine learning such as for instance `iris`, `crabs`, `Boston`, etc.

To load such a dataset, you will need to specify which R package it belongs to as well as its name; for instance `Boston` is part of `MASS`.

In [4]:
using RDatasets
import DataFrames

boston = dataset("MASS", "Boston");

The fact that `Boston` is part of `MASS` is clearly indicated on the [list](http://vincentarelbundock.github.io/Rdatasets/datasets.html) linked to earlier.
While it can be a bit slow, loading a dataset via RDatasets is very simple and convenient as you don't have to  worry about setting the names of columns etc.

The `dataset` function returns a `DataFrame` object from the [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl) package.

In [5]:
typeof(boston)

DataFrame

For a short introduction to DataFrame objects, see [this tutorial](/data/dataframe).

## Using CSV

The package [CSV.jl](https://github.com/JuliaData/CSV.jl) offers a powerful way to read arbitrary CSV files efficiently.
In particular the `CSV.read` function allows to read a file and return a DataFrame.

### Basic usage

Let's say you have a file `foo.csv` at some path `fpath=joinpath("data", "foo.csv")` with the content

```
col1,col2,col3,col4,col5,col6,col7,col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true
```

In [6]:
c = """
col1, col2, col3, col4, col5, col6, col7, col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true
"""
fpath, = mktemp()
write(fpath, c);

In [7]:
fpath

"C:\\Users\\jeffr\\AppData\\Local\\Temp\\jl_76A2.tmp"

In [8]:
mktemp()

("C:\\Users\\jeffr\\AppData\\Local\\Temp\\jl_8912.tmp", IOStream(<file C:\Users\jeffr\AppData\Local\Temp\jl_8912.tmp>))

* mktemp()는 임시 경로와 input-output stream을 제공하는 함수로 Base.Filesystem에서 export됨 

     - **```mktemp(parent=tempdir(); cleanup=true) -> (path, io)```**

     - Return ```(path, io)```, where path is the path of a new temporary file in ```parent``` and io is an open file object for this path. The ```cleanup``` option controls whether the temporary file is automatically deleted when the process exits.


You can read it with CSV using

In [9]:
using CSV
data = CSV.read(fpath, DataFrames.DataFrame)

Unnamed: 0_level_0,col1,col2,col3,col4,col5,col6,col7,col8
Unnamed: 0_level_1,Missing,Int64,Float64,Float64,String7,Date,DateTime,Bool
1,missing,1,1.0,1.0,one,2019-01-01,2019-01-01T00:00:00,1
2,missing,2,2.0,2.0,two,2019-01-02,2019-01-02T00:00:00,0
3,missing,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,1


Note that we use this `joinpath` for compatibility with  our system but you could pass any valid path on your system for instance `CSV.read("path/to/file.csv")`.
The data is also returned as a dataframe

In [10]:
typeof(data)

DataFrame

Some of the useful arguments for `read` are:

* `header=` to specify whether there's a header, or which line the header is on or to specify a full header yourself,
* `skipto=` to specify how many rows to skip before starting to read the data,
* `limit=` to specify a maximum number of rows to parse,
* `missingstring=` to specify a string or vector of strings that should be parsed as missing values,
* `delim=','` a char or string to specify how columns are separated.

For more details see `?CSV.File`.

In [11]:
?CSV.File

```
CSV.File(input; kwargs...) => CSV.File
```

Read a UTF-8 CSV input and return a `CSV.File` object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use `CSV.read(input, sink; kwargs...)` instead if the `CSV.File` intermediate object isn't needed.

The [`input`](@ref input) argument can be one of:

  * filename given as a string or FilePaths.jl type
  * a `Vector{UInt8}` or `SubArray{UInt8, 1, Vector{UInt8}}` byte buffer
  * a `CodeUnits` object, which wraps a `String`, like `codeunits(str)`
  * a csv-formatted string can also be passed like `IOBuffer(str)`
  * a `Cmd` or other `IO`
  * a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
  * a `Vector` of any of the above, which will parse and vertically concatenate each source, returning a single, "long" `CSV.File`

To read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or `HTTP.Response` body can be passed like:

```julia
using Downloads, CSV
f = CSV.File(Downloads.download(url))

# or

using HTTP, CSV
f = CSV.File(HTTP.get(url).body)
```

Opens the file or files and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the `types` keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a `Vector` for all columns, or specified per column via name or index in a `Dict`).

When a `Vector` of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using `ChainedVector`s to lazily concatenate each thread's columns.

For text encodings other than UTF-8, load the [StringEncodings.jl](https://github.com/JuliaStrings/StringEncodings.jl) package and call e.g. `CSV.File(open(read, input, enc"ISO-8859-1"))`.

The returned `CSV.File` object supports the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface and can iterate `CSV.Row`s. `CSV.Row` supports `propertynames` and `getproperty` to access individual row values. `CSV.File` also supports entire column access like a `DataFrame` via direct property access on the file object, like `f = CSV.File(file); f.col1`. Or by getindex access with column names, like `f[:col1]` or `f["col1"]`. The returned columns are `AbstractArray` subtypes, including: `SentinelVector` (for integers), regular `Vector`, `PooledVector` for pooled columns, `MissingVector` for columns of all `missing` values, `PosLenStringVector` when `stringtype=PosLenString` is passed, and `ChainedVector` will chain one of the previous array types together for data inputs that use multiple threads to parse (each thread parses a single "chain" of the input). Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name `a` will become `a_1`). For example, one could iterate over a csv file with column names `a`, `b`, and `c` by doing:

```julia
for row in CSV.File(file)
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
```

By supporting the Tables.jl interface, a `CSV.File` can also be a table input to any other table sink function. Like:

```julia
# materialize a csv file as a DataFrame, copying columns from CSV.File
df = CSV.File(file) |> DataFrame

# to avoid making a copy of parsed columns, use CSV.read
df = CSV.read(file, DataFrame)

# load a csv file directly into an sqlite database table
db = SQLite.DB()
tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")
```

# Arguments

## File layout options:

  * `header=1`: how column names should be determined; if given as an `Integer`, indicates the row to parse for column names; as an `AbstractVector{<:Integer}`, indicates a set of rows to be concatenated together as column names; `Vector{Symbol}` or `Vector{String}` give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a `Vector`, or set `header=0` or `header=false` and column names will be auto-generated (`Column1`, `Column2`, etc.). Note that if a row number header and `comment` or `ignoreemptyrows` are provided, the header row will be the first non-commented/non-empty row *after* the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  * `normalizenames::Bool=false`: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the `tbl.col1` `getproperty` syntax or iterating rows and accessing column values of a row via `getproperty` (e.g. `row.col1`)
  * `skipto::Integer`: specifies the row where the data starts in the csv file; by default, the next row after the `header` row(s) is used. If `header=0`, then the 1st row is assumed to be the start of data; providing a `skipto` argument does *not* affect the `header` argument. Note that if a row number `skipto` and `comment` or `ignoreemptyrows` are provided, the data row will be the first non-commented/non-empty row *after* the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  * `footerskip::Integer`: number of rows at the end of a file to skip parsing.  Do note that commented rows (see the `comment` keyword argument) *do not* count towards the row number provided for `footerskip`, they are completely ignored by the parser
  * `transpose::Bool`: read a csv file "transposed", i.e. each column is parsed as a row
  * `comment::String`: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or `skipto` and `comment` are provided, the header/data row will be the first non-commented/non-empty row *after* the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  * `ignoreemptyrows::Bool=true`: whether empty rows in a file should be ignored (if `false`, each column will be assigned `missing` for that empty row)
  * `select`: an `AbstractVector` of `Integer`, `Symbol`, `String`, or `Bool`, or a "selector" function of the form `(i, name) -> keep::Bool`; only columns in the collection or for which the selector function returns `true` will be parsed and accessible in the resulting `CSV.File`. Invalid values in `select` are ignored.
  * `drop`: inverse of `select`; an `AbstractVector` of `Integer`, `Symbol`, `String`, or `Bool`, or a "drop" function of the form `(i, name) -> drop::Bool`; columns in the collection or for which the drop function returns `true` will ignored in the resulting `CSV.File`. Invalid values in `drop` are ignored.
  * `limit`: an `Integer` to indicate a limited number of rows to parse in a csv file; use in combination with `skipto` to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the `limit` argument may not result in an exact # of rows parsed; use `threaded=false` to ensure an exact limit if necessary
  * `buffer_in_memory`: a `Bool`, default `false`, which controls whether a `Cmd`, `IO`, or gzipped source will be read/decompressed in memory vs. using a temporary file.
  * `ntasks::Integer=Threads.nthreads()`: [not applicable to `CSV.Rows`] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. `JULIA_NUM_THREADS` environment variable or `julia -t N`); setting `ntasks=1` will avoid any calls to `Threads.@spawn` and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  * `rows_to_check::Integer=30`: [not applicable to `CSV.Rows`] a multithreaded parsed file will be split up into `ntasks` # of equal chunks; `rows_to_check` controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, `lines_to_check` may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
  * `source`: [only applicable for vector of inputs to `CSV.File`] a `Symbol`, `String`, or `Pair` of `Symbol` or `String` to `Vector`. As a single `Symbol` or `String`, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a `Pair`, the 2nd part of the pair should be a `Vector` of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.

## Parsing options:

  * `missingstring`: either a `nothing`, `String`, or `Vector{String}` to use as sentinel values that will be parsed as `missing`; if `nothing` is passed, no sentinel/missing values will be parsed; by default, `missingstring=""`, which means only an empty field (two consecutive delimiters) is considered `missing`
  * `delim=','`: a `Char` or `String` that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  * `ignorerepeated::Bool=false`: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  * `quoted::Bool=true`: whether parsing should check for `quotechar` at the start/end of cells
  * `quotechar='"'`, `openquotechar`, `closequotechar`: a `Char` (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  * `escapechar='"'`: the `Char` used to escape quote characters in a quoted field
  * `dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}`: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an `AbstractDict`, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index `Int`, or name `Symbol` or `String` to the format string for that column.
  * `decimal='.'`: a `Char` indicating how decimals are separated in floats, i.e. `3.14` uses `'.'`, or `3,14` uses a comma `','`
  * `truestrings`, `falsestrings`: `Vector{String}`s that indicate how `true` or `false` values are represented; by default `"true", "True", "TRUE", "T", "1"` are used to detect `true` and `"false", "False", "FALSE", "F", "0"` are used to detect `false`; note that columns with only `1` and `0` values will default to `Int64` column type unless explicitly requested to be `Bool` via `types` keyword argument

## Column Type Options:

  * `types`: a single `Type`, `AbstractVector` or `AbstractDict` of types, or a function of the form `(i, name) -> Union{T, Nothing}` to be used for column types; if a single `Type` is provided, *all* columns will be parsed with that single type; an `AbstractDict` can map column index `Integer`, or name `Symbol` or `String` to type for a column, i.e. `Dict(1=>Float64)` will set the first column as a `Float64`, `Dict(:column1=>Float64)` will set the column named `column1` to `Float64` and, `Dict("column1"=>Float64)` will set the `column1` to `Float64`; if a `Vector` is provided, it must match the # of columns provided or detected in `header`. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or `nothing` to signal the column's type should be detected while parsing.
  * `typemap::Dict{Type, Type}`: a mapping of a type that should be replaced in every instance with another type, i.e. `Dict(Float64=>String)` would change every detected `Float64` column to be parsed as `String`; only "standard" types are allowed to be mapped to another type, i.e. `Int64`, `Float64`, `Date`, `DateTime`, `Time`, and `Bool`. If a column of one of those types is "detected", it will be mapped to the specified type.
  * `pool::Union{Bool, Real, AbstractVector, AbstractDict, Function}=0.25`: [not supported by `CSV.Rows`] controls whether columns will be built as `PooledArray`; if `true`, all columns detected as `String` will be pooled; alternatively, the proportion of unique values below which `String` columns should be pooled (by default 0.25, meaning that if the # of unique strings in a column is under 25.0%, it will be pooled); if an `AbstractVector`, each element should be `Bool` or `Real` and the # of elements should match the # of columns in the dataset; if an `AbstractDict`, a `Bool` or `Real` value can be provided for individual columns where the dict key is given as column index `Integer`, or column name as `Symbol` or `String`. If a function is provided, it should take a column index and name as 2 arguments, and return a `Bool`, `Float64`, or `nothing` for each column.
  * `downcast::Bool=false`: controls whether columns detected as `Int64` will be "downcast" to the smallest possible integer type like `Int8`, `Int16`, `Int32`, etc.
  * `stringtype=InlineStrings.InlineString`: controls how detected string columns will ultimately be returned; default is `InlineString`, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to `String`. If `String` is passed, all string columns will just be normal `String` values. If `PosLenString` is passed, string columns will be returned as `PosLenStringVector`, which is a special "lazy" `AbstractVector` that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of `PosLenStringVector` makes it read-only, so operations like `push!`, `append!`, or `setindex!` are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  * `strict::Bool=false`: whether invalid values should throw a parsing error or be replaced with `missing`
  * `silencewarnings::Bool=false`: if `strict=false`, whether invalid value warnings should be silenced
  * `maxwarnings::Int=100`: if more than `maxwarnings` number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to `maxwarnings`
  * `debug::Bool=false`: passing `true` will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
  * `validate::Bool=true`: whether or not to validate that columns specified in the `types`, `dateformat` and `pool` keywords are actually found in the data. If `false` no validation is done, meaning no error will be thrown if `types`/`dateformat`/`pool` specify settings for columns not actually found in the data.

## Iteration options:

  * `reusebuffer=false`: [only supported by `CSV.Rows`] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing `collect(CSV.Rows(file))` because only current iterated row is "valid")


### Example 1

Let's consider [this dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00504/), the content of which we saved in a file at path `fpath`.

In [12]:
c = """
3.26;0.829;1.676;0;1;1.453;3.770
2.189;0.58;0.863;0;0;1.348;3.115
2.125;0.638;0.831;0;0;1.348;3.531
3.027;0.331;1.472;1;0;1.807;3.510
2.094;0.827;0.86;0;0;1.886;5.390
3.222;0.331;2.177;0;0;0.706;1.819
3.179;0;1.063;0;0;2.942;3.947
3;0;0.938;1;0;2.851;3.513
2.62;0.499;0.99;0;0;2.942;4.402
2.834;0.134;0.95;0;0;1.591;3.021
2.405;0.134;0.843;0;0;1.769;3.210
2.728;0.223;0.953;0;0;1.591;2.371
2.512;0.223;0.929;1;0;1.769;3.919
2.834;0.134;1.237;0;0;1.859;3.030
2.819;0.331;1.271;0;1;0.981;2.736
2.126;0.251;1.114;0;0;0.143;2.157
2.834;0.134;1.322;0;0;1.199;2.413
3.014;0.56;1.781;0;0;-0.115;0.898
3.024;0.452;2.698;0;0;1.107;0.450
3.036;0.405;1.205;1;0;1.807;3.733
2.707;0.972;1.889;0;3;-1.169;2.976
2.978;1.246;1.103;0;1;3.988;6.535
3.111;0.732;0.923;0;0;4.068;5.643
"""
fpath, = mktemp()
write(fpath, c);

It doesn't have a header so we have to provide it ourselves.

In [13]:
header = ["CIC0", "SM1_Dz", "GATS1i",
          "NdsCH", "NdssC", "MLOGP", "LC50"]
data = CSV.read(fpath, DataFrames.DataFrame, header=header)
first(data, 3)

Unnamed: 0_level_0,CIC0,SM1_Dz,GATS1i,NdsCH,NdssC,MLOGP,LC50
Unnamed: 0_level_1,Float64,Float64,Float64,Int64,Int64,Float64,Float64
1,3.26,0.829,1.676,0,1,1.453,3.77
2,2.189,0.58,0.863,0,0,1.348,3.115
3,2.125,0.638,0.831,0,0,1.348,3.531


### Example 2

Let's consider [this dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/00423/), the content of which we saved at `fpath`.

In [14]:
c = """
1,0,1,0,0,0,0,1,0,1,1,?,1,0,0,0,0,1,0,0,0,0,1,67,137,15,0,1,1,1.53,95,13.7,106.6,4.9,99,3.4,2.1,34,41,183,150,7.1,0.7,1,3.5,0.5,?,?,?,1
0,?,0,0,0,0,1,1,?,?,1,0,0,1,0,0,0,1,0,0,0,0,1,62,0,?,0,1,1,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,1.8,?,?,?,?,1
1,0,1,1,0,1,0,1,0,1,0,0,0,1,1,0,0,0,0,1,0,1,1,78,50,50,2,1,2,0.96,5.8,8.9,79.8,8.4,472,3.3,0.4,58,68,202,109,7,2.1,5,13,0.1,28,6,16,1
1,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,1,1,77,40,30,0,1,1,0.95,2440,13.4,97.1,9,279,3.7,0.4,16,64,94,174,8.1,1.11,2,15.7,0.2,?,?,?,0
1,1,1,1,0,1,0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,1,76,100,30,0,1,1,0.94,49,14.3,95.1,6.4,199,4.1,0.7,147,306,173,109,6.9,1.8,1,9,?,59,15,22,1
1,0,1,0,?,0,0,1,0,?,0,1,0,0,0,0,0,1,1,1,0,0,1,75,?,?,1,1,2,1.58,110,13.4,91.5,5.4,85,3.4,3.5,91,122,242,396,5.6,0.9,1,10,1.4,53,22,111,0
1,0,0,0,?,1,1,1,0,0,1,0,?,0,0,0,0,0,0,0,0,0,1,49,0,0,0,1,1,1.4,138.9,10.4,102,3.2,42000,2.35,2.72,119,183,143,211,7.3,0.8,5,2.6,2.19,171,126,1452,0
1,1,1,0,?,0,0,1,0,1,1,?,0,0,0,0,0,0,1,1,1,0,1,61,?,20,3,1,1,1.46,9860,10.8,92,3,58,3.1,3.2,79,108,184,300,7.1,0.52,2,9,1.3,42,25,706,0
1,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,0,?,1,1,0,0,1,50,100,32,1,1,2,3.14,8.8,11.9,107.5,4.9,70,1.9,3.3,26,59,115,63,6.1,0.59,1,6.4,1.2,85,73,982,1
1,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,43,100,0,0,1,1,1.12,1.8,11.8,87.8,5100,193000,4.2,0.5,71,45,256,303,7.1,0.59,1,9.3,0.7,?,?,?,1
1,0,1,0,0,0,1,1,?,?,0,0,0,0,0,0,0,?,1,1,0,0,1,41,?,?,0,1,2,1.05,100809,13,94.2,5.7,196,4.4,3,90,334,494,236,7.6,0.8,5,?,1.1,?,?,?,0
1,0,1,0,0,0,1,1,1,0,0,0,0,1,0,0,0,?,0,1,0,0,1,74,?,0,0,1,1,1.33,86,15.7,96.7,4,61,3.7,1.3,132,168,113,154,?,7.6,5,1.9,0.3,144,41,277,1
1,0,1,0,0,0,0,1,0,1,1,0,0,1,0,0,?,?,1,1,1,0,0,66,?,30,0,1,1,1.53,60,13.3,90.1,5.5,207000,4.4,8.5,25,36,35,74,8.5,0.73,1,5,0.8,?,?,?,1
1,?,0,0,0,0,1,1,?,?,0,0,0,0,0,0,0,0,0,0,0,0,1,56,0,?,0,1,1,1.2,6.6,13.7,93.8,4.1,91000,4.5,1,103,96,205,70,8.8,0.88,1,22,?,82,24,?,1
1,0,1,0,0,0,0,1,0,?,1,0,0,1,0,0,?,1,1,1,0,0,1,63,?,?,2,2,2,1.25,29,13.5,93,6,128,3.15,10.5,76,116,165,163,7.3,1.07,4,4.5,4.5,197,84,302,1
0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,41,100,0,1,1,2,1.61,4.6,10.2,89.6,5.5,161,3.1,3.1,24,57,163,176,5,0.8,2,2.6,1.3,25,13,60,1
1,0,1,0,0,0,0,1,?,1,1,?,1,0,0,0,?,?,1,1,1,0,1,72,?,?,3,2,1,2.14,60,12.1,99.2,5,58,2.4,9.8,69,63,201,235,6.2,0.96,2,2,2.9,136,95,767,0
1,1,1,0,0,0,0,1,0,1,0,0,?,0,0,0,?,1,1,1,1,1,1,60,100,60,2,1,1,1.05,9.2,10.3,103.7,5.4,159,3.8,0.5,56,91,459,146,5.4,1.23,5,13.5,3.8,187,58,443,1
1,?,1,0,0,0,0,1,?,1,0,?,0,1,1,0,0,1,0,0,0,0,1,64,200,78,1,1,1,1.13,8.8,14.9,94.8,6.3,137,4.3,0.9,16,23,82,180,6.5,4.95,1,5.4,0.9,144,49,295,1
1,1,1,0,0,0,0,1,?,?,0,0,0,1,0,0,0,1,1,1,1,0,1,75,500,?,0,1,3,1.44,34,15.9,103.4,9600,101000,3.4,3.4,27,87,260,147,6.3,0.9,5,2.3,1.6,67,34,774,0
"""
fpath, = mktemp()
write(fpath, c);

It does not have a header and missing values indicated by `?`.

In [15]:
data = CSV.read(fpath, DataFrames.DataFrame, header=false, missingstring="?") # ?로 된 string은 missing으로 처리하라는 의미 
first(data[:, 1:5], 3)

Unnamed: 0_level_0,Column1,Column2,Column3,Column4,Column5
Unnamed: 0_level_1,Int64,Int64?,Int64,Int64,Int64?
1,1,0,1,0,0
2,0,missing,0,0,0
3,1,0,1,1,0


---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*