# Julia- Working with Data
---

__Topics covered in this hands-on exercise:__
* Downloading data
* Reading the data from CSV files
* Performing operations on DataFrames

## Importing Dependencies
---
The first step, we will be importing all the necessary packages.

In [1]:
using CSV
using DataFrames

## Downloading Data
--- 
In the first step, let us have a look at how you can download your data using Julia. For this, you can use the `download` function build-into the Julia STL. Let us have a look at the documentation for __download__ using the '?' operator.

In [1]:
?download

search: [0m[1md[22m[0m[1mo[22m[0m[1mw[22m[0m[1mn[22m[0m[1ml[22m[0m[1mo[22m[0m[1ma[22m[0m[1md[22m



```
download(url::AbstractString, [localfile::AbstractString])
```

Download a file from the given url, optionally renaming it to the given local file name. If no filename is given this will download into a randomly-named file in your temp directory. Note that this function relies on the availability of external tools such as `curl`, `wget` or `fetch` to download the file and is provided for convenience. For production use or situations in which more options are needed, please use a package that provides the desired functionality instead.

Returns the filename of the downloaded file.


Now, let us download a dummy dataset that we will be using for this tutorial.

In [2]:
data = download("https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv", 
                "data/programming_languages.csv")

"data/programming_languages.csv"

Now that we have downloaded the dataset, let us load this data into a _DataFrame_ so that we can work on it. For this, we will be using the `CSV.read` method. Let us see the documentation for _CSV.read_ method.

In [5]:
?CSV.read

`CSV.read(source, sink::T; kwargs...)` => T

Read and parses a delimited file, materializing directly using the `sink` function.

`CSV.read` supports all the same keyword arguments as [`CSV.File`](@ref).


In [11]:
# reading the CSV file using the CSV module
df = CSV.read(data, DataFrame)

Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


Now, we will be performing some experimentations on this DataFrame object and see how we can interact with the data stored in a DataFrame.

In [15]:
# Fetchin the name of the columns in the DataFrame
cols = names(df);

In [21]:
# print the first 10 rows of the dataframe
df[1:10, :]

Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


In [22]:
# printing the first 10 values in the year column
df.year[1:10]

10-element Array{Int64,1}:
 1951
 1952
 1954
 1955
 1957
 1957
 1958
 1958
 1959
 1959

In [23]:
# statistical description of the df
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,year,1982.99,1951,1986.0,2014,0,Int64
2,language,,ALGOL 58,,dBase III,0,String


Now that we have performed some of the most basic operations on the DataFrame, let us try to tackle some problems using the data.

In [34]:
# Q1: Which year was was a given language invented?
function find_year(df, lang::String)
    index = findfirst(df.language.==lang)
    !(index == nothing) && return df.year[index]
    error("$lang was not found in the dataframe!")
end


find_year(df, "Python")

1991

In [36]:
# Q2: How many languages were created in a given year?
function find_num_langs(df, year::Int64)
    num = length(findall(df.year.==year))
    !(num == nothing) && return num
    error("No languages were created in $year")
end

find_num_langs(df, 2012)

1

In [42]:
# Q3: Get the list of all the languages that were created in a given year?
function find_list_langs(df, year::Int64)
    indices = findall(df.year.==year)
    if indices == nothing
        error("No languages were created in $year")
    else
        langs = []
        for i in indices
            push!(langs, df.language[i])
        end
        return langs
    end
end

find_list_langs(df, 2012)

1-element Array{Any,1}:
 "Julia"

In [44]:
# Q4 Find the total number of instances in the dataset.
length(df.year)

73