# Data Analysis in Julia


### Defining and Applying Statistical Tools

If you have collected data into some variable, you can easily perform several measurements via functions in the Statistics package:

 - minimum(x) calculates the smallest element in x
 - maximum(x) calculates the largest element in x
 - mean(x) calculates the arithmetic mean of elements in x
 - median(x) calculates the median of x
 - std(x) calculates the standard deviation of x 
 - var(x) calculates the variance of x 

In [None]:
#Make sure to run this code block again if you have to restart the notebook
using Statistics

In [None]:
@show minimum(1:5) #the smallest number in this range is 1
@show maximum(1:5) #and the largest is 5
@show mean(1:5) #the mean of 1,2,3,4,5 is 3 because of symmetry
@show median(1:5) #so is the median!
@show std(1:5);


In [None]:
#TODO: what is the variance of 1:5? 
#With the measurements from above, you can calculate it in at least two different ways.
#Once you have a result, compute the variance by running this code block to check your work.

@show var(1:5);

The StatsBase package provides a few more functions for analysing data:
 - mode(x) returns a mode of x. This function differs from the NuMaSS definition in several ways which we will cover below.
 - quantile(x) returns a vector with five elements:
    - The smallest element in x
    - The first quartile of x
    - The median of x
    - The third quartile of x
    - The largest element in x
 - iqr(x) calculates the interquartile range of x.

 Note that the function above is called "quantile," not "quartile."  
 There's a reason for this which I'll provide below, but because it's not relevant to this exercise, feel free to skip it.

 A "quantile" is a way of separating data into pieces which hold about the same number of data points each.  
 A "quartile" is a specific kind of quantile; it splits data into 4 equally sized blocks.  
 Likewise, a "percentile" is a way of splitting data into 10 equally sized blocks.  

 The quantile function above returns quartiles by default, but by using some optional arguments, you could use it to construct percentiles, or quintiles (5 parts), or septiles (7 parts).

 

In [None]:
#Make sure to run this code block again if you have to restart the notebook
using StatsBase 

In [None]:
@show mode(1:5) #the StatsBase mode function does not consider the case where every element appears once, which should return a mode of zero.

#because the median is 3, the left dataset is [1,2,3] and the right set is [3,4,5]; so the first and third quartiles are 2 and 4.
@show iqr(1:5); 

In [None]:
#TODO: create a collection with an iqr of 5 and a mean of zero
x = []

@show iqr(x)
@show mean(x);

In [None]:
#TODO: create a collection with more than one mode and run the mode() function on it.
#Is there any pattern when determining which mode is returned? Try a few different collections.



Note that the presentation mentioned several statistical functions that are not implemented in Julia.  
Julia does not implement the range and trimmed mean functions, and its implementation of the mode function does not match the NuMaSS definition.

Thankfully, you can implement these functions yourself via the use of Julia functions like sort(), maximum(), and minimum().

In [None]:
"""
range(x) takes in a vector x and returns the range of x.
inputs:
    - x, a collection of data

outputs:
    - m, the difference between the largest and the smallest element of x.

"""
function range(x)

    
    #TODO: complete this function to define range as specified above. 

    return 
end

In [None]:
#run the following code to check your work.
@show range(1:5) == 4 #5-1=4
@show range([1,-5,10,2,3]) == 15 #10 - (-5) = 15
@show range(1) == 0; #the largest and smallest element are the same, so 1-1=0.

In [None]:
"""
trimmedmean(x,k)
inputs:
    - x, a collection of data
    - k, an integer. You can assume that 2 * k < length(x).

outputs:
    - m, the mean of x after we remove k of the largest and k of the smallest elements from x

"""
function trimmed_mean(x::Vector,k::Integer)

    #TODO: complete this function to define the trimmed mean as specified in the docstring.
    #sort() will be useful here!

    return 
    
end

In [None]:
#run the following code to check your work

@show trimmed_mean(1:5,1) #for symmetric distributions, trimming a mean does nothing
@show trimmed_mean(1:5,2) #even with larger k.
@show trimmed_mean([1,2,3,4,10000],1) #trimming a single outlier allows for some resilience (should be 3)
@show trimmed_mean([1,2,3,1000,10000],1) #but k=1 only accounts for a single outlier in each direction (should be about 335)

A mode of a set of data is one of the most common elements in that set of data.  
So the mode of the dataset \[1,1,1,3,3,2\] is 1.
Notably, this means that a dataset can have multiple modes: the dataset \[5,5,4,4,1\] has modes 5 and 4.  
If every element appears exactly once, there is no mode.

In [None]:
"""
numass_mode(x) takes in x and returns a vector containing the most common elements of x.
inputs:
    - x, a collection of data we will take the mode of. 

outputs:
    - m::Set, a set containing each mode of x. 
        If there is no mode (each element appears once), this set is empty.

"""
function numass_mode(x)

    #TODO: complete this function to define numass_mode as specified above. 
    #You may want to use the StatsBase mode function, but you can also implement the function without it. 
    #This will be complicated. It might be useful to use a Dict() to store how often each element appears.
    
end

Once you've defined mode(), you can check your work by running the cell below.  

In [None]:

@show mode([1,2,5,5,5,1,2]) #5 is the most common element here
@show mode([1,2,3,1,2]) #both 2 and 1 appear twice
@show mode(1:5); #when every element appears once, the vector has no mode