In [1]:
Pkg.update()

INFO: Updating METADATA...
INFO: Updating cache of DataFrames...
INFO: Updating cache of DataFrames...
INFO: Computing changes...
INFO: No packages to install, update or remove


## Data Frames

### Creating and Populating a data frame

In [3]:
using DataFrames, DataArrays
df = DataFrame()    # an empty data frame
da = DataArray()    # an empty data array; consisting of a one-dimensional array with names

LoadError: LoadError: MethodError: no method matching DataArrays.DataArray{T,N}()
Closest candidates are:
  DataArrays.DataArray{T,N}{T,N}(!Matched::Array{T,N}) at C:\Users\ChuKY\.julia\v0.5\DataArrays\src\dataarray.jl:75
  DataArrays.DataArray{T,N}{T,N}(!Matched::Array{T,N}, !Matched::BitArray{N}) at C:\Users\ChuKY\.julia\v0.5\DataArrays\src\dataarray.jl:75
  DataArrays.DataArray{T,N}(!Matched::Array{T,N}, !Matched::Array{Bool,N}) at C:\Users\ChuKY\.julia\v0.5\DataArrays\src\dataarray.jl:94
  ...
while loading In[3], in expression starting on line 3

In [9]:
da = DataArray([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# if you have a couple of data arrays; da1, da2
da1 = DataArray([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
da2 = DataArray([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

df[:var1] = da1    # :var1 - the column name for da1
df[:var2] = da2    # :var2 - the column name for da2

10-element DataArrays.DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

### Data Frames Basics

In [10]:
# Variable names in a data frame
names(df)    # shows the names of variables, similar to colnames

2-element Array{Symbol,1}:
 :var1
 :var2

In [11]:
# renaming the names for variables
rename!(df, [:var1, :var2], [:length, :width])

# here's another way of doing it
rename!(df, :width, :height)    # change the name :width to :height

Unnamed: 0,length,height
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
6,6,6
7,7,7
8,8,8
9,9,9
10,10,10


### Accessing Particular Variables in a Data Frame

In [12]:
# To access a particular variable with names
df[:length]

10-element DataArrays.DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

In [14]:
# If the name of the variable is itself a variable, you need to convert it first using the symbol() function:
var_name = "height"
df[Symbol(var_name)]    # symbol() function is depricated

10-element DataArrays.DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

In [18]:
Symbol(var_name) == :height

true

In [20]:
df[1], df[2]    # indexing columns from data frames

([1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10])

### Exploring a data frame

In [21]:
showcols(df)

10×2 DataFrames.DataFrame
│ Col # │ Name   │ Eltype │ Missing │
├───────┼────────┼────────┼─────────┤
│ 1     │ length │ Int64  │ 0       │
│ 2     │ height │ Int64  │ 0       │

In [22]:
head(df)

Unnamed: 0,length,height
1,1,1
2,2,2
3,3,3
4,4,4
5,5,5
6,6,6


In [23]:
tail(df)

Unnamed: 0,length,height
1,5,5
2,6,6
3,7,7
4,8,8
5,9,9
6,10,10


In [24]:
describe(df)

length
Min      1.0
1st Qu.  3.25
Median   5.5
Mean     5.5
3rd Qu.  7.75
Max      10.0
NAs      0
NA%      0.0%

height
Min      1.0
1st Qu.  3.25
Median   5.5
Mean     5.5
3rd Qu.  7.75
Max      10.0
NAs      0
NA%      0.0%



### Filtering Sections of a Data Frame

In [25]:
df[1:5, [:length]]    # the first 5 rows from :length

Unnamed: 0,length
1,1
2,2
3,3
4,4
5,5


In [27]:
df[df[:length] .> 2, :]    # selecting all variables with :length bigger than 2

Unnamed: 0,length,height
1,3,3
2,4,4
3,5,5
4,6,6
5,7,7
6,8,8
7,9,9
8,10,10


In [28]:
ind = df[:length] .> 2
println(ind)
df[ind, :]

Bool[false,false,true,true,true,true,true,true,true,true]


Unnamed: 0,length,height
1,3,3
2,4,4
3,5,5
4,6,6
5,7,7
6,8,8
7,9,9
8,10,10


### Applying Functions to a Data Frame's Variables

In [37]:
# applying functions column-wise
colwise(maximum, df)

2-element Array{Any,1}:
 [10]
 [10]

In [38]:
# applying functions to seleceted columns
colwise(mean, df[[:length, :height]])

2-element Array{Any,1}:
 [5.5]
 [5.5]

### Working with Data Frames

In [40]:

df[:weight] = DataArray([10, 20, -1, 15, 25, 5, 10, 20, -1, 5])
df[df[:weight] .== -1, :weight] = NA
mean(df[:weight])    # returns NA because of two NAs in df[:weight]

NA

In [41]:
# How to find NA
isna(df[:weight])    # returns Boolean values

10-element BitArray{1}:
 false
 false
  true
 false
 false
 false
 false
 false
  true
 false

In [42]:
find(isna(df[:weight]))    # returns indices of NA values

2-element Array{Int64,1}:
 3
 9

In [43]:
# Fill NAs with the mean
m = round(Int64, mean(df[!isna(df[:weight]), :weight]))
df[isna(df[:weight]), :weight] = m

14

In [44]:
show(df[:weight])

[10,20,14,15,25,5,10,20,14,5]

### Altering Data Frames

In [45]:
# How to delete data
delete!(df, :length)    # ! makes the function apply to the dataset directly

Unnamed: 0,height,weight
1,1,10
2,2,20
3,3,14
4,4,15
5,5,25
6,6,5
7,7,10
8,8,20
9,9,14
10,10,5


In [46]:
# If you want to meddle with the rows, use the push!() and @data() commands:
push!(df, @data([6, 15]))    # add a row with values of (6, 15)

In [47]:
df

Unnamed: 0,height,weight
1,1,10
2,2,20
3,3,14
4,4,15
5,5,25
6,6,5
7,7,10
8,8,20
9,9,14
10,10,5


In [49]:
# If you wish to delete certain rows, you can do that using the deleterows!() command:
deleterows!(df, 9:11)    # delete rows using range
df

Unnamed: 0,height,weight
1,1,10
2,2,20
3,3,14
4,4,15
5,5,25
6,6,5
7,7,10
8,8,20


In [50]:
deleterows!(df, [1,2,4])    # delete rows using array
df

Unnamed: 0,height,weight
1,3,14
2,5,25
3,6,5
4,7,10
5,8,20


### Sorting the Contents of a Data Frame

In [51]:
by(df, :weight, nrow)

Unnamed: 0,weight,x1
1,5,1
2,10,1
3,14,1
4,20,1
5,25,1


In [52]:
sort!(df, cols = [order(:height), order(:weight)])

Unnamed: 0,height,weight
1,3,14
2,5,25
3,6,5
4,7,10
5,8,20


In [53]:
sort!(df, cols = order(:height))

Unnamed: 0,height,weight
1,3,14
2,5,25
3,6,5
4,7,10
5,8,20


## Importing and Exporting Data

### Accessing .JSON Data Files

In [54]:
Pkg.add("JSON")
import JSON

f = open("file.json")    # open a .json file
X = JSON.parse(f)        # read in a .json file and store in variable X
close(f)                 # close a .json file

INFO: No packages to install, update or remove
INFO: Package database updated


### Storing Data in .JSON Files

In [None]:
f = open("test.json", "w")    # open a file name "test.json" with writing mode
JSON.Print(f, X)              # write X into a file "test.json"
close(f)                      # close the file

### Loading DataFiles into Data Frames

In [None]:
df = readtable("CaffeineForTheForce.csv")
df = readtable("CaffeineForTheForce.csv", nastrings = ["N/A", "-", ""])    # remove NA's while reading data in

### Saving Data Frames into Data Files

In [None]:
writetable("dataset.csv", df)
writetable("dataset.tsv", df)

## Cleaning Up Data

### Cleaning Up Numeric Data

### Cleaning Up Text Data
- Punctuation marks
- Numbers
- Symbols("+", "*", "<", etc.)
- Extra white spaces
- Special characters("@", "~", etc.)

In [56]:
# Store all characters in a variable Z, removing all things, other than space, metioned above
S = "One efficient way of Stripping a given text(stored in variable S) of most of the irrelevant characters is the following:"
Z = ""
for c in S
    if lowercase(c) in "qwertyuiopasdfghjklzxcvbnm "
        Z = string(Z, c)
    end
end
Z


"One efficient way of Stripping a given textstored in variable S of most of the irrelevant characters is the following"

## Formatting and Transforming Data

## Formatting Numeric Data

In [57]:
x = [1.0, 5.0, 3.0, 78.0, -2.0, -54.0]    # ::Float64
x = convert(Array{Int8}, x)               # ::Int8
show(x)

Int8[1,5,3,78,-2,-54]

In [58]:
x = convert(Array{Float16}, x)            # ::Float16
show(x)

Float16[1.0,5.0,3.0,78.0,-2.0,-54.0]

### Formatting Text Data

In [59]:
'c' == "c"    # char != String even though they contain the same thing, Julia sees them different

false

### Importance of Data Types
- It cannot be stressed enough that data types need to be chosen carefully, 
  particularly when dealing with large data set.
- An incorrect data type may waste valuable resources (especially RAM).

## Applying Data Transformations to Numeric Data
- Normalization
- Discretization
- Binarization
- Making a binary variable continuous

### Normalization
- Max-min normalization
- Mean-standard deviation normalization
- Sigmoidal normalization

In [66]:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

In [69]:
# 1. Max-min normalization
norm_x_1 = (x - minimum(x)) / (maximum(x) - minimum(x))

10-element Array{Float64,1}:
 0.0     
 0.111111
 0.222222
 0.333333
 0.444444
 0.555556
 0.666667
 0.777778
 0.888889
 1.0     

In [68]:
# 2. Mean_Standard deviation normalization
norm_x_2 = (x - mean(x)) / std(x)

10-element Array{Float64,1}:
 -1.4863  
 -1.15601 
 -0.825723
 -0.495434
 -0.165145
  0.165145
  0.495434
  0.825723
  1.15601 
  1.4863  

In [70]:
# 3. Sigmoidal normalization
norm_x_3 = 1 ./ (1 + exp(-x))

10-element Array{Float64,1}:
 0.731059
 0.880797
 0.952574
 0.982014
 0.993307
 0.997527
 0.999089
 0.999665
 0.999877
 0.999955

### Discretization(Binning) and Binarization

In [72]:
# turning age_new variable into 3 binary ones
age_new = ["young", "young", "mature", "middle-aged", "mature"]
is_young = (age_new .== "young")
is_middle_aged = (age_new .== "middle-aged")
is_mature = (age_new .== "mature")
show(is_young)

Bool[true,true,false,false,false]

In [73]:
# handling missing values
age_new = ["young", "young", "mature", "middle-aged", "mature", "", "NA", "mature", ""]
is_missing = (age_new .== "") | (age_new .== "NA")
show(is_missing)

Bool[false,false,false,false,false,true,true,false,true]

In [74]:
# handling missing values using list comprehension
NA_denotations = ["", "NA"]
age_new = ["young", "young", "mature", "middle-aged", "mature", "", "NA", "mature", ""]
is_missing_lc = [age_value in NA_denotations for age_value in age_new]
show(is_missing_lc)

Bool[false,false,false,false,false,true,true,false,true]

### Binary to Continuous (Binaray Classification Only)
- The relative risk transformation
- The odd-ratio

### Applying Data Transformations to Text Data
- Changing the case of the text
- Turning the whole thing into a vector

### Case Normalization

In [75]:
S = "Mr. Smith is particularly fond of product #2235; What a surprise!"
S_new = lowercase(S)

"mr. smith is particularly fond of product #2235; what a surprise!"

In [76]:
S_upper = uppercase(S)
# lowercase() or uppercase() functions do not make any changes on non-alphabetic characters

"MR. SMITH IS PARTICULARLY FOND OF PRODUCT #2235; WHAT A SURPRISE!"

### Vectorization

In [83]:
X = ["Julia is a relatively new programming language", 
    "Julia can be used in data science", "Data science is used to derive insights from data", 
    "Data is often noisy"]

4-element Array{String,1}:
 "Julia is a relatively new programming language"   
 "Julia can be used in data science"                
 "Data science is used to derive insights from data"
 "Data is often noisy"                              

In [86]:
temp = [split(lowercase(x), " ") for x in X]
vocabulary = unique(temp[1])
for T in temp[2:end]
    vocabulary = union(vocabulary, T)
end
vocabulary = sort(vocabulary)

19-element Array{SubString{String},1}:
 "a"          
 "be"         
 "can"        
 "data"       
 "derive"     
 "from"       
 "in"         
 "insights"   
 "is"         
 "julia"      
 "language"   
 "new"        
 "noisy"      
 "often"      
 "programming"
 "relatively" 
 "science"    
 "to"         
 "used"       

In [94]:
N = length(vocabulary)
n = length(X)

VX = zeros(Int8, n, N)    # Vectorized X
for i in 1:n
    temp = split(lowercase(X[i]))
    for T in temp
        ind = find(T .== vocabulary)
        VX[i, ind] = 1
    end
end
VX

4×19 Array{Int8,2}:
 1  0  0  0  0  0  0  0  1  1  1  1  0  0  1  1  0  0  0
 0  1  1  1  0  0  1  0  0  1  0  0  0  0  0  0  1  0  1
 0  0  0  1  1  1  0  1  1  0  0  0  0  0  0  0  1  1  1
 0  0  0  1  0  0  0  0  1  0  0  0  1  1  0  0  0  0  0

## Preliminary Evaluation of Features

### Regression
- Examine the absolute value of the coefficient of a regression model, such as linear regression, support vector machine(SVM), or decision tree
- Calculate the absolute value of the correlation of that feature with the target variable(particularly the rank-based correlation)

The higher each one of these two values is, the better the feature in general.

### Classification
- Index of discernibility
- Fisher's Discriminant Ratio
- Similarity index
- Jaccard Similarity
- Mutual Information