# What this Notebook is about?

This Notebook is part of the Data Science for Management course I'm teaching at the University of Florence (Italy) during 2016/2017 academic year. A companion tutorial for the same course is [Python Course for Data Science](https://github.com/andrgig/Data-Science-for-Management/blob/master/Python%20Course%20for%20Data%20Science.ipynb).

For suggestions you can contact me on [Linkedin](https://it.linkedin.com/in/agigli) or [Twitter](https://twitter.com/andrgig).

*Andrea Gigli, PhD in Applied Statistics, MSc in Big Data Analytics and Social Mining*

---

# What is R

R is a powerful language and environment for doing Data Science (ie machine learning, information retrieval, statistical analysis, optimization and visualization). It is a public domain project (“GNU”) which is similar to the commercial S language, an environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies). 

The main advantages of R are the fact that R is freeware and that there is a lot of help available online thanks to a very active community. It is quite similar to other programming packages such as MatLab (not freeware), but more user-friendly than programming languages such as C++ or Java. 

Along with Python, it is one of the most important tools to know for people working with Data at all levels. 

R can be downloaded for free at http://www.r-project.org/

We'll use extensively RStudio interface, available at http://www.rstudio.org/


R can do many statistical and data analyses thanks to the so-called packages or libraries build by R users community. With the standard installation, most common packages are installed. To get a list of all installed packages, go to the RStudio packages window or type library() in the console window.

In [1]:
library()

There are many more packages available on the R website than those you can see. If you want to install and use a package (for example, the package called “geometry”) you should:

1) first click "install packages" in the packages window and then type geometry, or type install.packages("geometry") in the command window,

2) then check the box in front of geometry in the package window or type library("geometry") in the command window.

In [2]:
install.packages("geometry")

ERROR: Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror


In [3]:
library("geometry")

ERROR: Error in library("geometry"): there is no package called 'geometry'


It's possible to ask help to R on commands using "help()" or "?"

In [4]:
help(round)
?round
help.search("round")

starting httpd help server ... done


## Working Directory
Your working directory is the folder on your computer in which you are currently working. A working directory is always assumed by R
• New files are created by default in this directory
• Files to be read are expected by default in this directory
• getwd() and setwd() are used to manage the it.
When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or figure, it will save it in the working directory.
Before you start working, please set your working directory to where all your data and script files are or should be stored. Type in the command window:

In [5]:
setwd("directoryname")

ERROR: Error in setwd("directoryname"): cannot change working directory


If you want to know your working directory type

In [6]:
getwd()

setwd("../../Dati/")# sale 2 livelli, entra in Dati
dir() # elenca files/cartelle
dir.create("tmpLavoro")# crea cartella
setwd("tmpLavoro") # vi entra

source("script.R") esegue il codice R contenuto nel file di
testo "script.R" (che potete editare anche con Notepad).
Questo vi consente di:
◮ spezzare un programma in moduli (script) separati, da
eseguire richiamandoli sequenzialmente;
◮ collocare una funzione che usate ripetutamente in uno
script-libreria personale, da caricare all’occorrenza.
Esempio: inizio di un mio “main script”:
source("00_Parametri.R")
source("00_Input.R")
source("00_CaricamentoDati.R")
source("libUtils.R")
source("libFunz.R")

## R as a calculator
You can use R as a calculator and do both basic mathematical operations...

In [7]:
2+2

In [8]:
4-5

In [9]:
4/5

In [10]:
8*9

In [11]:
8^2

In [12]:
exp(2)

In [13]:
log(exp(2))

In [14]:
5%%2

In [15]:
5%/%2

... or more complex ones

In [16]:
log((54*3+2)^2)/4

## Variables 
You can create also variables in R and use them later in your scripts. The sign "<-" is used to assign a value to a variable. Variable types are
- Numeric
- Boolean
- Character

In [17]:
#numeric variable
x <- 10

Afterwards, no result is displayed. To view the content of the variable just call the variable

In [18]:
x

You can also write the assignement operation in brackets

In [19]:
(x <- 20)

In [20]:
y <- x + 10
z <- x^(3/4) -pi*y
z

In [21]:
# boolean variable
z > 0

In [22]:
!(z > 0)

In [23]:
#string variable
string <- "pippo"
string

In [24]:
boolean <- TRUE
boolean

"Inf" and "NaN" stand for Infinite Number and Missing data: 

In [25]:
1/0

In [26]:
0/0

In [27]:
10*1/0

In [28]:
10*0/0

is.na() can be used to test if a variable has missing value

In [29]:
is.na(0/0)

## Dates and Time

The as.Date function is a built-in function able to handle dates (without time, without control zone). The chron library handles dates and times (without control zones). POSIXct and POSIXlt classes allow for dates and times with control for time zones.

### as.Date

The as.Date function allows a variety of input formats through the format = argument. The default format is a four digit year, followed by a month, then a day, separated by either dashes or slashes


In [30]:
as.Date('1915-6-16')

[1] "1915-06-16"

| Code | Value         
| :- |:-------------
| %d | Day of the month (decimal) 
| %m | Month (decimal)
| %b | Month (abbreviated name)
| %B | Month (full name)
| %y | Year (2 digits)
| %Y | Year (4 digits)



In [31]:
as.Date('01/15/2001',format='%m/%d/%Y')

[1] "2001-01-15"

In [32]:
 as.Date('April 26, 2001',format='%B %d, %Y')

[1] "2001-04-26"

**Internally, Date objects are stored as the number of days since January 1, 1970**, using negative numbers for earlier dates. The as.numeric function can be used to convert a Date object to its internal form.

To extract the components of the dates, the weekdays, months, days or quarters functions can be used.

In [33]:
bdays = c(tukey=as.Date('1915-06-16'),fisher=as.Date('1890-02-17'),
          cramer=as.Date('1893-09-25'), kendall=as.Date('1907-09-06'))
weekdays(bdays)
months(bdays)

### chron
This class is a good option when you don’t need to deal with timezones. It requires the package chron.

In [34]:
require(chron)

Loading required package: chron


In [35]:
#create some times:

tm1.c <- as.chron("2013-07-24 23:55:26")
tm1.c

[1] (07/24/13 23:55:26)

In [36]:
attributes(tm1.c)

In [37]:
tm2.c <- as.chron("07/25/13 08:32:07", "%m/%d/%y %H:%M:%S")
tm2.c

[1] (07/25/13 08:32:07)

In [38]:
#extract just the date:

dates(tm1.c)

    day  
07/24/13 

In [39]:
#compare times:

tm2.c > tm1.c

In [40]:
#add days:

tm1.c + 10


[1] (08/03/13 23:55:26)

In [41]:
#calculate the differene between times:
tm2.c - tm1.c

[1] 08:36:41

In [42]:
difftime(tm2.c, tm1.c, units = "hours")
## Time difference of 8.611 hours

Time difference of 8.611389 hours

In [43]:
#does not adjust for daylight savings time:

as.chron("2013-03-10 08:32:07") - as.chron("2013-03-09 23:55:26")


[1] 08:36:41

### POSIXct
This class enables easy extraction of specific componants of a time. (“ct” stand for calender time and “lt” stands for local time. “lt” also helps one remember that POXIXlt objects are lists.)

In [44]:
#Get the current time (in POSIXct by default):

Sys.time()

[1] "2016-12-28 16:43:12 CET"

In [45]:
#create some POSIXct objects:

tm1 <- as.POSIXct("2013-07-24 23:55:26")
tm1

[1] "2013-07-24 23:55:26 CEST"

In [46]:
attributes(tm1)

In [47]:
tm2 <- as.POSIXct("25072013 08:32:07", format = "%d%m%Y %H:%M:%S")
tm2

[1] "2013-07-25 08:32:07 CEST"

In [48]:
#specify the time zone:

tm3 <- as.POSIXct("2010-12-01 11:42:03", tz = "GMT")
tm3

[1] "2010-12-01 11:42:03 GMT"

In [49]:
tm2 > tm1

In [50]:
tm2 - tm1

Time difference of 8.611389 hours

In [51]:
tm1
tm1 + 30

[1] "2013-07-24 23:55:26 CEST"

[1] "2013-07-24 23:55:56 CEST"

In [52]:
#automatically adjusts for daylight savings time:

as.POSIXct("2013-03-10 08:32:07") - as.POSIXct("2013-03-09 23:55:26")

Time difference of 8.611389 hours

### POSIXlt
This class enables easy extraction of specific componants of a time. (“ct” stand for calender time and “lt” stands for local time. “lt” also helps one remember that POXIXlt objects are lists.)

In [53]:
#create a time:

tm1.lt <- as.POSIXlt("2013-07-24 23:55:26")
tm1.lt


[1] "2013-07-24 23:55:26 CEST"

In [54]:
attributes(tm1.lt)

In [55]:
unclass(tm1.lt)

In [56]:
unlist(tm1.lt)

In [57]:
#extract componants of a time object:

tm1.lt$se

In [58]:
#truncate or round off the time:

trunc(tm1.lt, "days")


[1] "2013-07-24 CEST"

In [59]:
trunc(tm1.lt, "mins")

[1] "2013-07-24 23:55:00 CEST"

### Useful Properties

In [60]:
typeof(x)

In [61]:
typeof(string)

In [62]:
typeof(boolean)

## Vectors and Lists
Vector is the basic data structure in R. It can be atomic vector (all elements are the same type) or list (element can have different type).


In [63]:
#combine objects into an atomic vector
y <- c(1,2,3)

In [64]:
y

In [65]:
string <- "pippo"
strings <- c(string, "pluto")

In [66]:
strings

Atomic vectors are always flat

In [67]:
c(1,c(2, c(3,4)))

In [68]:
v <- vector(10, mode="numeric"); v #initializa a vector of numerics

Vectors can be created via ad'hoc buil-in functions. A very useful function is seq()

In [69]:
seq(from = 1, to = 10, by = 3)

In [70]:
v <- seq(-1,1,0.01)
v

Vector elements can be accessed in several ways

In [71]:
v[1]

In [72]:
v[1:3]

In [73]:
v[c(10:12,28:29,32)]

**Be aware R doesn't work like Python**... If you type v[-1] you get the vector without the first elemnt and not the last element of the vector

In [74]:
v[-1]

Here are some useful built-in functions you can apply to a vector

In [75]:
length(v) #lunghezza (numero elementi) del vettore v
min(v) #minimo degli elementi in v
max(v) #massimo degli elementi in v
sum(v) #sommatoria degli elementi in v
prod(v) #produttoria degli elementi in v
sort(v) #ritorna vettore con elementi di v riordinati
head(v,5)  #n valori di testa del vettore v
tail(v,5) #n valori di coda del vettore v

Lists are another very useful data structure in R. They are different wrt Vectors because a list can hold different data types.

In [76]:
#list
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
x

In [77]:
x[[4]]
typeof(x[[4]])

In [78]:
x[4]
typeof(x[4])

In [79]:
x[[4]][2]
typeof(x[[4]][2])

In [80]:
x[2:4]

In [81]:
L <- list(name = "Andrea", height = 190.0, married = TRUE, number.children = 0, email = "andrgig@gmail.com")
names(L)

In [82]:
L[[1]] ; L[[5]]
L$married
L$number.children 

In [83]:
L$number.children <- 1
L


In [84]:
L<-append(L,list(children.name=c("Arianna")))
L

In [85]:
#lists
l <- list()
l[10] <- "pippo"
l[1] <- 1
l <- append(l, "pluto")
l <- c(l, c(1, 2, 3))
l

## Arrays and Matrices
Multi-dimensioanl arrays are atomic vector with an additional attribute, dim(). A special case of array structures are matrices, which are bi-dimensional arrays. Matrices can contain numeric values only.

In [86]:
#arrays
a <- array(c(1,2,3,4,5,6,7,8,9,10), c(2,5))

In [87]:
a

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10


In [88]:
nrow(a)
ncol(a)

In [89]:
rownames(a) <- c("A", "B")
colnames(a) <- c("a", "b", "c","d","e")
a

Unnamed: 0,a,b,c,d,e
A,1,3,5,7,9
B,2,4,6,8,10


Accesso ad elementi dell'array tramite []:

In [90]:
a[1]
a[1,1]
a[3]
a[1,2]
a[-1]
a[-c(2,4)]

In [91]:
a1 <- array(seq_len(24), c(2,4,3))
a1
a1[1,1,1]
a1[1,1,3]
a1[2,4,3]

In [92]:
length(a1)

In [93]:
attributes(a1)

In [94]:
#transform array into vectors
dim(a1) <- NULL
a1

In [95]:
attributes(a1)

NULL

In [96]:
length(a1)

In [97]:
# Matrix definition requires 2 scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
a

0,1,2
1,3,5
2,4,6


In [98]:
#or just one of them
m <- matrix(c(1,2,3,4,5,6,7,8,9,10), ncol = 2) # a matrix is a 2D array
m

0,1
1,6
2,7
3,8
4,9
5,10


In [99]:
dim(m)

In [100]:
#get dimensions and set dim names
ncol(m); nrow(m)

Here are some matrix operations you can try

In [101]:
t(m) #matrice trasposta
n <- m*2/5
m + n #matrice somma (componente per componente)
m * n #matrice prodotto componente per componente!!


0,1,2,3,4
1,2,3,4,5
6,7,8,9,10


0,1
1.4,8.4
2.8,9.8
4.2,11.2
5.6,12.6
7.0,14.0


0,1
0.4,14.4
1.6,19.6
3.6,25.6
6.4,32.4
10.0,40.0


In [102]:
A <- m %*% t(n); A #prodotto matriciale riga per colonna!!
diag(5:7) #crea matrice 3x3 con 0 fuori dalla diagonale
x <- 1:15; dim(x) <- c(3,5) #converte x in matrice 3x5
eigen(A) #calcola autovalori ed autovettori (A quadrata!)
qr(A) #decomposizione QR
svd(A) #Singular Value Decomposition

0,1,2,3,4
14.8,17.6,20.4,23.2,26
17.6,21.2,24.8,28.4,32
20.4,24.8,29.2,33.6,38
23.2,28.4,33.6,38.8,44
26.0,32.0,38.0,44.0,50


0,1,2
5,0,0
0,6,0
0,0,7


0,1,2,3,4
-0.3042621,-0.71233741,0.0,0.6324555,0.0
-0.3707409,-0.40317635,-0.02698067,-0.6324555,0.5470576
-0.4372198,-0.09401529,0.44372691,-0.3162278,-0.7093
-0.5036986,0.21514578,-0.8065118,7.21645e-16,-0.222573
-0.5701775,0.52430684,0.38976556,0.3162278,0.3848153


$qr
            [,1]        [,2]          [,3]          [,4]          [,5]
[1,] -46.4671927 -56.6076805 -6.674817e+01 -7.688866e+01 -8.702914e+01
[2,]   0.3787619   0.6086934  1.217387e+00  1.826080e+00  2.434773e+00
[3,]   0.4390194  -0.1498862 -6.301922e-15 -6.669632e-16  3.706437e-15
[4,]   0.4992770  -0.4925889  6.518369e-01  1.737456e-15  5.360213e-16
[5,]   0.5595346  -0.8352916 -3.523443e-02  8.753660e-01 -3.983144e-15

$rank
[1] 2

$qraux
[1] 1.318504e+00 1.192816e+00 1.757540e+00 1.483461e+00 3.983144e-15

$pivot
[1] 1 2 3 4 5

attr(,"class")
[1] "qr"

0,1,2,3,4
-0.3042621,-0.71233741,0.61211795,-0.10250109,0.1216764
-0.3707409,-0.40317635,-0.74624939,-0.32576568,0.1923241
-0.4372198,-0.09401529,-0.15069837,0.86802314,-0.1543562
-0.5036986,0.21514578,0.09167311,-0.34874488,-0.7549656
-0.5701775,0.52430684,0.1931567,-0.09101149,0.5953213

0,1,2,3,4
-0.3042621,-0.71233741,0.33548899,0.334905904,0.41867072
-0.3707409,-0.40317635,-0.01704823,-0.770650749,-0.32527954
-0.4372198,-0.09401529,-0.3404715,0.533736673,-0.63182618
-0.5036986,0.21514578,-0.60986826,-0.095144714,0.56480809
-0.5701775,0.52430684,0.63189901,-0.002847113,-0.02637309


Matrix elements can be accessed in different ways

In [103]:
#indexing
A[1]
A[c(1,3)]
A[1, 2]
A[1,]

To remove all variables from R’s memory, click “clear all” in the workspace window or type

In [104]:
rm(list=ls())

 If you only want to remove the variable a, you can type rm(variable_name)

In [105]:
a <- 5
b <- a
ls()
rm(a)
ls()

## Factor Variables
A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class(), “factor”, which makes them behave differently from regular integer vectors, and the levels(), which defines the set of allowed values.

In [106]:
x <- c("female","male","male","female", "female")
x

In [107]:
class(x)

In [108]:
f <- factor(x)
f
class(f)

In [109]:
levels(f)
as.numeric(f)

In [110]:
table(f)

f
female   male 
     3      2 

# Dataframes
Dataframes represets the most famous R data structures. They represents a table, with rows and columns, where

1) One column holds data of the same type ( a vector)

2) Different columns can have different data types. ( a list)

3) Every row in a data frame has a row id.

Rows and columns can be accessed through indexes, which can be numeric or associative
- df[ 1, 2 ]
- df$example_column
- df[ 1:3, 4:5 ]


In [19]:
#data frames
df1 <- data.frame(x = seq_len(20), y =rnorm(20), z = 2*seq_len(20))

df1$x

Data Frames can be analyzed using the following commands
• nrow and ncol
• summary
• head
• str

In [20]:
head(df1)

x,y,z
1,0.21935915,2
2,-1.20865293,4
3,-0.67966199,6
4,-0.08903885,8
5,2.21350302,10
6,1.08399879,12


In [21]:
summary(df1)

       x               y                 z       
 Min.   : 1.00   Min.   :-1.9639   Min.   : 2.0  
 1st Qu.: 5.75   1st Qu.:-0.5085   1st Qu.:11.5  
 Median :10.50   Median : 0.3215   Median :21.0  
 Mean   :10.50   Mean   : 0.2531   Mean   :21.0  
 3rd Qu.:15.25   3rd Qu.: 0.8341   3rd Qu.:30.5  
 Max.   :20.00   Max.   : 2.4327   Max.   :40.0  

In [22]:
nrow(df1); ncol(df1)

In [23]:
str(df1)

'data.frame':	20 obs. of  3 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10 ...
 $ y: num  0.219 -1.209 -0.68 -0.089 2.214 ...
 $ z: num  2 4 6 8 10 12 14 16 18 20 ...


In [24]:
names(df1)[1] <- "column1"
df1$column1
df1$x
df1$y

NULL

In [68]:
#Upload an R dataframe
my.airquality <- airquality
head(my.airquality)

Ozone,Solar.R,Wind,Temp,Month,Day
41.0,190.0,7.4,67,5,1
36.0,118.0,8.0,72,5,2
12.0,149.0,12.6,74,5,3
18.0,313.0,11.5,62,5,4
,,14.3,56,5,5
28.0,,14.9,66,5,6


In [69]:
my.airquality[2,]<-data.frame(34,197,7.8,67,5,0)
head(my.airquality)

Ozone,Solar.R,Wind,Temp,Month,Day
41.0,190.0,7.4,67,5,1
34.0,197.0,7.8,67,5,0
12.0,149.0,12.6,74,5,3
18.0,313.0,11.5,62,5,4
,,14.3,56,5,5
28.0,,14.9,66,5,6


In [71]:
my.airquality <- my.airquality[-2,]
#oppure
#subset(my.airquality, Day != 0 )
#oppure
#my.airquality[ ,2] <- NULL
head(my.airquality)

Unnamed: 0,Ozone,Wind,Temp,Month,Day
1,41.0,7.4,67,5,1
4,18.0,11.5,62,5,4
5,,14.3,56,5,5
6,28.0,14.9,66,5,6
7,23.0,8.6,65,5,7
8,19.0,13.8,59,5,8


In [58]:
#what's happening here?
subset(airquality[1:40,], Temp > 80, select = c(Ozone, Temp))

Unnamed: 0,Ozone,Temp
29,45.0,81
35,,84
36,,85
38,29.0,82
39,,87
40,71.0,90


In [75]:
airquality[3,]

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
3,12,149,12.6,74,5,3


In [93]:
airquality[,1]

In [76]:
summary(airquality)

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

In [95]:
airquality[airquality$Ozone>0,]

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
1,41,190,7.4,67,5,1
2,36,118,8.0,72,5,2
3,12,149,12.6,74,5,3
4,18,313,11.5,62,5,4
,,,,,,
6,28,,14.9,66,5,6
7,23,299,8.6,65,5,7
8,19,99,13.8,59,5,8
9,8,19,20.1,61,5,9
NA.1,,,,,,


In [96]:
#Let's try to understand how many records are complete
complete.cases(airquality)

In [114]:
#and select them only
my.airquality <- airquality[complete.cases(airquality),]
my.airquality[my.airquality$Ozone>80,]

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
30,115,223,5.7,79,5,30
62,135,269,4.1,84,7,1
69,97,267,6.3,92,7,8
70,97,272,5.7,92,7,9
71,85,175,7.4,89,7,10
86,108,223,8.0,85,7,25
89,82,213,7.4,88,7,28
99,122,255,4.0,89,8,7
100,89,229,10.3,90,8,8
101,110,207,8.0,90,8,9


In [113]:
subset(my.airquality, Month == 5 & Day > 15)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
16,14,334,11.5,64,5,16
17,34,307,12.0,66,5,17
18,6,78,18.4,57,5,18
19,30,322,11.5,68,5,19
20,11,44,9.7,62,5,20
21,1,8,9.7,59,5,21
22,11,320,16.6,73,5,22
23,4,25,9.7,61,5,23
24,32,92,12.0,61,5,24
28,23,13,12.0,67,5,28


# Dataframe creation

We've seen that a number of built-in data frame examples are available in R. 

In real life, dataframe can be created by combining individual vectors, other data frames or reading from exterlas cources (files or databases).

- `read.table()` function reads any file into a data frame and it can control file format like newline, separator, header etc... (`write.table()` is the equivalent write operation)
- `read.csv()` function reads a csv file into a data frame and it is dedicated to the csv pre-defined format (`write.csv()` is the equivalent write operation)
- Files can be downloaded directly from the web using `download.file()`
- Raw web pages can be scraped using `getURL()`


**inserire DB COnnection modulo 2 Renato**

In [37]:
##### working with data frames #####
getwd()
ds <- read.table("foods.csv")
#average length of review text
review.length <- unlist(lapply(as.character(knime.in$"ReviewText"), nchar))

"cannot open file 'foods.csv': No such file or directory"

ERROR: Error in file(file, "rt"): cannot open the connection


Dataframe can be manipulated in different ways, for example

- Data Frames can be combined with `rbind()` and `cbind()`

In [38]:
df1 <- data.frame(x = seq_len(10), y =rnorm(10), z = 2*seq_len(10))
df2 <- data.frame(xx = 10 + seq_len(10), yy =rnorm(10), zz = 2*(10+seq_len(10)))
df1; df2

x,y,z
1,-0.4016475,2
2,0.9342918,4
3,-1.0732909,6
4,-0.791802,8
5,0.2385133,10
6,-0.7641136,12
7,-2.4276716,14
8,0.1812767,16
9,-0.3311844,18
10,-0.4987595,20


xx,yy,zz
11,1.035665,22
12,-0.81030789,24
13,-1.4121061,26
14,-0.97840458,28
15,-1.14952704,30
16,-0.97054509,32
17,0.01706244,34
18,1.23016052,36
19,-0.97179538,38
20,-0.1375043,40


In [39]:
cbind(df1,df2)

x,y,z,xx,yy,zz
1,-0.4016475,2,11,1.035665,22
2,0.9342918,4,12,-0.81030789,24
3,-1.0732909,6,13,-1.4121061,26
4,-0.791802,8,14,-0.97840458,28
5,0.2385133,10,15,-1.14952704,30
6,-0.7641136,12,16,-0.97054509,32
7,-2.4276716,14,17,0.01706244,34
8,0.1812767,16,18,1.23016052,36
9,-0.3311844,18,19,-0.97179538,38
10,-0.4987595,20,20,-0.1375043,40


In [40]:
df3 <- data.frame(x = 100 + seq_len(5), y =rnorm(5), z = 2*(100 + seq_len(5)))
df3

x,y,z
101,-1.1572085,202
102,-1.0783864,204
103,-0.7911281,206
104,-0.2139881,208
105,0.1983205,210


In [41]:
rbind(df1,df3)

x,y,z
1,-0.4016475,2
2,0.9342918,4
3,-1.0732909,6
4,-0.791802,8
5,0.2385133,10
6,-0.7641136,12
7,-2.4276716,14
8,0.1812767,16
9,-0.3311844,18
10,-0.4987595,20


- Columns can be added by using an associative name and contents

In [42]:
df3$example_col <- c(1,2,3,4,5)
df3

x,y,z,example_col
101,-1.1572085,202,1
102,-1.0783864,204,2
103,-0.7911281,206,3
104,-0.2139881,208,4
105,0.1983205,210,5


- Most operations on vectors can be done on individual rows or columns of a data frame, for example any vector can be sorted using `sort()`, so we can sort the column `y` of dataframe `df1` in the same way

In [44]:
sort(df1$y)

## Dataframe transformations: Sorting, Merging, Binning
Any vector can be sorted using the `sort()` function. A Data Frame can be sorted according to one or more of its columns using the `order()` function.

In [45]:
df1[order(df1$y) , ]

Unnamed: 0,x,y,z
7,7,-2.4276716,14
3,3,-1.0732909,6
4,4,-0.791802,8
6,6,-0.7641136,12
10,10,-0.4987595,20
1,1,-0.4016475,2
9,9,-0.3311844,18
8,8,0.1812767,16
5,5,0.2385133,10
2,2,0.9342918,4


In [46]:
df1[order(-df1$z, df1$y),]

Unnamed: 0,x,y,z
10,10,-0.4987595,20
9,9,-0.3311844,18
8,8,0.1812767,16
7,7,-2.4276716,14
6,6,-0.7641136,12
5,5,0.2385133,10
4,4,-0.791802,8
3,3,-1.0732909,6
2,2,0.9342918,4
1,1,-0.4016475,2


Two Data Frames can be merged using `merge()`. 
- if two dataframes have a column with the same name `merge()` will use that column for the join operation
- otherwise you have to specify the column on the left dataframe and the column on the riht dataframe to join

In [52]:
df4 <- data.frame(k = seq(from = 7, to = 1, by = -1), l = runif(7))
df4; df1

k,l
7,0.2884849
6,0.4861099
5,0.995627
4,0.6489981
3,0.1650115
2,0.2952198
1,0.8211966


x,y,z
1,-0.4016475,2
2,0.9342918,4
3,-1.0732909,6
4,-0.791802,8
5,0.2385133,10
6,-0.7641136,12
7,-2.4276716,14
8,0.1812767,16
9,-0.3311844,18
10,-0.4987595,20


In [53]:
merge(df1,df4, by.x = "x", by.y = "k" )

x,y,z,l
1,-0.4016475,2,0.8211966
2,0.9342918,4,0.2952198
3,-1.0732909,6,0.1650115
4,-0.791802,8,0.6489981
5,0.2385133,10,0.995627
6,-0.7641136,12,0.4861099
7,-2.4276716,14,0.2884849


Continuous data can be converted to categorical data by binning (conversion into pre-defined ranges or bins) using the cut() function

In [124]:
cut(df$y, c(-Inf,seq(0.5, 1, 0.1), Inf))

### Apply function
A number of R programs require looping through a data frame and performing an operation on each record. The “apply” class of functions provide a short-cut through which that entire operation can be performed in one function call.

The function to apply for each row can by user defined function too.

Variants of the “apply” functions include `apply`, `eapply`, `lapply`, `mapply`, `rapply`, `tapply`

In [115]:
mat <- matrix(c(1:20), nrow= 10, ncol= 2)
apply(mat, 1, mean)
apply(mat, 2, mean)

`apply(matrice, indice, funzione)`
- con indice=1 applica funzione() alle righe di matrice, e
- ritorna un vettore lungo come il numero di colonne.

In [126]:
nrighe <- 100000; ncolonne <- 10
m <- matrix(rnorm(nrighe*ncolonne),
nrow=nrighe, ncol=ncolonne)
mediacolonne <- apply(m,2,mean)

`lapply(lista, funzione)`
- Applica funzione() a ciascun elemento di lista;
- ritorna una list con la stessa lunghezza di lista.

In [127]:
L <- list(1:2, 3:7, -1:10)
lapply(L, sd)

`sapply(dataframe, funzione)`
- a ciascuna colonna di dataframe viene applicata funzione();
- ritorna vettore lungo quanto il numero di colonne di dataframe.

In [128]:
clienti <- data.frame(genere = c("M","F","F",NA),
altezza=c(172,186.5,165,180),
peso = c(91, 75, 74, 85))
sapply(clienti, mean)
sapply(clienti[ ,c(2,3)], quantile, prob=0.95)

"argument is not numeric or logical: returning NA"

`tapply(vettore, fattore, funzione)`
- vettore viene suddiviso in sotto-vettori corrispondenti alle fattore;
- a ciascun sottovettore viene applicata funzione();
- ritorna vettore lungo quanto il numero livelli di fattore.

In [129]:
clienti <- data.frame(genere = c("M","F","F",NA),
altezza=c(172,186.5,165,180),
peso = c(91, 75, 74, 85))
tapply(clienti$altezza, clienti$genere,
mean, na.rm=TRUE)

# Control Structures

If-else:

`if (cond) expr`
`if (cond) expr1 else expr2`

For:

`for (varinseq) expr`

While:

`while (cond) expr`

Switch

Ifelse

`ifelse(test,yes,no)`

# User Defined Functions

Functions can be built and saved as function definitions in memory. They take input parameters and write output


In [130]:
computeSum <- function(x,y) {
    x+y
}

computeSum(3,5)

In [131]:
segno <- function(x){ 
    
    if(x<0){ 
        print("negativo") 
    }
    
    else { 
        print("non negativo") 
    }
    
} 

segno(0) # chiamata di funzione

[1] "non negativo"


In [132]:
funz <- function ( x, y = 2, z = 3 ) {
    return(x+y+z) 
}

funz(5)

## Data Clensing and Data Transformation
Issues with Data Quality

- Invalid values
- formats
- Attribute dependencies
- Uniqueness
- Referential integrity
- Missing values
- Misspellings
- Misfielded values (value in the wrong field)
- Wrong references

Finding Data Quality Issues

- Sample Visual Inspection
- Automated validation code
- Schema validation
- Outlier Analysis
- Exploratory Data Analysis

Fixing Data Quality Issues

- Fixing Data Quality issues is regular boilerplate coding in any language the team is comfortable with
- Fix the source if possible
- Find possible loopholes in data processing streams
- Analyze batches coming in and automate fixing
- Libraries and tools available

Data Imputation

- Any “value” present in a dataset is used by machine learning algorithms as valid values –including null, N/A, blank etc.
- This makes populating missing data a key step that might affect prediction results
- Techniques for populating missing data
- Mean, median, mode
- Multiple imputation
- Use regression to find values based on other values
- hybrid

Data Standardization

Numbers
- Decimal places
- Log
Date and Time
- Time Zone
- POSIX
- Epoch
Text Data
- Name formatting ( First Last vs Last, First)
- Lower case/ Upper case/ Init Case

Binning

- Convert numeric to categorical data
- Pre-defined ranges are used to create bins and individual data records are classified based on this.
- New columns typically added to hold the binned data
- Binning is usually done when the continuous variable is used for classification.

Indicator variables
- Categorical variables are converted into Boolean data by creating indicator variables
- If the variable has n classes, then n-1 new variables are created to indicate the presence or absence of a value
- Each variable has 1/0 values
- The nth value is indicated by a 0 in all the indicator columns
- Indicator variables sometimes work better in predictions that their categorical counterparts

Centering and Scaling
- Standardizes the range of values of a variables while maintaining their signal characteristics
- Makes comparison of two variables easier
- The values are “centered” by subtracting them from the mean value
- The values are “scaled” by dividing the above by the Standard Deviation.
- ML algorithms give far better results with standardized values