# Base R (Some Data Operations)

In this notebook, I recorded some operations on data set and compared them to Python pandas. We will also observe the data structure and values when passing a variable from R to python.

<a name="Outline"></a>Below are the operations in R.
- [Print the structure of the data set.](#PrintStruct)
- [Print the column names and row names.](#PrintNames)
- [Print 20 rows, selected randomly.](#Print20Rows)
- [Create a matrix 'M' using the first 4 columns and all rows.](#CreateM)
- [Select the second column of M, first as a vector, then as a matrix.](#Select2ndColumn)
- [Replace all the entries of M with zeroes.](#SetEntriesZero)
- [Replace M with the value 0.](#SetZero)
- [Select the Species column from iris as a dataframe and then as a factor, storing the factor in a variable 'v'. Convert the column to a vector and store it in a variable 'w'. What is the difference between w and v?](#Factor_vs_Vector)
- [For 'v' and 'w' above, try to add an element 'newspecies'. Describe what R did.](#AddElement)
- [Add a column to iris that has the value of sepal width/sepal length and name it 'Sepal.Ratio'.](#AddColumn)

In [24]:
import numpy as np
import numpy.random as random
import pandas as pd

%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


### <a name="PrintStruct"></a>Print the structure of the data set.
[[Back to Outline]](#Outline)

In [4]:
%%R
data(iris)
str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


In [13]:
##### Python #####
iris = %R iris
iris.describe()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### <a name="PrintNames"></a>Print the column names and row names.
[[Back to Outline]](#Outline)

In [15]:
%%R
cat("",
    "=== column names ===\n", colnames(iris), "\n\n",
    "=== row names ===\n",    rownames(iris), "\n")

 === column names ===
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 

 === row names ===
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 


In [20]:
##### Python #####
print("=== column names ===")
print(iris.columns)
print("\n=== row names ===")
print(iris.index)

=== column names ===
Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')

=== row names ===
Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            141, 142, 143, 144, 145, 146, 147, 148, 149, 150],
           dtype='int64', length=150)


###  <a name="Print20Rows"></a>Print 20 rows, selected randomly.
[[Back to Outline]](#Outline)

In [21]:
%%R
iris[sample(1:nrow(iris), 20),]

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
9            4.4         2.9          1.4         0.2     setosa
17           5.4         3.9          1.3         0.4     setosa
142          6.9         3.1          5.1         2.3  virginica
85           5.4         3.0          4.5         1.5 versicolor
8            5.0         3.4          1.5         0.2     setosa
18           5.1         3.5          1.4         0.3     setosa
86           6.0         3.4          4.5         1.6 versicolor
1            5.1         3.5          1.4         0.2     setosa
109          6.7         2.5          5.8         1.8  virginica
45           5.1         3.8          1.9         0.4     setosa
117          6.5         3.0          5.5         1.8  virginica
70           5.6         2.5          3.9         1.1 versicolor
116          6.4         3.2          5.3         2.3  virginica
102          5.8         2.7          5.1         1.9  virginica
68           5.8         

In [36]:
##### Python #####
print("===== Select a row ===== ")
print(iris.iloc[0])

print("\n===== Select a row (drop = False) =====")
print(iris.iloc[[0]])

print("\n===== Print 20 rows, selected randomly =====")
print(iris.iloc[random.choice(range(20),size=20)])


===== Select a row ===== 
Sepal.Length       5.1
Sepal.Width        3.5
Petal.Length       1.4
Petal.Width        0.2
Species         setosa
Name: 1, dtype: object

===== Select a row (drop = False) =====
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa

===== Print 20 rows, selected randomly =====
    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
11           5.4          3.7           1.5          0.2  setosa
7            4.6          3.4           1.4          0.3  setosa
2            4.9          3.0           1.4          0.2  setosa
16           5.7          4.4           1.5          0.4  setosa
7            4.6          3.4           1.4          0.3  setosa
18           5.1          3.5           1.4          0.3  setosa
13           4.8          3.0           1.4          0.1  setosa
18           5.1          3.5           1.4          0.3  setosa
20           5.1          3.8       

###  <a name="CreateM"></a>Create a matrix 'M' using the first 4 columns and all rows.
[[Back to Outline]](#Outline)

In [38]:
%%R
M <- iris[, 1:4]
head(M)

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4


In [41]:
M = iris.iloc[:, range(4)]
M.head(6)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
1,5.1,3.5,1.4,0.2
2,4.9,3.0,1.4,0.2
3,4.7,3.2,1.3,0.2
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2
6,5.4,3.9,1.7,0.4


###  <a name="Select2ndColumn"></a>Select the second column of M, first as a vector, then as a matrix.
[[Back to Outline]](#Outline)

In [51]:
%%R
tmp <- M[, 2]
print(head(tmp))
cat("===================\n")
tmp <- M[, 2, drop=FALSE]
print(head(tmp))

[1] 3.5 3.0 3.2 3.1 3.6 3.9
  Sepal.Width
1         3.5
2         3.0
3         3.2
4         3.1
5         3.6
6         3.9


In [65]:
##### Python ######
tmp = M.iloc[:, 1]
print(tmp.head())
print("Data Type:", type(tmp))
print("\n===================\n")
tmp = M.iloc[:, [1]]
print(tmp.head())
print("Data Type:", type(tmp))

1    3.5
2    3.0
3    3.2
4    3.1
5    3.6
Name: Sepal.Width, dtype: float64
Data Type: <class 'pandas.core.series.Series'>


   Sepal.Width
1          3.5
2          3.0
3          3.2
4          3.1
5          3.6
Data Type: <class 'pandas.core.frame.DataFrame'>


###  <a name="SetEntriesZero"></a>Replace all the entries of M with zeroes.
[[Back to Outline]](#Outline)

In [66]:
%%R
M[] <- 0
head(M)

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1            0           0            0           0
2            0           0            0           0
3            0           0            0           0
4            0           0            0           0
5            0           0            0           0
6            0           0            0           0


In [102]:
##### Python #####
M = iris.iloc[:, range(4)]
M.loc[:,:] = 0
M.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0


###  <a name="SetZero"></a>Replace M with the value 0.
[[Back to Outline]](#Outline)

In [103]:
%%R
M <- 0
print(M)

[1] 0


In [104]:
##### Python #####
M = 0
print(M)

0


###  <a name="Factor_vs_Vector"></a>Select the Species column from iris as a dataframe and then as a factor, storing the factor in a variable 'v'. Convert the column to a vector and store it in a variable 'w'. What is the difference between w and v?
[[Back to Outline]](#Outline)

In [106]:
%%R
# select Species as a dataframe
head(iris[, "Species", drop = FALSE])

  Species
1  setosa
2  setosa
3  setosa
4  setosa
5  setosa
6  setosa


In [107]:
%%R
# select Species as a factor
head(iris$Species)

[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica


In [115]:
%%R
# storing the factor in a variable 'v'. 
# Convert the column to a vector and store it in a variable 'w'. 
# What is the difference between w and v?
v <- iris$Species
w <- as.vector(iris$Species)

# Differences between factor and vector
cat("1st difference: vector does not contain levels\n")
print(levels(v))
print(levels(w))
cat("===================================\n")
cat("2nd difference: factor stores 1,2,... to specify elements in level\n")
str(v)
str(w)

1st difference: vector does not contain levels
[1] "setosa"     "versicolor" "virginica" 
NULL
2nd difference: factor stores 1,2,... to specify elements in level
 Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 chr [1:150] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" ...


In [132]:
##### Python #####
v = iris["Species"].astype("category")
print("===== Data Type =====")
print(type(v))
print("\n===== Categories =====")
print(v.categories)
print("\n===== Describe =====")
print(v.describe())
print("\n===== Values =====")
print(v.head())

===== Data Type =====
<class 'pandas.core.series.Series'>

===== Describe =====
count           150
unique            3
top       virginica
freq             50
Name: Species, dtype: object

===== Values =====
1    setosa
2    setosa
3    setosa
4    setosa
5    setosa
Name: Species, dtype: category
Categories (3, object): [setosa, versicolor, virginica]


In [159]:
##### Python #####
v = pd.Categorical(v)
print("===== Data Type =====")
print(type(v))
print("\n===== Categories =====")
print(v.categories)
print("\n===== Describe =====")
print(v.describe())
print("\n===== Values =====")
print(v)

===== Data Type =====
<class 'pandas.core.categorical.Categorical'>

===== Levels =====
Index(['setosa', 'versicolor', 'virginica'], dtype='object')

===== Describe =====
            counts     freqs
categories                  
setosa          50  0.333333
versicolor      50  0.333333
virginica       50  0.333333

===== Values =====
[setosa, setosa, setosa, setosa, setosa, ..., virginica, virginica, virginica, virginica, virginica]
Length: 150
Categories (3, object): [setosa, versicolor, virginica]


In [134]:
##### Python #####
w = iris["Species"]
print("===== Data Type =====")
print(type(w))
print("\n===== Describe =====")
print(w.describe())
print("\n===== Values =====")
print(w.head())

===== Data Type =====
<class 'pandas.core.series.Series'>

===== Describe =====
count           150
unique            3
top       virginica
freq             50
Name: Species, dtype: object

===== Values =====
1    setosa
2    setosa
3    setosa
4    setosa
5    setosa
Name: Species, dtype: object


In [185]:
##### More about Python Category ######
v = pd.Categorical(iris["Species"])
print(v.codes)
print("=========================")
print(v.get_values()[0:6])
print("=========================")
print(v.value_counts())

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa']
setosa        50
versicolor    50
virginica     50
dtype: int64


###  <a name="AddElement"></a>For 'v' and 'w' above, try to add an element 'newspecies'. Describe what R did.
[[Back to Outline]](#Outline)

In [141]:
%%R
# initialization
v <- iris$Species
w <- as.vector(iris$Species)

# Add newspecies into a factor
v2 <- unlist(list(factor("newspecies"), v), use.names=FALSE)

# Add newspecies into a vector
w2 <- c("newspecies", w)

cat("===== Factor =====\n")
print(head(v2))
cat("\n===== Vector =====\n")
print(head(w2))

===== Factor =====
[1] newspecies setosa     setosa     setosa     setosa     setosa    
Levels: newspecies setosa versicolor virginica

===== Vector =====
[1] "newspecies" "setosa"     "setosa"     "setosa"     "setosa"    
[6] "setosa"    


In [178]:
##### Python #####
v = pd.Categorical(iris["Species"])
w = iris["Species"]

In [186]:
w2 = w.append(pd.Series("x"))
w2.tail()

147    virginica
148    virginica
149    virginica
150    virginica
0              x
dtype: object

Here I do not know how to append elements into pandas Categorical.

###  <a name="AddColumn"></a>Add a column to iris that has the value of sepal width/sepal length and name it 'Sepal.Ratio'
[[Back to Outline]](#Outline)

In [187]:
%%R
data(iris)
iris$Sepal.Ratio <- iris$Sepal.Width / iris$Sepal.Length
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
1          5.1         3.5          1.4         0.2  setosa   0.6862745
2          4.9         3.0          1.4         0.2  setosa   0.6122449
3          4.7         3.2          1.3         0.2  setosa   0.6808511
4          4.6         3.1          1.5         0.2  setosa   0.6739130
5          5.0         3.6          1.4         0.2  setosa   0.7200000
6          5.4         3.9          1.7         0.4  setosa   0.7222222


In [190]:
##### Python #####
iris = %R iris
iris["Sepal.Ratio"] = iris["Sepal.Width"] / iris["Sepal.Length"]
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Sepal.Ratio
1,5.1,3.5,1.4,0.2,setosa,0.686275
2,4.9,3.0,1.4,0.2,setosa,0.612245
3,4.7,3.2,1.3,0.2,setosa,0.680851
4,4.6,3.1,1.5,0.2,setosa,0.673913
5,5.0,3.6,1.4,0.2,setosa,0.72
