
<br>
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>


### Prof. José Manuel Magallanes, PhD

* Full Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

* Visiting Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington
_____

# DATA FRAMES IN R and Python

In [None]:
%load_ext rpy2.ipython

**Data frames**  are more complex containers of values. The most common analogy is a spreadsheet.

## 1. Creating

In [None]:
namesP=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
agesP=[32,33,28,30,29]
countryP=["China", "Senegal", "España", "Norway","Korea"]
educationP=["Bach", "Bach", "Master", "PhD","PhD"]

classroomP=dict(student=namesP,age=agesP,edu=educationP,country=countryP)

import pandas as pd

# our data frame froma dict of lists:
studentsP=pd.DataFrame(classroomP)
## see it:
studentsP


Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,China
1,Françoise,33,Bach,Senegal
2,Raúl,28,Master,España
3,Bjork,30,PhD,Norway
4,Marie,29,PhD,Korea


In [None]:

%%R
namesR=c("Qing", "Françoise", "Raúl", "Bjork","Marie")
agesR=c(32,33,28,30,29)
countryR=c("China", "Senegal", "España", "Norway","Korea")
educationR=c("Bach", "Bach", "Master", "PhD","PhD")

classroomR=list(student=namesR,age=agesR,edu=educationR,country=countryR)

studentsR=as.data.frame(do.call(cbind,classroomR))

studentsR



    student age    edu country
1      Qing  32   Bach   China
2 Françoise  33   Bach Senegal
3      Raúl  28 Master  España
4     Bjork  30    PhD  Norway
5     Marie  29    PhD   Korea


## 2. Accessing

In [None]:
# ":" means 'all'; "iloc" requests positions (indices)
# output is a Series
studentsP.iloc[:,0]

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [None]:
%%R
# no text means all
# output is a vector
studentsR[,1:1]

[1] "Qing"      "Françoise" "Raúl"      "Bjork"     "Marie"    


In [None]:
# indices in a list
# output is a dataframe
studentsP.iloc[:,[1,2]]

Unnamed: 0,age,edu
0,32,Bach
1,33,Bach
2,28,Master
3,30,PhD
4,29,PhD


In [None]:
%%R
# indices in a vector
# output is a dataframe
studentsR[,c(2,3)]

  age    edu
1  32   Bach
2  33   Bach
3  28 Master
4  30    PhD
5  29    PhD


In [None]:
# "loc" requires labels (not positions)
studentsP.loc[:,'student']

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [None]:
%%R
# also valid in R
studentsR[,'student']

[1] "Qing"      "Françoise" "Raúl"      "Bjork"     "Marie"    


In [None]:
studentsP.loc[:,['student','edu']]

Unnamed: 0,student,edu
0,Qing,Bach
1,Françoise,Bach
2,Raúl,Master
3,Bjork,PhD
4,Marie,PhD


In [None]:
%%R
studentsR[,c('student','edu')]

    student    edu
1      Qing   Bach
2 Françoise   Bach
3      Raúl Master
4     Bjork    PhD
5     Marie    PhD


In [None]:
# one Series in Pandas
# using '.'
studentsP.student # see: studentsP[['student']]

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [None]:
%%R
# one column in R
# using '$'
studentsR$student

[1] "Qing"      "Françoise" "Raúl"      "Bjork"     "Marie"    


In [None]:
# row with label 2, info about 'student'
studentsP.loc[2,'student']

'Raúl'

In [None]:
# row with position 2, info about 'student'
studentsP.iloc[2,0]

'Raúl'

In [None]:
%%R
studentsR[3,'student']

[1] "Raúl"


In [None]:
studentsP.loc[[2,4],['student','edu']]

Unnamed: 0,student,edu
2,Raúl,Master
4,Marie,PhD


In [None]:
%%R
studentsR[c(3,5),c('student','edu')]

  student    edu
3    Raúl Master
5   Marie    PhD


## 3. Replacing

In [None]:
studentsP.loc[2,'student']='Lito'
studentsP

Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,China
1,Françoise,33,Bach,Senegal
2,Lito,28,Master,España
3,Bjork,30,PhD,Norway
4,Marie,29,PhD,Korea


In [None]:
%%R

studentsR[3,'student']='Lito'
studentsR

    student age    edu country
1      Qing  32   Bach   China
2 Françoise  33   Bach Senegal
3      Lito  28 Master  España
4     Bjork  30    PhD  Norway
5     Marie  29    PhD   Korea


In [None]:
studentsP.loc[[2,4],'age']=[32,31]
studentsP

Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,China
1,Françoise,33,Bach,Senegal
2,Lito,32,Master,España
3,Bjork,30,PhD,Norway
4,Marie,31,PhD,Korea


In [None]:
%%R

studentsR[c(3,5),'age']=c(32,31)
studentsR

    student age    edu country
1      Qing  32   Bach   China
2 Françoise  33   Bach Senegal
3      Lito  32 Master  España
4     Bjork  30    PhD  Norway
5     Marie  31    PhD   Korea


## 4. Deleting

In [None]:
# make copy
studentsP_new=studentsP.copy()

In [None]:
%%R
# make copy
studentsR_new=studentsR

### Deleting rows


In [None]:
byeRows=[2,3]
studentsP_new.drop(index=byeRows,inplace=True) #inplace=inmediately
#then
studentsP_new

Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,China
1,Françoise,33,Bach,Senegal
4,Marie,31,PhD,Korea


As you see, the indexes dissapeared. Then, you should reset the indexes:

In [None]:
studentsP_new.reset_index(drop=True,inplace=True)
#then
studentsP_new

Unnamed: 0,student,age,edu,country
0,Qing,32,Bach,China
1,Françoise,33,Bach,Senegal
2,Marie,31,PhD,Korea


In [None]:
%%R
byeRows=c(3,4)
studentsR_new=studentsR_new[-byeRows,] # overwriting (NO 'inplace')
#then
studentsR_new

    student age  edu country
1      Qing  32 Bach   China
2 Françoise  33 Bach Senegal
5     Marie  31  PhD   Korea


In [None]:
%%R
#reset indexes
row.names(studentsR_new)=NULL
#then
studentsR_new

    student age  edu country
1      Qing  32 Bach   China
2 Françoise  33 Bach Senegal
3     Marie  31  PhD   Korea


### Deleting columns

In [None]:

byeColumns=['edu'] # you can delete more than one

#this is the result
studentsP_new.drop(columns=byeColumns,inplace=True)
#then
studentsP_new

Unnamed: 0,student,age,country
0,Qing,32,China
1,Françoise,33,Senegal
2,Marie,31,Korea


In [None]:
%%R

byeColumns=c('edu') # this doesn't work: studentsR_new[,-byeColumns]
keepCols=setdiff(names(studentsR_new),byeColumns)
studentsR_new=studentsR_new[,keepCols]
#then
studentsR_new

    student age country
1      Qing  32   China
2 Françoise  33 Senegal
3     Marie  31   Korea


### Deleting cells

In [None]:
studentsP_new.loc[2,'country']=pd.NA
#then
studentsP_new


Unnamed: 0,student,age,country
0,Qing,32,China
1,Françoise,33,Senegal
2,Marie,31,


In [None]:
%%R

studentsR_new[3,'country']=NA
#then
studentsR_new

## Inserting

In [None]:
#currently
studentsP

In [None]:
femaleP=[True,True,False,False,True]
studentsP1=studentsP.assign(female=femaleP)
#then
studentsP1

In [None]:
#another way
studentsP2=studentsP.copy()
studentsP2['female']=femaleP
#then
studentsP2

In [None]:
#yet another way
studentsP3=studentsP.copy()
studentsP3.loc[:,'female']=femaleP
studentsP3

In [None]:
%%R
femaleR=c(T,T,F,F,T)
studentsR1=cbind(studentsR,female=femaleR)
studentsR1

In [None]:
%%R
studentsR2=studentsR
studentsR2$female=femaleR
studentsR2

In [None]:
%%R
studentsR3=studentsR
studentsR3[,'female']=femaleR
studentsR3

We could insert rows by appending

In [None]:
namesP1=["Qing", "Françoise", "Raúl"]
agesP1=[32,33,28]
countryP1=["China", "Senegal", "España"]

namesP2=["Bjork","Marie"]
agesP2=[30,29]
countryP2=["Norway","Korea"]

classroomP1=dict(student=namesP1,age=agesP1,country=countryP1)
classroomP2=dict(student=namesP2,age=agesP2,country=countryP2)

studentsP1=pd.DataFrame(classroomP1)
studentsP2=pd.DataFrame(classroomP2)

# inserting by appending
studentsP12=pd.concat([studentsP1,studentsP2])
## see it:
studentsP12

In [None]:
%%R

namesR1=c("Qing", "Françoise", "Raúl")
agesR1=c(32,33,28)
countryR1=c("China", "Senegal", "España")

namesR2=c("Bjork","Marie")
agesR2=c(30,29)
countryR2=c("Norway","Korea")

classroomR1=list(student=namesR1,age=agesR1,country=countryR1)
classroomR2=list(student=namesR2,age=agesR2,country=countryR2)

studentsR1=as.data.frame(do.call(cbind,classroomR1))
studentsR2=as.data.frame(do.call(cbind,classroomR2))

# inserting by appending
studentsR12=rbind(studentsR1,studentsR2)
## see it:
studentsR12

## Other basic operations

In [None]:
# data of structure: list? tuple? dataframe?
type(studentsP)

In [None]:
%%R
class(studentsR)

In [None]:
# type of data in data frame column
studentsP.info()

In [None]:
# details of data frame
%%R

str(studentsR)

In [None]:
# number of rows and columns
studentsP.shape

In [None]:
%%R
dim(studentsR)

In [None]:
# number of rows:
len(studentsP)

In [None]:
%%R

length(studentsR)

In [None]:
# first rows
studentsP.head(2) # compare with: studentsP.tail(2)

In [None]:
%%R
head(studentsR,2) # compare with: tail(studentsR,2)

In [None]:
# name of columns
studentsP.columns

In [None]:
%%R
names(studentsR)

## Queries

In [None]:
studentsP1.iloc[0,1]=33
studentsP1

In [None]:
#who is the oldest?

studentsP1[studentsP1.age==max(studentsP1.age)]

In [None]:
studentsP1[studentsP1.age==studentsP1.age.max()]['student']

In [None]:
%%R
studentsR1[1,2]=33
studentsR1


In [None]:
%%R

#who is the oldest?

studentsR1[which.max(studentsR1$age),]

In [None]:
%%R
studentsR1[studentsR1$age==max(studentsR1$age),]

In [None]:
%%R
studentsR1[studentsR1$age==max(studentsR1$age),'student']

In [None]:
#who has PhD?

studentsP1[studentsP1.edu=='PhD']

In [None]:
%%R
studentsR1[studentsR1$edu=='PhD',]

In [None]:
#who has PhD or Master?
studentsP1[studentsP1.edu.isin(['PhD','Master'])]

In [None]:
%%R
studentsR1[studentsR1$edu %in% c('PhD','Master'),]

In [None]:
#who does not has PhD or Master?
studentsP1[~studentsP1.edu.isin(['PhD','Master'])]

In [None]:
%%R
studentsR1[!studentsR1$edu %in% c('PhD','Master'),]

In [None]:
#the youngest female
studentsP1[studentsP1.female]

In [None]:
studentsP1[studentsP1.female].sort_values(by=['age'],ascending=True).iloc[0,0]

In [None]:
%%R
studentsR1[studentsR1$female,]

In [None]:
%%R
tail(studentsR1[studentsR1$female,][order(studentsR1$age)],1)

In [None]:
studentsP1[studentsP1.female & studentsP1.age==studentsP1.age.min()]

In [None]:
femdf=studentsP1[studentsP1.female]
femdf

In [None]:
femdf[femdf.age==femdf.age.min()]