**FOR3004 Data Management Tutorial**

Here, we will be going through how to do the following tasks in R and Python: 

(A) Addition and subtraction

(B) Basic ways to store data ("data types") 

(C) How to read a csv file into the computer

(D) How to write a function to calculate the mean and standard deviation for you

In [1]:
%%capture
#Set up the Colab Notebook

%load_ext rpy2.ipython 
%pip install pandas 
%pip install numpy

In [2]:
import numpy as np
import pandas as pd 

**(A) Addition and subtraction**

In [30]:
# Here is how to add in R 
%%R

print(1+1)
print(1+2)
print(1+3)
print(11111111111 + 11111111111)

[1] 2
[1] 3
[1] 4
[1] 22222222222


In [4]:
# Here is how to add in Python 

print(1+1)
print(1+2)
print(1+3)
print(11111111111 + 11111111111)

2
3
4
22222222222


Very similar...

In [5]:
# Here is how to subtract in R 

%%R

print(1-1)
print(1-2)
print(1-3)
print(1037-1019)

[1] 0
[1] -1
[1] -2
[1] 18


In [6]:
# Here is how to subtract in Python 

print(1-1)
print(1-2)
print(1-3)
print(1037-1019)

0
-1
-2
18


Again, very similar!

**(B) Basic ways to store data ("data types")**

Both R and Python store data using specific "types". Humans do this too. For example, when I am counting money, I know I am working with numbers. When I am reading a book, I am working with sentences (text). The reason why you need to care about this is because the computer sometimes needs to know what data type you are working with to complete a task correctly. 

The major ones you might encounter in R are: 

Characters = text 

Numeric - Double = decimal number 

Numeric - Integer = non-decimal number 

Logical = TRUE or FALSE

Vector = list of things 

Dataframe = table 

The major ones in Python are: 

Strings = text 

Integers = non-decimal numbers

Floats = decimal numbers

Boolean = TRUE or FALSE

List = a list 

Dictionary = a way of storing information with keys and values (a word in a real dictionary is a "key" and its description is the value of the key)

Pandas Dataframe = table  

Let's go through some examples with these. 


In [7]:
# Data types in R
%%R

print('Example of Character')
tree_species <- 'black_spruce' #Create a variable called "tree_species"
print(tree_species)
print(typeof(tree_species)) #Print its data type

print('Example of Numeric - Integer')
num_trees <- as.integer(19) 
print(num_trees)
print(typeof(num_trees))

print('Example of Numeric - Double')
av_num_trees_in_stand <- 19.2
print(av_num_trees_in_stand)
print(typeof(av_num_trees_in_stand))

print('Example of Vector')
stand_locations <- c('Timmins','Ottawa','Port Hope','Englehart','Hearst')
print(stand_locations)
print('Example of Logical')
print(is.vector(stand_locations)) #Is it a vector or not? 

print('Example of Dataframe')
df <- data.frame (tree_growth_trial_location  = c('Timmins','Ottawa',
                                                  'Port Hope','Englehart',
                                                  'Hearst'), height = 
                  c(1.29,1.86,2.1,1.16,1.99))
print(df)
print('Example of Logical')
print(is.data.frame(df)) #Is it a dataframe or not? 

[1] "Example of Character"
[1] "black_spruce"
[1] "character"
[1] "Example of Numeric - Integer"
[1] 19
[1] "integer"
[1] "Example of Numeric - Double"
[1] 19.2
[1] "double"
[1] "Example of Vector"
[1] "Timmins"   "Ottawa"    "Port Hope" "Englehart" "Hearst"   
[1] "Example of Logical"
[1] TRUE
[1] "Example of Dataframe"
  tree_growth_trial_location height
1                    Timmins   1.29
2                     Ottawa   1.86
3                  Port Hope   2.10
4                  Englehart   1.16
5                     Hearst   1.99
[1] "Example of Logical"
[1] TRUE


In [8]:
# Data types in Python 
print('Example of String')
tree_species = 'black_spruce' #Create a variable called "tree_species"
print(tree_species)
print(type(tree_species)) #Print its data type

print('Example of Integer')
num_trees = 19
print(num_trees)
print(type(num_trees))

print('Example of Float')
av_num_trees_in_stand = 19.2
print(av_num_trees_in_stand)
print(type(av_num_trees_in_stand))

print('Example of List')
stand_locations = ['Timmins','Ottawa','Port Hope','Englehart','Hearst']
print(stand_locations)
print(type(stand_locations)) 

print('Example of Dictionary')
stand_locations = ['Timmins','Ottawa','Port Hope','Englehart','Hearst']
heights = [1.29,1.86,2.1,1.16,1.99]  
trial_dictionary = {} 
for loc,height in zip(stand_locations,heights): 
  trial_dictionary[loc] = height
print(trial_dictionary)
print(type(height)) 

print('Example of Pandas Dataframe')
df = pd.DataFrame()
df['Tree Growth Trial Location'] = stand_locations
df['Height'] = [1.29,1.86,2.1,1.16,1.99] 
print(df)
print(type(df)) 

print('Example of Boolean')
print('Timmins' in trial_dictionary.keys()) #Is Timmins one of 
                                            #the dictionary keys? 


Example of String
black_spruce
<class 'str'>
Example of Integer
19
<class 'int'>
Example of Float
19.2
<class 'float'>
Example of List
['Timmins', 'Ottawa', 'Port Hope', 'Englehart', 'Hearst']
<class 'list'>
Example of Dictionary
{'Timmins': 1.29, 'Ottawa': 1.86, 'Port Hope': 2.1, 'Englehart': 1.16, 'Hearst': 1.99}
<class 'float'>
Example of Pandas Dataframe
  Tree Growth Trial Location  Height
0                    Timmins    1.29
1                     Ottawa    1.86
2                  Port Hope    2.10
3                  Englehart    1.16
4                     Hearst    1.99
<class 'pandas.core.frame.DataFrame'>
Example of Boolean
True


**(C) How to read a csv file into the computer** 

In [9]:
%%capture
# Now, we need to get our data. Our data is located on Quercus under "Python 
# Tutorial", in a text file called "NFDB_point_20201029.txt". This is the 
# text file of the Canadian National Forest Fire Database. Please download this
# data and then upload it to your Google Drive in a folder called "FOR3004_ 
# Python_Tutorial". 

# First, we need to give permission for Google Colab to access our data. You 
# will need to enable all the permissions (there are 8). 

from google.colab import drive
drive.mount('/content/drive')

In [10]:
%%capture
# Now, grab the data from Google Drive. 

%cd /content/drive/MyDrive/FOR3004_Python_Tutorial/


In [11]:
# Read the file using R

%%R 
dirname <- '/content/drive/MyDrive/FOR3004_Python_Tutorial/'
fn <- 'NFDB_point_20210916.txt' 
dirfile <- paste(dirname,fn,sep = "")
fires <- read.csv(file = dirfile)
head(fires)

  FID SRC_AGENCY     FIRE_ID FIRENAME LATITUDE LONGITUDE YEAR MONTH DAY
1   0         BC 1953-G00041            59.963  -128.172 1953     5  26
2   1         BC 1950-R00028            59.318  -132.172 1950     6  22
3   2         BC 1950-G00026            59.876  -131.922 1950     6   4
4   3         BC 1951-R00097            59.760  -132.808 1951     7  15
5   4         BC 1952-G00116            59.434  -126.172 1952     6  12
6   5         BC 1951-R00100            59.963  -136.502 1951     8   1
             REP_DATE ATTK_DATE OUT_DATE    DECADE SIZE_HA CAUSE PROTZONE
1 1953-05-26 00:00:00                    1950-1959     8.0     H         
2 1950-06-22 00:00:00                    1950-1959     8.0     L         
3 1950-06-04 00:00:00                    1950-1959 12949.9     H         
4 1951-07-15 00:00:00                    1950-1959   241.1     H         
5 1952-06-12 00:00:00                    1950-1959     1.2     H         
6 1951-08-01 00:00:00                    1950-1959  

In [15]:
# Now in Python 

dirname = '/content/drive/MyDrive/FOR3004_Python_Tutorial/'
fn = 'NFDB_point_20210916.txt' 

fires_df = pd.read_csv(dirname+fn, delimiter = ",",low_memory=False)
print(fires_df.head(10))

   FID SRC_AGENCY      FIRE_ID  ... ECOZ_REF          ECOZ_NAME            ECOZ_NOM
0    0         BC  1953-G00041  ...       12  Boreal Cordillera  CordillCre boreale
1    1         BC  1950-R00028  ...       12  Boreal Cordillera  CordillCre boreale
2    2         BC  1950-G00026  ...       12  Boreal Cordillera  CordillCre boreale
3    3         BC  1951-R00097  ...       12  Boreal Cordillera  CordillCre boreale
4    4         BC  1952-G00116  ...       12  Boreal Cordillera  CordillCre boreale
5    5         BC  1951-R00100  ...       12  Boreal Cordillera  CordillCre boreale
6    6         BC  1952-G00211  ...       12  Boreal Cordillera  CordillCre boreale
7    7         BC  1950-G00035  ...        4        Taiga Plain   Taiga des plaines
8    8         BC  1950-G00039  ...        4        Taiga Plain   Taiga des plaines
9    9         BC  1953-G00043  ...        4        Taiga Plain   Taiga des plaines

[10 rows x 27 columns]


**(D) How to write a function to calculate the mean and standard deviation for you**

Functions are blocks of code that can repeatedly execute tasks for you. 

In [23]:
# Calculate the mean size of fires in R using the CNFDB

%%R 

dirname <- '/content/drive/MyDrive/FOR3004_Python_Tutorial/'
fn <- 'NFDB_point_20210916.txt' 
dirfile <- paste(dirname,fn,sep = "")
fires <- read.csv(file = dirfile)

fire_sizes <- fires[c('SIZE_HA')]
no_nan <- fire_sizes[!is.na(fire_sizes$SIZE_HA),]
mean_size <- mean(no_nan)
print(mean_size)

[1] 311.571


In [24]:
# Now in Python 

fire_sizes = fires_df['SIZE_HA'] 
mean_size = np.nanmean(fire_sizes)
print(mean_size)

311.5710327143166


Okay, but what if we want to be able to do this automatically for other data? To do this, we must create a function. 

In [25]:
# Mean function in R

%%R 
dirname <- '/content/drive/MyDrive/FOR3004_Python_Tutorial/'
fn <- 'NFDB_point_20210916.txt' 
dirfile <- paste(dirname,fn,sep = "")
fires <- read.csv(file = dirfile)

calc_mean <- function(data) {
  no_nan <- fire_sizes[!is.na(data),]
  mean_value <- mean(data)
  return(mean_value)
}

fire_sizes <- fires[c('SIZE_HA')]
mean_from_function = calc_mean(fire_sizes$SIZE_HA)
print(mean_from_function)

[1] 311.571


In [26]:
# Mean function in Python 

def calc_mean(data): 
  mean_value = np.nanmean(data)
  return mean_value

fire_sizes = fires_df['SIZE_HA'] 
mean_from_function = calc_mean(fire_sizes)
print(mean_from_function)

311.5710327143166


What about standard deviation? 

In [28]:
# Function for calculating standard deviation in R

%%R 

dirname <- '/content/drive/MyDrive/FOR3004_Python_Tutorial/'
fn <- 'NFDB_point_20210916.txt' 
dirfile <- paste(dirname,fn,sep = "")
fires <- read.csv(file = dirfile)

calc_stdev <- function(data) {
  no_nan <- fire_sizes[!is.na(data),]
  stdev_value <- sd(data)
  return(stdev_value)
}

fire_sizes <- fires[c('SIZE_HA')]
stdev_from_function = calc_stdev(fire_sizes$SIZE_HA)
print(stdev_from_function)


[1] 5605.963


In [29]:
# And in Python... 

def calc_stdev(data): 
  stdev_value = np.nanstd(data)
  return stdev_value

fire_sizes = fires_df['SIZE_HA'] 
stdev_from_function = calc_stdev(fire_sizes)
print(stdev_from_function)

5605.956716670724
