### The R Language: Data Manipulation

<font color="red">File access required:</font> In Colab this notebook requires first uploading files **Cities.csv**, **Countries.csv**, **Players.csv**, and **Teams.csv** using the *Files* feature in the left toolbar. If running the notebook on a local computer, simply ensure these files are in the same workspace as the notebook.

In [None]:
# Set-up
%load_ext rpy2.ipython

#### Open CSV files and load data into data frames

In [None]:
C1 = open('Cities.csv').read()
C2 = open('Countries.csv').read()

In [None]:
%%R -i C1 -i C2
cities <- read.csv(text=C1)
countries <- read.csv(text=C2)

#### Data frame introduction

In [None]:
%%R
cities

In [None]:
%%R
print(nrow(cities))
print(ncol(cities))

In [None]:
%%R
cities[1,]

In [None]:
%%R
cities[1:10,]

In [None]:
%%R
for (i in 1:10) { print(cities[i,]) }

In [None]:
%%R
cities[,2]
# change to cities[,4]

In [None]:
%%R
cities[5,4]
# change to cities[5:10,2:4]

In [None]:
%%R
head(cities)
# also show number of rows, tail()

#### Basic data operations

*Select single column*

In [None]:
%%R
cities[,'city']

*Select multiple columns*

In [None]:
%%R
cities[,c('city','temperature')]

*Select rows*

In [None]:
%%R
cities[cities$longitude < 0,]

*Select rows and columns*

In [None]:
%%R
cities[cities$latitude > 50 & cities$temperature > 9,
       c('city','latitude','temperature')]

*Sort by temperature*

In [None]:
%%R
cities[order(cities$temperature),]
# descending using decreasing=TRUE
# add cities$country
# ascending country with descending temperature
# descending country with ascending temperature?

*Selection plus sort*

In [None]:
%%R
cities2 <- cities[cities$longitude < 0 & cities$temperature > 12,
                  c('city','temperature')]
cities2[order(cities2$temperature),]

### <font color="green">**Your Turn**</font>

*Find all countries that are not in the EU and don't have coastline, together with their populations, sorted by country name in reverse alphabetical order. Note: equality uses '==' and strings can be single (') or double (") quoted.*

In [None]:
%%R
YOUR CODE HERE

#### Aggregation

*Overall average temperature*

In [None]:
%%R
mean(cities$temperature)

*Average temperature of cities in each country*

In [None]:
%%R
aggregate(cities$temperature, by=list(cities$country), FUN=mean)

*More examples*

In [None]:
%%R
print(min(cities$temperature))
print(max(cities$temperature))

In [None]:
%%R
aggregate(countries$population, by=list(countries$EU,countries$coastline), FUN=mean)

*Number of cities west of the Prime Meridian (i.e., longitude < 0) - error then fix*

In [None]:
%%R
count(cities[cities$longitude < 0,])

### <font color="green">**Your Turn**</font>

*Considering only cities with latitude < 40, find the average temperature for each country. Then considering only cities with latitude > 60, find the average temperature for each country. Hint: Create temporary dataframes for cities with latitude < 40 and cities with latitude > 60, then use aggregate() on these dataframes. Remember print() is needed to see a result unless it's the last line.*

In [None]:
%%R
YOUR CODE HERE

#### Joining

In [None]:
%%R
merge(cities,countries)

*Cities not in the EU with latitude > 50; return city, country, latitude, and whether country has coastline*

In [None]:
%%R
citiesext <- merge(cities,countries)
citiesext[citiesext$EU == 'no' & citiesext$latitude > 50,
          c('city','country','latitude','coastline')]

#### Miscellaneous features

*String operations - countries with 'ia' in their name*

In [None]:
%%R
countries[grepl('ia',countries$country),]

*Add fahrenheit column*

In [None]:
%%R
cities['fahrenheit'] <- (cities$temperature * 9/5) + 32
head(cities, 10)

*Print using cat( )*

In [None]:
%%R
cat('Miniumum temperature:', min(cities$temperature), '\n')
cat('Maxiumum temperature:', max(cities$temperature), '\n')

### <font color="green">**Your Turn: World Cup Data**</font>

#### Open CSV files and load data into data frames

In [None]:
P = open('Players.csv').read()
T = open('Teams.csv').read()

In [None]:
%%R -i P -i T
players <- read.csv(text=P)
teams <- read.csv(text=T)

*1) What player on a team with “ia” in the team name played less than 200 minutes and made more than 100 passes? Print the player surname.*

In [None]:
%%R
YOUR CODE HERE

*2) What is the average number of passes made by forwards? By midfielders? Make sure the answer specifies which is which, and don't include other positions in your result.*

In [None]:
%%R
YOUR CODE HERE

*3) Which team has the highest ratio of goalsFor to goalsAgainst? Print the team name only. Hint: Add a "ratio" column to the teams dataframe, then sort and pick the first or last row depending how you sorted.*

In [None]:
%%R
YOUR CODE HERE

*4) How many players who play on a team with ranking <10 played more than 350 minutes?*

In [None]:
%%R
YOUR CODE HERE

### <font color="green">**Your Turn Extra: Titanic Data**</font>

<font color="red">File access required:</font> In Colab these extra problems require first uploading **Titanic.csv** using the *Files* feature in the left toolbar. If running the notebook on a local computer, simply ensure this file is in the same workspace as the notebook.

#### Open CSV file and load data into data frame

In [None]:
T = open('Titanic.csv').read()

In [None]:
%%R -i T
titanic <- read.csv(text=T)

*1) How many passengers sailed for free (i.e., fare is zero)?*

In [None]:
%%R
YOUR CODE HERE

*2) How many passengers are missing their age, and what is the average fare paid by these passengers? Note: In R, function is.na() checks whether a value is null*

In [None]:
%%R
YOUR CODE HERE

*3) For male survivors, female survivors, male non-survivors, and female non-survivors, how many passengers were in each of those four categories and what was their average fare?*

In [None]:
%%R
YOUR CODE HERE

*4) What is the survival rate of passengers in the three classes, i.e., what fraction of passengers in each class survived? What is the survival rate of females versus males? Of children (under 18) versus adults (age 18 or over)?*

In [None]:
%%R
YOUR CODE HERE