# Reviewing Tables in R
## Political Science 3 Discussion Week 3 - Clara Hu

Today, we will play around with data in R. We can work with data in tables.

### Coding in Style

If you are interested in conventions that are often used in formatting R code (spacing, naming, etc.), and would like to make your code easier to read, check out the [tidyverse style guide](https://style.tidyverse.org/syntax.html). 

## Review of Extracting and Subsetting from Dataframes

Remember that we can extract specific columns in a table using the `$` operator. We can also use the `subset` function to extract specific observations with values that satisfy a specific condition.

In [1]:
# We'll be looking at data about Iris flowers 
# Run this cell! Ignore what I'm doing below. 

iris$setosa <- ifelse(iris$Species == "setosa", 1, 0) # Making dummy variables
iris$versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris$virginica <- ifelse(iris$Species == "virginica", 1, 0)

head(iris)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
1,5.1,3.5,1.4,0.2,setosa,1,0,0
2,4.9,3.0,1.4,0.2,setosa,1,0,0
3,4.7,3.2,1.3,0.2,setosa,1,0,0
4,4.6,3.1,1.5,0.2,setosa,1,0,0
5,5.0,3.6,1.4,0.2,setosa,1,0,0
6,5.4,3.9,1.7,0.4,setosa,1,0,0


In [3]:
# Remember we can get values in a column by using $
head(iris$Sepal.Length, 10) 
# I'm using the head function to only show the first few values
# instead of all 150+

In [4]:
# Remember, we can also give things names and reference that name
# For example -- what's the median petal length for all 
#flowers in the dataset?
all_petal_lengths <- iris$Petal.Length
median(all_petal_lengths)

## Quick Check

Using the dummy variable columns, find the proportion of flowers in the dataset that are either the species `Iris setosa` or `Iris versicolor`. Save that value as the name `s_v_prop`.

In [5]:
table(iris$setosa, iris$versicolor)

   
     0  1
  0 50 50
  1 50  0

In [10]:
# Use as many lines as you need!
#prop iris setosa
# mean(iris$setosa)
# #prop iris versicolor
# mean(iris$versicolor)
s_v_prop <- mean(iris$setosa) + mean(iris$versicolor)
s_v_prop # Tell R to print the value

## Subsetting

Let's focus on only the `Iris virginica` flowers in the dataset. We can do this by using the `subset` function, which takes the following arguments:

`subset(table, column_logical)`

A logical in R is the same thing as a Boolean in Python. In other words, it is a value (or set of values) that is either `TRUE` or `FALSE`. You can also think of them as 1 and 0.

We can get these by doing something called a Boolean comparison, where we compare a value to another, and if that condition is True, it will return `TRUE`. Here are some common comparisons:

| Logical Operator | R Code |
| - | - |
| does x equal y? | x == y |
| does x NOT equal y? | x != y |
| is x less than y? | x < y |
| is x greater than y? | x > y |
| is x less than or equal to y? | x <= y |
| is x greater than or equal to y? | x >= y |


In [11]:
# Logical example in R
x <- 5
y <- 10

x == y
x < y

In [12]:
# Let's practice using this to subset:
# subset(table, column_name <comparison> <value>)
virginica <- subset(iris, virginica == 1)
head(virginica)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
101,6.3,3.3,6.0,2.5,virginica,0,0,1
102,5.8,2.7,5.1,1.9,virginica,0,0,1
103,7.1,3.0,5.9,2.1,virginica,0,0,1
104,6.3,2.9,5.6,1.8,virginica,0,0,1
105,6.5,3.0,5.8,2.2,virginica,0,0,1
106,7.6,3.0,6.6,2.1,virginica,0,0,1


In [13]:
# How can we get the virginica flowers using a different column?
virginica2 <- subset(iris, Species == "virginica")
head(virginica2)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
101,6.3,3.3,6.0,2.5,virginica,0,0,1
102,5.8,2.7,5.1,1.9,virginica,0,0,1
103,7.1,3.0,5.9,2.1,virginica,0,0,1
104,6.3,2.9,5.6,1.8,virginica,0,0,1
105,6.5,3.0,5.8,2.2,virginica,0,0,1
106,7.6,3.0,6.6,2.1,virginica,0,0,1


## Subset of a subset

Using the `subset` function and some other R code, create a table that only has rows of Iris setosa flowers that have a sepal length smaller than the mean sepal length of ALL virginica flowers. Call this new table `small_setosas`.

*Hint:* You should use the `virginica` table to help find the mean sepal length of Iris virginica flowers.

In [20]:
avg_virginica_sepal_length <- mean(virginica$Sepal.Length)
setosa <- subset(iris, Species == "setosa")
small_setosas <- subset(setosa, 
                        avg_virginica_sepal_length > Sepal.Length)

head(small_setosas)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
1,5.1,3.5,1.4,0.2,setosa,1,0,0
2,4.9,3.0,1.4,0.2,setosa,1,0,0
3,4.7,3.2,1.3,0.2,setosa,1,0,0
4,4.6,3.1,1.5,0.2,setosa,1,0,0
5,5.0,3.6,1.4,0.2,setosa,1,0,0
6,5.4,3.9,1.7,0.4,setosa,1,0,0


In [21]:
avg_virginica_sepal_length 

## One-way and Two-way Tables

We can use the `table` function to create one and two way tables. One and two way tables are used to summarize the counts of each category in a table. To use the `table` function, just plug in the column that we want to check.

| One way | Two way |
| - | - | 
| table(data\$var1) | table(data\\$var1, data\$var2) |

In [22]:
# Creating a one-way table
# Let's see how many flowers are in each Species category!
table(iris$Species)


    setosa versicolor  virginica 
        50         50         50 

In [23]:
head(iris)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>
1,5.1,3.5,1.4,0.2,setosa,1,0,0
2,4.9,3.0,1.4,0.2,setosa,1,0,0
3,4.7,3.2,1.3,0.2,setosa,1,0,0
4,4.6,3.1,1.5,0.2,setosa,1,0,0
5,5.0,3.6,1.4,0.2,setosa,1,0,0
6,5.4,3.9,1.7,0.4,setosa,1,0,0


**Is creating a two-way table useful for the `iris` dataset?**  

Two way tables are created to summarize counts between two categorical variables.   
Let's create another dataframe called `iris2` with all of the original `iris` data along with a new variable `long_petal` that categorizes the observations based on petal length.

In [24]:
# Adding a categorical variable
# Run this cell! Ignore what I'm doing below. 
iris2 <- iris
iris2$long_petal <- 'medium'
iris2$long_petal[iris$Petal.Length > quantile(iris$Petal.Length, 0.75)] <- 'long'
iris2$long_petal[iris$Petal.Length < quantile(iris$Petal.Length, 0.25)] <- 'short'
head(iris2)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,setosa,versicolor,virginica,long_petal
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<chr>
1,5.1,3.5,1.4,0.2,setosa,1,0,0,short
2,4.9,3.0,1.4,0.2,setosa,1,0,0,short
3,4.7,3.2,1.3,0.2,setosa,1,0,0,short
4,4.6,3.1,1.5,0.2,setosa,1,0,0,short
5,5.0,3.6,1.4,0.2,setosa,1,0,0,short
6,5.4,3.9,1.7,0.4,setosa,1,0,0,medium


In [25]:
# Creating a two-way table
# Let's see how many of each species (rows) 
#have short, medium, or long petals (columns)!
table(iris2$Species, iris2$long_petal)

            
             long medium short
  setosa        0     13    37
  versicolor    0     50     0
  virginica    34     16     0

## Correlation vs Causality

It's important to distinguish between correlation and causality. Correlation means that two variables are linearly related without making a statement about cause and effect. By contrast, causality describes a relationship where one event or process causes an effect on the other event or process.

Sometimes, “A correlates with B” ≠ “A causes B”.


This may be due to: 

1. **Reverse causality** (the possibility that B actually causes A)  
2. **Omitted variable bias** (a 3rd variable `C` causes both `A` and `B`. Note that `C` does not need to be a variable in the given dataset).

## Spurious correlation

Two or more events or variables are mathematically associated but not causally related, due to coincidence, or the presence of a certain third, unseen factor (i.e., **confounding factor**), or it can just happen without any confounding variables.

Let's see some online examples here: [https://www.tylervigen.com/spurious-correlations](https://www.tylervigen.com/spurious-correlations). 