### Conditional Probability

Conditional probability, as you have seen in the labs and the video, is the probability of an event X occurring given that another event Y has occurred. Mathematically, it is represented as $P(X | Y)$ which is read as “probability of X given Y”.

We will continue working with motor vehicle thefts dataset to apply conditional probability concepts on it. You will be introduced to a new function tally() that works just like prop.table() for working easily with conditional probability.

#### Read the data

Load the motor thefts dataset into a variable called vehicle_thefts. The dataset exists in directory '/datasets/motor vehicle thefts/'.

In [1]:
vehicle_thefts <- read.csv("../../../datasets/motor vehicle thefts/mvt.csv",header=TRUE)
head(vehicle_thefts)

ID,Date,LocationDescription,Arrest,Domestic,Beat,District,CommunityArea,Year,Latitude,Longitude
8951354,12/31/2012 23:15,STREET,False,False,623,6,69,2012,41.75628,-87.62164
8951141,12/31/2012 22:00,STREET,False,False,1213,12,24,2012,41.89879,-87.6613
8952745,12/31/2012 22:00,RESIDENTIAL YARD (FRONT/BACK),False,False,1622,16,11,2012,41.96919,-87.76767
8952223,12/31/2012 22:00,STREET,False,False,724,7,67,2012,41.76933,-87.65773
8951608,12/31/2012 21:30,STREET,False,False,211,2,35,2012,41.83757,-87.62176
8950793,12/31/2012 20:30,STREET,True,False,2521,25,19,2012,41.92856,-87.754


In [2]:
DateConvert = strptime(vehicle_thefts$Date, "%m/%d/%Y %H:%M")

#extract the month and the day of the week and add these variables to the data frame vehicle_thefts
# install.packages("lubridate",repo="http://cran.mtu.edu/")
library(lubridate)
library(dplyr)

expand_date <- ymd_hms(DateConvert) #Converting input date "12/31/2012 20:30" to "2012-12-31 23:15:00 UTC" format

vehicle_thefts$Month = months(DateConvert)  #Extract month from formatted date. 
vehicle_thefts$Weekday = weekdays(DateConvert)   #Extract weekday from formatted date. 

# using base::format
vehicle_thefts$Hour = as.numeric(format(expand_date, "%H")) #Extract hour from formatted date. 
vehicle_thefts$Minutes = as.numeric(format(expand_date, "%M"))  #Extract minutes from formatted date. 


Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date


Attaching package: ‘dplyr’

The following objects are masked from ‘package:lubridate’:

    intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



**Reference: **[format()](https://stat.ethz.ch/R-manual/R-devel/library/base/html/format.html)

**Activity 1:** What is the probability of an arrest being made for the month with largest number of motor vehicle thefts? First let's find out the month with maximum number of thefts.


In [3]:
#Find the name of the month with largest value among the table() output results
which.max(table(vehicle_thefts$Month))

#A table is usually the best summary for categorical data. Once we have a table, we should look at it and say something sensible.
#Now let's take a look at the relationship between the two categorical variables Month and Arrest.
table(vehicle_thefts$Month, vehicle_thefts$Arrest)

           
            FALSE  TRUE
  April     14028  1252
  August    15243  1329
  December  15029  1397
  February  12273  1238
  January   14612  1435
  July      15477  1324
  June      14772  1230
  March     14460  1298
  May       14848  1187
  November  14807  1256
  October   15744  1342
  September 14812  1248

In [4]:
#P(Arrest in October)
 table(vehicle_thefts$Month[vehicle_thefts$Arrest==TRUE & vehicle_thefts$Month=="October"])/sum(table(vehicle_thefts$Month[vehicle_thefts$Month=="October"]))


   October 
0.07854384 

**Activity 2: **Which month has the largest number of motor vehicle thefts for which an arrest was made?

In [5]:
which.max(table(vehicle_thefts$Month[vehicle_thefts$Arrest==TRUE]))

**Activity 3: ** 

a) Read the smoke.csv dataset from the directory '/datasets/smoke/' into a variable called smoke_data. 

b) Create a two-way table called smoker_outcome for variables 'smoker' and 'outcome'. Add marginal distributions to the table by using addmargins() function.

In [6]:
smoke_data <- read.csv("../../../datasets/smoke/smoke.csv",header=TRUE)

smoker_outcome = table(smoke_data$smoker,smoke_data$outcome)
addmargins(smoker_outcome)

Unnamed: 0,Alive,Dead,Sum
No,502,230,732
Yes,443,139,582
Sum,945,369,1314


There is no point in getting this table unless we can interpret it. Most important thing we might be interested in is whether smoking is a factor in smokers' death vs. nonsmokers' death. 443 out of 945 alive are smokers while 139 out of 369 dead are smokers. Those are hard to compare unless we change to a common denominator, or express them as proportions or percentages. We see that 443 out of 945, or about 47% of the alive smoke; and 139 out of 369, or about 38% of the dead smoked. So smoking does not seem to be a factor in deaths for this group of people. 

In [16]:
# Example: Above dataset recorded smoking status and whether or not the subject was alive at the end 
# of 20 years. Use the prop.table function to find the conditional probability of survival for smokers and nonsmokers. 
# prop.table() is similar to table() command where former one gives probabilities while table() returns actual frequency count. 
# tally() works exactly like prop.table(). tal <- tally(~smoker + outcome)

tal <- prop.table(smoker_outcome)
addmargins(tal)

Unnamed: 0,Alive,Dead,Sum
No,0.3820396,0.1750381,0.5570776
Yes,0.3371385,0.1057839,0.4429224
Sum,0.7191781,0.2808219,1.0


**Reference: ** [tally()](http://finzi.psych.upenn.edu/library/dplyr/html/tally.html)

**Activity 4: ** 

a) What is the probability that a person was smoker and is dead?

b) What is the probability that a person was non-smoker and is dead? 

Here, smoker status is the condition.

In [8]:
#P(dead|smoker) = P(dead & smoker)/P(smoker)
0.1057839/0.4429224

In [9]:
#P(dead|nonsmoker) = P(dead & nonsmoker)/P(nonsmoker)
0.1750381/0.5570776

In [10]:
# Use prop.table() if you dont want to do arithmetic of finding percentages from table() results, 
# The "2" tells R to compute the marginal distributions across the columns (smoker status adds up to 1 columnwise). 
# To compute rowwise percentages, use "1" (outcome (dead or alive) adds up to 1.)

In [11]:
addmargins(prop.table(smoker_outcome,2))

Unnamed: 0,Alive,Dead,Sum
No,0.5312169,0.6233062,1.1545232
Yes,0.4687831,0.3766938,0.8454768
Sum,1.0,1.0,2.0


In [12]:
addmargins(prop.table(smoker_outcome,1))

Unnamed: 0,Alive,Dead,Sum
No,0.6857923,0.3142077,1
Yes,0.7611684,0.2388316,1
Sum,1.4469607,0.5530393,2


The meaning of conditional probabilities is much clearer in these tables than it is in language or mathematical notation. The idea of a conditional probability is that you are looking at a subset of the data. For example, in an election poll we might be interested in the subset of voters who prefer Candidate A, and also be interested in knowing the proportions of those voters  with respect to gender, race, ethnicity, etc. 

For the smoke data, we saw that about 40% of the 1314 people smoked. However, for the subset of alive, 443 out of 945, or about 47% are smokers. Often we want to compare one subset to another. Here, 139 of the 369 dead, or about 38% were smokers. We noted this earlier and found those numbers in the table. The notation for these conditional probabilities might look something like P(smoke | alive) and P(smoke | dead) respectively. These can be found by using "2" in prop.table() because the subsets (conditions) are dead or alive.

In [13]:
#comparing proportions of smokers and non-smokers for subsets of alive and dead.
addmargins(prop.table(smoker_outcome,2))

Unnamed: 0,Alive,Dead,Sum
No,0.5312169,0.6233062,1.1545232
Yes,0.4687831,0.3766938,0.8454768
Sum,1.0,1.0,2.0


Similarly, we can answer activity 3 by looking at the subsets (conditions) of smoking status.

In [14]:
#comparing proportions of alive and dead for subsets of nonsmokers and smokers.
addmargins(prop.table(smoker_outcome,1))

Unnamed: 0,Alive,Dead,Sum
No,0.6857923,0.3142077,1
Yes,0.7611684,0.2388316,1
Sum,1.4469607,0.5530393,2
