# Theoretical questions

#### 1. Basic probability and statistics

The joint probability of variables $x$ and $y$ is modeled by:

$
    p(x,y)=\left\{
                \begin{array}{ll}
                  72(xy^2-xy^3-x^2y^2+x^2y^3) \hspace{1cm}  x,y \in [0,1]\\
                  0  \hspace{5.8cm} \text{otherwise}
                \end{array}
              \right.
  $
  
Are $x$ and $y$ independent? If yes, please answer with number 1, else answer with 0

**Solution:**

$p(x,y) = 72(xy^2 -xY^3 -x^2y^2 + x^2y^3) = 72xy^2(1-y-x+xy) = 72xy^2[1-y-x(1-y)] = 72x(1-x)y^2(1-y)$

We have shown that $p(x,y) = f(x) \times f(y)$, hence $x$ and $y$ are independent.

In [2]:
# Your answer here:

q1 = 1

#### 2. Joint and marginal probabilities

$X$ defines the maximum seismic intensity experienced by a site in the next 10 years, and it may assume 4 values: $X_1 = $ 'low', $X_2 = $ 'medium', $X_3 = $ 'high', and $X_4 = $ 'very high'.

$Y$ defines the seismic damage on the building in that site, and it may assume 5 values: $Y_1 = $ 'no damage', $Y_2 = $ 'mild damage', $Y_3 = $ 'medium damage', $Y_4 = $ 'severe damage', and $Y_5 = $ 'collapse'.

Marginal probability of $X$ and conditional probability of $Y$ given $X$ are modeled as:

<img src="hw1_1.png", width="500">

**2.1.** For each damage state, calculate the marginal probability of damage $p(Y)$.

**hint: this is a column vector of length 5; note: Multiply the answer by 100 (percentage) and give the answer up to one decimal.**

**Solution:** 

For each $Y$, $P(Y) = P(Y \mid X_1) P(X_1) + P(Y \mid X_2) P(X_2)+...$, We are given the table of $P(Y \mid X)$, and a vector for $P(X)$, notice we want to take the linear combinations of the rows so we can make a matrix with the values of the table,  transpose it (we want to combine the rows not columns), and multiply it by the vector $x$.

Good reference for how to multiply matrices : https://www.khanacademy.org/math/precalculus/precalc-matrices/multiplying-matrices-by-matrices/a/multiplying-matrices


In [1]:
#Answer here:
import numpy as np
p_y_given_x = np.matrix([[0.7,0.3,0,0,0],[0.5,0.4,0.1,0,0],[0.3,0.3,0.35,0.04,0.01],[0,0.2,0.6,0.15,0.05]])
p_x = np.matrix([[0.05],[0.25],[0.4],[0.3]])
c = p_y_given_x.T * p_x
d = c *100

In [2]:
q2 = c.T * 100
q2

matrix([[ 28. ,  29.5,  34.5,   6.1,   1.9]])

**2.2.** For each damage state, calculate the conditional probability of damage given $X = X_2$

**note: we are calculating $p(Y \mid X = X_2)$; Multiply the answer by 100 (percentage) and give the answer up to one decimal (to round to 1 decimal use numpy.round('your vector', 1)).**

**Solution:** We are given the data for $P(Y \mid X)$. We want to extract the row that corresponds to $P(Y \mid X_2)$. 

In [3]:
# Answer here:

q3 = p_y_given_x[1,:] * 100 #The first argument is which row we want (remember we index at 0 so X_2 is at position 1). The second argument is for which column, the ":" means all. 
q3

matrix([[ 50.,  40.,  10.,   0.,   0.]])

**2.3.** For each damage state, calculate the joint probability of having that damage state and $X = X_2$

**note: we are calculating $p(X_2 , Y)$; Multiply the answer by 100 (percentage) and give the answer up to one decimal. (to round to 1 decimal use numpy.round('your vector', 1))**

**Solution:** $P(X_2, Y) = P(X_2) P(X_2 \mid Y)$. One method to calculate this is using np.multiply, which multiplies the values of the $X$ vector with the corresponding row of the $P(Y \mid X)$ matrix. Our answer will be the row for $X_2$.

In [4]:
p_xy = np.multiply(p_y_given_x , p_x)
p_xy[1, : ] * 100 #Remember we index at 0, so the row for X_2 is at row 1.

matrix([[ 12.5,  10. ,   2.5,   0. ,   0. ]])

In [5]:
# Your answer here:
q4 = p_xy[1,] * 100
q4

matrix([[ 12.5,  10. ,   2.5,   0. ,   0. ]])

**2.4.** For each seismic intensity, calculate the conditional probability of that intensity given $Y = Y_3$

**note: we are calculating $p(X \mid Y = Y_3)$; Multiply the answer by 100 (percentage) and give the answer up to one decimal. (to round to 1 decimal use numpy.round('your vector', 1))**

**Solution:** The conditional probability is represented by $P(X \mid Y)$. Recall that $P(X \mid Y) = \frac{P(X,Y)}{P(Y)}$.

In [6]:
p_y = np.sum(p_xy,0) #Returns the sum of each column, which gives us P(Y)
p_x_given_y = p_xy / p_y #Returns a matrix with P(X|Y) in each column

In [7]:
"""We want Y=Y_3, so we want the third column."""
# Answer here:

q5 = np.round(p_x_given_y[:,2]*100,1)
q5

array([[  0. ],
       [  7.2],
       [ 40.6],
       [ 52.2]])

#### 3. Contuning on probability

On day zero, the number of trucks in a parking lot is $X_0$. Assume $X_0 = 1$ with probability 20%, $X_0 = 2$ with probability 50% and $X_0 = 3$, with probability 30%. Between day zero and day one, assume one truck can leave (with probability 25%), one truck can be added (with probability 25%), or the number of trucks can stay the same (with probability 50%). $X_1$ is the number of trucks on day one.

**3.1.** What is the probability that $X_1 = 1$?

**Solution:** 

$p(X_1 = 1) = p(X_0 = 1) p(\Delta X = 0) + p(X_0 = 2) p(\Delta X = -1) + p(X_0 = 2) p(\Delta X = -2)$ 

In [8]:
# Your answer here:

q6 = 22.5

**3.2.** What is the probability that $X_1 = 5$?

**Solution:** same as above 3.1

In [9]:
# Your answer here:

q7 = 0

**3.3.** What is the probability that $X_1 > X_0$?

**Solution:** $p(\Delta X > 0)$

In [10]:
# Your answer here:

q8 = 25

**3.4.** If there are 3 trucks at day one, what is the conditional probability that there were 2 trucks at day zero?

**Solution:** 


$p(X_0 = 2 \mid X_1 = 3) = \frac{p(X_0 = 2, X_1 = 3)}{p(X_1 = 3)}$

$p(X_0 = 2, X_1 = 3) = p(X_1 = 3 \mid X_0 = 2) p(X_0 = 2)$ and $p(X_1 = 3)$ can be calculated as in 3.1.

In [11]:
# Answer here:

q9 = 45.5

# Real-data exploration

#### 4. Census Data

In this question load the census data (bay_area_census_age.csv) and implement your method to answer following questions:

**4.1.** identify census tracts with predominantly young population (we can define the predominantly young population as if the portion of population under 29 years old is more than 80% of the total population). Your output should be a list of census tract ids (this is the 'NAME' field in the data table).

**note: Once you find those tracts just enter the numbers below, for example if you find two tracts ('Census Tract 21', 'Census Tract 434'), your answer would be [21,434]**





**Solution:**

The idea behind this problem is taking the sum across each row to determine the total population
under 29 for the respective correct columns. To do this, first filter the table to only include the columns
of age groups that are under 29, then use the .apply method to sum the contents of each row to find the total 
population in that demographic. After that you divide the number by the total population and determine which 
of the states have higher than 80%.

In [21]:
from datascience import *

data = Table.read_table("bay_area_census_age.csv")

un29 = data.select(['Under 5 years', '5 to 9 years', '10 to 14 years',
                    '15 to 19 years', '20 to 24 years', '25 to 29 years'])
popsum = un29.apply(sum)
data['young population'] = popsum
data['young perc'] = data['young population'] / data['Total Population'] * 100
ff = data.where(data['young perc'] >= 80).select(['NAME'])
ff

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


NAME
Census Tract 4226
Census Tract 4227
Census Tract 4228
Census Tract 332.01
Census Tract 5009.02
Census Tract 5116.08
Census Tract 5130


In [11]:
# Answer here:

q10 = [4226,4227,4228,332.01,5009.02,5116.08,5130]

**4.2** Find the three closest tracts to Paris Baguette in Downtown Berkeley (latitude = 37.869941, longitude = -122.268377), and report the mean and standard deviation of total population in those tracts. 

**note: the function to find the distance of tracts to the specific lat/long is provided below. Please round your answers.**

In [22]:
def distance_on_sphere(lat1, long1, lat2, long2):

    # Convert latitude and longitude to spherical coordinates in radians.
    degrees_to_radians = np.pi/180.0
        
    # phi = 90 - latitude
    phi1 = (90.0 - lat1)*degrees_to_radians
    phi2 = (90.0 - lat2)*degrees_to_radians
        
    # theta = longitude
    theta1 = long1*degrees_to_radians
    theta2 = long2*degrees_to_radians
        
    # We can compute spherical distance from spherical coordinates.
    cos = (np.sin(phi1)*np.sin(phi2)*np.cos(theta1-theta2)+
           np.cos(phi1)*np.cos(phi2))
    arc = np.arccos( cos )

    # Multiply arc by the radius of the earth to get length.
    return 3960.*arc #to get distance in miles

def rotate_table(table):
    '''transforms a 2 x n table to be an n x 2 table'''
    return Table().with_columns(['Columns', list(table.labels),
                                 'Values', list(table.to_array()[0])])

**Solution:** Apply the distance_on_sphere function with the latitude and longitude of Paris Baguette as the first input and the values from columns 'INTPTLAT10' and 'INTPTLON10' as the second and third. Add this new column to the original table and sort the data from least to most and take the first 3. From there we can use built in numpy
functions to calculate the mean and std.

In [25]:
lat1, lon1 = 37.869941, -122.268377
data['distance to Paris'] = data.apply(lambda lat2, lon2 : distance_on_sphere(lat1, lon1, lat2, lon2), 
                                          ['INTPTLAT10', 'INTPTLON10'])


c = data.sort('distance to Paris')
d = c['Total Population'][0:3]




1935.0

In [16]:


# Enter mean here:

q11 = round(np.mean(d),0)

# Enter standard deviation here:

q12 = round(np.std(d),0)

#### 5. Traffic count on Bay Bridge

In this question load the traffic data on bay bridge (pems_output.csv.csv) and implement your method for: 


**5.1** identifying the best (most Flow) lane to travel on for the following period of time: 
between 8pm to 11pm

**note: the functions to get the hours (as we had in Lab 1) is given below** 

**If your answer is Lane 1, enter just number 1, and 2 or 3 or 4 for other lanes respectively.**

**Solution:** Note that 'Hour' column is a time stamp, so you can use the pandas function .to_datetime to calculate the hours of the day. We filter the data to only include times between 8pm (20) and 11pm (23). Next we filter the  data to only include the flow of each lane. Then we take the sum of each column using np.sum and choose the the lane with the highest Flow.

In [30]:
data = Table.read_table('pems_output.csv')
data

Hour,Lane 1 Flow (Veh/Hour),Lane 2 Flow (Veh/Hour),Lane 3 Flow (Veh/Hour),Lane 4 Flow (Veh/Hour),Lane 5 Flow (Veh/Hour),Flow (Veh/Hour),# Lane Points,% Observed
1/14/2016 0:00,34,347,372,291,119,1163,60,100
1/14/2016 1:00,20,199,295,230,74,818,60,100
1/14/2016 2:00,17,248,342,267,114,988,60,100
1/14/2016 3:00,158,427,433,347,164,1529,60,100
1/14/2016 4:00,883,1033,912,737,543,4108,60,100
1/14/2016 5:00,2037,1944,1734,1617,1594,8926,60,100
1/14/2016 6:00,1838,1844,1709,1715,1626,8732,60,100
1/14/2016 7:00,1790,1883,1760,1720,1627,8780,60,100
1/14/2016 8:00,1739,1820,1686,1621,1617,8483,60,100
1/14/2016 9:00,1709,1705,1681,1591,1631,8317,60,100


In [32]:
import pandas as pd
import matplotlib.pyplot as plots
%matplotlib inline
data['Hour day'] = pd.to_datetime(data['Hour']).hour
c = data.where((data['Hour day'] >= 20) & (data['Hour day'] <= 23))
d = c.select(['Lane 1 Flow (Veh/Hour)', 'Lane 2 Flow (Veh/Hour)', 'Lane 3 Flow (Veh/Hour)', 'Lane 4 Flow (Veh/Hour)','Lane 5 Flow (Veh/Hour)'])
f = Table.to_array(d)
f = f.tolist()
ff = np.sum(f,0)
np.max(ff)
np.sum(d)



Lane 1 Flow (Veh/Hour),Lane 2 Flow (Veh/Hour),Lane 3 Flow (Veh/Hour),Lane 4 Flow (Veh/Hour),Lane 5 Flow (Veh/Hour)
21698,28251,25420,21102,11924


In [19]:
# Your final answer here:

q13 = 2

**5.2.** identifying the worst (least flow) lane to travel on for the following period of time: 
between 8pm to 11pm between 5pm to 8pm

**Solution:** Replicate the work from 5.1 but instead for 5pm (17) and 8pm (20) and pick the lowest flow.


In [33]:
c = data.where((data['Hour day'] >= 17) & (data['Hour day'] <= 20))
d = c.select(['Lane 1 Flow (Veh/Hour)', 'Lane 2 Flow (Veh/Hour)', 'Lane 3 Flow (Veh/Hour)', 'Lane 4 Flow (Veh/Hour)','Lane 5 Flow (Veh/Hour)'])
np.sum(d)



Lane 1 Flow (Veh/Hour),Lane 2 Flow (Veh/Hour),Lane 3 Flow (Veh/Hour),Lane 4 Flow (Veh/Hour),Lane 5 Flow (Veh/Hour)
33020,38159,35711,32095,26025


In [23]:
# Your final answer here:

q14 = 5

#### 6. TAZ data

In this question you will play with some data from the Metropolitan Transportation Commission on travel time from one Traffic Analysis Zone (TAZ) to another. 

##### The Dataset

##### MTC travel skims

The Metropolitan Transportation Commission (MTC) is the regional transportation planning organization for the Bay Area. They host a database with average travel time, cost, and distance from each traffic analysis zone (TAZ) to all other TAZs in the Bay Area. The files have data for driving alone, carpooling, walking to transit, driving to transit, walking, and biking. 

We have pre-processed the data from the morning commute to include only TAZs around San Francisco, Oakland and Berkeley. The file with inter-TAZ travel time is sf_oak_traveltimes_bymode.csv.

More info on the dataset can be found here - http://analytics.mtc.ca.gov/foswiki/Main/SimpleSkims. 
The descriptions of the columns in the data set are shown below:

|column|description|
|---|---|
|origin|Origin transportation analysis zone|
|destination|Destination transportation analysis zone|
|drive alone|Door-to-door time for the drive alone travel mode (i.e. single occupant private automobile)|
|shared ride (2 people)|Door-to-door time for the shared ride 2 travel mode (i.e. double occupant private automobile)|
|shared ride (3 people)|Door-to-door time for the shared ride 3+ travel mode (i.e. three-or-more occupants traveling in a private vehicle)|
|walk|Door-to-door time for walking|
|bike|Door-to-door time for bicycling|
|walk-transit-walk|Door-to-door time for walk to transit to walk paths|
|drive-transit-walk|Door-to-door time for drive to transit to walk paths|
| walk-transit-drive|Door-to-door time for walk to transit to drive paths (returning home on a park-and-ride tour)|


(The raw data with all Bay Area TAZs can be found at https://mtcdrive.app.box.com/2015-03-116)


**6.1:** what is the drive alone travel time from origin TAZ 10 (in downtown SF) to destination TAZ 
1019 (the TAZ at UC Berkeley) according to this dataset?

**Solution:** First filter the data to only include rows that have 10 as the origin, then filter the data to only include data where the destination is 1019. This should result in a table with one row, where the 'drive alone' number is our desired result.

In [24]:
travel_data = Table.read_table("sf_oak_traveltimes_bymode.csv")
travel_data.where('origin',10).where('destination', 1019)['drive alone']

array([ 23.64])

In [20]:
# Answer here:

q15 = 23.64

**6.2:** what is the worst mode of travel (most travel time) for the above mentioned origin-destination according to the data? 

**note: if you get the travel time of -999, that means that mode is not available, simply disregard those**

**Please just enter the name of the travel model as a string below**

**Solution:** First filter the data to only include rows with 10 as the origin and 1019 as its destination. This should give you a single row, which you can manually check which category has the worst mode of travel.

In [25]:
travel_data.where('origin',10).where('destination', 1019)

origin,destination,drive alone,shared ride (2 people),shared ride (3 people),walk,bike,walk-transit-walk,drive-transit-walk,walk-transit-drive
10,1019,23.64,23.64,23.64,-999,-999,51.05,48.76,44.04


In [26]:
# Your final answer here:

q16 = 'walk-transit-walk'

**6.3** Among the first 100 destinations, which one is the closet to origin 10 only consideting the travel model, biking

**note: exclude those destinations where biking is not available (travel time is -999)**

**Solution:** For this question, filter the data to only include the first 100 destinations, starting at origin 10, where the bike data is not -999 and sort is in ascending order. Choose the top option that is not 10 itself.

In [27]:

(travel_data
.where('destination', are.below_or_equal_to(100)) #First 100 destinations
.where('origin', 10) #Starting at origin 10
.where('bike', are.not_equal_to(-999)) #Only available data
.select('origin', 'destination', 'bike') #Only interested in where is starts, ends, and bike distance
.sort('bike') #Sort the data from closest to furthest 
)

origin,destination,bike
10,10,1.05
10,20,2.1
10,9,2.45
10,11,2.45
10,79,3.55
10,76,3.7
10,8,3.9
10,30,4.3
10,19,4.35
10,21,4.55


In [28]:
# Your final answer here:

q17 = 20

#### Bonus question

###### Conditional independence

The notation $X \perp Y$ indicates that random variables $X$ and $Y$ are independent. Similarly, $X \perp Y \mid Z$ means that $X$ and $Y$ are conditionally independent given $Z$, that is $p(x \perp y \mid z) = p(x \mid z)p(y \mid z)$.

Is the following statements about conditional independence true or false? **If it is true answer with number 1, if false answer with number 0.**

- $(X \perp (Y,W) \mid Z)$ implies $X \perp Y \mid Z$

**Solution:**

$(X \perp (Y,W) \mid Z) \rightarrow p(X,Y,W \mid Z) = p(X \mid Z) p(Y,W \mid Z)$

Now we sum over $W$ on both sides:

$\sum_W p(X,Y,W \mid Z) = \sum_W p(X \mid Z) p(Y,W \mid Z)$

$\sum_W p(X,Y,W \mid Z) = p(X \mid Z) \sum_W  p(Y,W \mid Z)$

$p(X,Y \mid Z) = p(X \mid Z) p(Y\mid Z) \rightarrow X \perp Y \mid Z$



In [22]:
# Your answer here:

q18 = 1

- $(X \perp Y \mid Z)$ and $((X,Y) \perp W \mid Z)$ implies $X \perp W \mid Z$


**Solution:** 

$(X \perp Y \mid Z) \rightarrow p(X,Y \mid Z) = p(X \mid Z) p(Y \mid Z)$

$((X,Y) \perp W \mid Z) \rightarrow p(X,Y,W \mid Z) = p(X,Y \mid Z) p(W \mid Z)$

$p(X,Y,W \mid Z) = p(X \mid Z) p(Y \mid Z) p(W \mid Z)$

$\sum_Y p(X,Y,W \mid Z) = \sum_Y p(X \mid Z)  p(Y \mid Z) p(W \mid Z)$

$\sum_Y p(X,Y,W \mid Z) =  p(X \mid Z)  \sum_Y p(Y \mid Z) p(W \mid Z)$

$p(X,W \mid Z) =  p(X \mid Z) p(W \mid Z) \rightarrow X \perp W \mid Z$


In [29]:
# Your answer here:

q19 = 1

- $(X \perp (Y,W) \mid Z)$ and $(Y \perp W \mid Z)$ implies $(X,W) \perp Y \mid Z$

**Solution:** 

from $(X \perp (Y,W) \mid Z)$ and $(Y \perp W \mid Z)$ we can easily conclude that $(X \perp W \mid Z)$. 

So, $p(X,Y,W \mid Z) = p(X \mid Z) p(Y,W \mid Z) = p(X \mid Z) p(Y \mid Z) p(W \mid Z)$

$p(X,Y,W \mid Z) = p(X,W \mid Z) p(Y \mid Z) \rightarrow (X,W) \perp Y \mid Z$

In [30]:
# Your answer here:

q20 = 1

# Load OKpy

In [None]:
from client.api.notebook import Notebook
ok = Notebook('HW1.ok')
_ = ok.auth(inline=True)

# Submit to OKpy 

In [None]:
_ = ok.submit()