<a href="https://colab.research.google.com/github/gaaithri/aiml/blob/main/M1_AST_01_Probability_Statistics_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Deep Learning
## A program by IISc and TalentSprint
### Assignment 1: Probability and Statistics

## Learning Objectives

At the end of the experiment, you will be able to

* understand the terms like experiment, outcome, sample space and event, as related to probability
* check if the events are mutually exclusive events
* understand the difference between dependent and independent events
* understand conditional probability and chain rule
* understand about independence and conditional independence
* understand Bayes Theorem


## Information

**Why do we need probability for Data Science?**

Learning probability helps in making informed decisions about likelihood of events, based on a pattern of collected data. In the context of data science, statistical inferences are often used to analyze or predict trends from data and these inferences use probability distributions of data. Using probability, we can model elements of uncertainty such as risk in financial transactions and many other business processes such as risk evaluation, sales forecasting, market research etc.

**Terminology**

The basic terms related to probability are as follows:

- **Experiment:** an action where the result is uncertain even though all the possible outcomes related to it are known in advance.
- **Outcome:**  a possible result of an experiment or trial.
- **Sample space:** the set of all possible outcomes associated with a random experiment.
- **Event:** a subset of sample space or the single result of an experiment.
- **Mutually exclusive events:** two events are mutually exclusive if the probability of occurrence of both events simultaneously is zero.
- **Dependent events:** two events are dependent if the occurrence of the first affects the occurrence of the second, so the probability is changed.
- **Independent events:** two events are independent if occurring or non-occurring of one does not affect the occurring or non-occurring of a second.
- **Conditional probability:** a measure of the probability of an event occurring, given that another event has already occurred.
- **Conditional independence:** A and B are conditionally independent given C if and only if, given knowledge that C occurs, knowledge of whether A occurs provides no information on the likelihood of B occurring, and knowledge of whether B occurs provides no information on the likelihood of A occurring.

### Dataset

The dataset chosen for this assignment is [Productivity Prediction of Garment Employees](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees). The dataset is made up of 1197 records and 15 columns. It includes important attributes of the garment manufacturing process and the productivity of the employees. Some of the features are listed below
- date : date
- day : day of the Week
- quarter : a portion of the month. A month was divided into four or five quarters
- department : associated department with the instance
- team : associated team number with the instance

Here, we will be using four features which are *department*, *day* of week, *quarter* of month and *team* number to cover the learning objectives and see how selection from one feature affects the selection from other feature. Also we will check their dependency when they are occurring simultaneously as well as one after the other.

To know more about other features of the dataset click [here](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees).

### Setup Steps:

### Setup Steps:

In [9]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2305642" #@param {type:"string"}

In [10]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "9480469334" #@param {type:"string"}

In [12]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython
import requests
import warnings
warnings.filterwarnings("ignore")

ipython = get_ipython()

notebook= "M1_AST_01_Probability_Statistics_C" #name of the notebook

def setup():
    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/CDS/Datasets/garments_worker_productivity.csv")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer1() and getAnswer2() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer1" : Answer1, "answer2" : Answer2, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://dlfa-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer1():
  try:
    if not Answer1:
      raise NameError
    else:
      return Answer1
  except NameError:
    print ("Please answer Question 1")
    return None

def getAnswer2():
  try:
    if not Answer2:
      raise NameError
    else:
      return Answer2
  except NameError:
    print ("Please answer Question 2")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Importing required packages

In [13]:
import numpy as np
import pandas as pd
import scipy                        # scientific computation library
import matplotlib.pyplot as plt     # Visualization
import seaborn as sns               # Advaced Visualization with high level interface
from scipy import integrate         # several integration techniques
sns.set_style('whitegrid')

#### Loading the data

In [14]:
df_ = pd.read_csv('garments_worker_productivity.csv')

#### Explore and preprocess dataset

In [15]:
df_.head()

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.8865
2,1/1/2015,Quarter1,sweing,Thursday,11,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057
3,1/1/2015,Quarter1,sweing,Thursday,12,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057
4,1/1/2015,Quarter1,sweing,Thursday,6,0.8,25.9,1170.0,1920,50,0.0,0,0,56.0,0.800382


In [17]:
# Consider only five features from dataset
df = df_[['date', 'quarter', 'department', 'day', 'team']]
# Consider records where 'day' is Monday, Thursday or Saturday
df_day = df[df['day'].isin(['Monday', 'Thursday', 'Saturday'])]

# Consider records where 'team' number is 1, 2 or 3
df_day_team = df_day[df_day['team'].isin([1, 2, 3])]
# Consider records where 'quarter' is 'Quarter1' or 'Quarter2'
df_day_team_quarter = df_day_team[df_day_team['quarter'].isin(['Quarter1', 'Quarter2'])]

# Reset the index and store dataset to 'df'
df = df_day_team_quarter.reset_index(drop= True)

In [18]:
# Check for unique values in department column
df['department'].unique()

array(['finishing ', 'sweing', 'finishing'], dtype=object)

In [19]:
# Remove extra space from 'finishing ' department column
df['department'] = df['department'].apply(lambda x: x.replace(' ',''))

# Change department from 'sweing' to 'sewing'
for i in range(len(df)):
    if df.loc[i, 'department']=='sweing':
        df.loc[i, 'department'] = 'sewing'

In [20]:
# Check for unique values in department column
df['department'].unique()

array(['finishing', 'sewing'], dtype=object)

In [21]:
# Display few rows of processed dataset
df.sample(5)

Unnamed: 0,date,quarter,department,day,team
46,2/7/2015,Quarter1,sewing,Saturday,3
39,2/5/2015,Quarter1,sewing,Thursday,2
25,1/10/2015,Quarter2,finishing,Saturday,2
21,1/8/2015,Quarter2,sewing,Thursday,3
36,2/2/2015,Quarter1,sewing,Monday,2


In [22]:
print('Dataset shape before processing: ', df_.shape)
print('Dataset shape after processing: ', df.shape)

Dataset shape before processing:  (1197, 15)
Dataset shape after processing:  (85, 5)


### Experiment

An experiment or trial is any procedure that can be infinitely repeated and has a well-defined set of possible outcomes. An experiment is said to be *random* if it has more than one possible outcome, and *deterministic* if it has only one. For example, selecting a record from the above dataset, tossing a coin, rolling a die, etc are all random experiments.

**Exercise 1:** Select a record from the above given dataset.

In [23]:
i1 = np.random.randint(df.shape[0]-1)    # get any random index
record = df.iloc[i1:i1+1, :]             # extract record for that index
record

Unnamed: 0,date,quarter,department,day,team
40,2/5/2015,Quarter1,sewing,Thursday,1


### Outcome

Each possible outcome of a particular experiment is unique, and different outcomes are mutually exclusive (only one outcome will occur on each trial of the experiment).

For the experiment where a coin is flipped twice, the four possible outcomes that make up the sample space are (H, T), (T, H), (T, T) and (H, H), where "H" represents a "heads", and "T" represents a "tails".

Similarly, in an experiment of selecting a record from a dataset, the outcome will be that record which got selected.

### Sample space

A sample space is usually denoted using set notation, and the possible ordered outcomes are listed as elements in the set. It is common to refer to a sample space by the labels S, Ω, or U (for "universal set"). The elements of a sample space may be numbers, words, letters, or symbols. They can also be finite, countably infinite, or uncountably infinite.

For example, if the experiment is tossing a coin, the sample space is typically the set {head, tail}, commonly written {H, T}. For tossing two coins, the corresponding sample space would be {HH, HT, TH, TT}.
Similarly, for a random experiment of selecting a record from a dataset, all the rows become it's sample space.

**Exercise 2:** Calculate the length of sample space for a random experiment of selecting a record from the above given dataset.

In [24]:
len(df.index)

85

### Event

An event is a set of outcomes of an experiment to which a probability is assigned. A single outcome may be an element of many different events, and different events in an experiment are usually not equally likely, since they may include very different groups of outcomes. For example, getting an even number after rolling a die once, getting atleast one head after tossing a coin twice, etc.

**Exercise 3:** Getting a *finishing* department record is an event related to the experiment of selecting a record from the whole dataset. Extract a *finishing* department record.

In [25]:
df_finishing = df[df['department']=='finishing']
i2 = np.random.randint(df_finishing.shape[0]-1)
selection = df_finishing.iloc[i2:i2+1, :]
selection

Unnamed: 0,date,quarter,department,day,team
0,1/1/2015,Quarter1,finishing,Thursday,1


### Probability of an event

The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates the impossibility of the event and 1 indicates certainty. The probability formula gives the possibility of an event to happen and is given as

 $\text{Probability of an event occurring} = \frac{favorable\ outcomes}{total\ outcomes}$

### Mutually exclusive events

Two events $A$ and $B$ are known as mutually exclusive if the probability of occurrence of both the events simultaneously is zero, i.e. $ P(A∩B) = 0 $.

To know more about mutually exclusive events click [here](https://www.mathsisfun.com/data/probability-events-mutually-exclusive.html) .

**Exercise 4:** Show that selecting a *finishing* department record and selecting a *sewing* department record are two mutually exclusive events.

In [26]:
# Select records where department is 'finishing' as well as 'sewing' simultaneously
finishing_and_sewing = np.logical_and(df['department']=='finishing', df['department']=='sewing')
finishing_and_sewing.value_counts()

department
False    85
Name: count, dtype: int64

Seen from above there are no records where the department is *finishing* as well as *sewing* simultaneously.

**Note:** The *True* values are treated as 1 and *False* values are treated as 0. For example, *True+True = 2*.

In [27]:
# Probability of selecting finishing and sewing department records simultaneously
P = finishing_and_sewing.sum()/len(df)
print('P(selecting finishing and sewing department records simultaneously)= ', P)

P(selecting finishing and sewing department records simultaneously)=  0.0


Seen that occurrence of both the events simultaneously is zero hence the above mentioned two events are mutually exclusive.

Now, let's see the probability of selecting a *finishing* department record first and then a *sewing* department record.

### Dependent events

Two events are called dependent, if the outcome of the first affects the outcome of the second, such that the probability is changed.

To know more about dependent events click [here](https://corporatefinanceinstitute.com/resources/knowledge/other/dependent-events-vs-independent-events/#:~:text=Dependent%20events%20influence%20the%20probability,probability%20of%20another%20event%20happening.).

**Exercise 5:** A record is selected at random from the dataset. **Without replacing it, a second record is selected**. Show that getting a *finishing* department record in the first selection and getting a *sewing* department record in the second selection are dependent events.

**Hint:** Take two cases, one for getting the *finishing* department and another for not getting the *finishing* department in the first selection then check if probability for the second selection changes.

*Case 1:* Getting *finishing* department record in first selection and *sewing* department record in the second selection

In [28]:
# count of finishing department records
finishing = df['department']=='finishing'
finishing.value_counts()

department
False    47
True     38
Name: count, dtype: int64

In [29]:
df_finishing = df[finishing]
# Probability of selecting finishing department record first = count of finishing department records / all records count
P_finishing_first = len(df_finishing) / len(df)    # 38 / 85 = 0.4471
print('P(selecting a finishing department record first)= ', round(P_finishing_first,4))

P(selecting a finishing department record first)=  0.4471


In [30]:
# Randomly selecting any 'finishing' department record
i = np.random.randint(len(df_finishing)-1)             # -1 is to start the index numbering at 0 instead of 1
selection = df_finishing.iloc[i:i+1, :]                # obtaining a single record with index i
selection

Unnamed: 0,date,quarter,department,day,team
1,1/1/2015,Quarter1,finishing,Thursday,2


In [31]:
# As one record is already selected, the total records available becomes one less than total records
df_new = df.drop(selection.index)

In [32]:
# count of sewing department records
sewing = df_new['department']=='sewing'
sewing.value_counts()

department
True     47
False    37
Name: count, dtype: int64

In [33]:
df_sewing = df_new[sewing]
# Probability of selecting sewing department record second = count of sewing department records / (all records count - 1) = 47 / 84 = 0.5595.
P_sewing_second_given_finishing_first = len(df_sewing) / len(df_new)
print('P(selecting a sewing department record given finishing department record was selected first)= ', round(P_sewing_second_given_finishing_first,4))

P(selecting a sewing department record given finishing department record was selected first)=  0.5595


Note: In case that the first record was replaced before selecting the second record from the sewing department, the probability for the second selection would remain as 47/85, not affecting the original probability of selecting a sewing department record (indicating Independent events)

In [34]:
P_finishing_sewing = P_finishing_first * P_sewing_second_given_finishing_first
print('P(finishing record first and sewing record second)= ', round(P_finishing_sewing,4))

P(finishing record first and sewing record second)=  0.2501


*Case 2:* Getting non-*finishing* department record in first selection and *sewing* department record in the second selection

In [35]:
# count of non-finishing department records
non_finishing = df['department']!='finishing'
non_finishing.value_counts()

department
True     47
False    38
Name: count, dtype: int64

In [36]:
df_non_finishing = df[non_finishing]
# Probability of selecting non-finishing department record first = count of non-finishing department records / all records count
P_non_finishing_first = len(df_non_finishing) / len(df)           # 47 / 85 = 0.5529
print('P(selecting a non-finishing department record first)= ', round(P_non_finishing_first,4))

P(selecting a non-finishing department record first)=  0.5529


In [37]:
# Randomly selecting any non-'finishing' department record
i = np.random.randint(len(df_non_finishing)-1)
selection = df_non_finishing.iloc[i:i+1, :]
selection

Unnamed: 0,date,quarter,department,day,team
54,2/12/2015,Quarter2,sewing,Thursday,1


In [38]:
# As one record is already selected, the records available becomes one less than total records
df_new = df.drop(selection.index)

In [39]:
# count of sewing department records
sewing = df_new['department']=='sewing'
sewing.value_counts()

department
True     46
False    38
Name: count, dtype: int64

In [40]:
df_sewing = df_new[sewing]
# Probability of selecting sewing department record second = count of sewing department records / (all records count - 1) = 46 / 84 = 0.5476
P_sewing_second_given_non_finishing_first = len(df_sewing) / len(df_new)
print('P(selecting a sewing department record given non-finishing department record was selected first)= ', round(P_sewing_second_given_non_finishing_first,4))

P(selecting a sewing department record given non-finishing department record was selected first)=  0.5476


In [41]:
P_non_finishing_sewing = P_non_finishing_first * P_sewing_second_given_non_finishing_first
print('P(non-finishing record first and sewing record second)= ', round(P_non_finishing_sewing,4))

P(non-finishing record first and sewing record second)=  0.3028


In [42]:
# Check for dependency
P_finishing_sewing != P_non_finishing_sewing

True

As we see, selecting the second record without replacing the first record in the dataset changed the probability of the selection of the second record. This indicates that these are dependent events.

Till now the selections were made from a common dataset. Let's see what will happen if it is to be made from different subsets of the dataset.

### Independent events

Two events $A$ and $B$ are called independent, if the happening of $A$ does not affect the happening of $B$. Also, for independent events,

$ P(A∩B) = P(A).P(B) $ will hold true

To know more about independent events click [here](https://corporatefinanceinstitute.com/resources/knowledge/other/dependent-events-vs-independent-events/#:~:text=Dependent%20events%20influence%20the%20probability,probability%20of%20another%20event%20happening.).

**Exercise 6:** A record is selected among those whose day of week is *Monday* and also another record is selected among those whose day of week is *Saturday*. Find the probability of getting a *finishing* department record from the first selection and a *sewing* department record from the second selection given both events are independent of each other?

In [43]:
# Display different department and day of week
print('Department: ',df['department'].unique())
print('Day: ',df['day'].unique())

Department:  ['finishing' 'sewing']
Day:  ['Thursday' 'Saturday' 'Monday']


In [44]:
# Select records having day = 'Monday'
df_monday = df[df['day']=='Monday']

P_finishing_from_monday = len(df_monday[df_monday['department']=='finishing']) / len(df_monday)
print('P(selecting finishing department record from Monday records)= ', round(P_finishing_from_monday,4))

P(selecting finishing department record from Monday records)=  0.4688


In [45]:
# Select records having day = 'Saturday'
df_saturday = df[df['day']=='Saturday']

P_sewing_from_saturday = len(df_saturday[df_saturday['department']=='sewing']) / len(df_saturday)
print('P(selecting sewing department record from Saturday records)= ', round(P_sewing_from_saturday,4))

P(selecting sewing department record from Saturday records)=  0.5769


In [46]:
# As events are independent,
P_finishing_and_sewing = P_finishing_from_monday * P_sewing_from_saturday
print('P(getting finishing department from first selection and sewing department from second selection)= ', round(P_finishing_and_sewing,4))

P(getting finishing department from first selection and sewing department from second selection)=  0.2704


Earlier we saw that the elements of a sample space can be numbers, words, letters, or symbols. Let's see how we can map them to set of real numbers.

**Why we need conditional probability?**

The conditional probability is an essential quantity in wide range of domains, including classification, decision theory, prediction, diagnostics, and other similar situations. That is because one typically makes the classification, decision, prediction, etc. based on some evidence. Thus, what one wants to know is the probability of the result given the evidence.

### Conditional Probability

The conditional probability of an event $A$ in relationship to an event $B$ is the probability that event $A$ occurs given that event $B$ has already occurred. The notation for conditional probability is $P(A|B)$ i.e. the probability of occurrence of event $A$, given that $B$ has already occurred.


Furthermore, the conditional probability for events A given event B is calculated as follows:

$P(A | B)$ = $\frac{P(A \cap B)}{P(B)}$ , where $P(B)\neq0$

where, $ P(A∩B) $ is probability of event $ A $ and $ B $ occurring together,

$ P(B) $ is the probability of observing $ B $.


**Special Cases of Conditional Probability**

1. A and B are disjoint: Here $A \cap B = \Phi $, this denotes that A and B cannot occur at the same time.

  $P(A|B) = \frac{P(A \cap B)} {P(B)}$

  $P(A|B) = \frac{p(\Phi)} {P(B)} = 0$

2. B is a subset of A: If B is a subset of A, then whenever B happens, definitely A must have happened too. Therefore, $A \cap B = B$

  $P(A|B) = \frac{P(A \cap B)} {P(B)}$

  $P(A|B) = \frac{P(B)} {P(B)} = 1$, since $A \cap B = B$


3. A is a subset of B: Here, $A \cap B = A$

  $P(A|B) = \frac{P(A \cap B)} {P(B)}$

  $P(A|B) = \frac{P(A)} {P(B)}$, since $A \cap B = A$


**Exercise 1:**  A fair die is rolled. Assume, A is an event that the outcome is an even number, i.e., A={2, 4, 6}. Next, let B be the event where the outcome is greater than or equal to 4, i.e., B={4, 5, 6}. Find the following:

  a. Probability of A
  
  b. Probability of A given B

  Calculating probability of occurence of event A i.e.$P(A)$

In [47]:
# Total number of outcomes of a die is 6
S = 6
# Total number of outcomes of event A having even numbers i.e. A = {2, 4, 6} is 3.
A = 3
# Probability of getting an even number when a die is rolled
pa = A/S
print(pa)

0.5


Calculating probability of event A given B i.e. $P(A|B)$

To calculate conditional probability we can use the formula as $P(A | B)$ = $\frac{\left | A \cap B \right |}{\left | B \right |}$      

Further, as event B has already occured, the outcome of B is {4,5,6}. Also, event A's outcome is {2,4,6}. Therefore, the outcome of the event A when B has already occured will be $A \cap B$ which is {4,6}

In [48]:
# Declaring number of events of |A intersection B|=2 as A_B
A_B = 2
# Declaring number of events of |B| = 3
B = 3
# Probability of event A given B is calculated as follows
P_AB = (A_B)/B
print(P_AB)

0.6666666666666666


**Exercise 2:**  A card is drawn randomly from a deck of 52 cards. Find the probability of getting a *king* given it is a *red* card.

Let, $K$ – represent the event of getting a king card

$R$ – represent the event of getting a red card

Then the probability of getting a king given it is a red card can be shown as :

$    P(K|R)= \frac{P(K∩R)}{P(R)}  $

where,

$P(R)$ – represents probability of getting a red card

$P(K∩R)$ – represents probability of getting a king and red card simultaneously

In [49]:
P_R = 26/52     # a deck contains half red and half black cards
P_K_and_R = 2/52    #   a deck contains only two king cards which are red
P_K_given_R = P_K_and_R / P_R
print('The probability of getting a king given it is a red card= ', round(P_K_given_R, 4))

The probability of getting a king given it is a red card=  0.0769


### Chain Rule of Conditional Probabilities

By now, we know about conditional probability. Further, what if we are interested in knowing the probability of intersections like $A \cap B$ or $A_{1} \cap A_{2} \cap A_{3} \cap,....,A_{n}$. For this, let us take the formula of conditional probability to derive "**chain rule of conditional probability**".

Rewriting Conditional Probability in the below format:

  $P(A \cap B)=P(A)P(B|A)=P(B)P(A|B)$.........................(1)

  Above formula for three events will be:

  $P(A \cap B \cap C)= P(A \cap (B \cap C))= P(A)P(B \cap C |A)$

  Now, as we know $P(B \cap C) = P(B)P(C|B)$  (from equation 1)

  Apply the condition A on both the sides, we get

  $P(B \cap C|A)=P(B|A)P(C|A,B)$...............................(2)

  From equation 1 and 2, we get

  $P(A \cap B \cap C)=P(A)P(B|A)P(C|A,B)$






  Finally, the general formula for n events will be

  $P(A_{1} \cap A_{2} \cap....\cap A_{n}) = P(A_{1})P(A_{2}|A_{1})P(A_{3}|A_{2}A_{1}).....P(A_{n}|A_{n-1}A_{n-2}....A_{1})$


**Exercise 3:** There is a tyre manufacturing factory that produces 250 units per month, 17 of which are defective. We pick 3 units out of 250 units at random. What is the probability that none of them are defective?

**Explanation**: Here, we are picking up 3 units at random. This can be considered as 3 events $A_{1}$, $A_{2}$, and $A_{3}$ of picking up non-defective units. Thus, we have to find the $P(A_{1} \cap A_{2} \cap A_{3})$.

Let us first find the probability of event $A_{1}$

In [50]:
# Declaring units of production, total number of non-defective units, and number of defective units
total_units = 250
defec_u = 17
n_defec = total_units - defec_u # There are 233 not defective units
# Probability of picking up the non-defective units for the first time (Event A1)
P1 = n_defec/total_units
print(P1)

0.932


Now, the next item will be chosen from 232 not defective units and 17 defective units. This means we have to calculate $P(A_{2}|A_{1})$

In [51]:
# The non-defective units will be 1 less than the previous one as we have already picked up one unit
n_defec_2 = n_defec - 1
# Probability of picking up the non-defective units for the second time (Event A2)
P2 = n_defec_2 / (total_units-1)
print(P2)

0.9317269076305221


Further, given that $A_{1}$ and $A_{2}$ has occured, we have to find the probability of chosing the 3rd unit.

The third unit will be chosen from 231 not defective units and 17 defective units.

In [52]:
# Again, the non-defective units will be 1 less than the previous one (Event A2)
n_defec_3 = n_defec_2 - 1
P3 = n_defec_3/(total_units-2)
print(P3)

0.9314516129032258


The final probability will be calculated as per the below formula:

 $P(A_{1} \cap A_{2} \cap A_{3})$ = $P(A_{1})P(A_{2}|A_{1})P(A_{3}|A_{2})$

In [53]:
# Final probability as per the condition
p_final = P1*P2*P3
print(p_final)

0.8088441507967353


### Independence of Events

An independent event is one that has no effect on the possibility of another event occurring (or not occuring). Moreover, when two events are independent, one event does not influence the probability of another event.


Two events A and B are independent if and only if $P(A \cap B)=P(A)P(B)$.

**Exercise 4:** Pick up a random number from the set {1, 2, 3,...,10}, and call it N. Assume that all outcomes are equally likely.

Let us consider two events A and B. Event A is such that N is less than 5, and let B be the event such that N is an odd number. Check whether A and B independent or not?

**Hint:** Event A = {1,2,3,4}, B = {1,3,5,7,9}, and $A \cap B$ = {1,3}

In [54]:
# Total number of events
t_events = 10
# Probability of event A{1,2,3,4} is |A|/t_events. Here, |A| is the number of events of A{1,2,3,4} i.e. 4
A = 4
p_A = A/t_events
# Probability of event B{1,3,5,7,9} is |B|/t_events. Here, |B| is the number of events of B{1,3,5,7,9} i.e. 5
B = 5
p_B = B/t_events
# Probability of event A intersection B. Here, A intersection B is {1,3} and |A intersection B| = 2.
i_AB = 2
p_AB = i_AB/t_events

In [55]:
# Checking the independence
if p_A*p_B==p_AB:
  print("A and B are independent events")
else:
  print("A and B are not independent events")

A and B are independent events


### Conditional Independence

Previously, we saw that two events A and B are independent if:

$P(A\cap B) = P(A)P(B)$

Further, this can be extended using conditional probability as follows:

Two events A and B are conditionally independent given an event C with P(C)>0 if

$P(A\cap B|C)=P(A|C)P(B|C)$

**Exercise 5:** A die is rolled thrice and following events were recorded:

  A = {1,2}

  B = {2,4,6}

  C = {1,4}

  Find whether A and B are conditionally independent given C?

  Hint: For condition independence, the following should be true:

  $P(A \cap B|C) = P(A|C)P(B|C)$


In [56]:
# Total outcomes of a die, event A, event B, and event C
t_outcome = 6
A_outcome = 2
B_outcome = 3
C_outcome = 2
# |A intersection C| = {1} = 1
A_in_C = 1
# Probability of event A given C
P_AC = A_in_C/C_outcome
# |B intersection C| = {4} = 1
B_in_C = 1
# Probability of event B given C
P_BC = B_in_C/C_outcome

Now, let us calculate $P(A\cap B|C)$. Before this, we will do some prior calculation as follows:

Given, A = {1,2} and B = {2,4,6}, let us condsider an event E.

$E = A\cap B$ = {2}

Further, we will calculate $P(A\cap B|C)$ rewritten as $P(E|C)$ as follows:

In [57]:
# |E intersection C| = {2} intersection {1,4} = 0
E_in_C =  0
# Probability of E intersection C i.e. E given C is
P_EC = E_in_C/C_outcome

In [58]:
# Checking the conditional independence
if P_AC*P_BC==P_EC:
  print("A and B are conditionally independent given C")
else:
  print("A and B are not conditionally independent given C")

A and B are not conditionally independent given C


### Bayes Theorem

By now, we already know about conditional probability and how to calculate it. Further, suppose that we know P(A|B), but we are interested in the probability P(B|A). We can calculate it using the below formula derived for Bayes Theorem:

$P(B|A) = \frac {P(A|B)P(B)}{P(A)}$ where $P(A)\neq0$

It can also be written as:

$P(B_{j}|A) = \frac {P(A|B_{j})P(B_{j})}{\sum_{i} P(A|B_{i})P(B_{i})}$

where $B_{1},B_{2},...,B_{n}$ form a partition of the sample space.



**Exercise 6:** Suppose there are three bags where each bag contains 100 marbles:

  a). Bag 1 has 75 red and 25 blue marbles

  b). Bag 2 has 60 red and 40 blue marbles

  c). Bag 3 has 45 red and 55 blue marbles

  We choose one of the bags at random and then pick a random marble from the chosen bag. Further, we observe that the chosen marble is red. What is the probability that the bag chosen was Bag 1?



In [59]:
# Probability of picking up the marbles from Bag 1, Bag 2, and Bag 3
P_B1 = 1/3
P_B2 = 1/3
P_B3 = 1/3

# Probability of picking up the red marbles from Bag 1, Bag 2, and Bag 3
PR_B1 = 0.75
PR_B2 = 0.60
PR_B3 = 0.45

# Probability of the event where chosen marble is red
P_Red = PR_B1*P_B1 + PR_B2*P_B2 + PR_B3*P_B3
print(P_Red)

0.6


Here, we know $P(R|B_{i})$, but, to calculate the probability that Bag 1 was chosen $P(B_{1}|R)$ we will follow the **Bayes rule**.

$P(B_{1}|R) = \frac{P(R|B_{1})P(B_{1})}{P(R)}$

In [60]:
# Probability that Bag 1 was chosen
P_B1R = (PR_B1*P_B1)/P_Red
print(P_B1R)

0.4166666666666667


**Exercise 7:** Suppose that two factories supply machines to the market. Factory X's machines work for over 5000 hours in 99% of cases, whereas factory Y's machines work for over 5000 hours in 95% of cases . It is known that factory X supplies 60% of the machines available and Y supplies 40% of the machines available. What is the chance that a purchased machine is manufactured by factory X given it works for longer than 5000 hours?

The above desired probability can be represented by the bayes theorem as:

### $    P(B_X|A)= \frac{P(A|B_X).P(B_X)}{P(A)}  $
    
where, $ P(B_X|A) $ is the probability that machine is manufactured by factory X given it work for over 5000 hours, and

   $ P(A|B_X) $ is the probability that machine works for over 5000 hours given it is manufactured by factory X.

In [61]:
# According to question,
P_BX = 0.6
P_BY = 0.4
P_A_given_BX = 0.99
P_A_given_BY = 0.95

# Using total probability theorem
P_A = P_BX * P_A_given_BX + P_BY * P_A_given_BY

In [62]:
P_BX_given_A = P_A_given_BX * P_BX / P_A
print('The chance that machine is manufactured by factory X given it work for longer than 5000 hours = ', round(P_BX_given_A*100, 2), '%')

The chance that machine is manufactured by factory X given it work for longer than 5000 hours =  60.99 %


### Please answer the questions below to complete the experiment:




In [63]:
#@title Q.1. There are two identical bags containing respectively 6 black and 4 red balls, 2 black and 2 red balls. A bag is chosen at random and a ball is drawn from it. If the ball is black, what is the probability that it is from the first bag?
Answer1 = "" #@param ["","3/11", "6/11","5/11","10/11"]


In [None]:
#@title Q.2. A bucket contains 6 blue, 8 red and 9 black pens. If six pens are drawn one by one without replacement, find the probability of getting all black pens? { run: "auto", form-width: "500px", display-mode: "form" }
Answer2 = "4/4807" #@param ["","8/213", "8/4807", "7/4328", "4/4807"]


In [65]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


JUST trying .. need to work on these


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Didn't use" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Didn't use" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")