# Boolean algebra

In [None]:
%reload_ext postcell
%postcell register

In [None]:
import pandas as pd
from sklearn import datasets

Please also see notebook "basic conditionals and None"

### Booleans algebra

Python provides a data type, called boolean or `bool` to represent `True` and `False` values. 

Data types such as `int`, `float` and `str` can represent an infinite set of values. However, `bool` and only take on two values: `True` and `False`.

Just as numbers come with operations such as `+`, `-`, `*`, etc. and strings have built-in functions such as `startswith` and `capitalize`, booleans come with their won operations: `and`, `or` and `not`.

Variables can be assigned booean values similar to any other data type:

In [None]:
homer_is_a_dad = True
marge_is_a_mom = True

marge_is_a_dad = False
homer_is_a_mom = False

More practically, booleans show up in comparisons and constraints. For example:

In [None]:
# Grade thresholds in a grading system
grade_A = 3.5
grade_B = 3.0
grade_C = 2.5
grade_D = 2.0

# Input from grader for Lisa Simpson
student_grade = 3.75

if student_grade >= 3.5: print("Student receives grade A")
elif student_grade >=3.0: print("Student receives grade B")
elif student_grade >=2.5: print("Student receives grade C")
else: print("Student fails")

For data scientists and analysts, an even more common reason to use booleans is in data queries (sql or Pandas based).

In [None]:
data_file_location = "../../datasets/life-expectancy/life-expectancy-who.zip2"

who_df = pd.read_csv(data_file_location, compression='zip')
who_df.head()

In [None]:
who_df[who_df.Year > 2014]

Boolean _expressions_ can be combined using operators `and`, `or` and `not` (`&`, `|` and `!` in Pandas). 

### `and` / `&`
Boolean expression can be combined usind the `and` operator. In technical terms, this is called a _conjunction_. In set theoretic or relational algebra terms, `and` is the same as an intersection.

In [None]:
homer_is_a_dad and marge_is_a_mom

In [None]:
homer_is_a_dad and marge_is_a_dad

While core Python uses `and` conjunction between individual variables, unfortunately, for Pandas queries, we have to use the `&` operator (also note that Pandas expressions need to be wrapped in parenthesis)

In [None]:
who_df[(who_df.Year == 2015) & (who_df.Status != 'Developing') & (who_df['Life expectancy '] < 80)]

**Remember** In order for a boolean expression to be true, **all 'and'** sub-expression must be true

In [None]:
True and True and True and True and True and False

In [None]:
who_df[(who_df.Year == 2015) & (who_df.Status != 'Developing') & (who_df['Life expectancy '] < 80) & (who_df['infant deaths'] > 30)]

### `or` / `|`
Boolean expressions can be combined using the `or` operator. In technical terms, this is called a _disjunction_. In set theoretic or relational algebra terms, `or` is the same as a union.

In [None]:
homer_is_a_dad or marge_is_a_dad

The Pandas counterpart to `or` is `|`

In [None]:
who_df[(who_df.Year == 2015) | (who_df.Year == 2014) ]

**Remember** In order for a boolean expression to be true, **any 'or'** sub-expression must be true

In [None]:
True or False or False or False

In [None]:
who_df[(who_df.Year == 2015) | (who_df.Year == 2014)  | (who_df.Year == 1776) ]

**What is the deal with `|` and `&`?**

Having to switch to these single characters instead of the elegant `and`/`or` is indeed annoying. Unlike `and`/`or`, third party libraries, like Pandas, take assign their own logic to `|`/`&`. Also recall that you actually have used these operators before. Remember `sets`?

In [None]:
simpsons   = set(['homer', 'marge', 'bart', 'lisa', 'maggie', 'barney', 'mr. burns'])
flinstones = set(['fred', 'wilma', 'pebbles', 'barney', 'betty'])

In [None]:
simpsons & flinstones # <== Intersection! (aka simpsons.intersection(flinstones)

In [None]:
simpsons | flinstones # <= Union (aka simpsons.union(flinstones)

### `not` / `!` (and even `~`)

Any boolean expression can be negated (converted to the opposite), by adding a `not`

In [None]:
not True

In [None]:
not False

In [None]:
True and True and True and True and True and False

In [None]:
not (True and True and True and True and True and False)

In [None]:
who_df[who_df.Year == 2015].head()

In [None]:
who_df[who_df.Year != 2015].head()

You can negate the whole expression ... we will see more about this later

In [None]:
who_df[~(who_df.Year == 2015)].head()

**Exercise** Show me all countries which are neither developing nor have infant deaths below 60

In [None]:
%%postcell exercise_025_145_a

#Type code here

In [None]:
who_df[~((who_df.Status == 'Developing') | (who_df['infant deaths'] < 60))]

**Be careful when combining ands/ors**

**Example**
Say you are a teacher and you need to give grades to students. A student can only pass if they have:
**exam grade of above 85% and either assignment grade of above 75% or have missed fewer than 3 classes**

In [None]:
students_df = pd.DataFrame({'name':['skipper', 'bobby', 'tommy', 'sally', 'jacky', 'jenny', 'flunky'],  'exam':[83, 79, 68, 60, 54, 92, 0], 'assignments':[67, 89, 93, 74, 23, 76, 0],  'absences':[9, 6, 0, 2, 1, 5, 0], })
students_df

**exam grade of above 85% and either assignment grade of above 75% or have missed fewer than 3 classes**

In [None]:
students_df[(students_df.assignments > 75) | (students_df.absences < 3) & (students_df.exam > 85)]

In [None]:
# Incorrectly parenthesized
#students_df[(students_df.assignments > 75) |  ( (students_df.absences < 3) & (students_df.exam > 85) ) ] <= And clauses have higher priority than or clauses

**exam grade of above 85% &nbsp; &nbsp; &nbsp; &nbsp; and &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; either assignment grade of above 75% or have missed fewer than 3 classes**

            \/                                                       \/ Notice the parenthesis

In [None]:
students_df[ ( (students_df.assignments > 75) | (students_df.absences < 3) ) & (students_df.exam > 85)]

In [None]:
True or False and False, (True or False) and False

**important**: `and` clauses have higher priority than `or` clauses, `x or y and z` is the same as `x or (y and z)`

**Exercise** Show all data points from 2015 where either the country was developing or had infant deaths below 60

In [None]:
%%postcell exercise_025_145_b

#Type code here

### Truth tables
Boolean algebra is often described in terms of a "truth table," as listed below:

Here is the full **truth table**:

|Statement A | Statement B| A and B | A or B|
| ---        | ---        | ---     | ---   |
| True       | True       | True    | True  |
| True       | False      | False   | True  |
| False      | True       | False   | True  |
| False      | False      | False   | False |

### Negating comparison operators

In [None]:
who_df[who_df.Year > 2013].head()

In [None]:
who_df[ ~ (who_df.Year <= 2013) ].head()

**Negating a comparison operator does NOT simply change to opposite sign, you have to account for the 'equal to' part**

Notice, in boolean algebra notation, this is: 
```
ORIGINAL:     (Year  GREATER THAN 2013) 
NEGATED : NOT (Year  LESS THAN OR EQUAL TO 2013)
```

            <-------------------------------------------------------------->
            <------------------------------------[*************************>
            <*************************************]------------------------>
                                                 ^--- This is where most people make mistakes

**Consequence** Breaking up continuous values requires care!

**Exercise** Fix the bug in the following program:

In [None]:
def get_letter_grade(gpa):
    if gpa > 3.5: return 'A'
    elif gpa < 3.5 and gpa > 3.0: return 'B'
    elif gpa <3.0 and gpa > 2.5: return 'C'
    else: return 'D'

get_letter_grade(3.5)

In [None]:
%%postcell exercise_025_145_c

#Retype code here and fix the bug

### Demorgan's laws

Deomrgan's laws are often used to simplify a complex set of boolean expressions. Here is wikipedia's definition:

The negation of a disjunction is the conjunction of the negations; and
The negation of a conjunction is the disjunction of the negations;
or

the complement of the union of two sets is the same as the intersection of their complements; and
the complement of the intersection of two sets is the same as the union of their complements.
or

not (A or B) = not A and not B; and

not (A and B) = not A or not B

In set theory and Boolean algebra, these are written formally as

\begin{aligned}{\overline {A\cup B}}&={\overline {A}}\cap {\overline {B}},\\{\overline {A\cap B}}&={\overline {A}}\cup {\overline {B}},\end{aligned}

\begin{aligned}{\overline {A\cup B}}&={\overline {A}}\cap {\overline {B}},\\{\overline {A\cap B}}&={\overline {A}}\cup {\overline {B}},\end{aligned}
where

A and B are sets,
A is the complement of A,
∩ is the intersection, and
∪ is the union.

Earlier we saw these two examples:

In [None]:
who_df[who_df.Year != 2015].head()

In [None]:
who_df[~(who_df.Year == 2015)].head()

Notice, in boolean algebra notation, this is: 
```
ORIGINAL:     (Year IS NOT 2015) 
NEGATED : NOT (Year IS 2015)
```

Sometimes the results of Demorgan's laws are not inutitive

In [None]:
tmp_df = who_df[(who_df.Year != 2015) & (who_df.Status != 'Developing')]
tmp_df.head()

In [None]:
tmp_df.GDP.sum()

In [None]:
tmp_df = who_df[ ~ ( (who_df.Year == 2015) | (who_df.Status == 'Developing') ) ]
tmp_df.head()

In [None]:
tmp_df.GDP.sum()

Notice, in boolean algebra notation, this is: 
```
ORIGINAL:     (Year IS NOT 2015) AND (Status IS NOT Developing)
NEGATED : NOT (Year IS 2015) OR (Status IS  Developing)
```

**How to think of it:** ORIGINAL corresponds to "include these results" and NEGATED corresponds to "filter these results"

Combine negation of comparson operators and demorgan's laws

In [None]:
tmp_df = who_df[(who_df.Year > 2013) & (who_df.Status != 'Developing')]
tmp_df.head()

In [None]:
tmp_df.GDP.sum()

In [None]:
tmp_df = who_df[ ~ ( (who_df.Year <= 2013) | (who_df.Status == 'Developing') ) ]
tmp_df.head()

In [None]:
tmp_df.GDP.sum()

**Exercise** Use De Morgan's law the simplify this

In [None]:
# Include countries which are not developing and which do not have infant deaths of zero
tmp_df = who_df[(who_df.Status != "Developing") & (who_df['infant deaths'] != 0)]
tmp_df

In [None]:
tmp_df.Polio.sum()

In [None]:
%%postcell exercise_025_145_d

#Retype code here and fix the bug

# FILTER countries which are ???
who_df[???].Polio.sum()

### Boolean algebra and probability <= We are data scientists, not computer scientists, right?

See Jayne's "Probability Theory: The Logic of Science" for details: https://www.amazon.com/Probability-Theory-Science-T-Jaynes/dp/0521592712

This topic is well outside the scope of this class, but may be of interest to some students. Boolean values of `True` and `False` are often represented as `1` and `0` is many programming languages, including Python. This also works from a probability perspective, since `True` can be thought of as the probability of absolute certainty that something will happen and `False` as the probability that an event will (almost certainly) not happen. This is where boolean algebra stops. Hoever, proability can handle statements of varying degrees of certinity. 

Probability has its own version of negation and Demogran's laws.