***
***
***
<br><br><br><br><br>
<h1>Python for Business Analytics</h1>
<em>A Nontechnical Approach for Nontechnical People</em><br><br>
<em><strong>Custom Edition for Hult International Business School</strong></em><br>

Written by Konrad Ilczuk <br>
Hult International Business School <br>
***
***
***

# <u>Chapter X: Sets and Set Operations</u>

In <strong>Chapter 5: Lists and List Operations</strong>, we introduced the list - a data structure that allows us to store multiple elements of more than one type. This is one of <a href="https://docs.python.org/3/tutorial/datastructures.html">Python's built-in data structures</a> (and it is native to many other programming languages as well). As you may recall, compared to declaring several stand-alone objects, <strong>lists are a way to better organize our information</strong>. As we move forward, we will explore more specialized storage objects that are critical for several analytically-related tasks.
<br><br>
Amongst these are <a href="https://docs.python.org/3/tutorial/datastructures.html#sets">sets</a> and <a href="https://docs.python.org/3/tutorial/datastructures.html#dictionaries">dictonaries</a>. The former is often used to represent a set of unique elements, and the latter allows us to create an association between two values (i.e. two variables). If these are difficult to conceptualize, have no fear! It is our hope that by the end of this chapter you will be well-acquainted with both structures and develop an intuition for their usage. As with the content in previous chapters, the concepts of this chapter will be reinforced as you move forward throughout this book.
<br><br>

***

<h3>X.1 Declaring Sets with Values</h3>

Perhaps the most fundamental way to think about a set is to compare it to an already familiar structure: the <a href="https://docs.python.org/3/library/stdtypes.html#lists">list</a>. As can be recalled from <strong>Chapter 5: Lists and List Operations</strong>, a list is declared with square brackets and its elements are separated by commas, as can be observed in <em>Code XYZ</em>.

In [9]:
## Code XYZ ##

# Declaring a list
tasty_lst = ["avocados", "tomatoes", "pineapples", "nachos", "avocados"]

# Printing the list
print(tasty_lst)

['avocados', 'tomatoes', 'pineapples', 'nachos', 'avocados']


***

Mmmmmm, that is a tasty list! However, it appears that avocados sounded so good that we accidentally put them on <em>tastylst</em> twice. We could simply slice out the second occurrence of avocados, as covered in <strong>Chapter 5: Lists and List Operations</strong>. However, this would not be feasible if our list was much longer and/or if duplicate entries were more numerous. Sets effectively solve this problem. For our purposes:<br><br>


<div align="center"><h3>
    Sets are a way to better organize our <font color="dark orange">unique</font> information.
<a class="tocSkip"></h3></div><br>

Sets can be declared as easily as lists, the only difference being that instead of square brackets, we use curly brackets. This is exemplified in <em>Code XYZ</em>.

In [8]:
## Code XYZ ##

# declaring a set
tasty_set = {"avocados", "tomatoes", "pineapples", "nachos", "avocados"}

# printing the set
print(tasty_set)

{'tomatoes', 'pineapples', 'nachos', 'avocados'}


***

Upon running the cells above, notice the key difference between <em>tasty_lst</em> and <em>tasty_set</em> - one of the "avocados" objects disappeared when <em>tasty_set</em> was run. This is because unlike lists, which will allow for duplicate values, <strong>sets always remain in a state where only <font color ="dark orange">unique</font> elements that are stored</strong>, ensuring that no value is ever repeated. This makes the set an excellent structure to store information that we do not want to have duplicates of. 

<font color="purple">
<h4>Bonus: How Python detects duplicates when declaring sets.</h4>
I think this would be pretty cool to add, or perhaps to mention a bit in the above paragraph/provide a link to an external source. Also, what do you think of my purple font for adding random feedback/ideas? Pretty awesome right? The color of analytics as well as the Minnesota Vikings! Enjoy the encrypted message below.

***

<br>


.-------.     .-''-.     ____        _______      .-''-.              ,-----.      ___    _ ,---------. .---.  
\  _(`)_ \  .'_ _   \  .'  __ `.    /   __  \   .'_ _   \           .'  .-,  '.  .'   |  | |\          \\   /  
| (_ o._)| / ( ` )   '/   '  \  \  | ,_/  \__) / ( ` )   '         / ,-.|  \ _ \ |   .'  | | `--.  ,---'|   |  
|  (_,_) /. (_ o _)  ||___|  /  |,-./  )      . (_ o _)  |        ;  \  '_ /  | :.'  '_  | |    |   \    \ /   
|   '-.-' |  (_,_)___|   _.-`   |\  '_ '`)    |  (_,_)___|        |  _`,/ \ _/  |'   ( \.-.|    :_ _:     v    
|   |     '  \   .---..'   _    | > (_)  )  __'  \   .---.        : (  '\_/ \   ;' (`. _` /|    (_I_)    _ _   
|   |      \  `-'    /|  _( )_  |(  .  .-'_/  )\  `-'    /         \ `"/  \  ) / | (_ (_) _)   (_(=)_)  (_I_)  
/   )       \       / \ (_ o _) / `-'`-'     /  \       /           '. \_/``".'   \ /  . \ /    (_I_)  (_(=)_) 
`---'        `'-..-'   '.(_,_).'    `._____.'    `'-..-'              '-----'      ``-'`-''     '---'   (_I_)  
                                                                                                               


<br>

***
    
</font>

### Transforming a List Into a Set

Picture a scenario where you wrote an app that allows people to enter their name to sign up for your birthday party. You share the link in your social media profile, few days later you send it out over the email, and also remind your class mates about it. Some of your friends might accidentally sign up twice. Your app, right before delivering the final names of the participants to your mailbox should transform the list of participants to a set of participants. This can be done by applying the set() function on a list:

In [2]:
## Code XYZ ##

# Transforming a list into a set
participants = ['Chase','Konrad','Jane','Anders','Andrew','Bertil','Jane']
unique_participants = set(participants)
print(unique_participants)

{'Chase', 'Andrew', 'Anders', 'Jane', 'Konrad', 'Bertil'}


This example shows one of the properties of sets that are in common with lists. Sets are convertable as we are able to change the data type from a list into a set. We can do the exact oposite as well, thus reinforcing what we learned about lists being convertable:

In [3]:
## Code XYZ ##

# Transforming a set into a list
list(unique_participants)

['Chase', 'Andrew', 'Anders', 'Jane', 'Konrad', 'Bertil']

### Adding and Removing Elements

The simple cases above show how to deal with pre-defined structures, such that we manually write in code. In most scenarios you will probably find yourself adding elements to a set inside of a loop. Python sets have the add() function, which is sets equivalent to append() for lists which we saw in Chapter 5. 

In [4]:
## Code XYZ ##

# Manipulating a set - adding elements
unique_participants.add('Johan') #is not present in the set
unique_participants.add('Jane') # already is in the set
print(unique_participants)

{'Bertil', 'Chase', 'Anders', 'Konrad', 'Jane', 'Andrew', 'Johan'}


Python performs a check when adding an element to a set to see if it wouldn't duplicate it. If the element we wish to add to the set is no there yet, it adds it - much like we did previously with lists. Otherwise, the element would not find itself again in the set.

Sadly, Jane has just informed us that she is not able to join our party. We need to update the guest list to reflect this. As you correctly guessed, the remove() function will do this for us. Given the number of changes we have done to the guest list, we would also need to see how many people we have now so as to prepare the right quantities of food.

In [5]:
## Code XYZ ##

# Manipulating a set - removing elements
unique_participants.remove('Jane')
print(unique_participants)
print("The final number of participants is: " + str(len(unique_participants)))

{'Bertil', 'Chase', 'Anders', 'Konrad', 'Andrew', 'Johan'}
The final number of participants is: 6


Try to run the cell above twice. The first time you ran it, it removed Jane from the set. The second time, the set contains no entry for Jane and will throw a key error. For this reason, if we do not wish to be alerted about this state of the set, we can use the discard() function instead. The discard() function will "safely remove" an element if it exists in the set or do nothing and allow your code to continue running smoothly.

In [6]:
## Code XYZ ##

# Manipulating a set - safe removal
unique_participants.discard('Jane')
print(unique_participants)

{'Bertil', 'Chase', 'Anders', 'Konrad', 'Andrew', 'Johan'}


Mutability was briefly discussed in chapter 5 in the context of lists. What it means when an object is mutable is that we can change it for example by adding or removing elements. If you would like to guarantee immutability of your set and disallow removal of elements, you can create a frozenset which acts much like the set we discussed above, except for that it lacks the "add", "remove" and "discard" functions.

In [9]:
## Code XYZ ##

# Immutable set

immutable_set = frozenset(unique_participants)
print(immutable_set)
#None of the methods below will work, as the set is now immutable
#immutable_set.add('Roman')
#immutable_set.remove('Jane')
#immutable_set.discard('Jane')

frozenset({'Chase', 'Anders', 'Jane', 'Konrad', 'Andrew', 'Bertil'})


### Set Operations

For practical reasons we will use the word "list" and "set" interchangably, however note that a "list of participants" is translated to a Python set containing names of participants.

To complicate matters further (and to have a great excuse to introduce set operations!) our friend Jacob has taken the initiative to help us out with the list of people who would come to your party. Unfortunately he was not aware of our initiative to do the same, hence double work was made. Once he learned that we were doing that too he suggested to combine your both lists(remember - sets!). With the two sets in place, we would be able to obtain some interesting answers about the data gathered:

- Find out who has already been asked be either us or Jacob
- Combine both lists, so that we can now work as a team  
- Find out who is in our list, but not Jacobs and vice-versa
- Who is either in ours or Jacobs list, but not in their intersection

Luckily, to answer these questions Python comes with ready implemented functions for set operations. We put in the parantheses their symbolic expressions. 

- Intersection (A⋂B)

- Union (A⋃B)

- Difference (A\B) or (B\A)

- Symetric Difference (A⊖B)

Our sets have grown in size, making it impractical for us to simply read the values and figure it out. This presents us with an excellent case to use a programming language to help us find the answers. Jacob was so kind as to have already declared the sets for us: 

In [None]:
## Code XYZ ##

# Declaring two sets
myset = set(['Chase','Konrad','Jane','Anders','Andrew','Bertil','Jane','David','Dimitri','Arshak','Daniel','Henrik','Zhou','Wang','Jonathan'])
jacobs = set(['Cheryl','Lana','Konrad','Jane','Anders','Andrew','Bertil','Jane','Dimitri','Arshak','Zhou','Jonathan'])

We will now visualize these sets and perform a number of operations with them.

#### Intersection A⋂B

In the following images, "A" is "myset"(ours) and "B" refers to "Jacobs". We marked the common part - the intersection with the symbolic expression "A⋂B".

The intersection returns the elements that are common for both A and B (A⋂B). Those are friends both you and Jacob have reached out to and asked. We should probably not bother them a third time. 

![caption](images/venn_diagram.png)

In [None]:
## Code XYZ ##

# Intersection (myset⋂jacobs)
myset.intersection(jacobs)

#### Union (A⋃B)

![caption](images/venn_diagram_union.png)

Union is the starting point of your colaboration. You combine your two lists, remove the duplicates and have therefore an updated understanding on who has signed up. In A,A⋂B and B terms that would be A + B - A⋂B. We subtract A⋂B, because it is already included in both A and in B and we do not wish to count it twice. The result presents a final set that we can now use as our new starting point to continue collecting names of people.

In [14]:
## Code XYZ ##

# Union
myset.union(jacobs)

{'Anders',
 'Andrew',
 'Arshak',
 'Bertil',
 'Chase',
 'Cheryl',
 'Daniel',
 'David',
 'Dimitri',
 'Henrik',
 'Jane',
 'Jonathan',
 'Konrad',
 'Lana',
 'Wang',
 'Zhou'}

#### Difference (A\B) or (B\A)

<center>(A\B)</center> | <center>(B\A)</center>
- | - 
![caption](images/venn_diagram_relative_difference_ab.png) | ![caption](images/venn_diagram_relative_difference_ba.png)

Differences in sets bring up what is unique for them. The difference of A and B would be the elements that are only present in A, but are not present in B. Similiarly, the difference of B and A would be the elements that are only present in B but not in A.

Now we found out who we haven't reached out to yet. We haven't asked Lana and Cheryl, but those were luckily already contacted by Jacob.

In [15]:
## Code XYZ ##

# Difference of sets
print("My set - Jacobs: " + str(myset.difference(jacobs)))
print("Jacobs - my set: " + str(jacobs.difference(myset)))

My set - Jacobs: {'Chase', 'David', 'Wang', 'Henrik', 'Daniel'}
Jacobs - my set: {'Lana', 'Cheryl'}


#### Symetric Difference (A⊖B)

![caption](images/venn_diagram_symetric_difference.png)

The symetric difference is a bit trickier to explain. It returns the subset of elements that are in B but are absent from A (the difference) AND (union) of elements in A that are absent from B. This operation is useful when you want to see what the both sets missed. This can be visualized as everything except the intersection A⋂B.

In [16]:
## Code XYZ ##

# Symetric difference
myset.symmetric_difference(jacobs)

{'Chase', 'Cheryl', 'Daniel', 'David', 'Henrik', 'Lana', 'Wang'}

#### Comparing Sets

Let us verify this ourselves. If we subtract B from A and combine that(union) with A from B, we should get everything that is not the intersection, which is exactly what the Symetric Difference is. Naturally, we can use the "==" operator to assess whether those sets are equivalent.

In [18]:
## Code XYZ ##

# Asserting that the two sets are equal
myset.difference(jacobs).union(jacobs.difference(myset)) == myset.symmetric_difference(jacobs)

True

One last check we could do is to go back to our visualizations and compare them (in case we have not yet developed a sufficient amoutn of trust in machine-executed code).


<center>(A\B)</center> | &nbsp; | <center>(B\A)</center> |
- | - | -
![caption](images/venn_diagram_relative_difference_ab.png) | **⋃** | ![caption](images/venn_diagram_relative_difference_ba.png) |


![caption](images/venn_diagram_symetric_difference.png)

Seems like our solution is correct!

# External Reading

- https://python-reference.readthedocs.io/en/latest/docs/sets/
- https://www.w3schools.com/python/python_sets.asp
- https://docs.python.org/3/tutorial/datastructures.html#sets

# Exercises

We made it through the basic cases of sets, introduced the different set operations with their respective symbolic notations. Now the time has come to practice on these concepts in a scenario with 3 sets.

![3vd](images/venn_diagram_3.png)

You are working as a Data Scientist in a respected and global educational company. You have collected data on three of the best performing students. Right now you are looking into which teachers taught all three students, believing that identifying them would be a great predictor of the success of future students. 

In [3]:
student1 = {"Mr Beto","Ms Alice","Mr Han","Ms Aline de Silva","Senhor Bruno Ferreira","Mr Damese","Ms Jabrullah","Mr Elif","Ms Ilaf"}
student2 = {"Senhor Bruno Ferreira","Mr Damese","Ms Jabrullah","Ms Ilaf","Mr Geralt","Ms Sandeepa","Ms Xin Guo"}
student3 = {"Mr Han","Mr Damese","Ms Jabrullah","Ms Ilaf","Mr Tomasz"}

First, let's begin by printing out and counting the number of unique teacher names in the sets. Try to write it yourself first, in the cells provided below and only later compare later with our solutions.

In [None]:
len(student1.union(student2).union(student3))

In [6]:
len(student1.union(student2).union(student3))

13

In [7]:
### Type your code here

Which are the teachers they all had in commmon ?

In [4]:
student1.intersection(student2).intersection(student3)

{'Mr Damese', 'Ms Ilaf', 'Ms Jabrullah'}

In [None]:
### Type your code here

You realized a mistake while collecting your data and it turned out that the student2 set comes actually from a higher-level course that does not exist anymore and the teachers that taught it are long gone from the school. You will need to remove them and only consider those that were in common between the other two students.

In [10]:
student1.union(student3).difference(student2)

{'Mr Beto', 'Mr Elif', 'Mr Han', 'Mr Tomasz', 'Ms Alice', 'Ms Aline de Silva'}

In [None]:
### Type your code here

Who are the teachers that none of them has in common ? 

In [11]:
student1.symmetric_difference(student2).symmetric_difference(student3) #this is a trap :-)

{'Mr Beto',
 'Mr Damese',
 'Mr Elif',
 'Mr Geralt',
 'Mr Tomasz',
 'Ms Alice',
 'Ms Aline de Silva',
 'Ms Ilaf',
 'Ms Jabrullah',
 'Ms Sandeepa',
 'Ms Xin Guo'}

In [None]:
student1.union(student2).union(student3).difference(student1.intersection(student2).intersection(student3))

In [8]:
### Type your code here