<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<br><br><br>
<h1>Python for Business Analytics</h1>
<em>A Nontechnical Approach for Nontechnical People</em><br><br>
<em><strong>Custom Edition for Hult International Business School</strong></em><br>

Written by Konrad Ilczuk <br>
Hult International Business School <br><br><br>
<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<br>

<h1><u>Chapter X: Sets and Set Operations</u></h1>

In <strong>Chapter 5: Lists and List Operations</strong>, we introduced the list - a data structure that allows us to store multiple elements of more than one type. This is one of <a href="https://docs.python.org/3/tutorial/datastructures.html">Python's built-in data structures</a> (and it is native to many other programming languages as well). As you may recall, compared to declaring several stand-alone objects, <strong>lists are a way to better organize our information</strong>. As we move forward, we will explore more specialized storage objects that are critical for several analytically-related tasks.
<br><br>
Amongst these are <a href="https://docs.python.org/3/tutorial/datastructures.html#sets">sets</a> and <a href="https://docs.python.org/3/tutorial/datastructures.html#dictionaries">dictonaries</a>. The former is often used to represent a set of unique elements, and the latter allows us to create an association between two values (i.e. two variables). If these are difficult to conceptualize, have no fear! It is our hope that by the end of this chapter you will be well-acquainted with both structures and develop an intuition for their usage. As with the content in previous chapters, the concepts of this chapter will be reinforced as you move forward throughout this book.
<br><br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray"><br>

<h3>X.1 Declaring Sets with Values</h3><br>
Perhaps the most fundamental way to think about a set is to compare it to an already familiar structure: the <a href="https://docs.python.org/3/library/stdtypes.html#lists">list</a>. As can be recalled from <strong>Chapter 5: Lists and List Operations</strong>, a list is declared with square brackets and its elements are separated by commas, as can be observed in <em>Code XYZ</em>.

<br><hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

In [None]:
## Code XYZ ##

# declaring a list
tasty_lst = ["avocados", "tomatoes", "pineapples", "nachos", "avocados"]

# printing the list
print(tasty_lst)

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray"><br>

Mmmmmm, that is a tasty list! However, it appears that avocados sounded so good that we accidentally put them on <em>tastylst</em> twice. We could simply slice out the second occurrence of avocados, as covered in <strong>Chapter 5: Lists and List Operations</strong>. However, this would not be feasible if our list was much longer and/or if duplicate entries were more numerous. Sets effectively solve this problem. For our purposes:<br><br>


<div align="center"><h3>
    Sets are a way to better organize our <font color="dark orange">unique</font> information.
<a class="tocSkip"></h3></div><br>

Sets can be declared as easily as lists, the only difference being that instead of square brackets, we use curly brackets. This is exemplified in <em>Code XYZ</em>.
    
<br><hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

In [None]:
## Code XYZ ##

# declaring a set
tasty_set = {"avocados", "tomatoes", "pineapples", "nachos", "avocados"}

# printing the set
print(tasty_set)

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray"><br>

Upon running the cells above, notice the key difference between <em>tasty_lst</em> and <em>tasty_set</em> - one of the "avocados" objects disappeared when <em>tasty_set</em> was run. This is because unlike lists, which will allow for duplicate values, <strong>sets always remain in a state where only <font color ="dark orange">unique</font> elements that are stored</strong>, ensuring that no value is ever repeated. This makes the set an excellent structure to store information that we do not want to have duplicates of. 

<br><hr style="height:0.5px;border-width:0;color:gray;background-color:gray"><br><br>

<h3>X.2 Transforming a List Into a Set</h3><br>

Picture a scenario where you wrote an app that allows people to enter their name to RSVP for your birthday party. You first share the link on social media. Then, a few days later you send it out over the email, and also reach out to remind your classmates about it. In such a scenario, some of your friends might accidentally RSVP more than once. Instead of immediately sending the names of your party's attendees to your inbox, the app should eliminate duplicate entries. In other words, it should transform the list of participants into a set. This can be done by applying the <strong>set( )</strong> function, as shown in <em>Code XYZ</em>.

<br><hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

In [None]:
## Code XYZ ##

# declaring a list
participants = ['Chase', 'Konrad', 'Jane', 'Anders',
                'Andrew', 'Bertil', 'Jane']


# transforming the list into a set
unique_participants = set(participants)


# printing the results
print(unique_participants)

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<em>Code XYZ</em> exhibits a property of sets that also exists for lists: <strong>sets are convertible</strong>. In other words, Python allows us to change an object's type from list to set. We can do the exact opposite as well. As you may recall from <strong>Chapter 5</strong>, convertibility is a key advantage of lists, and we are now ready to build upon this knowledge to master other data storage types.

In [None]:
## Code XYZ ##

# transforming a set into a list
list(unique_participants)

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray"><br>

<h3>X.3 Adding and Removing Set Elements</h3><br>
The simple cases above utilize sets as a stand-alone element. In most scenarios, however, sets are used in tandem with other Python syntax, such as for loops. For example, if we wanted to add elements to a set, we could simply apply the <strong>add( )</strong> method. This is the equivalent to the <strong>append( )</strong> method for lists, which was covered in <strong>Chapter 5</strong>.

<strong>Note:</strong> One of the beauties of Python is its relatively mild learning curve compared to other programming languages. The time you invested in learning list methods will pay off exponentially, as these concepts are present in all of Python's data structures.

<br><hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

In [None]:
## Code XYZ ##

# adding elements to a set
unique_participants.add('Johan') # is not present in the set
unique_participants.add('Jane')  # already is in the set


# printing the results
print(unique_participants)

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray"><br>

Sadly, even though Jane RSVP'd more than any other attendee, she has just informed us that she will not be able to attend the party. Thus, we need to update the guest list. In other words, we need a way to remove an element from our set. Using your intuition and understanding of lists, try to guess which method we should use to accomplish this.<br><br>

The <strong>remove( )</strong> method.

<br>

Let's remove Jane from the list. Also, given that we have made number of changes to the original data, we should recount the number of people that are planning to attend so that we can prepare the right quantities of food.
<br><br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

In [None]:
## Code XYZ ##

# removing elements from a set
unique_participants.remove('Jane')


# printing the results
print(unique_participants)


# printing the current number of participants
print("The final number of participants is: " + str(len(unique_participants)))

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

Try to run the cell above twice. The first time you ran it, it removed Jane from the set. The second time, the set contains no entry for Jane and will throw a <a href="https://docs.python.org/3/library/exceptions.html#KeyError">key error</a>. This functionality makes sense, as Python is simply trying to let us know that something went wrong. However, imagine that we wanted to make sure Jane was not on the guest list, but didn't want the code to throw an error in the event that we had already removed her. For this, we can use the <strong>discard( )</strong> method. This method will "safely remove" an element if it exists in the set or do nothing if the element is not present. Try running <em>Code XYZ</em> several times to see that <strong>discard( )</strong> will not throw a key error.<br><br>

<h4>Mutability</h4>

Mutability was briefly discussed in <strong>Chapter 5</strong> in the context of lists. As a review, an object is mutable if it can be changed, such as by adding or removing elements. 

<font color="purple">The concept of immutability can be a challenging one for the students. Can you provide a real world example of when this would be beneficial?</font>

If you would like to guarantee immutability of your set and disallow removal of elements, you can create a <strong>frozenset</strong> which acts much like the set we discussed above, except for that it lacks the "add", "remove" and "discard" functions.

In [None]:
## Code XYZ ##

# safe removal from a set
unique_participants.discard('Jane')


# printing the results
print(unique_participants)

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

In [None]:
## Code XYZ ##

# delcaring an immutable set
immutable_set = frozenset(unique_participants)


# printing the results
print(immutable_set)


#!# none of the methods below will work, as the set is now immutable
# immutable_set.add('Roman')
# immutable_set.remove('Jane')
# immutable_set.discard('Jane')

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<h3>X.4 Set Operations</h3><br>
</strong>Note:</strong> For practical purposes, we will use the words "list" and "set" interchangeably.

To complicate matters further (and to have a great reason to introduce set operations!), your friend Jacob has been secretly planning a surprise party for your birthday. Two interesting facts about Jacob: (1) He has an enormous amount of friends, and (2) He likes to invite people to parties that he thinks will become friends. Additionally, he was not aware of the party we were planning, and now two lists of party guests exist. Once this was discovered, it was suggested to combine the lists (remember - sets!). We could simply combine the lists and remove duplicates by converting the result into a set (<em>Code XYZ</em>). However, this would hinder our ability to address some interesting questions about the gathered data. For example:
<br><br>

<strong>Which people are on both guest lists?</strong><br>
In other words, which of our friends are also friends with Jacob? There are many practical reasons for asking such a question. For example, if you and Jacob got into a huge fight, who would be caught in the crossfire? Sets in Python rely on the method <strong>intersection()</strong> to address such a challenge.
<br><br>

<strong>Who is on either list or both lists?</strong><br> Perhaps by answering this question, we can collaborate with Jacob and prepare the party together. This is a great opportunity to unite and apply the <strong>union()</strong> method to get everyone on the same page.
<br><br>

<strong>Who is only on our list? Who is only on Jacob's list?</strong><br>
There is a high chance that many people at the party will not know each other, especially those appearing on only one of the lists. If a guest only appears on one of the lists, perhaps we can accommodate them better by introducing them to someone that only appears on the other list (in fact, this is how Chase's friend Andrew met his wife Justine). In such situations, we can utilize the <strong>difference()</strong> method.
<br><br>

<strong>Who is on either list, but not on both lists?</strong><br>
Note that in <a href="https://en.wikipedia.org/wiki/Set_theory">mathematical set theory</a>, this is a very different question from the one stated above, and this distinction becomes more pronounced in critical tools such as the various dialects of <a href="https://en.wikipedia.org/wiki/SQL">structured query language (SQL)</a>. For now, we can use the <strong>symmetric_difference()</strong> method to address this question, which will result in one set of all of the people that appeared on only one of the guest lists (not both). In other words, <strong>symmetric_difference()</strong> does the opposite of the <strong>union()</strong> method.
<br><br>

<strong>Next Steps</strong><br>
Run the sets in <em>Code XYZ</em>. Notice that sets are being declared using two different syntax styles. Also, to tie up loose ends and just in case it was bothering you, please assume the following: Jacob <em>was</em> invited to the party. We accidentally sent the invite to the email address that he uses when he's forced to put his email on a web forms. So don't feel bad that we may have forgotten to invite him :)
<br><br>
<font color="purple"> Hey Konrad, any thoughts on the advantages/disadvantages of each?</font>

In [None]:
## Code XYZ ##

# deckaring my_lst using {} syntax
my_set = ['Konrad', 'Johan', 'Chase',
          'Andrew', 'Bertil', 'Anders']


# declaring jacobs_set by converting a list into a set
jacobs_set = ['Cheryl', 'Lana', 'Konrad',
              'Jane', 'Anders', 'Andrew',
              'Bertil', 'Jane', 'Dimitri',
              'Arshak', 'Julia', 'Alex',
              'Jonathan', 'Risson', 'Zhou',
              'Rowena', 'Harry', 'Sam',
              'Nick', 'Evan', 'Connie']


# converting the lists into a set
set(my_set + jacobs_set)

<br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<br>

In [None]:
## Code XYZ ##

# deckaring my_set using {} syntax
my_set = {'Konrad', 'Johan', 'Chase',
          'Andrew', 'Bertil', 'Anders'}


# declaring jacobs_set by converting a list
# into a set
jacobs_set = set(['Cheryl', 'Lana', 'Konrad',
                  'Jane', 'Anders', 'Andrew',
                  'Bertil', 'Jane', 'Dimitri',
                  'Arshak', 'Julia', 'Alex',
                  'Jonathan', 'Risson', 'Zhou',
                  'Rowena', 'Harry', 'Sam',
                  'Nick', 'Evan', 'Connie'])

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

Luckily, the aforementioned approaches to addressing these questions are native to Python, meaning they can be accessed without importing any additional packages. To reiterate the methods mentioned above, please see the following symbolic expressions, which are commonplace in the analytical and technical worlds.

- Intersection (A⋂B)
- Union (A⋃B)
- Difference (A\B) or (B\A)
- Symetric Difference (A⊖B)

<br>
As sets grow in size, such as in the party guests example above, it becomes impractical to simply read the values and find a solution. This presents yet another excellent case for utilizing a programmatic approach! As you move forward, note that each set operation has been visualized.

<br><br>

<h4> Intersection A⋂B</h4>
In the following images, think of "A" as "my_set" (our guest list) and "B" as "jacobs_set" (Jacob's guest list). The intersection (i.e. the shared part) is labeled with the symbolic expression "A⋂B". <strong>intersection( )</strong> returns the elements that are common between both A and B (A⋂B). In other words, these are guests that both you and Jacob have reached out to. We should probably not bother them a third time. 

![caption](images/venn_diagram.png)

In [None]:
## Code XYZ ##

# intersection (my_set ⋂ jacobs_set)
my_set.intersection(jacobs_set)

<br><br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<br><br>

<h4>Union (A⋃B)</h4><br>
Remember, union is the starting point for collaboration with Jacob. It combines each guest list and removes duplicates, granting you a more clear understanding as to who is attending the party. In mathematical terms (A, A⋂B, and B), the people attending the party is the result of:

<br>

~~~
A + B - A⋂B
~~~

Note from the diagram that the population of A⋂B is being double counted as it is already included in both A and in B. Thus, it must be subtracted from the sum of A and B. The result presents a final set that we can use as our new starting point to understand who has yet to be invited.

![caption](images/venn_diagram_union.png)

In [43]:
## Code XYZ ##

# union
my_set.union(jacobs_set)

{'Alex',
 'Anders',
 'Andrew',
 'Arshak',
 'Bertil',
 'Chase',
 'Cheryl',
 'Connie',
 'Dimitri',
 'Evan',
 'Harry',
 'Jane',
 'Johan',
 'Jonathan',
 'Julia',
 'Konrad',
 'Lana',
 'Nick',
 'Risson',
 'Rowena',
 'Sam',
 'Zhou'}

<br><br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<br><br>

#### Difference (A\B) or (B\A)

When sets are compared, any unique values to a given set are known as the difference of that set. This addresses the question: What is on our list that is not on Jacob's? In other words, the difference of sets tells us what one set has that the other one does not. The difference of A and B would be the elements that are only present in A, but are not present in B. Similarly, the difference of B and A would be the elements that are only present in B but not in A.

Thanks to Jacob, we have realized that we forgot to invite Lana and Cheryl. Luckily, Jacob has already contacted them. Who would've guessed our social groups would be so similar?

<center>(A\B)</center> | <center>(B\A)</center>
- | - 
![caption](images/venn_diagram_relative_difference_ab.png) | ![caption](images/venn_diagram_relative_difference_ba.png)

In [44]:
## Code XYZ ##

# difference of sets
print("My Set - Jacobs: " + str(my_set.difference(jacobs_set)))
print('\n')
print("Jacobs - My Set: " + str(jacobs_set.difference(my_set)))

My set - Jacobs: {'Johan', 'Chase'}


Jacobs - my set: {'Cheryl', 'Jonathan', 'Nick', 'Alex', 'Connie', 'Jane', 'Risson', 'Sam', 'Julia', 'Dimitri', 'Arshak', 'Rowena', 'Harry', 'Evan', 'Lana', 'Zhou'}


<br><br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<br><br>

#### Symetric Difference (A⊖B)

![caption](images/venn_diagram_symetric_difference.png)

The symmetric difference can be a bit trickier to conceptualize. It returns the subset of elements that are in B but are absent from A (the difference) AND (union) of elements in A that are absent from B (the difference). This operation is useful when you want to know what the sets didn't have in common, and can be visualized as everything except the intersection A⋂B.

In [None]:
## Code XYZ ##

# symetric difference
my_set.symmetric_difference(jacobs_set)

<br><br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<br><br>

<h4>Comparing Comparing Operational Approaches</h4>

<center>(A\B)</center> | &nbsp; | <center>(B\A)</center> |
- | - | -
![caption](images/venn_diagram_relative_difference_ab.png) | **⋃** | ![caption](images/venn_diagram_relative_difference_ba.png) |


![caption](images/venn_diagram_symetric_difference.png)

Let us verify these approaches. If we subtract B from A and combine (i.e. union) that with A from B, we should get everything that is not in the intersection, which is exactly what the symmetric difference should give us. Naturally, we can use the "==" operator to assess whether these sets are equivalent.
<br><br>
TO additionally verify our work, we could go back to the set operation visualizations and compare them directly... in case we have not yet developed a sufficient amount of trust in machine-executed code :).
<br><br><br><br>

In [48]:
## Code XYZ ##

# asserting that the two sets are equal
my_set.difference(jacobs_set). \
union(jacobs_set.difference(my_set)) == \
my_set.symmetric_difference(jacobs_set)

True

Seems like our solution is correct!

<br><br>

<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">
<hr style="height:0.5px;border-width:0;color:gray;background-color:gray">

<br><br>

# External Reading

- https://python-reference.readthedocs.io/en/latest/docs/sets/
- https://www.w3schools.com/python/python_sets.asp
- https://docs.python.org/3/tutorial/datastructures.html#sets

# Exercises

We made it through the basic cases of sets, introduced the different set operations with their respective symbolic notations. Now the time has come to practice on these concepts in a scenario with 3 sets.

![3vd](images/venn_diagram_3.png)

You are working as a Data Scientist in a respected and global educational company. You have collected data on three of the best performing students. Right now you are looking into which teachers taught all three students, believing that identifying them would be a great predictor of the success of future students. 

In [None]:
student1 = {"Mr Beto","Ms Alice","Mr Han","Ms Aline de Silva","Senhor Bruno Ferreira","Mr Damese","Ms Jabrullah","Mr Elif","Ms Ilaf"}
student2 = {"Senhor Bruno Ferreira","Mr Damese","Ms Jabrullah","Ms Ilaf","Mr Geralt","Ms Sandeepa","Ms Xin Guo"}
student3 = {"Mr Han","Mr Damese","Ms Jabrullah","Ms Ilaf","Mr Tomasz"}

First, let's begin by printing out and counting the number of unique teacher names in the sets. Try to write it yourself first, in the cells provided below and only later compare later with our solutions.

In [None]:
len(student1.union(student2).union(student3))

In [None]:
len(student1.union(student2).union(student3))

In [None]:
### Type your code here

Which are the teachers they all had in commmon ?

In [None]:
student1.intersection(student2).intersection(student3)

In [None]:
### Type your code here

You realized a mistake while collecting your data and it turned out that the student2 set comes actually from a higher-level course that does not exist anymore and the teachers that taught it are long gone from the school. You will need to remove them and only consider those that were in common between the other two students.

In [None]:
student1.union(student3).difference(student2)

In [None]:
### Type your code here

Who are the teachers that none of them has in common ? 

In [None]:
student1.symmetric_difference(student2).symmetric_difference(student3) #this is a trap :-)

In [None]:
student1.union(student2).union(student3).difference(student1.intersection(student2).intersection(student3))

In [None]:
### Type your code here

<font color="purple">
<h4>Bonus: How Python detects duplicates when declaring sets.</h4>
I think this would be pretty cool to add, or perhaps to mention a bit in the above paragraph/provide a link to an external source. Also, what do you think of my purple font for adding random feedback/ideas? Pretty awesome right? The color of analytics as well as the Minnesota Vikings! Enjoy the encrypted message below.

***

<br>


.-------.     .-''-.     ____        _______      .-''-.              ,-----.      ___    _ ,---------. .---.  
\  _(`)_ \  .'_ _   \  .'  __ `.    /   __  \   .'_ _   \           .'  .-,  '.  .'   |  | |\          \\   /  
| (_ o._)| / ( ` )   '/   '  \  \  | ,_/  \__) / ( ` )   '         / ,-.|  \ _ \ |   .'  | | `--.  ,---'|   |  
|  (_,_) /. (_ o _)  ||___|  /  |,-./  )      . (_ o _)  |        ;  \  '_ /  | :.'  '_  | |    |   \    \ /   
|   '-.-' |  (_,_)___|   _.-`   |\  '_ '`)    |  (_,_)___|        |  _`,/ \ _/  |'   ( \.-.|    :_ _:     v    
|   |     '  \   .---..'   _    | > (_)  )  __'  \   .---.        : (  '\_/ \   ;' (`. _` /|    (_I_)    _ _   
|   |      \  `-'    /|  _( )_  |(  .  .-'_/  )\  `-'    /         \ `"/  \  ) / | (_ (_) _)   (_(=)_)  (_I_)  
/   )       \       / \ (_ o _) / `-'`-'     /  \       /           '. \_/``".'   \ /  . \ /    (_I_)  (_(=)_) 
`---'        `'-..-'   '.(_,_).'    `._____.'    `'-..-'              '-----'      ``-'`-''     '---'   (_I_)  
                                                                                                               


<br>

***
    
</font>