<a href="https://colab.research.google.com/github/benjamin-du/intermediate-python-topics/blob/main/Working_with_sets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install faker >> /dev/null

In [2]:
%%html 
<p style="text-align: center; font-size: 56px"> SETS </p>



-------------------

In [3]:
from faker import Faker
import pandas as pd

In [4]:
fake = Faker()

In [5]:
id_customers_database = {"ID_customers_database":[fake.uuid4() for i in range(10)]}

In [6]:
id_reimbursement_database = {"ID_reimbursement_database": id_customers_database["ID_customers_database"][:5] + [fake.uuid4() for i in range(5)]}

In [7]:
id_customers_database

{'ID_customers_database': ['ec99c7bc-c780-4126-b95b-5c319caab526',
  '1f76ce85-9485-4046-8187-b626345dd3f8',
  '83f09866-0de5-46c6-b12b-bf8f258be0a4',
  '59fc9344-3532-4984-aaa6-8755ddfc8613',
  '63c1a579-4320-4b8d-a527-82c65d2fab78',
  '5a53b7f5-0309-47b8-aa51-c0aeb4cc058c',
  '0d254945-4ad6-4abd-bc69-ee9cb66807bf',
  '8225c5f4-48e2-47ff-9279-ec0224a0e4a5',
  'e8c7009b-9f4a-44c1-a2fe-86cdbe7bb35e',
  'c30d7f4e-1e7b-43d1-82e2-768fd8143040']}

In [8]:
id_reimbursement_database

{'ID_reimbursement_database': ['ec99c7bc-c780-4126-b95b-5c319caab526',
  '1f76ce85-9485-4046-8187-b626345dd3f8',
  '83f09866-0de5-46c6-b12b-bf8f258be0a4',
  '59fc9344-3532-4984-aaa6-8755ddfc8613',
  '63c1a579-4320-4b8d-a527-82c65d2fab78',
  'da745b1b-7d11-4076-9f6d-985c65d3ce8f',
  '850f65d5-cccb-45c3-9cc0-859ea07a3982',
  '0ca1d38c-4f56-4510-bfd4-94536bfcf2d9',
  'aac86721-c1cc-49a4-b403-29b188358b65',
  '6b224c1e-5c19-41ea-b80c-87c34ebb73b4']}

# Working with sets

## Declaring a set

You can declare a set just like this: 

In [9]:
my_set = {"item1", "item2", "item3", "item4", "item5" }
my_set

{'item1', 'item2', 'item3', 'item4', 'item5'}

Another option is to insert another python object into the set() function: 

In [10]:
my_list = ['apple', 'date', 'citrus', 'banana', 'apple']
my_second_set = set(my_list)
my_second_set

{'apple', 'banana', 'citrus', 'date'}

In [11]:
type(my_set)
type(my_second_set)

set

### You need an empty set ? **Be careful:**

In [12]:
# empty brackets create a dictionnary, not a set:
print(type({}))
# if you want an empty set, use the function set()
print(type(set()))

<class 'dict'>
<class 'set'>


## Key feature 1: no duplicates

You may have noticed that the second set contains only four elements:

In [13]:
my_second_set

{'apple', 'banana', 'citrus', 'date'}

That's because sets do not contain duplicates elements.

This is in my opinion their most important feature. I often use sets when I want to know quickly how many unique values a list contains: 

In [14]:
len(my_list)

5

In [15]:
len(set(my_list))

4

## Key feature 2: unordered

Another thing you may have noticed in the exemples above is that my_second_set is printed in alphabetical order:

In [16]:
# date comes before citrus, citrus comes before banana, etc
print(my_list)

['apple', 'date', 'citrus', 'banana', 'apple']


In [17]:
# this was reordered for printing:
set(my_list)

{'apple', 'banana', 'citrus', 'date'}

Wait wait wait !! Your title says: "dicts are unordered". Now you tell me the set has ordered the list... What's going on ?

Well, that's because we are in a jupyter notebook. And they seem to order the variable for you (check <a href="https://stackoverflow.com/questions/62360830/why-does-python-print-a-set-of-numbers-as-ordered">this</a>)

But you can't trust your dictionnary to be sorted:

In [18]:
# "great, python dict sorts my values for me!!!!"
print(list(set([6,3,7])))

# "Hmmm, not so much....."
print(list(set([6,3,8])))

# "In fact, really not!"
print(set("Hello!"))

[3, 6, 7]
[8, 3, 6]
{'o', '!', 'H', 'l', 'e'}


All in all, we've seen that: 
- Your dictionnary doesn't care in which order you gave it your values
- The way it gives them back is unpredictable

In short, dictionnaries are _unordered_. I don't think there is a use for this feature, it is just something you should be aware of :) 

## Key feature 3: mutable

Mutable means: "I can change it later". In our case, we can add and remove items to our set no problem: 

In [19]:
# remember to declare an empty set using set():
my_set = set()
my_set.add("A")
print(my_set)
my_set.remove("A")
print(my_set)

{'A'}
set()


My set is now empty. What happens if I want to remove an element that is not in there ? 

In [20]:
my_set.remove("A")

KeyError: ignored

To avoid this problem, you can use the discard() method:

In [21]:
# my_set is still empty, yet no problem:
my_set.discard("A")

## Practical uses

Let's say an insurance company gives you two files:
- one contains the information about their customers (ID, address, age, type_of_contract,...)
- the other one contains the information about reimbursments (customer_id, how_much_he_paid, how_much_we_reimbused_that_customer, date_of_operation, etc...)

Of course, if a reimbursment was sent to a specific customer ID, and this ID didn't exist in the customer database, it would raise a red flag.

Well, sets, help you find quickly which customer_ids are in a file and not the other:

In [22]:
id_customers_database = ['365fe38b-4275-4daa-bc1d-450ffd65162b',
                         'c61097a6-a936-4878-a080-21259010d16b',
                         '00f0e885-e7a8-460a-b7b8-21f5af90f5e6',
                         '0053d7e1-2afe-4fec-889b-4689fd39a463',
                         '429b2e6f-f561-4280-98c8-7fa652e8dd7b',
                         '27d4133c-1ca8-41f0-82a3-8c1b6e5724dc',
                         '0fb023a8-a847-45e6-8bba-89a526e413de',
                         '997bc88e-b056-4306-b498-815fbf1d3199',
                         'd0f903c7-c3fb-4cc0-af13-211a1e815464',
                         '2aa42d19-5b2b-482b-bb41-c419d227ce29']

id_reimbursement_database = ['365fe38b-4275-4daa-bc1d-450ffd65162b',
                             'c61097a6-a936-4878-a080-21259010d16b',
                             '00f0e885-e7a8-460a-b7b8-21f5af90f5e6',
                             '365fe38b-4275-4daa-bc1d-450ffd65162b',
                             '429b2e6f-f561-4280-98c8-7fa652e8dd7b',
                             '27d4133c-1ca8-41f0-82a3-8c1b6e5724dc',
                             '365fe38b-4275-4daa-bc1d-450ffd65162b',
                             '365fe38b-4275-4daa-bc1d-450ffd65162b',
                             '4ca03197-da1d-45c1-a856-c5d0337b457a',
                             'd585b7eb-a8e7-4e13-b554-1b51878c02b8']

First, let's put our databases into sets:

In [23]:
id_customers_database_set = set(id_customers_database)
print(len(id_customers_database_set))
id_reimbursement_database_set = set(id_reimbursement_database)
print(len(id_reimbursement_database_set))

10
7


We can see the second set has only 7 elements. That's because someone was reimbursed several times.

Now, let's use set operations to find out if some people got reimbursed even though they don't appear in the customer base:

In [24]:
# id of people reimbursed, except if they also appear in id_customers_database: 
id_reimbursement_database_set - id_customers_database_set

{'4ca03197-da1d-45c1-a856-c5d0337b457a',
 'd585b7eb-a8e7-4e13-b554-1b51878c02b8'}

(Note: This exemple might seem trivial, but it's from a real case I encoutered in my professional experience. I was given two csv files, with customers_id being one of the column of each files. And I actually used sets to quickly compare and find out whether something was wrong with the ids... turns out some people were indeed given reimbursements even though their ids didn't appear in the customer base ! )

# Exercices 

1) Create an empty set and a set containing numbers from 0 to 10

In [25]:
empty_set = set()
from_one_to_ten = set(range(11))

2) Iterate over a set:

In [26]:
for n in from_one_to_ten:
    print(n)

0
1
2
3
4
5
6
7
8
9
10


3) Add "a", "b" and "c" to an empty set:

In [27]:
my_set = set()
my_set.update(["a", "b", "c"])
my_set

{'a', 'b', 'c'}

4) Update a set with a set:

In [29]:
from_one_to_ten = set(range(11))
my_set = set(["a", "b", "c"])
from_one_to_ten.update(my_set)
from_one_to_ten

{0, 1, 10, 2, 3, 4, 5, 6, 7, 8, 9, 'a', 'b', 'c'}

5) Remove "a", "b" and "c" from the set:

In [39]:
my_set = set(list(range(11)) + ["a", "b", "c"])
my_set.discard("a")
my_set.discard("b")
my_set.discard("c")
my_set

# pro version: 

my_set = set(list(range(11)) + ["a", "b", "c"])
my_set.difference_update({"a", "b", "c"})
my_set


{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

Try to insert a dict into the set() function. What happens ? 

# Resources


# Bonus: venn diagrams

In [None]:
import matplotlib.pyplot as plt
from matplotlib_venn import venn3

In [None]:
plt.figure(figsize=(8,8))
v = venn3(subsets=(10,10,10,10,10,10,10),
      set_labels=("Statistics", "Programming", "Domain Knowledge"))

v.get_label_by_id('100').set_text('Statistics')
v.get_label_by_id('010').set_text('Hacking skills')
v.get_label_by_id('001').set_text('Domain Knowledge')
v.get_label_by_id('111').set_text('Data Science')
v.get_label_by_id('110').set_text('Machine Learning')
v.get_label_by_id('011').set_text('Danger Zone')
v.get_label_by_id('101').set_text('Traditionnal research')
plt.show()

In [None]:
?venn3