# Sets


## Sets

* unordered
* uniques only - great for getting unique items out of some collection
* curly braces {3, 6, 7} - we used {} for dictionaries as well, dictionaries used : for key:value

In [1]:
s = {3,3,6,1,3,6,7} # so curly braces are also used for a set not only a dictionary
print(s)

{1, 3, 6, 7}


In [2]:
nset = set((3,3,6,1,3,6,7)) # alternative is to use set which takes any iterable
nset

{1, 3, 6, 7}

In [3]:
character_set = set("abracadbra") # since string is an iterable this is allowed
character_set

{'a', 'b', 'c', 'd', 'r'}

In [4]:
num_set = set([1,2,6,2,7,2,1]) # could pass a list
num_set

{1, 2, 6, 7}

In [5]:
a = set("ķiļķēni un klimpas") # takes a sequence so string qualifies
a

{' ', 'a', 'i', 'k', 'l', 'm', 'n', 'p', 's', 'u', 'ē', 'ķ', 'ļ'}

### Set from Mixed Data Types

Since this is Python we can mix and match our data types

In [6]:
b = {"abracadbra","abba", "dubba", "abba",56,7,2,12,2,2,1,1}
b

{1, 12, 2, 56, 7, 'abba', 'abracadbra', 'dubba'}

In [7]:
bset = set(["abracadbra","abba", "dubba", "abba"])
bset

{'abba', 'abracadbra', 'dubba'}

In [12]:
aset = set("abracadbra")
aset

{'a', 'b', 'c', 'd', 'r'}

### Looping through sets

No order guarantee

In [13]:
for c in aset: # notice no guarantee on order 
    print(c)

r
d
c
b
a


In [14]:
# If i wanted guarantee I would convert set to sorted list and use that to loop through
for c in sorted(aset):
    print(c)

a
b
c
d
r


### Membership check in set - O(1)

Membership check for sets is very fast - constant time

Use sets if you need to make many membership checks as compared to list

Lists use linear time lookup.

In [15]:
# this lookup is very quick even for large sets - just like key lookup in dictionaries
'a' in aset, 'b' in aset, 'f' in aset 

(True, True, False)

In [None]:
mylist = sorted(aset) # sorted gives you a list
mylist

['a', 'b', 'c', 'd', 'r']

In [None]:
# list lookup is linear so much slower for large data list > 10_000 and so on
'a' in mylist, 'b' in mylist, 'f' in mylist 

# of course making a set from million item list will also be slow, but only needs to be done one time.


(True, True, False)

In [None]:
type(s)

set

In [16]:
a

{' ', 'a', 'i', 'k', 'l', 'm', 'n', 'p', 's', 'u', 'ē', 'ķ', 'ļ'}

In [17]:
# I can always type cast set to list if i need to keep the items in order
myletters = list(a)
myletters

['n', 'i', 'ķ', 'ē', ' ', 'u', 'm', 'l', 'p', 'ļ', 's', 'k', 'a']

In [18]:
"|".join(sorted(a)) # you can join with any character even blank space
# notice that sorting is using Unicode chr values so Latvian letters are after English
# TODO sort it locale specific way

' |a|i|k|l|m|n|p|s|u|ē|ķ|ļ'

In [None]:
myletters[:3]

['a', 'ķ', 'u']

In [None]:
al = list(a)
al

['a', 'ķ', 'u', 'l', 'm', 'n', 'p', 's', ' ', 'ē', 'ļ', 'i', 'k']

In [None]:
sorted(al)

[' ', 'a', 'i', 'k', 'l', 'm', 'n', 'p', 's', 'u', 'ē', 'ķ', 'ļ']

In [None]:
s = {1,2,65,2,6,3}
s

{1, 2, 3, 6, 65}

In [19]:
nset = set(range(10))
nset # still this is not sorted

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

In [27]:
big_number_set = set(list(range(4,18)) + list(range(-120,100,25)))
big_number_set

{-120,
 -95,
 -70,
 -45,
 -20,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 30,
 55,
 80}

In [28]:
!python --version

Python 3.9.16


## Set Operations

### Subset

![subset](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Venn_A_subset_B.svg/300px-Venn_A_subset_B.svg.png)

In [29]:
print(s)
print(nset)
s.issubset(nset)

{1, 3, 6, 7}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}


True

In [30]:
n_3_7 = set(range(3,8))
n_3_7

{3, 4, 5, 6, 7}

In [31]:
n_3_7.issubset(nset)

True

In [None]:
# Alternative syntax
n_3_7 < nset # strong subset meaning n_3_7 can't be equal to nset

True

In [32]:
n_3_7 <= nset # allows equality meaning set of same items can be subset of same set with same items

True

In [33]:
nset < nset

False

In [34]:
nset <= nset

True

### Superset

So inverse of subset

In [35]:
nset.issuperset(s)

True

In [36]:
nset, s

({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {1, 3, 6, 7})

In [38]:
s.remove(6) # we can remove elements
s

{1, 3, 7}

In [39]:
nset, s

({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {1, 3, 7})

In [40]:
nset.issuperset(s)

True

In [None]:
# again we have shortcuts for strong superset >
# weak superset meaning allowing equality >=
nset > s, nset >= s, nset < s

(True, True, False)

In [41]:
s.issuperset(range(6))

False

In [None]:
nset.issuperset(range(6))

True

### Union

Union will let us combine sets - still no duplicates

![union](https://upload.wikimedia.org/wikipedia/commons/thumb/3/30/Venn0111.svg/400px-Venn0111.svg.png)

Mathematically used notation: A∪B

In [42]:
n_5_9 = set(range(5,10))
n_5_9

{5, 6, 7, 8, 9}

In [44]:
n_3_9 = n_3_7.union(n_5_9)
n_3_9 # note how duplicates are gone

{3, 4, 5, 6, 7, 8, 9}

In [46]:
# you can union on  multiple sets
{1,3,51,61,1,3} | {3,1,5,1,6,1} | {3,3,1,1,5,1,63,3}

{1, 3, 5, 6, 51, 61, 63}

In [49]:
# lets make a list of random sets
import random
random.seed(42) # so we get same randoms
set_list = [] # store our sets
for _ in range(5):
    set_list.append(set(range(random.randint(1,6), random.randint(10,20))))
    print(set_list[-1]) # print last one - since I set specific i get same sets


{6, 7, 8, 9, 10}
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
{2, 3, 4, 5, 6, 7, 8, 9, 10}
{6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}


In [51]:
# i can unroll the list into separate sets as parameters to my set.intersection
set.union(*set_list) # shorter than spelling out each set individually
# we need * because union needs individual sets

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}

### Set Intersection

We keep values that are present in BOTH (or more) sets

A∩B

![intersection](https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Venn0001.svg/440px-Venn0001.svg.png)

In [52]:
n_3_7.intersection(n_5_9)

{5, 6, 7}

In [53]:
n_3_7 & n_5_9 # same as intersection above so only elements in BOTH sets
# again shorter way of writing

{5, 6, 7}

In [54]:
n_5_7 = n_3_7 & n_5_9
n_5_7

{5, 6, 7}

In [55]:
n_5_7 = n_3_7 & n_5_9 & nset # nset is 0 to 9
n_5_7

{5, 6, 7}

In [56]:
n_5_6 = n_3_7 & n_5_9 & set(range(7)) # range goes to 6
n_5_6 # only 5 and 6 is in ALL 3 sets

{5, 6}

In [57]:
# interesting bug/error/unexpected might happend if we create an intersection which includes and empty set
empty_set = set()
empty_set

set()

In [58]:
empty_set & n_3_7 & n_5_7 # what will happen is... well we get an empty set because there are NO common elements

set()

In [59]:
# same trick that we used for union will work on intersection for multiples
set.intersection(*set_list)

{6, 7, 8, 9, 10}

### Set Difference

So items that are ONLy in the left set but not in right set

![Set Difference](https://upload.wikimedia.org/wikipedia/commons/thumb/2/23/Relative_compliment.svg/460px-Relative_compliment.svg.png)

In picture only items in B since B is on the left B\A

In [60]:
n_3_7.difference(n_5_9) # only elements unique to left side

{3, 4}

In [61]:
n_3_7 - n_5_9, n_5_9 - n_3_7 # so - is syntactic sugar to the difference

({3, 4}, {8, 9})

### Symmetrical Difference

In [62]:
n_3_7.symmetric_difference(n_5_9) # only elements unique either side

{3, 4, 8, 9}

In [63]:
n_3_7 ^ n_5_9 # ^ is short for .symmetric_difference

{3, 4, 8, 9}

In [None]:
s

{1, 2, 3, 6}

### Updating existing sets

In [64]:
print(s)

{1, 3, 7}


In [65]:
# we can update with many differnt data types
s.update({3,3,6,2,7,9},range(4,15), [3,6,7,"Valdis", "Badac"])
s # so no order guaranteeed

{1, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9, 'Badac', 'Valdis'}

In [None]:
dir(s)

['__and__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__iand__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__isub__',
 '__iter__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__or__',
 '__rand__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__ror__',
 '__rsub__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__xor__',
 'add',
 'clear',
 'copy',
 'difference',
 'difference_update',
 'discard',
 'intersection',
 'intersection_update',
 'isdisjoint',
 'issubset',
 'issuperset',
 'pop',
 'remove',
 'symmetric_difference',
 'symmetric_difference_update',
 'union',
 'update']

In [None]:
# we can check if our set has anything in common with anohther data structures
n_3_7.isdisjoint(n_5_9) # False because sets do intersect with 5,6,7

False

## Typical Use Cases for Sets

* so Sets use them to obtain  unique elements 
* then can convert back to other data structures

* Convert to Sets - use set operations back to other structures 

* Convert to set to perform multiple membership tests

Example: Good for comparing two word lists from two document corpuses.
You could find what words are common and which words are unique to each corpus.

Another way of thinking of sets is as dictionaries of only keys, which you remembe have to be unique. Plus we get the set operations here.

In [None]:
# so Sets use them to obtain  unique elements 
# then can convert back to other data structures