# Introduction to Sets

## Overview:

This lesson will introduce the set datatype and detail how to perform operations with sets.


## Objectives

1. Learn the characteristics of sets
2. Create a set
3. Add items to a set
4. Remove items from a set
5. Perform operations between sets
6. Determine use cases for sets
7. Analyze the efficiency of sets 
8. Create functions that make use of sets

## What are sets?

A set is an unordered, mutable collection of unique items. What does this mean? It means that a set is a data structure in which you can store items and not care about their order, knowing that there will always be *at most* one of them in the structure. 

To help you distinguish between the data types:

* lists : mutable and ordered
* tuples: immutable and ordered
* dictionaries: mutable and unordered
* sets: mutable and unordered

## Creating a Set

The set( ) function turns a single array into a set. Curly brackets will create a *dictionary*, but if items are included in the brackets (that are not in key:value pair format), a set is created.

In [1]:
set_1 = set(['a','b','c','d'])
set_2 = {'a','b','c','d'}
dic = {}
print(type(set_1))
print(type(set_2))
print(type(dic))

<class 'set'>
<class 'set'>
<class 'dict'>


In [2]:
print(set_1)
print(set_2)

{'b', 'd', 'a', 'c'}
{'b', 'd', 'a', 'c'}


You'll notice that the Jupyter Notebook displays both of the sets with the same order. This is just iPython trying to organize the sets in an easier to understand manner. All the items in a set are *unordered*. Let's prove it. Down below we have a list of my favorite Skittles flavors.

In [3]:
list_a = ['blue','red','orange','purple']
set_a = {'blue', 'red','orange','purple'}

Now, we're going to access one of the items (red) in the list by using its index (1):

In [4]:
list_a[1]

'red'

Because the list is ordered, I'm able to use the index. Let's try and do the same thing with a set.

In [5]:
set_a[2]

TypeError: 'set' object does not support indexing

As you can see, we are unable to choose a specific index into a set because there is no index! The values within a set are unordered.

## Adding items to a Set

To add single items to a set, use the .add( ) method. Notice items in sets do not have to contain the same datatype.

### Single Items

To add an item to a set, use the .add( ) method.

In [6]:
set_1.add(1)
print(set_1)

{'b', 1, 'a', 'c', 'd'}


Notice that the .add( ) method is done *in-place*, meaning the set we instantiated earlier has been modified. We are not creating a new set. If I were to set this equal a new variable, it would be of the type None.

In [7]:
new_set =  set_1.add(1)
print(new_set)

None


Now, let's see if we can add a list to a set.

In [8]:
set_1.add([1,2,3,4,5])

TypeError: unhashable type: 'list'

We can't! The issue here has to do with mutability. Every item in a set must be *immutable*. Therefore, lists, sets, and dictionaries cannot be included into a set. 

### Multiple Items

What if we wanted to add more than one item at a time to a set. In order to do this, we can use the .update( ) method. *Note: The update method requires an iterable as input*

In [9]:
set_1.update([1,2,3,4,5])
print(set_1)

{'b', 1, 2, 3, 'a', 'c', 4, 5, 'd'}


#### Bonus Info

If we want to create a set that is *immutable*, in other words, the values in it cannot be changed, we can create a frozen set by using the frozenset( ) in the same way we use the set( ) function. Because frozen sets are immutable, they can be added to a set.

## Removing Items from a Set
To remove items form a set, we can use the .remove( ) or .discard( ) methods. The .remove( ) method will return an Error if an item is not present. The .discard( ) method will remove an item from a set if it is present, but it will not return a Error if the item you are trying to remove is not present.

In [10]:
set_1.remove('b')
print(set_1)

{1, 2, 3, 'a', 'c', 4, 5, 'd'}


In [11]:
set_1.discard(5)
print(set_1)

{1, 2, 3, 'a', 'c', 4, 'd'}


## Operations with sets

The main utility of sets is when we perform operations on them to test for membership within different groups. The main operations we can perform are:

* in/not in
* union
* intersection
* difference
* symmetric difference
* subset/superset

Here are examples of these operations. Imagine the two sets below represent the number of Girl Scout Cookies ordered by households in one city block versus another

In [12]:
set_a, set_b = {3,4,8,9,10,0,1,12}, {5,4,3,0,20,1,2}


#### In / Not In
The in operator works returns True if the value is contained within a set and False in a value is not in it.

The not in operator does the opposite, returning True if a value is not found within a set and False if a value is.

In [13]:
print(3 in set_a)
print(4 not in set_a)

True
False


#### Union
To find all of the different order numbers from both blocks, we would need to perform a *union*. This combines all elements of both sets. This can be done using the .union( ) method or the **|** operator.  *Note: these operations are not done 'in-place,' so you must set the result equal to another variable to use the result
<img src="images/union.png" width="300px">

In [14]:
print(set_a.union(set_b))
print(set_a | set_b)

{0, 1, 2, 3, 4, 5, 8, 9, 10, 12, 20}
{0, 1, 2, 3, 4, 5, 8, 9, 10, 12, 20}


#### Intersections

If we wanted to find the overlap in the number of orders from both blocks, we would need to find the *intersection*. These are the items that are included in both original sets. We can perform this operation with the .intersection( ) method or use the **&** operator.
<img src="images/intersection.png" width="300px">

In [15]:
print(set_a.intersection(set_b))
print(set_a & set_b)

{0, 1, 3, 4}
{0, 1, 3, 4}


#### Difference

If we wanted to find the order numbers that were unique to one of the blocks, we need to take the *difference*. We can perform this operation by using the .difference( ) method or use the **-** operator. Note that the order is important: whichever set is first is the one in which we are interested.

 <tr>
   <td>  <img src="images/differencea_b.png" width="200px"> </td>
   <td> <img src="images/differenceb_a.png" width="200px"> </td>
 </tr>

In [16]:
print(set_a.difference(set_b))
print(set_a - set_b)

{8, 9, 10, 12}
{8, 9, 10, 12}


In [17]:
print(set_b.difference(set_a))
print(set_b - set_a)

{2, 20, 5}
{2, 20, 5}


#### Symmetric Difference

If we wanted to find out the unique counts found within both of these events, we would want to find the symmetric difference between the two. To do this, we can use the symmetric_difference( ) method.
<img src="images/symmetric_difference.png" width="300px">

In [18]:
print(set_a.symmetric_difference(set_b))

{2, 5, 8, 9, 10, 12, 20}


You might notice that the symmetric difference is the same thing as the union of the two differences.

In [19]:
(set_a - set_b).union(set_b - set_a)

{2, 5, 8, 9, 10, 12, 20}

#### Subset and Superset
Subsets are when every item in a set is contained in another set. A superset is a set contains the entirety of another set within it. In this diagram, the A is a subset of B and B is a superset of A. The method to determine whether or not something is a sub or superset is .issubset( ) and .issuperset( ) respectively, which will return a Boolean TRUE or FALSE.

<img src="images/sub_superset.png" width="300px">

## Set Use Cases

Sets are very powerful when it comes to testing for membership. It is much faster to test for membership in a set rather than a list. Let's look at an example where we test whether or not an array has the number 10 in it.

In [20]:
example_set = set([8,5,3,6,9,0,10,20])
example_list = [8,5,3,6,9,0,10,20]

In [21]:
print(10 in example_list)
print(10 in example_set)

True
True


In [22]:
import timeit
import string
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
%timeit 10 in example_list

84.3 ns ± 0.828 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [26]:
%timeit 10 in example_set

33 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


As we can see, it is much faster to test for membership in sets than in a list. This is because python must iterate through every item of the list; whereas, it can just look in the hash table and determine membership for the set. Let's look at an example with far more items in each datatype.

In [27]:
l = list(range(100000))
%timeit 99,999 in l
%timeit 1 in l

9.59 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
36.6 ns ± 0.873 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [28]:
s = set(range(100000))
%timeit 99,999 in s
%timeit 1 in s

64.7 ns ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
32.8 ns ± 0.886 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


Wow!!! Notice how when we iterated through the list, the amount of time it took depended greatly on the position of the number we were testing in the array (from micro to nano seconds). For the set, the variation in our time was much less. Now let's investigate a common use case for a set in Natural Language Processing. 

## NLP Example

Here I am importing "stopwords" from NLTK, a natural language processing library. Stopwords are words that are so frequently used in the English language that they will probably be used in any context. My goal in this next section will be to eliminate the stopwords from a sentence taken from Alice in Wonderland.

In [29]:
stop = stopwords.words('english')
alice_sentence = """So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her."""


Here are the stopwords that we are removing

In [30]:
print(type(stop))
print(stop)

<class 'list'>
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 

Here we have a list comprehension that is iterating through the words from Alice in Wonderland and determining if they are stopwords or not. After going through the sentence, we will only have the non-stopwords, which we see printed below.

In [31]:
print([word.strip(string.punctuation).lower() for word in alice_sentence.split() if word not in stop])

['so', 'considering', 'mind', 'as', 'well', 'could', 'hot', 'day', 'made', 'feel', 'sleepy', 'stupid', 'whether', 'pleasure', 'making', 'daisy-chain', 'would', 'worth', 'trouble', 'getting', 'picking', 'daisies', 'suddenly', 'white', 'rabbit', 'pink', 'eyes', 'ran', 'close', 'her']


Now let's compare the speed difference between using lists and sets

In [32]:
#the strip(string.punctuation) is removing all extranious punctuation from each word
%timeit [word.strip(string.punctuation).lower() for word in alice_sentence.split() if word not in stop]


90.9 µs ± 772 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [33]:
set_stop = set(stop)

%timeit [word.strip(string.punctuation).lower() for word in alice_sentence.split() if word not in set_stop]



14.8 µs ± 398 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Practice Problems

1) Create a function that prompts users for 2 different lists of numbers (integers). Then the script should print out the numbers that are present in both lists.

 Example: If you inputted the numbers `5, 4, 2, 9, 10` first, and `3, 5, 9, -4, 9` second, your script should print `9, 5`. Make sure to use sets.





2) Imagine you are watching a nature documentary, and you want to keep track of all the animals you see on screen. Write a function that will repeatedly asks the user for an animal in string form until they type 'stop' at which point the function should return all of the animals seen. To view the animals seen so far, the user can type in 'all_animals'. All inputs should be case insenstive.



While solving these problems, remember to use these methods


set.add                         
set.intersection                 
set.remove
set.clear                        
set.intersection_update          
set.symmetric_difference
set.copy                         
set.isdisjoint                   
set.symmetric_difference_update
set.difference                   
set.issubset                     
set.union
set.difference_update            
set.issuperset                   
set.update
set.discard                      
set.pop  


In [None]:
def number_checker():
    list_1 = input('Please enter numbers separated by commas: ')
    list_2 = input('Please enter more separated by commas: ')
    first_set = set(int(num) for num in list_1.split(','))
    second_set = set(int(num) for num in list_2.split(','))
    print(first_set.intersection(second_set))
    

### References
first 4 venn-diagram pictures taken from: http://www.texample.net/tikz/examples/set-operations-illustrated-with-venn-diagrams/

subset/superset picture taken from:
https://en.wikipedia.org/wiki/Subset#/media/File:Venn_A_subset_B.svg
