# **Important Python Data Structures**

We've seen lists so far, but there are other collections of data in python which you will find yourself using quite often. Here we'll see the three main ones: sets, dictionaries, and tuples.

# **Sets** - Unordered & Collection of Hashable/Immutable Unique Elements

Sets come from Mathematics, where they are used to track unique elements. From the python perspective, a set is an unordered collection data type that is iterable, mutable and has no duplicate elements.

The major advantage of using a set, as opposed to a list, is that it has a highly optimized method for checking whether a specific element is contained in the set.

## Constructing & Adding elements in a set

In [1]:
# Constructing a set
Set = {"a", "b", "c"}
  
print("Set: ") 
print(Set) 
  
# Adding element to the set 
Set.add("d") 
print("\nSet after adding: ") 
print(Set) 

# Adding an element that already exists
Set.add("a") 
print("\nSet after adding: ") 
print(Set) 

Set: 
{'a', 'b', 'c'}

Set after adding: 
{'d', 'a', 'b', 'c'}

Set after adding: 
{'d', 'a', 'b', 'c'}


Sets can also be defined with the built-in function set([iterable]). This function takes as argument an iterable (i.e. any type of sequence, collection, or iterator), returning a set that contains unique items from the input (duplicated values are removed).

In [2]:
### Create a set from ###

# a string
print(set('PyDS'))

# a tuple
print(set(('Madrid', 'Valencia', 'Munich')))

# a dictionary 
print(set({'hydrogen': 1, 'helium': 2, 'carbon': 6, 'oxygen': 8}))

# a list
print(set(['Madrid', 'Valencia', 'Munich', 'Munich']))

{'D', 'S', 'P', 'y'}
{'Madrid', 'Munich', 'Valencia'}
{'helium', 'carbon', 'hydrogen', 'oxygen'}
{'Madrid', 'Munich', 'Valencia'}


Sets are listy and you can iterate over them.

In [3]:
#Looping through our Set
for ele in Set:
    print(ele)

d
a
b
c


And this is super fast:

In [4]:
'c' in Set

True

But remember, since sets are immutable, you cannot use a mutable datatype inside a set nor can you change values inside.

In [5]:
# Mutable data types cannot be used inside a set
numbers = {[1,2],3, 4}

#TypeError

TypeError: unhashable type: 'list'

In [6]:
Set[3] = 'f'

TypeError: 'set' object does not support item assignment

## Set Methods

These are the following methods possible with sets --

![](https://www.pstanalytics.com/blog/wp-content/uploads/2019/10/set1.png)

### Add/Remove/Discard/Pop operations

In [7]:
#Intializing a set of numbers
numbers = {1,2,3,4,5}

#Initializing a set of fruits
fruits = {'apple','orange','grape'}

#Initializing an empty set
empty_set = set()

In [8]:
#Using the add method
fruits.add('banana')
print("Added new fruit -",fruits)

#Adding an already present element to the set
fruits.add('orange')
print("Added another orange -",fruits)

## Note: Sets does not hold duplicates, hence our prev add does not add an extra orange to our set

Added new fruit - {'orange', 'banana', 'apple', 'grape'}
Added another orange - {'orange', 'banana', 'apple', 'grape'}


In [9]:
#Using the remove method
fruits.remove('grape')
print("Removed grape from the set -",fruits)

## Note: You cannot remove an element that is not present in the list using .remove()

Removed grape from the set - {'orange', 'banana', 'apple'}


In [0]:
#Exception Raised
fruits.remove('grape')

KeyError: ignored

To overcome this exception, you can use set.discard() method. The set.discard(x) method removes an element x from a set if it is present. In comparison to the remove method, the discard method does not raise an exception (KeyError) if the element to be removed does not exist.

In [10]:
#Exception Not Raised
fruits.discard('grape')

In [11]:
#Using pop method to remove arbitrary element from the set
print(fruits.pop())

orange


**Note**: In comparison to lists, the pop method does not take any arguments. We can not specify the index we want to remove, since sets are an unordered collection of elements.

### Mathematical Operations - Union, Intersection, Difference, Symmetric Difference

In [12]:
# two sets - one containing mammals and another containing aquatic animals
mammals = {'Tiger', 'Camel', 'Sheep', 'Whale', 'Walrus'}
aquatic = {'Octopus', 'Squid', 'Crab', 'Whale', 'Walrus'}

#### **Union**

The union of two sets A and B is the set containing the elements that are in A, B, or both, and is denoted by A ∪ B.

In [13]:
# union of two sets
# union method
animals = mammals.union(aquatic)
print(animals)

# operator |
animals = mammals | aquatic
print(animals)

# sets mammals and aquatic are not modified
print(mammals)
print(aquatic)

{'Sheep', 'Camel', 'Walrus', 'Octopus', 'Whale', 'Crab', 'Squid', 'Tiger'}
{'Sheep', 'Camel', 'Walrus', 'Octopus', 'Whale', 'Crab', 'Squid', 'Tiger'}
{'Sheep', 'Camel', 'Tiger', 'Walrus', 'Whale'}
{'Squid', 'Walrus', 'Octopus', 'Whale', 'Crab'}


#### **Intersection**

The intersection of two sets A and B is the set containing the elements that are common to both sets and is denoted by A ∩ B.


In [14]:
# intersection of two sets
# intersection method
animals = mammals.intersection(aquatic)
print(animals)

# operator &
animals = mammals & aquatic
print(animals)

# sets mammals and aquatic are not modified
print(mammals)
print(aquatic)

{'Whale', 'Walrus'}
{'Whale', 'Walrus'}
{'Sheep', 'Camel', 'Tiger', 'Walrus', 'Whale'}
{'Squid', 'Walrus', 'Octopus', 'Whale', 'Crab'}


#### **Difference**

The difference of two sets A and B is the set of all elements of set A that are not contained in set B and is denoted by A-B.


In [15]:
# difference between two sets
# difference method
animals = mammals.difference(aquatic)
print(animals)

# operator -
animals = mammals - aquatic
print(animals)

# sets mammals and aquatic are not modified
print(mammals)
print(aquatic)

{'Sheep', 'Camel', 'Tiger'}
{'Sheep', 'Camel', 'Tiger'}
{'Sheep', 'Camel', 'Tiger', 'Walrus', 'Whale'}
{'Squid', 'Walrus', 'Octopus', 'Whale', 'Crab'}


#### **Symmetric Difference**

The symmetric difference of two sets A and B is the set of elements that are in either of the sets A and B, but not in both, and is denoted by A △ B.

In [16]:
# symmetric difference between two sets
# symmetric_difference method
animals = mammals.symmetric_difference(aquatic)
print(animals)

# operator ^
animals = mammals ^ aquatic
print(animals)

# sets mammals and aquatic are not modified
print(mammals)
print(aquatic)

{'Sheep', 'Camel', 'Tiger', 'Squid', 'Octopus', 'Crab'}
{'Sheep', 'Camel', 'Tiger', 'Squid', 'Octopus', 'Crab'}
{'Sheep', 'Camel', 'Tiger', 'Walrus', 'Whale'}
{'Squid', 'Walrus', 'Octopus', 'Whale', 'Crab'}


## Subsets & Supersets

A set A is a subset of a set B (A ⊆ B) or equivalently set B is a superset of set A (B ⊇ A), if all elements of set A are contained in set B.

In [17]:
# three sets 
mammals = {'Tiger', 'Camel', 'Sheep', 'Whale', 'Walrus'}
aquatic = {'Octopus', 'Squid', 'Crab', 'Whale', 'Walrus'}
aquatic_mammals = {'Whale', 'Walrus'}

# mammals is a superset of aquatic_mammals
mammals.issuperset(aquatic_mammals)

# aquatic_mammals is a subset of mammals
aquatic_mammals.issubset(mammals)

# aquatic is not a subset of mammals
aquatic.issubset(mammals)

False

## Disjoint

Two sets are disjoint if they have no elements in common.


In [18]:
# two sets - mammals and reptiles
mammals = {'Tiger', 'Camel', 'Sheep', 'Whale', 'Walrus'}
reptiles = {'Turtle', 'Snake', 'Lizard', 'Crocodile'}

# check if both sets are disjoint sets
mammals.isdisjoint(reptiles)

True

## Frozen Set

A frozenset object is a set that, once created, cannot be changed. We create a frozenset in Python using the frozenset([iterable]) constructor, providing an iterable as input.

Since frozensets are immutable, they do not accept the methods that modify sets in-place such as add, pop, or remove. As shown below, trying to add an element to a frozenset raises an exception (AttributeError).

In [19]:
# create a set
cities = {'Valencia', 'Madrid'}

# set are mutable can be modified in-place
cities.add('Munich')
print(cities)
# {'Madrid', 'Munich', 'Valencia'}

# create a frozenset
cities_frozen = frozenset(['Barcelona', 'Berlin'])

# frozensets are immutable 
cities_frozen.add('Stuttgart')
# AttributeError

{'Madrid', 'Munich', 'Valencia'}


AttributeError: 'frozenset' object has no attribute 'add'

## How can we use sets for Data Science?

In order to understand the usefullness of sets, let us take an problem statement where you need to read a file containing the text from Julius Caesar and print the top 100 unique words.

In [20]:
## Read a file, parse lines, and get all UNIQUE words
wordset = set() # make a set with unique items  

#Reading from txt file
fd = open("caesar.txt")
#Storing all the lines
lines = fd.readlines()
fd.close()

# strip newline characters and other whitespace off the edges
cleaned_lines = [line.strip() for line in lines] 

# make a list of lists. 
# each inner list if the list of words on that line
list_of_lines_words = [line.split() for line in lines]

# Take each list of words, and get all the words
for lines_words in list_of_lines_words:
    wordset.update(lines_words) # update the wordset using the new list.

#Creating a list of the unique words
unique_words = list(wordset)

#Fetching the top 100 words
unique_words[:100] # 100 of these words

['Look',
 'sicken',
 'point,',
 'STRATO,',
 'how',
 'greets',
 "monarch's",
 '&',
 'Alas,',
 'attain',
 'ARTEMIDORUS,',
 'dangerous',
 'eight.',
 'May',
 'amaze',
 'humble',
 'Therein,',
 'Room',
 'fault',
 'necessary',
 'sir;',
 'room',
 'supporting',
 'flesh',
 'strucken',
 "dreams.'",
 'Walk',
 'Cimber',
 'once:',
 'assembly,',
 'any',
 'Exit',
 'yourselves;',
 "harm's",
 'bars,',
 'cavern',
 'heavy',
 'fighting,',
 'Antony;',
 'age!',
 'Pompey',
 'end',
 'mettle;',
 'Cannot,',
 'chidden',
 'rid',
 'soldier,',
 'tokens',
 'Looks',
 'be;',
 'necessities.',
 'feeding',
 'Flourish',
 'strong',
 'reason?',
 'ambitious,',
 'saucy',
 'about',
 'presently?',
 'gone?',
 'number,',
 'sword,',
 'slaughter',
 'women',
 "mix'd",
 'steel',
 'ferret',
 'Vexed',
 'come,',
 'Gave',
 'Is',
 "He's",
 'Publius.',
 'shrieking.',
 'lowest',
 'more,',
 'Rome:',
 'fawn',
 'devil,',
 'were',
 'henceforth,',
 "laugh'd",
 'kingly',
 'trophies.',
 'shores,',
 'thick.',
 'heads!',
 'talk,',
 'pieces!',
 'No,',

# **Dictionaries** - Mutable and Ordered Collection of Key:Value Pairs

A "bag" of values, each with its own label, called a key. The 'most powerful' data collection type,and one I suspect you will find yourself using a lot!

A dictionary is similar to a list, and you can iterate over it but:

the order of items doesn't matter (use an OrderedDict for this)
they aren't selected by an index such as 0 or 5.
Instead, a unique 'key' is associated with each 'value' . The 'key' can be any immutable data type: boolean, float, int, tuple, string (but it is often a string)

Dictionaries themselves are "Mutable" (the values can be changed).

A dictionary is a set with a value.


In [21]:
type({})

dict

## Creating a dictionary

In [22]:
# Creating a dictionary:

# 1. Using {}
empty_dict = {} 
print (type(empty_dict))
new_dict = { "day":5, "venue": "GJB", "event": "Python Carnival!" }
print(new_dict)

#2. Using dict()
purse = dict(type="wallet", material="leather")
print(purse)

<class 'dict'>
{'day': 5, 'venue': 'GJB', 'event': 'Python Carnival!'}
{'type': 'wallet', 'material': 'leather'}


In [23]:
# Creating a Nested Dict
D ={'to': {'name': 'Alice', 'age':18}}
D

{'to': {'name': 'Alice', 'age': 18}}

In [24]:
# Alternative construction techniques:

# dict_var = dict([(key1, value1),(key2, value2), ...])	
D = dict([('name','Alice'),('age',18)])
print(D)

{'name': 'Alice', 'age': 18}


In [25]:
# Creating dict from keys only

# dict_var = dict.fromkeys([key1, key2, ...])
D = dict.fromkeys(['name','age','place'])
D

{'name': None, 'age': None, 'place': None}

This is construction from a list of tuples. We'll see tuples soon.

Notice that the values have been asigned to a special object in python None which is of type NoneType. Its specially created to handle the situation of missing values, and evaluates as falsy in conditionals.

## Dictinary Operations

#### Conditionals

In [26]:
#Using conditionals using dict
if D['age']:
    print("Age is", d['age'])
else:
    print("Nothing specified")

Nothing specified


In [27]:
if not D['age']:
    print("Age not given..ask!")

Age not given..ask!


#### Indexing & Membership Operation in Dict

In [28]:
#Indexing by key dict_var['key']
print (D['age'])

#Membership operation 'key' in dict_var
'place' in D

None


True

#### Dictionaries are list-like.

In [29]:
#Iterating through new_dict
for key in new_dict:
    print(key, ":", new_dict[key])

day : 5
venue : GJB
event : Python Carnival!


In [30]:
#Iterating through new_dict - alternative way
for key, value in new_dict.items():
    print(key, ";", value)

day ; 5
venue ; GJB
event ; Python Carnival!


Some other useful methods:

1. All keys `dict_var.keys()`
2. All values `dict_var.values()`
3. All key + value tuples `dict_var.items()`
4. Copy method `dict_var.copy()`
5. Remove all items `dict_var.clear()`
6. Merging keys from different dict `dict_var1.update(dict_var2)`
7. Fetch by key, if absent default `dict_var.get(key, default)`
8. Remove by key, if absent default `dict_var.pop(key, default)`
9. Fetch by key, if absent set default `dict_var.setdefault(key, default)`
10. deleting items by key `del dict_var[key]`
11. Dictionaries can be iterated over using dictionary comprehensions which look thus: </br>
`output_dict = {key:value for (key, value) in iterable if (key, value satisfy this condition)}`

## How can we use dict in Data Science?

Let's continue from our last example of analyzing Julius Caesar text and performing analysis. Let us now using a dictionary to store the no. of occurences of the top 100 unique words.

In [31]:
## Read a file, parse lines, and get all UNIQUE words
worddict = dict() # make a dictionary to get the occurence of the top 100 unique words

#Reading from txt file
fd = open("caesar.txt")
#Storing all the lines
lines = fd.readlines()
fd.close()

# strip newline characters and other whitespace off the edges
cleaned_lines = [line.strip() for line in lines] 

# make a list of lists. 
# each inner list if the list of words on that line
list_of_lines_words = [line.split() for line in lines]

#Making a flat list of words
flat_list = []
for sublist in list_of_lines_words:
    for item in sublist:
        flat_list.append(item)

#Looping through first 100 words
for i in unique_words[:100]:
  #Appending the occurence of word to the dict
  worddict[i] = flat_list.count(i)

print(worddict)



In [32]:
worddict

{'Look': 2,
 'sicken': 1,
 'point,': 1,
 'STRATO,': 2,
 'how': 13,
 'greets': 1,
 "monarch's": 1,
 '&': 2,
 'Alas,': 5,
 'attain': 2,
 'ARTEMIDORUS,': 1,
 'dangerous': 3,
 'eight.': 1,
 'May': 3,
 'amaze': 1,
 'humble': 2,
 'Therein,': 2,
 'Room': 1,
 'fault': 1,
 'necessary': 2,
 'sir;': 1,
 'room': 2,
 'supporting': 1,
 'flesh': 1,
 'strucken': 2,
 "dreams.'": 1,
 'Walk': 1,
 'Cimber': 3,
 'once:': 1,
 'assembly,': 1,
 'any': 23,
 'Exit': 26,
 'yourselves;': 1,
 "harm's": 1,
 'bars,': 1,
 'cavern': 1,
 'heavy': 1,
 'fighting,': 1,
 'Antony;': 3,
 'age!': 1,
 'Pompey': 2,
 'end': 4,
 'mettle;': 1,
 'Cannot,': 1,
 'chidden': 1,
 'rid': 2,
 'soldier,': 3,
 'tokens': 1,
 'Looks': 2,
 'be;': 1,
 'necessities.': 1,
 'feeding': 1,
 'Flourish': 2,
 'strong': 8,
 'reason?': 1,
 'ambitious,': 1,
 'saucy': 3,
 'about': 14,
 'presently?': 1,
 'gone?': 1,
 'number,': 1,
 'sword,': 5,
 'slaughter': 1,
 'women': 1,
 "mix'd": 1,
 'steel': 3,
 'ferret': 1,
 'Vexed': 1,
 'come,': 5,
 'Gave': 1,
 'Is':

# **Tuples** - Immutable Ordered Collection of Python objects

They are a fast kind of sequence that functions much like a list - they have elements which are indexed starting at 0. They work exactly like lists, except that tuples can't be changed in place!! This means they are immutable, and this guarantee gives them their speed.

* Ordered collections of arbitrary objects
* Accessed by index
* Of the category "immutable sequence"
* Fixed-length, heterogeneous and arbitrarily nested
* The fixed length is important for performance. Unlike lists, they cannot be grown or shrunk.


## Creating tuples

In [33]:
# CREATING TUPLEs

# (a) Using tuple()
x = tuple() 
type(x)	

tuple

In [34]:
# (b) Using only ()
t=() 
type(t) 

tuple

The above tuples are 0-length and not so useful. Because tuples are immutable, the following code will not work.

In [35]:
t[0] = 5

TypeError: 'tuple' object does not support item assignment

In [36]:
#Intializing a tuple
t = (1,2,3,4, [1,2,3])

#Access in tuple to mutable objects is possible
t[4][2] = 4

#Note that the mutable object inside the tuple has been modified!
t

(1, 2, 3, 4, [1, 2, 4])

You will usually see them defined thus:

In [37]:
#(c) Casual way! 
z = 1,2,3,4 # or z = (1, 2, 3, 4)
type(z) 
print(z)

(1, 2, 3, 4)


## Tuple Operations

#### Nested Tuples

In [38]:
# TUPLE LITERALS AND OPERATIONS
# (a) Nested tuples
T = ('Da Vinci', ('Painter','Sculptor'))

# Print the message below using tuple T
# Da Vinci was a great Painter
# your code here



#### Indexing and Slicing

In [39]:
# Indexing and Slicing
T1 = (1, 2, 3, 4, 5)
print(T1[0:2]) 
print (T1[::-1]) 
print (T1[0], T1[2:4])

(1, 2)
(5, 4, 3, 2, 1)
1 (3, 4)


#### Iteration & Membership

In [40]:
# Iteration and Membership
T1 = (1, 2, 3, 4, 5) 
for ele in T1:
    print(ele)

1
2
3
4
5


#### Tuple Assignment

In [41]:
# TUPLE ASSIGNMENT
# Whenever we need to swap two variables, we use the conventional method: Using a temporary variable,
# temp = a 
# a = b 
# b = temp

#It is rather simple to perform swapping using tuple assignment (does not require 'temp' variable!)
A = (1, 2, 3) 
B = (4, 5, 6) 
A, B = B, A 
print (A)
print (B)

(4, 5, 6)
(1, 2, 3)


Why do you think the above worked?



### **Difference between (2,) 2,3 and (2) - why do we have to be careful about this?**

In [42]:
#look at 2,3 - is this a tuple?
2,3

(2, 3)

In Python, when you specify two numbers seperated by a comma, it defaults it to be a tuple. 

In [43]:
# look at (2) - is this a tuple?
(2)

2

When you put a bracket over a number or string, it does not make it a tuple, it remains to be the integer or string type it is.

In [44]:
#look at (2,) - is this a tuple?
(2,)

(2,)

This remains to be a tuple since we have put a comma.

In [45]:
a = 2 , 3
print(a)

(2, 3)


When we assign two numbers seperated by a comma to a single variable, Python takes it to be a tuple.

In [46]:
a,b = 2, 3
print(a)
print(b)

2
3


When we assign two numbers seperated by a comma to the corresponding number to variables, it is not considered to be a tuple. Rather it assigns it to the variables individually.