# Data Structure

Python has four data structures: list, dictionary, tuple, and set. Each data structure has its own characteristics and unique uses.

# list


First, let's start with lists and delve deeply into each data structure. The most notable characteristics of lists are:

1. Each element in a list is mutable.
2. The elements in a list are ordered, meaning each element has a specific position.
3. A list can contain any object in Python.

The elements in a list being mutable means that we can add, delete, and modify elements within the list. Each element in a list corresponds to a position, and we can query the value at that position by inputting the position.

About operate to the data, the most common is to add, delete, modify, and query.

When using the `insert` method, you must specify the position in the list where the new element should be inserted. The actual insertion point is before the element currently at that position. If the specified position does not exist in the list (i.e., it exceeds the current length of the list), the element will be placed at the end of the list.

In [14]:
fruit = ['pineapple','pear']
fruit.insert(2,'grape')
print(fruit)

['pineapple', 'pear', 'grape']


In [15]:
fruit[0:3] = ['Orange']
print(fruit)

['Orange']


remove method to delete the element in the list

In [16]:
fruit = ['pinapple','pear','grape']
fruit.remove('grape')
print(fruit)

['pinapple', 'pear']


method to replace the element in the list

In [17]:
fruit[0] = 'Grapefruit'
print(fruit)

['Grapefruit', 'pear']


Another method to remove an element is to use the `del` keyword.

In [12]:
del fruit[0:2]
print(fruit)

[]


# Dictionary（dict）

In the programming world, many concepts are based on real-life prototypes. The characteristics of the dictionary data structure are similar to real-world dictionaries, using name-content pairs for data construction. In Python, these correspond to keys and values, commonly referred to as key-value pairs.

The characteristics of dictionaries are summarized as follows:

1. The data in a dictionary must appear in key-value pairs.
2. Logically, keys cannot be duplicated, but values can be duplicated.
3. The keys in a dictionary are immutable, meaning they cannot be changed, while the values are mutable and can be modified. The values can be any object.

In [19]:
NASDAQ_code = {
'BIDU':'Baidu',
'BIDU':'Baidu',
'YOKU':'Youku'
}
print(NASDAQ_code)

{'BIDU': 'Baidu', 'YOKU': 'Youku'}


In a dictionary, keys and values cannot exist independently of each other.

*error show below:*

In [9]:
NASDAQ_code = {
'BIDU':,
'SINA':'Sina',
'YOKU':'Youku'
}

SyntaxError: invalid syntax (375022434.py, line 2)

Using a mutable element as a key to construct a dictionary:

Keys and values correspond one-to-one, with keys being immutable.

In [None]:
key_test = {[]:'a Test'}
print(key_test)

In a dictionary, key-value pairs cannot be duplicated. Even if you try to do so, the same key can only appear once.

In [None]:
a = {'key':123,'key':123}
print(a)

Crate a dictionary with a key-value pair:

In [23]:
NASDAQ_code = {'BIDU':'Baidu','SINA':'Sina'}

Dictionaries do not have a method for adding a single element directly, but we can add elements using this approach:

```python
dictionary[key] = value
```

In [24]:
NASDAQ_code['YOKU'] = 'Youku'
print(NASDAQ_code)

{'BIDU': 'Baidu', 'SINA': 'Sina', 'YOKU': 'Youku'}


In lists, there is a method called `extend` used to add multiple elements. Similarly, in dictionaries, there is a method called `update` for adding multiple key-value pairs.

In [26]:
NASDAQ_code.update({'FB':'Facebook','TSLA':'Tesla'})
print(NASDAQ_code)

{'BIDU': 'Baidu', 'SINA': 'Sina', 'YOKU': 'Youku', 'FB': 'Facebook', 'TSLA': 'Tesla'}


use del method to delete the element in the dictionary

In [27]:
del NASDAQ_code['FB']
print(NASDAQ_code)

{'BIDU': 'Baidu', 'SINA': 'Sina', 'YOKU': 'Youku', 'TSLA': 'Tesla'}


It is important to note that although dictionaries use curly braces, when indexing their contents, you still use square brackets like with lists. However, the brackets must contain a key from the dictionary, meaning you need to use the key to index the value.

In [28]:
NASDAQ_code['TSLA']

'Tesla'

Additionally, dictionaries cannot be sliced, so the following syntax is incorrect when applied to dictionaries:

```python
dictionary[1:3]  # This is incorrect for dictionaries
```

In dictionaries, you must access elements directly by their keys.

In [1]:
NASDAQ_code[1:4]

NameError: name 'NASDAQ_code' is not defined

# tuple

A tuple can actually be understood as a stable version of a list, because tuples are immutable. Therefore, methods that exist for lists cannot be used on tuples. However, tuples can be indexed and viewed.

In [2]:
letters = ('a','b','c','d','e','f','g')
letters[0]

'a'

You can create a tuple using parentheses (). It can contain elements of any type, including numbers, strings, lists, etc.

In [3]:
my_tuple = (1, 2, 3, "hello", [4, 5])


可以使用索引来访问元组中的单个元素，索引从0开始。

In [4]:
my_tuple = (1, 2, 3, "hello", [4, 5])
print(my_tuple[0])  # output 1
print(my_tuple[3])  # output "hello"


1
hello


You can use slicing to get a subset of a tuple.

In [5]:
my_tuple = (1, 2, 3, 4, 5)
subset = my_tuple[1:3]  # subset = (2, 3)
subset

(2, 3)

You can unpack elements from a tuple into multiple variables.

In [7]:
my_tuple = (1, 2, 3)
a, b, c = my_tuple
print(a)  # output 1
print(b)  # output 2
print(c)  # output 3


1
2
3


The elements of a tuple cannot be modified, which is the most significant difference between tuples and lists.

In [8]:
my_tuple = (1, 2, 3)
my_tuple[0] = 4  # will cause error because tuples are immutable


TypeError: 'tuple' object does not support item assignment

You can use the len() function to get the number of elements in a tuple.

In [11]:
my_tuple = (1, 2, 3)
length = len(my_tuple)  # length = 3
length

3

When you want data to remain unchanged, you can use a tuple.

When a function returns multiple values, it usually returns a tuple.

In [13]:
def get_info():
    name = "John"
    age = 30
    return name, age

result = get_info()
a,b = result
print(result)  # output ('John', 30)
print(a)


('John', 30)
John


Since tuples are immutable, they can be used as keys in dictionaries (because keys need to be hashable) or in sets.

In [16]:
my_dict = {(1, 2): "value"}
my_set = {(1, 2), (3, 4)}
print(my_dict)
print(my_set)

{(1, 2): 'value'}
{(1, 2), (3, 4)}


# Set

A set is more akin to the mathematical concept of a set. Each element in a set is an unordered, unique object. We can use sets to determine membership of data and sometimes use sets to remove duplicate elements from data structures.

In [9]:
# use {} to create a set
a_set = {1, 2, 3, 4, 5}
print(a_set)

# use set() to create a set
my_set = set([1, 2, 3, 4, 5])

a_set.add(5)
print(a_set)

a_set.discard(5)
print(a_set)

my_set.add(6)    # add element of 6
print(my_set)

my_set.remove(3) # remove 3, if 3 is not exist it will raise an error
print(my_set)

my_set.discard(4) # remove 4, if 4  is exist it will not raise an error
print(my_set)

{1, 2, 3, 4, 5}
{1, 2, 3, 4, 5}
{1, 2, 3, 4}
{1, 2, 3, 4, 5, 6}
{1, 2, 4, 5, 6}
{1, 2, 5, 6}


Union Set

In [13]:
set1 = {1, 2, 3}
set2 = {3, 4, 5}

union_set = set1.union(set2)  # or union_set = set1 | set2
print(union_set)

{1, 2, 3, 4, 5}


Intersection Set

In [15]:
intersection_set = set1.intersection(set2)  # or intersection_set = set1 & set2
intersection_set

{3}

Difference Set

In [16]:
difference_set = set1.difference(set2)  # 或者 difference_set = set1 - set2
difference_set

{1, 2}

In [17]:
if 1 in my_set:
    print("1 is in the set")


1 is in the set


In [19]:
size = len(my_set)  # get size
size


4

In [20]:
my_set.clear()  # clear all elements
my_set

set()

# Data structure techniques

There are many functions whose usage is closely related to data structures. Previously, we learned the basic usage of lists, but in practical operations, we often encounter more problems. For example, when organizing tables or files, sorting by alphabet or date is common. Python also provides similar functionality.

In [22]:
num_list = [6,2,7,4,1,3,5]
print(sorted(num_list))

[1, 2, 3, 4, 5, 6, 7]


The sorted function sorts the elements in each list by length, size, or alphabetical order. This function is often used in data presentation. An important point to note is that the sorted function does not modify the original list. You can think of it as first copying the list and then sorting the copy.

When using the reverse parameter, the list can be sorted in reverse order.

In [23]:
sorted(num_list,reverse=True)

[7, 6, 5, 4, 3, 2, 1]

When organizing lists, if you need to handle two lists simultaneously, you can use the zip function.

In [24]:
num = [1,2,3,4]
string = ['1','2','3']
for a,b in zip(num,string):
    print(b,'is',a)

1 is 1
2 is 2
3 is 3


List comprehension in data structures, which you might have seen referred to by another name, list comprehension expressions, refers to the same concept.

There are 10 elements to be put into a list, and the ordinary way is as follows:

In [25]:
a = []
for i in range(1,11):
    a.append(i)
print(a)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


below show the list comprehension way to write:


In [26]:
b = [i for i in range(1,11)]
print(b)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


**Tips:** List comprehensions are not only very convenient but also much more efficient than the former. Comparing the time taken by these two different list operations, it is easy to see the significant difference in their efficiency.

In [3]:
import time

# Timing the loop method
a = []
start_time = time.perf_counter()
for i in range(1, 20000000):
    a.append(i)
end_time = time.perf_counter()
elapsed_time = end_time - start_time
print(f"Execution time: {elapsed_time} seconds")

# Timing the list comprehension method
start_time = time.perf_counter()
b = [i for i in range(1, 20000000)]
end_time = time.perf_counter()
elapsed_time = end_time - start_time
print(f"Execution time: {elapsed_time} seconds")


Execution time: 2.2309534999999983 seconds
Execution time: 1.3103011000000038 seconds


The usage of list comprehension is also easy to understand. You can think of it as having two parts: the part after the `:` is the familiar `for` loop expression, and the part before the `:` can be considered the elements we want to put in the list.

list = [item : for item in iterable]

In [4]:
a = [i**2 for i in range(1,10)]
c = [j+1 for j in range(1,10)]
k = [n for n in range(1,10) if n % 2 ==0]
z = [letter.lower() for letter in 'ABCDEFGHIGKLMN']
print(a)
print(c)
print(k)
print(z)

[1, 4, 9, 16, 25, 36, 49, 64, 81]
[2, 3, 4, 5, 6, 7, 8, 9, 10]
[2, 4, 6, 8]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'g', 'k', 'l', 'm', 'n']


Dictionary comprehension works slightly differently, mainly because creating a dictionary must meet the two conditions of key-value pairs.

In [1]:
d = {i:i+1 for i in range(4)}
n = {i:j for i,j in zip(range(1,6),'abcde')}
g = {i:j.upper() for i,j in zip(range(1,6),'abcde')}
print(d)
print(n)
print(g)

{0: 1, 1: 2, 2: 3, 3: 4}
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}
{1: 'A', 2: 'B', 3: 'C', 4: 'D', 5: 'E'}


Loop through a list to get the index of elements: Now we have an alphabet list, how can we get the specific position of each element when indexing? Lists are ordered, so we can use Python's unique function `enumerate` for this.

In [2]:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
for num,letter in enumerate(letters):
    print(letter,'is',num + 1)

a is 1
b is 2
c is 3
d is 4
e is 5
f is 6
g is 7


Final chaper, we will do a word frequency count.

use split function to get each word in the string:

In [12]:
lyric = 'The night begin to shine, the night begin to shine 1 2'
words = lyric.split()
print(words)

['The', 'night', 'begin', 'to', 'shine,', 'the', 'night', 'begin', 'to', 'shine', '1', '2']


Next is word frequency counting. We use the `count` method to count the repeated occurrences of words:

In [13]:
for word in words:
    print('{}-{} times'.format(word,words.count(word)))

The-1 times
night-2 times
begin-2 times
to-2 times
shine,-1 times
the-1 times
night-2 times
begin-2 times
to-2 times
shine-1 times
1-1 times
2-1 times


The results are out, but something seems odd. After careful observation, the conclusions are:

1. Some words with punctuation were counted separately.

2. Some words displayed their occurrence count more than once.

Since Python is case-sensitive, words with capitalized beginnings were counted separately. Now let's adjust our counting method based on these points and preprocess the words a bit.

In [18]:
import string
text= 'The night begin to shine, the night begin to shine 1 2'
# string.punctuation is to remove the punctuation like: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
# strip() is to remove the leading and trailing characters 
words = [raw_word.strip(string.punctuation).lower() for raw_word in text.split()]
words_index = set(words)
counts_dict = {index:words.count(index) for index in words_index}
# key = lambad is sort the key of dictionary 
for word in sorted(counts_dict,key=lambda x: counts_dict[x],reverse=True):
    print('{} -- {} times'.format(word,counts_dict[word]))

to -- 2 times
night -- 2 times
shine -- 2 times
the -- 2 times
begin -- 2 times
1 -- 1 times
2 -- 1 times


# Classes and Coca-Cola


"In America, everyone drinks the same Coke, whether it's the President or a homeless person," said pop artist Andy Warhol. If we were to express Andy's idea using programming language, I think the concept of a class would be the most appropriate.

Just as the name "class" suggests, the concept it describes is very similar to the concept of classes in our real life. Living organisms have different classes, foods have different classes, and various commodities in human society also have different classes. Any objects that can be referred to as a class have similar characteristics and behaviors. In other words, a class is a generalization of an abstract concept of a series of things with common characteristics and behaviors.

For Coca-Cola, as long as it is the same brand of Coke, they have the same ingredients, which is called the formula. Just like the uniform standards followed during mass production in a factory, it is because of the same formula that all Coca-Cola achieves the same taste. We can use a class in Python to express this:

We use `class` to define a class, just as simply as using `def` to define a function. Next, you can see a variable `formula` containing a list in the indented area. Variables assigned within the class are class variables, and there is a special term for class variables: we call them class attributes.

Class variables are no different from the variables we encounter. Since strings, lists, dictionaries, integers, and so on can all be variables, they can naturally all become class attributes.

In [19]:
class CocaCola:
    formula = ['caffeine','sugar','water','soda']

**Class Instantiation**

Next, we produce Coca-Cola according to the defined formula. Naturally, no matter how many bottles of Coke are produced according to this formula, they all taste the same.

On the left, we create a variable and on the right, we write the name of the class. This looks very much like an assignment operation, and we call it class instantiation. The object that is instantiated is called an instance, or an instance of the class. For Coca-Cola, the process of producing Coke according to the formula is the process of instantiation.

In [20]:
CocaCola0 = CocaCola()
coke_for_me = CocaCola()
coke_for_you = CocaCola()

**Class Attribute References**

When you type a `.` after the class name, the IDE will automatically suggest the attributes we defined earlier in the class. This is known as attribute references.

Class attributes are shared by all instances of the class, so when you use a `.` after an instance of the class to access an attribute, the attribute value will be exactly the same.

In [21]:
print(CocaCola0.formula)
print(coke_for_me.formula)
print(coke_for_you.formula)

['caffeine', 'sugar', 'water', 'soda']
['caffeine', 'sugar', 'water', 'soda']
['caffeine', 'sugar', 'water', 'soda']


The lines of code above are like saying, "Tell me the formula of Coca-Cola" and "Tell me the formula of the Coke in your hand." The result is exactly the same.

Class attributes are no different from regular variables.

In [22]:
for element in coke_for_me.formula:
    print(element)

caffeine
sugar
water
soda


Coca-Cola's global popularity is closely related to its localized marketing strategies. In 1927, Coca-Cola first entered China. At that time, Chinese people felt distant from this black, bittersweet drink. Moreover, "Coca-Cola" had no official translation and was used in its English form, while locals called this strange drink "kē kē kěn là" (蝌蝌啃蜡). The strange taste combined with the strange name led to a poor reception when Coca-Cola first entered China.

In 1979, China began large-scale opening up of import and export trade. The official, Chinese-labeled "Coca-Cola" that we are familiar with started to appear in people's lives and became popular. The same formula, but with a different name, brought different results. This shows that it is necessary to make some unique localization adjustments during production:

In [24]:
formula = ['caffeine','sugar','water','soda']
coke_for_China = CocaCola()
coke_for_China.local_logo = 'CocaCola'
print(coke_for_China.local_logo) 

coke-cola


Through the code above, we have labeled the Coca-Cola produced in China with the Chinese characters "可口可乐." After creating the class, we assign a value using the form `object.new_attr`, and thus we obtain a new instance variable. The instance variable is referred to as an instance attribute.

The formula belongs to Coca-Cola (Class), while the Chinese label (local_logo) belongs to each bottle of Coca-Cola in China (Instance). Labeling the Coca-Cola in China with a Chinese label does not affect the Coca-Cola labels sold in the US or Japan.

**Instance Methods**

Instances of a class can reference attributes, but we also learned earlier that instances of a class can use methods (see Chapter 3: String Methods). A method is a function, but we call this function a method. Methods are for instances to use, so we can also call them instance methods. When you drink a bottle of Coca-Cola, you gain energy from the caffeine and the large amount of sugar. If we were to represent this "function" of Coca-Cola using a class method, it would look like this:

In [25]:
class CocaCola:
    formula = ['caffeine','sugar','water','soda']
    def drink(self): # HEREѺ
        print('Energy!')
coke = CocaCola()
coke.drink()



Energy!


The `self` parameter seems like it hasn't been put to any use.

In [26]:
class CocaCola:
    formula = ['caffeine','sugar','water','soda']
    def drink(coke): # HEREѺ
        print('Energy!')
coke = CocaCola()
coke.drink()

Energy!


This parameter is actually the instance itself that is being created. The way to use a function is to place objects as parameters within the function parentheses. Furthermore, once a class is instantiated, we can use it in a way similar to how we use functions:

```python
coke = CocaCola
coke.drink() == CocaCola.drink(coke)
```

The instantiated object is silently passed into the parentheses of the method by the compiler as the first parameter. The two methods above are the same, but we more commonly write it in the former way.

In [29]:
class CocaCola:
    formula = ['caffeine','sugar','water','soda']
    def drink(coke): # HEREѺ
        print('Energy!')
coke = CocaCola()
coke.drink()
CocaCola.drink(coke)

Energy!
Energy!


Like functions, class methods can also have their own parameters. Let's try making some changes to the `.drink()` method:


In [30]:
class CocaCola:
    formula = ['caffeine','sugar','water','soda']
    def drink(self,how_much):
        if how_much == 'a sip':
            print('Cool')
        elif how_much =='whole bottle':
            print('Headache!')
ice_coke = CocaCola()
ice_coke.drink('a sip')

Cool


In Python classes, there are some methods called "magic methods," and `__init__()` is one of them. The magic of `__init__()` lies in the fact that if you define it within a class, it can automatically handle many things when an instance is created, such as adding instance attributes.

In the code above, we created an instance attribute, but it was done after defining the class.

`__init__()` is short for initialize, which means that even if we don't explicitly call this method when creating an instance, the commands within it will be executed automatically.

In [32]:
class CocaCola():
    formula = ['caffeine','sugar','water','soda']
    def __init__(self):
        self.local_logo = 'CocaCola'
    def drink(self): # HEREѺ
        print('Energy!')
coke = CocaCola()
print(coke.local_logo)

CocaCola


The `__init__()` method provides great flexibility in using classes.

In [36]:
class CocaCola:
    formula = ['caffeine','sugar','water','soda']
    def __init__(self):
        for element in self.formula:
            print('Coke has {}!'.format(element))
    def drink(self):
        print('Energy!')
coke = CocaCola()

Coke has caffeine!
Coke has sugar!
Coke has water!
Coke has soda!


Besides the mandatory `self` parameter, the `__init__()` method can also have its own parameters. Additionally, you don't need to call it explicitly with `obj.__init__()` (since it executes automatically). Instead, you pass the parameters into the parentheses following the class during instantiation, and all the corresponding parameters will be passed to this special `__init__()` method.

In [40]:
class CocaCola:
    formula = ['caffeine','sugar','water','soda']
    def __init__(self,logo_name):
        self.local_logo = logo_name
    def drink(self):
        print('Energy!')
coke = CocaCola('CocaCola')
coke.local_logo
# print(coke.local_logo)

'CocaCola'

If you find `self.local_logo = logo_name` in the code above confusing, we can explain it briefly here. The left side is a variable used as a class attribute, and the right side is the parameter passed in as a variable. In other words, the value stored in this variable depends on the parameter `logo_name` passed in during initialization. Whatever is passed in will be the value of the variable.

**Class Inheritance**

Times are changing, and the variety of consumer goods is constantly increasing. Today is no longer the era of Andy Warhol, when there was only one flavor of Coca-Cola. Moreover, not all Coca-Cola flavors taste the same—if you've ever tried cherry-flavored Coke, you'll understand. The taste of Coca-Cola itself has changed over time to meet modern demands.

Now, let's redefine this class using the latest formula from the official Coca-Cola website:


In [42]:
class CocaCola:
    calories = 140
    sodium = 45
    total_carb = 39
    caffeine = 34
    ingredients = [
'High Fructose Corn Syrup',
'Carbonated Water',
'Phosphoric Acid',
'Natural Flavors',
'Caramel Color',
'Caffeine'
 ]
    def __init__(self,logo_name):
        self.local_logo = logo_name
    def drink(self):
        print('You got {} cal energy!'.format(self.calories))

Different localization strategies and the development of new varieties make production more than just changing the label. Packaging, volume, and even the formula may change, but the one thing that remains constant is that they are always Coca-Cola.

All subcategories will inherit the Coca-Cola brand. In Python, classes naturally have a corresponding concept called inheritance. Let's use caffeine-free Coke as an example:

In [43]:
class CaffeineFree(CocaCola):
    caffeine = 0
    ingredients = [
 'High Fructose Corn Syrup',
 'Carbonated Water',
 'Phosphoric Acid',
 'Natural Flavors',
 'Caramel Color',
 ]
coke_a = CaffeineFree('Cocacola-FREE')
coke_a.drink()
coke_a.local_logo


You got 140 cal energy!


'Cocacola-FREE'

We place `CocaCola` in parentheses after the new class `CaffeineFree`, which indicates that this class inherits from the parent class `CocaCola`, making `CaffeineFree` a subclass of `CocaCola`. Variables and methods in the parent class can be completely inherited by the subclass, but they can also be overridden if special modifications are needed.

You can see that `CaffeineFree` has differences in caffeine content and formula. In the new class, only these two aspects are overridden, while other methods and attributes that are not overridden can still be used as usual.

If a class property is reassigned, does it affect the reference to the class property?

In [50]:
class TestA:
    attr = 1
obj_a = TestA()
print(obj_a.attr)
TestA.attr = 42
print(obj_a.attr)

1
42


Bulit-in types in Python are classes, and they are all pre-defined in the Python standard library. For example, `int` is a class, and `str` is a class.

In Python, we can use the `type()` function to get the type of an object.

Let's take a look at some examples:

In [51]:
obj1 = 1
obj2 = 'String!'
obj3 = []
obj4 = {}
print(type(obj1),type(obj2),type(obj3),type(obj4))

<class 'int'> <class 'str'> <class 'list'> <class 'dict'>


In Python, any kind of object is an instance of a class. The types mentioned above are called built-in types, and they do not need to be instantiated like we did above.

We use `BeautifulSoup` as an example. Hold down `Cmd` (on Windows, `Ctrl`) and click on `BeautifulSoup` to view the complete class definition of the `soup` object.

**Note:** if you haven't installed `BeautifulSoup`.

You can install Beautiful Soup using the following command:

```bash
pip install beautifulsoup4
```

Additionally, you may also want to install a parser like `lxml` for better performance:

```bash
pip install lxml
```


In [3]:
# or run this to download
!pip install beautifulsoup4 lxml



In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup
print(type(soup))

<class 'type'>


More about BeautifulSoup 
please visit https://www.crummy.com/software/BeautifulSoup/bs4/doc/