# "Python for Data Analysis" Course from [Oreilly](https://learning.oreilly.com/library/view/python-for-data/9781491957653/)
## Chapter 3. Built-in Data Structures, Functions, and Files [Chapter3](https://learning.oreilly.com/library/view/python-for-data/9781491957653/ch03.html)

## Preparations
- **References**
    * Chapter 3. Built-in Data Structures, Functions, and Files

## 3.1 Data Structures and Sequences
- **tuple**
    * A tuple is a fixed-length, **immutable** sequence of Python objects
    * If an object inside a tuple is mutable, such as a list, you can modify it in-place:
- **list**
    * In contrast with tuples, lists are variable-length and their contents can be **modified** in-place. 
    * You can define them using square brackets [] or using the list type function
    * Lists and tuples are semantically similar (though tuples cannot be modified) and can be used interchangeably in many functions
- **dict**
    * dict is likely the most important built-in Python data structure. A more common name for it is hash map or associative array. It is a flexibly sized collection of key-value pairs, where key and value are Python objects. One approach for creating one is to use curly braces {} and colons to separate keys and values. 
    * While the values of a dict can be any Python object, the **keys** generally have to be **immutable** objects like scalar types (int, float, string) or tuples (all the objects in the tuple need to be immutable, too)
- **set**
    * A set is an unordered collection of unique elements. You can think of them like dicts, but keys only, no values. A set can be created in two ways: via the set function or via a set literal with curly braces


In [143]:
# Define a tuple
tup = 4, 5, 6, (7, 8 ,9), [10, 11, 12]
print(tup)

# Select element of a tuple
print(tup[2:4])

# Modify mutable object inside a tuple
tup[4].append(13)
tup[4][1] = 'a'

print(tup[4])

# Convert to a tuple or a list
print("Convert to a tuple:", tuple([1, 2, 3]), tuple('string'))
print("Convert to a list:", list(tuple('string')), list(range(10)))

# concatenate tuples using +, *
print(tuple([1, 2, 3]) + tuple('string'))
print(tuple([1, 2, 3])*3 + tuple('string'))

# UNPACKING TUPLES with nested tuples
a, b, c, d, [e, f, g, h] = tup
print("Unpacking Tuples:", d, e, f)

# A common use of variable unpacking is iterating over sequences of tuples or lists:
seq = [(1, 2, 3, 4, 5, 6), (7, 8, 9, 10, 11, 12), (13, 14, 15, 16, 17, 18)]
for a, b, c, *rest in seq:
    print('a={0}, b={1}, c={2}, the rest={3}'.format(a, b, c, rest))
# if don't want the rest variables
for a, b, c, *_ in seq:
    print('a={0}, b={1}, c={2}'.format(a, b, c))

# Swapping variables
d, e, f = e, f, d
print("Swapping variables:", d, e, f)

# Creating a dictionary
mapping = dict(zip(range(5), reversed(range(5))))
print("Create a dictionary using dict:", mapping)
# Find value based on a given key
val1 = mapping.get(6, 'default_value')
val2 = mapping.get(0, 'default_value')
print('Find a key: {1}; Not Find a key: {0}'.format(val1, val2))

# Create a set
set1 = set([2, 2, 2, 1, 3, 3])
set2 = {3, 4, 5, 6, 7, 8}
print(set1)
print('union two sets:',set1.union(set2), set1 | set2)
print('inner join two sets:',set1.intersection(set2), set1 & set2)


(4, 5, 6, (7, 8, 9), [10, 11, 12])
(6, (7, 8, 9))
[10, 'a', 12, 13]
Convert to a tuple: (1, 2, 3) ('s', 't', 'r', 'i', 'n', 'g')
Convert to a list: ['s', 't', 'r', 'i', 'n', 'g'] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
(1, 2, 3, 's', 't', 'r', 'i', 'n', 'g')
(1, 2, 3, 1, 2, 3, 1, 2, 3, 's', 't', 'r', 'i', 'n', 'g')
Unpacking Tuples: (7, 8, 9) 10 a
a=1, b=2, c=3, the rest=[4, 5, 6]
a=7, b=8, c=9, the rest=[10, 11, 12]
a=13, b=14, c=15, the rest=[16, 17, 18]
a=1, b=2, c=3
a=7, b=8, c=9
a=13, b=14, c=15
Swapping variables: 10 a (7, 8, 9)
Create a dictionary using dict: {0: 4, 1: 3, 2: 2, 3: 1, 4: 0}
Find a key: 4; Not Find a key: default_value
{1, 2, 3}
union two sets: {1, 2, 3, 4, 5, 6, 7, 8} {1, 2, 3, 4, 5, 6, 7, 8}
inner join two sets: {3} {3}


- **Change/Update Variables**
   * Tuples CANNOT be **modified**, while lists can be **modified**. 
   * WARNING: **insert** is computationally expensive compared with **append**, because references to subsequent elements have to be shifted internally to make room for the new element.
   * Using **extend** to append elements is faster than the concatenative **+**
   * In dictionary, you can delete values either using the **del** keyword or the **pop** method (which simultaneously returns the value and deletes the key)
   * In dictionary, the update method changes dicts in-place, so **any existing keys** in the data passed to update will have their **old values discarded**.

In [7]:
ls = list(range(10))
print(ls)

# Add/Insert/Delete elements
ls.append('a')
print("Add an element to the end:", ls)
ls.insert(1, 'a')
print("Insert an element at a specific location:", ls)
ls.extend(['b', 'c', 'd'])
print("Append multiple elements:", ls)
print("Concatenates lists:", ls + ['e', 'f'])
ls.pop(2)
print("Remove an element at a specific location:", ls)
ls.remove('a')
print("Remove the first matched element:", ls)

# Others
print("Check if a list contains 'a' value:", 'a' in ls, 'a' not in ls)

print("----------------------------")
print("### Below for dictionary ###")
# Add element into dictionary
d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
d1[7] = 'an integer'
print("Add element into dictionary:", d1)
# Add element into dictionary
del(d1[7])
print("Delete element using del into dictionary:", d1)
d1.pop('a')
print("Delete element using pop into dictionary:", d1)
# Merge element into dictionary
d1.update({'d' : 'foo', 'c' : 12})
print("Merge element into dictionary:", d1)
print("The keys are: {0} and the values are: {1}".format(list(d1.keys()),list(d1.values())))


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Add an element to the end: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'a']
Insert an element at a specific location: [0, 'a', 1, 2, 3, 4, 5, 6, 7, 8, 9, 'a']
Append multiple elements: [0, 'a', 1, 2, 3, 4, 5, 6, 7, 8, 9, 'a', 'b', 'c', 'd']
Concatenates lists: [0, 'a', 1, 2, 3, 4, 5, 6, 7, 8, 9, 'a', 'b', 'c', 'd', 'e', 'f']
Remove an element at a specific location: [0, 'a', 2, 3, 4, 5, 6, 7, 8, 9, 'a', 'b', 'c', 'd']
Remove the first matched element: [0, 2, 3, 4, 5, 6, 7, 8, 9, 'a', 'b', 'c', 'd']
Check if a list contains 'a' value: True False
----------------------------
### Below for dictionary ###
Add element into dictionary: {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}
Delete element using del into dictionary: {'a': 'some value', 'b': [1, 2, 3, 4]}
Delete element using pop into dictionary: {'b': [1, 2, 3, 4]}
Merge element into dictionary: {'b': [1, 2, 3, 4], 'd': 'foo', 'c': 12}
The keys are: ['b', 'd', 'c'] and the values are: [[1, 2, 3, 4], 'foo', 1

- **Slicing**
   * While the element at the start index is included, the stop index is not included, so that the number of elements in the result is stop - start.
   ![pyda_1302.png](./pyda_1302.png)
   * [Start:Stop:Step]

- **Some Useful Built-in Functions**
   * __enumerate__ is to iterate over a sequence, which returns a sequence of (i, value) tuples
   * __zip__ can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence
   * __reversed__ is a generator, so it does not create the reversed sequence until materialized (e.g., with list or a for loop)

In [145]:
# enumerate is to iterate over a sequence, which returns a sequence of (i, value) tuples
some_list = ['foo', 'bar', 'baz']
mapping = {}
for i, v in enumerate(some_list):
    mapping[v] = i
print("Using enumerate to iterate: {0}".format(mapping))

# zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a list of tuples
seq1 = ['foo', 'bar', 'baz']
seq2 = [1, 2, 3]
seq3 = [False, True]
zipped1 = zip(seq1, seq2)
zipped2 = zip(seq1, seq2, seq3)
print("Using zip to pair elements: {0}".format(dict(zipped1)))
print("zip is determined by the shortest sequence: {0}".format(list(zipped2)))

mapping = {}
for i, (a, b) in enumerate(zip(seq1, seq2)):
    mapping[(a,b)] = i
print(mapping)
unzip1, unzip2 = zip(*mapping)
print("To unzip the sequence: {0}, {1}".format(unzip1, unzip2))

# reversed iterates over the elements of a sequence in reverse order
print(list(reversed(range(10))))

Using enumerate to iterate: {'foo': 0, 'bar': 1, 'baz': 2}
Using zip to pair elements: {'foo': 1, 'bar': 2, 'baz': 3}
zip is determined by the shortest sequence: [('foo', 1, False), ('bar', 2, True)]
{('foo', 1): 0, ('bar', 2): 1, ('baz', 3): 2}
To unzip the sequence: ('foo', 'bar', 'baz'), (1, 2, 3)
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


- **List, Set, and Dict Comprehensions**
   * nested list comprehension actually is same as a nested for loop
   * Like:  
       for x in all:
           for y in x: 
               t = x



In [20]:
# list comprehension: [expr for val in collection if condition]
# set comprehension: {expr for value in collection if condition}
# dict comprehension: {key-expr : value-expr for value in collection if condition}
print({i : i**2 for i in range(10) if i>5})
print({val : index for index, val in enumerate('strings')})

# Nested list comprehensions
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],
            ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]
print([name for names in all_data for name in names if name.count('a') >= 2])


some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
print('flattened nested list: {0}'.format(flattened))

ls2 = [[i, i**2, i**3] for i in range(10)]
print("select even postion: {0}".format(ls2[::2]))

{6: 36, 7: 49, 8: 64, 9: 81}
{'s': 6, 't': 1, 'r': 2, 'i': 3, 'n': 4, 'g': 5}
['Maria', 'Natalia']
flattened nested list: [1, 2, 3, 4, 5, 6, 7, 8, 9]
select even postion: [[0, 0, 0], [2, 4, 8], [4, 16, 64], [6, 36, 216], [8, 64, 512]]


## 3.2 Functions

- **Functions**
   * Functions are declared with the __def__ keyword and returned from with the __return__ keyword
   * Each function can have __positional__ arguments and __keyword__ arguments. Keyword arguments are most commonly used to specify default values or optional arguments. 
       * The main restriction on function arguments is that the keyword arguments must follow the positional arguments (if any). You can specify keyword arguments __in any order__; this frees you from having to remember which order the function arguments were specified in and only what their names are
   * Functions can access variables in two different scopes: **global** and **local**. An alternative and more descriptive name describing a variable scope in Python is a namespace. Any variables that are assigned within a function by default are assigned to the local namespace. The local namespace is created when the function is called and immediately populated by the function’s arguments. After the function is finished, the local namespace is destroyed (with some exceptions that are outside the purview of this chapter).
       * Assigning variables outside of the function’s scope is possible, but those variables must be declared as global via the **global** keyword


In [101]:
def my_function(x, y, z = 2):
    if z > 1:
        return z * (x + y), (x + y)
    else:
        return z / (x + y)

print(my_function(x=5, y=6, z=7))
print(my_function(y=6, x=5, z=7))


glb_a = [1, 2]
glb_b = [2, 3]

def my_another():
    global glb_a
    glb_a = None
    glb_b = None

my_another()
print("The difference beween local and global: \nglobal(changed): {0}, \nlocal(no change): {1}".format(glb_a, glb_b))

states = ['   Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
          'south   carolina##', 'West virginia?']

import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result


def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings_2nd(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

print('\n{0}\n{1}'.format(clean_strings(states),clean_strings_2nd(states,clean_ops)))


(77, 11)
(77, 11)
The difference beween local and global: 
global(changed): None, 
local(no change): [2, 3]

['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', 'South   Carolina', 'West Virginia']
['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', 'South   Carolina', 'West Virginia']



- **Anonymous (Lambda) Functions**
   * Python has support for so-called anonymous or lambda functions, which are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the **lambda** keyword

In [108]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']
strings.sort(key=lambda x: len(set(list(x))))
print("sort a collection of strings by the number of distinct letters :{0}".format(strings))

sort a collection of strings by the number of distinct letters :['aaaa', 'foo', 'abab', 'bar', 'card']



- **Generators**
   * Having a consistent way to iterate over sequences, like objects in a list or lines in a file, is an important Python feature. This is accomplished by means of the **iterator** protocol, a generic way to make objects iterable
   * To create a generator, use the yield keyword instead of return in a function
   * Another even more concise way to make a generator is by using a generator expression. This is a generator analogue to list, dict, and set comprehensions; to create one, enclose what would otherwise be a list comprehension within **parentheses** instead of brackets
   

In [121]:
def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2

gen = squares()
print(gen)
print(list(gen))

gen2 = (i ** 2 for i in range(10))
print(gen2)
print(list(gen2))
print(sum((i ** 2 for i in range(10))))


<generator object squares at 0x106c836d0>
Generating squares from 1 to 100
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
<generator object <genexpr> at 0x106c834c0>
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
285


- **ITERTOOLS MODULE**
   * The standard library itertools module has a collection of generators for many common data algorithms. For example, groupby takes any sequence and a function, grouping consecutive elements in the sequence by return value of the function.

|Function|Description|
|----|--------|
|combinations(iterable, k)|Generates a sequence of all possible k-tuples of elements in the iterable, ignoring order and without replacement (see also the companion function combinations_with_replacement)|
|permutations(iterable, k)|	Generates a sequence of all possible k-tuples of elements in the iterable, respecting order|
|groupby(iterable[, keyfunc])|	Generates (key, sub-iterator) for each unique key|
|product(*iterables, repeat=1)|	Generates the Cartesian product of the input iterables as tuples, similar to a nested for loop|

In [123]:
import itertools
first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']
for letter, name in itertools.groupby(names, first_letter):
    print(letter, list(name)) # name is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


<div class="alert alert-block alert-warning">
<b>Errors and Exception Handling:</b> Handling Python errors or exceptions gracefully is an important part of building robust programs
</div>

* In some cases, you may not want to suppress an exception, but you want some code to be executed regardless of whether the code in the try block succeeds or not. To do this, use **finally**

In [142]:
def attempt_float(x):
    try:
        
        return float(x)
    except: ##(TypeError, ValueError):
        return x 
    else:
        print('NA')
    finally:
        print("Done")

print(attempt_float('xx'))
print(attempt_float(['100','22']))
print(attempt_float('100'))



Done
xx
Done
['100', '22']
Done
100.0


## 3.3 Files and the Operating System


- **Open File**
   * To open a file for reading or writing, use the built-in **open** function with either a relative or absolute file path
   * By default, the file is opened in read-only mode 'r'. We can then treat the file handle f like a list and iterate over the lines
   * The lines come out of the file with the end-of-line (EOL) markers intact, so you’ll often see code to get an EOL-free list of lines in a file
   * When you use open to create file objects, it is important to explicitly **close the file** when you are finished with it. Closing the file releases its resources back to the operating system
   * One of the ways to make it easier to clean up open files is to use the **with statement**, This will automatically close the file f when exiting the with block

In [200]:
path = './Sample/3.3.sample1.txt'
path2 = './Sample/3.3.sample2.txt'

f = open(path)

for line in f:
    print("Sample 1: {0}".format(line))
print("The read method advances the file handle’s position by the number "
      "of bytes read. tell gives you the current position: {0}"
      .format(f.tell()))

f = open(path)
line2 = [l for l in f]
print("Sample 2: {0}".format(line2))

f.close()


with open(path) as f: 
    line3 = [l.strip() for l in f]
print("Sample 3: {0}".format(line3))

with open(path, 'rb') as f:  # Binary mode
    line4 = [l.strip() for l in f]
print("Sample 4: {0}".format(line4))


with open(path2, 'r+') as f: # write a new line
    f.writelines(x for x in open(path) if len(x) > 1)
    f.writelines('\nAdd a new line for a test')

with open(path2, 'r', encoding='utf8') as f:
    line5 = [l.strip() for l in f]
print("Sample 5: {0}".format(line5))


Sample 1: This is a test file

Sample 1: Study from Data Anlaysis
The read method advances the file handle’s position by the number of bytes read. tell gives you the current position: 44
Sample 2: ['This is a test file\n', 'Study from Data Anlaysis']
Sample 3: ['This is a test file', 'Study from Data Anlaysis']
Sample 4: [b'This is a test file', b'Study from Data Anlaysis']
Sample 5: ['This is a test file', 'Study from Data Anlaysis', 'Add a new line for a testest']
