# Python for Data Analysis

This Notebook covers some important topics to know about Python before using it for Data Analysis

## Strings

In [1]:
# Multiline strings
s = """
    1334
    jsnbjks
"""

print(repr(s))
print('-' * 30)
print(s)

'\n    1334\n    jsnbjks\n'
------------------------------

    1334
    jsnbjks



In [2]:
# Raw strings -> add an r before the quotation marks; then, every scaped character is ignored
s1 = '\ta\n1'
s2 = r'\ta\n1'

print(repr(s1))
print('-' * 30)
print(s1)

print('\n' + '=' * 30 + '\n')

print(repr(s2))
print('-' * 30)
print(s2)

'\ta\n1'
------------------------------
	a
1


'\\ta\\n1'
------------------------------
\ta\n1


In [3]:
# String useful functions
s = "hello, world!"

In [4]:
s.index(',')

5

In [5]:
s.startswith('he'), s.endswith('ld')

(True, False)

In [6]:
', w' in s

True

In [7]:
s.replace('l', 'k').replace('wo', '?')

'hekko, ?rkd!'

In [8]:
s.split('l')

['he', '', 'o, wor', 'd!']

In [9]:
'/'.join(['a', 'b'])

'a/b'

In [10]:
print('_%s_' % 'hello')
print('%d %.3d %f %.2f %s' % (20, 20, 0.843978734, 0.343874838, False))

_hello_
20 020 0.843979 0.34 False


In [11]:
print('{a}={b} or {}={}'.format('1', 34, a='a', b=2))

a=2 or 1=34


## Data structures

### Lists

In [12]:
l = []

In [13]:
l.append(2)
l

[2]

In [14]:
l.remove(2)
l

[]

In [15]:
l = [7, 4, 3]
l.pop()

3

In [16]:
l

[7, 4]

In [17]:
l.index(4)

1

In [18]:
l = list(range(10))
l

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [19]:
l[2:7:2]

[2, 4, 6]

In [20]:
l[::-1]

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

### Tuples

Just like lists, but inmutable. Item assignment is forbidden

In [21]:
t = (1, 2, 3)

t

(1, 2, 3)

In [22]:
t[2] = 3

TypeError: 'tuple' object does not support item assignment

In [23]:
# Notice that for 1-element tuples a trailing comma is needed (to differentiate it from regular parenthesis)
(1)

1

In [24]:
(1,)

(1,)

### Zip

Given any number of iterables, returns an iterable with element-tuples

In [25]:
for i, j in zip(range(1, 10), range(11, 20)):
    print(i, j)

1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19


If one of the iterables is shorter than the others, zip stops at the shortest

In [26]:
for x, y, z in zip(range(5), range(6), range(7)):
    print(x, y, z)

0 0 0
1 1 1
2 2 2
3 3 3
4 4 4


### Dicts

In [27]:
d = {1: 'a'}

In [28]:
d.update({'b': 3})

d

{1: 'a', 'b': 3}

In [29]:
d.pop(1)

d

{'b': 3}

In [30]:
d = dict(zip(range(10), range(9, -1, -1)))

d

{0: 9, 1: 8, 2: 7, 3: 6, 4: 5, 5: 4, 6: 3, 7: 2, 8: 1, 9: 0}

In [31]:
print(d.keys())
print(d.values())
print(d.items())

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dict_values([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
dict_items([(0, 9), (1, 8), (2, 7), (3, 6), (4, 5), (5, 4), (6, 3), (7, 2), (8, 1), (9, 0)])


In [32]:
for k, v in d.items():
    print(k, v)

0 9
1 8
2 7
3 6
4 5
5 4
6 3
7 2
8 1
9 0


### Sets

Just like lists, but they ensure that its elements are **unique**.

More so, it is faster to check that elements are in it than lists

In [33]:
s = {1, 2, 3}

In [34]:
1 in s

True

In [35]:
# Elements can't be indexed though
s[1]

TypeError: 'set' object does not support indexing

In [36]:
s = set(range(10 ** 6))

In [37]:
%time 1000 in s

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.11 µs


True

In [38]:
%time 100000 in s

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 6.91 µs


True

In [39]:
l = list(range(10 ** 6))

In [40]:
% time 1000 in l

CPU times: user 29 µs, sys: 1 µs, total: 30 µs
Wall time: 35 µs


True

In [41]:
% time 100000 in l

CPU times: user 2.09 ms, sys: 98 µs, total: 2.19 ms
Wall time: 2.19 ms


True

## Iterators

A big difference between Python2 and Python3 (apart from print 'hello' vs. print('hello')) <br />
is that **iterators are preferred in Python3**, because they are more memory and time efficient

Example:

In [42]:
iterator = (
    x * 2
    for x in range(10)
)

iterator

<generator object <genexpr> at 0x1043ec308>

In [43]:
for x in iterator:
    print(x)

0
2
4
6
8
10
12
14
16
18


In [44]:
iterator = (
    x
    for x in range(10)
    if (x % 2) == 1
)

for x in iterator:
    print(x)

1
3
5
7
9


Caution: when an iterator is used, it can't be used again

In [45]:
for x in iterator:
    print(x)

They can also be defined with functions

In [46]:
def iterator_func():
    for x in range(10):
        if (x % 2) == 1:
            yield x
            
# In this case, the result of executing the function is the iterable itself
iterator = iterator_func()

iterator

<generator object iterator_func at 0x1043a1d00>

In [47]:
for x in iterator:
    print(x)

1
3
5
7
9


If a data structure type-function receives an iterable, <br />
it returns the contents of the iterable in that data structure

In [48]:
list(
    c
    for c in 'hello'
    if c == 'l'
)

['l', 'l']

In [49]:
tuple(x for x in range(10))

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

In [50]:
set(range(10))

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

In [51]:
# Caution: dict is special. It assumes that the iterator elements are 2-tuples (key-value)
dict(zip(range(10), range(11, 21)))

{0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 17, 7: 18, 8: 19, 9: 20}

In [52]:
# Notice that zip returns an iterator
print(zip('abc', 'def'))
print(list(zip('abc', 'def')))

<zip object at 0x1043e9e88>
[('a', 'd'), ('b', 'e'), ('c', 'f')]


## Iterables and iterators

All Python data structures that contain elements can create iterators that yield its elements.

That's why, those are called iterables

To get the iterator of an object:

In [53]:
l = [1, 2, 3]

In [54]:
iterator = iter(l)

for x in iterator:
    print(x)

1
2
3


Or call the collection in a for clause directly (this will call iter() behind the scenes)

In [55]:
for x in l:
    print(x)

1
2
3


Types that are iterables:

* list
* tuples
* dict
* set
* strings
* iterators
* any instance whose class implements the \_\_iter\_\_ method

## Comprehensions

It's very common (and useful) to create data structures with comprehensions

It is simply done by passing an iterator to a data structure, like this:

In [56]:
l = [
    x
    for x in range(10)
    if (x % 2) == 1
]

l

[1, 3, 5, 7, 9]

Notice that parentheses must not be used when typing a comprehension

In [57]:
d = {
    k: v
    for k, v in zip(range(10), range(11, 21))
}

d

{0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 17, 7: 18, 8: 19, 9: 20}

In [58]:
s = { i for i in range(10) }

s

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

## Lambdas

A function is defined with:

In [59]:
def f(x, y, z):
    return x * y + z

f(1, 2, 3)

5

but they can also be created (with limitations) in single lines: those are called lambdas

In [60]:
f = lambda x, y, z: x * y + z

f(1, 2, 3)

5

Notice:
* The parameters are not contained in parentheses
* Return is not needed; the result of the expression is the result of the function
* The lambda must be assigned to a variable to keep it in memory

Why do we use lambdas? 

To pass them as arguments to other functions without needing to define a function

In [61]:
# map applies a function to every element in an iterable and returns an iterable with the results
m = map(lambda x: x ** 2, range(10))

m

<map at 0x10c27c9b0>

In [62]:
list(m)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [63]:
# sorted sorts the elements of an iterable
# If a key function is specified, it is used to order those elements with its results

print(sorted(range(10), key=lambda x: (x % 2) * 100 + x / 2))

print(sorted(zip(range(10), range(9, -1, -1)), key=lambda pair: pair[1]))

[0, 2, 4, 6, 8, 1, 3, 5, 7, 9]
[(9, 0), (8, 1), (7, 2), (6, 3), (5, 4), (4, 5), (3, 6), (2, 7), (1, 8), (0, 9)]


## \*args and \*\*kwargs

A function can receive multiple undefinite arguments as parameters

In [64]:
def sum_func(*args):
    res = 0
    
    for x in args:
        res += x
        
    return res

print(sum_func(1, 2, 3, 4))
print(sum_func(1, 2, 3, 4, 5))
print(sum_func())

10
15
0


Arguments that have no name and are recovered by the \*args keyword are called positional arguments

In [65]:
def print_dict(**kwargs):
    print(kwargs)
    
    for k, v in kwargs.items():
        print(k, v)
        
print_dict(a=2, b=3)

print('-' * 20)

print_dict()

{'a': 2, 'b': 3}
a 2
b 3
--------------------
{}


Arguments that have no name and are recovered by the \*\*kwargs keyword are called keyword arguments

The \* and \*\* special marks can be used when calling functions too:

In [66]:
def f(*args, **kwargs):
    print(args)
    print(kwargs)
    
f(*range(10))

print('-' * 20)

f(**dict(zip('abc', range(3))))

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
{}
--------------------
()
{'c': 2, 'a': 0, 'b': 1}


And they can be used in lambdas too:

In [67]:
f = lambda *args, **kwargs: print(args, kwargs)

f(1, 2, 3, a=2)

(1, 2, 3) {'a': 2}


## Implicit type casting

Most objects in Python have some special methods defined <br />
that let them transform themselves in other types of objects:

**\_\_int\_\_** <br />
**\_\_float\_\_** <br />
**\_\_str\_\_** <br />
**\_\_bool\_\_**

Basic object types have some default behaviours:

* To bool: if it is an iterable and contains at least 1 element, True; else, False. If a quantity, x != 0
* Bool to numeric: False -> 0, True -> 1
* None to bool: False
* None to numeric: 0

Some expressions perform an implicit type casting:

* for: transforms an iterable into an iterator and assigns each yielded element into a variable
* if/elif: transforms its element into a boolean and checks if it's True

Thanks to the implicit type casting in if, things like the following are possible:

In [68]:
if []:
    print('not void list')
else:
    print('void list')

void list


In [69]:
for i in range(10):
    if i: # only 0 will cast to False, so that's the one omitted
        print(i)

1
2
3
4
5
6
7
8
9
