# Python Language Basics

## Language Semantics

### INDENTATION, NOT BRACES

Python uses whitespace (tabs or spaces) to structure code instead of using braces as in many other languages like R, C++, Java, and Perl. Consider a for loop from a sorting algorithm:

    for x in array:
      if x < pivot:
          less.append(x)
      else:
          greater.append(x)

As you can see by now, Python statements also do not need to be terminated by semicolons. Semicolons can be used, however, to separate multiple statements on a single line:

    a = 5; b = 6; c = 7

### EVERYTHING IS AN OBJECT

An important characteristic of the Python language is the consistency of its object model. Every number, string, data structure, function, class, module, and so on exists in the Python interpreter in its own “box,” which is referred to as a Python object. Each object has an associated type (e.g., string or function) and internal data. In practice this makes the language very flexible, as even functions can be treated like any other object.

### COMMENTS

Any text preceded by the hash mark (pound sign) # is ignored by the Python interpreter. This is often used to add comments to code. At times you may also want to exclude certain blocks of code without deleting them. An easy solution is to comment out the code:

    results = []
    for line in file_handle:
        # keep the empty lines for now
        # if len(line) == 0:
        #   continue
        results.append(line.replace('foo', 'bar'))

### FUNCTION AND OBJECT METHOD CALLS

You call functions using parentheses and passing zero or more arguments, optionally assigning the returned value to a variable:

    result = f(x, y, z)
    g()

Almost every object in Python has attached functions, known as methods, that have access to the object’s internal contents. You can call them using the following syntax:

    obj.some_method(x, y, z)
    
Functions can take both positional and keyword arguments:

    result = f(a, b, c, d=5, e='foo')

### VARIABLES AND ARGUMENT PASSING

When assigning a variable (or name) in Python, you are creating a reference to the object on the righthand side of the equals sign. In practical terms, consider a list of integers:

    In [8]: a = [1, 2, 3]

Suppose we assign a to a new variable b:

    In [9]: b = a

In some languages, this assignment would cause the data [1, 2, 3] to be copied. In Python, a and b actually now refer to the same object, the original list [1, 2, 3] (see Figure 2-7 for a mockup). You can prove this to yourself by appending an element to a and then examining b:

    In [10]: a.append(4)

    In [11]: b
    Out[11]: [1, 2, 3, 4]
    
When you pass objects as arguments to a function, new local variables are created referencing the original objects without any copying. If you bind a new object to a variable inside a function, that change will not be reflected in the parent scope. It is therefore possible to alter the internals of a mutable argument. Suppose we had the following function:

    def append_element(some_list, element):
        some_list.append(element)

Then we have:

    In [27]: data = [1, 2, 3]

    In [28]: append_element(data, 4)

    In [29]: data
    Out[29]: [1, 2, 3, 4]



### DYNAMIC REFERENCES, STRONG TYPES

In contrast with many compiled languages, such as Java and C++, object references in Python have no type associated with them. There is no problem with the following:

    In [12]: a = 5

    In [13]: type(a)
    Out[13]: int

    In [14]: a = 'foo'

    In [15]: type(a)
    Out[15]: str
    
Variables are names for objects within a particular namespace; the type information is stored in the object itself. Some observers might hastily conclude that Python is not a “typed language.” This is not true; consider this example:

    In [16]: '5' + 5
    -------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-16-f9dbf5f0b234> in <module>()
    ----> 1 '5' + 5
    TypeError: must be str, not int

In some languages, such as Visual Basic, the string '5' might get implicitly converted (or casted) to an integer, thus yielding 10. Yet in other languages, such as JavaScript, the integer 5 might be casted to a string, yielding the concatenated string '55'. In this regard Python is considered a strongly typed language, which means that every object has a specific type (or class), and implicit conversions will occur only in certain obvious circumstances, such as the following:

    In [17]: a = 4.5

    In [18]: b = 2

    # String formatting, to be visited later
    In [19]: print('a is {0}, b is {1}'.format(type(a), type(b)))
    a is <class 'float'>, b is <class 'int'>

    In [20]: a / b
    Out[20]: 2.25
    
Knowing the type of an object is important, and it’s useful to be able to write functions that can handle many different kinds of input. You can check that an object is an instance of a particular type using the `isinstance` function:

    In [21]: a = 5

    In [22]: isinstance(a, int)
    Out[22]: True

`isinstance` can accept a tuple of types if you want to check that an object’s type is among those present in the tuple

    In [23]: a = 5; b = 4.5

    In [24]: isinstance(a, (int, float))
    Out[24]: True

    In [25]: isinstance(b, (int, float))
    Out[25]: True


### ATTRIBUTES AND METHODS

Objects in Python typically have both attributes (other Python objects stored “inside” the object) and methods (functions associated with an object that can have access to the object’s internal data). Both of them are accessed via the syntax obj.attribute_name:

    In [1]: a = 'foo'

    In [2]: a.<Press Tab>
    a.capitalize  a.format      a.isupper     a.rindex      a.strip
    a.center      a.index       a.join        a.rjust       a.swapcase
    a.count       a.isalnum     a.ljust       a.rpartition  a.title
    a.decode      a.isalpha     a.lower       a.rsplit      a.translate
    a.encode      a.isdigit     a.lstrip      a.rstrip      a.upper
    a.endswith    a.islower     a.partition   a.split       a.zfill
    a.expandtabs  a.isspace     a.replace     a.splitlines
    a.find        a.istitle     a.rfind       a.startswith

Attributes and methods can also be accessed by name via the getattr function:

    In [27]: getattr(a, 'split')
    Out[27]: <function str.split>

In other languages, accessing objects by name is often referred to as “reflection.” While we will not extensively use the functions `getattr` and related functions hasattr and `setattr` in this book, they can be used very effectively to write generic, reusable code.

### DUCK TYPING

Often you may not care about the type of an object but rather only whether it has certain methods or behavior. This is sometimes called “duck typing,” after the saying “If it walks like a duck and quacks like a duck, then it’s a duck.” For example, you can verify that an object is iterable if it implemented the iterator protocol. For many objects, this means it has a `__iter__` “magic method”, though an alternative and better way to check is to try using the iter function:

    def isiterable(obj):
        try:
            iter(obj)
            return True
        except TypeError: # not iterable
            return False

This function would return True for strings as well as most Python collection types:

    In [29]: isiterable('a string')
    Out[29]: True

    In [30]: isiterable([1, 2, 3])
    Out[30]: True

    In [31]: isiterable(5)
    Out[31]: False

A place where I use this functionality all the time is to write functions that can accept multiple kinds of input. A common case is writing a function that can accept any kind of sequence (list, tuple, ndarray) or even an iterator. You can first check if the object is a list (or a NumPy array) and, if it is not, convert it to be one:

    if not isinstance(x, list) and isiterable(x):
        x = list(x)

### IMPORTS

In Python a module is simply a file with the .py extension containing Python code. Suppose that we had the following module:

    # some_module.py
    PI = 3.14159

    def f(x):
        return x + 2

    def g(a, b):
        return a + b
        
If we wanted to access the variables and functions defined in some_module.py, from another file in the same directory we could do:

    import some_module
    result = some_module.f(5)
    pi = some_module.PI

Or equivalently:

    from some_module import f, g, PI
    result = g(5, PI)

By using the `as` keyword you can give imports different variable names:

    import some_module as sm
    from some_module import PI as pi, g as gf

    r1 = sm.f(pi)
    r2 = gf(6, pi)


### BINARY OPERATORS AND COMPARISONS

Most of the binary math operations and comparisons are as you might expect:

    In [32]: 5 - 7
    Out[32]: -2

    In [33]: 12 + 21.5
    Out[33]: 33.5

    In [34]: 5 <= 2
    Out[34]: False

To check if two references refer to the same object, use the `is` keyword. `is not` is also perfectly valid if you want to check that two objects are not the same:

    In [35]: a = [1, 2, 3]

    In [36]: b = a

    In [37]: c = list(a)

    In [38]: a is b
    Out[38]: True

    In [39]: a is not c
    Out[39]: True

Since list always creates a new Python list (i.e., a copy), we can be sure that c is distinct from a. Comparing with is is not the same as the == operator, because in this case we have:

    In [40]: a == c
    Out[40]: True

A very common use of `is` and `is not` is to check if a variable `is None`, since there is only one instance of `None`:

    In [41]: a = None

    In [42]: a is None
    Out[42]: True

### MUTABLE AND IMMUTABLE OBJECTS

Most objects in Python, such as lists, dicts, NumPy arrays, and most user-defined types (classes), are mutable. This means that the object or values that they contain can be modified:

    In [43]: a_list = ['foo', 2, [4, 5]]

    In [44]: a_list[2] = (3, 4)

    In [45]: a_list
    Out[45]: ['foo', 2, (3, 4)]

Others, like strings and tuples, are immutable:

    In [46]: a_tuple = (3, 5, (4, 5))

    In [47]: a_tuple[1] = 'four'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-47-b7966a9ae0f1> in <module>()
    ----> 1 a_tuple[1] = 'four'
    TypeError: 'tuple' object does not support item assignment

Remember that just because you can mutate an object does not mean that you always should. Such actions are known as side effects. For example, when writing a function, any side effects should be explicitly communicated to the user in the function’s documentation or comments. If possible, I recommend trying to avoid side effects and favor immutability, even though there may be mutable objects involved.

## Scalar Types

Python along with its standard library has a small set of built-in types for handling numerical data, strings, boolean (True or False) values, and dates and time. These “single value” types are sometimes called scalar types and we refer to them in this book as scalars. See Table 2-4 for a list of the main scalar types. Date and time handling will be discussed separately, as these are provided by the datetime module in the standard library.

### NUMERIC TYPES

The primary Python types for numbers are int and float. An int can store arbitrarily large numbers:

    In [48]: ival = 17239871

    In [49]: ival ** 6
    Out[49]: 26254519291092456596965462913230729701102721

Floating-point numbers are represented with the Python float type. Under the hood each one is a double-precision (64-bit) value. They can also be expressed with scientific notation:

    In [50]: fval = 7.243

    In [51]: fval2 = 6.78e-5

Integer division not resulting in a whole number will always yield a floating-point number:

    In [52]: 3 / 2
    Out[52]: 1.5

To get C-style integer division (which drops the fractional part if the result is not a whole number), use the floor division operator //:

    In [53]: 3 // 2
    Out[53]: 1


### STRINGS

Many people use Python for its powerful and flexible built-in string processing capabilities. You can write string literals using either single quotes ' or double quotes ":

    a = 'one way of writing a string'
    b = "another way"

For multiline strings with line breaks, you can use triple quotes, either ''' or """:

    c = """
    This is a longer string that
    spans multiple lines
    """

It may surprise you that this string c actually contains four lines of text; the line breaks after """ and after lines are included in the string. We can count the new line characters with the count method on c:

    In [55]: c.count('\n')
    Out[55]: 3

Python strings are immutable; you cannot modify a string:

    In [56]: a = 'this is a string'

    In [57]: a[10] = 'f'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-57-5ca625d1e504> in <module>()
    ----> 1 a[10] = 'f'
    TypeError: 'str' object does not support item assignment

    In [58]: b = a.replace('string', 'longer string')

    In [59]: b
    Out[59]: 'this is a longer string'

Afer this operation, the variable a is unmodified:

    In [60]: a
    Out[60]: 'this is a string'

Many Python objects can be converted to a string using the str function:

    In [61]: a = 5.6

    In [62]: s = str(a)

    In [63]: print(s)
    5.6

Strings are a sequence of Unicode characters and therefore can be treated like other sequences, such as lists and tuples (which we will explore in more detail in the next chapter):

    In [64]: s = 'python'

    In [65]: list(s)
    Out[65]: ['p', 'y', 't', 'h', 'o', 'n']

    In [66]: s[:3]
    Out[66]: 'pyt'

The syntax `s[:3]` is called slicing and is implemented for many kinds of Python sequences. This will be explained in more detail later on, as it is used extensively in this book.

The backslash character \ is an escape character, meaning that it is used to specify special characters like newline \n or Unicode characters. To write a string literal with backslashes, you need to escape them:

    In [67]: s = '12\\34'

    In [68]: print(s)
    12\34

If you have a string with a lot of backslashes and no special characters, you might find this a bit annoying. Fortunately you can preface the leading quote of the string with r, which means that the characters should be interpreted as is:

    In [69]: s = r'this\has\no\special\characters'

    In [70]: s
    Out[70]: 'this\\has\\no\\special\\characters'

The r stands for raw.

Adding two strings together concatenates them and produces a new string:

    In [71]: a = 'this is the first half '

    In [72]: b = 'and this is the second half'

    In [73]: a + b
    Out[73]: 'this is the first half and this is the second half'

String templating or formatting is another important topic. The number of ways to do so has expanded with the advent of Python 3, and here I will briefly describe the mechanics of one of the main interfaces. String objects have a format method that can be used to substitute formatted arguments into the string, producing a new string:

    In [74]: template = '{0:.2f} {1:s} are worth US${2:d}'

In this string,

{0:.2f} means to format the first argument as a floating-point number with two decimal places.

{1:s} means to format the second argument as a string.

{2:d} means to format the third argument as an exact integer.

To substitute arguments for these format parameters, we pass a sequence of arguments to the format method:

    In [75]: template.format(4.5560, 'Argentine Pesos', 1)
    Out[75]: '4.56 Argentine Pesos are worth US$1'

String formatting is a deep topic; there are multiple methods and numerous options and tweaks available to control how values are formatted in the resulting string. To learn more, I recommend consulting the official Python documentation.

I discuss general string processing as it relates to data analysis in more detail in Chapter 8.

### BYTES AND UNICODE

In modern Python (i.e., Python 3.0 and up), Unicode has become the first-class string type to enable more consistent handling of ASCII and non-ASCII text. In older versions of Python, strings were all bytes without any explicit Unicode encoding. You could convert to Unicode assuming you knew the character encoding. Let’s look at an example:

    In [76]: val = "español"

    In [77]: val
    Out[77]: 'español'

We can convert this Unicode string to its UTF-8 bytes representation using the encode method:

    In [78]: val_utf8 = val.encode('utf-8')

    In [79]: val_utf8
    Out[79]: b'espa\xc3\xb1ol'

    In [80]: type(val_utf8)
    Out[80]: bytes

Assuming you know the Unicode encoding of a bytes object, you can go back using the decode method:

    In [81]: val_utf8.decode('utf-8')
    Out[81]: 'español'

While it’s become preferred to use UTF-8 for any encoding, for historical reasons you may encounter data in any number of different encodings:

    In [82]: val.encode('latin1')
    Out[82]: b'espa\xf1ol'

    In [83]: val.encode('utf-16')
    Out[83]: b'\xff\xfee\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

    In [84]: val.encode('utf-16le')
    Out[84]: b'e\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

It is most common to encounter bytes objects in the context of working with files, where implicitly decoding all data to Unicode strings may not be desired.

Though you may seldom need to do so, you can define your own byte literals by prefixing a string with b:

    In [85]: bytes_val = b'this is bytes'

    In [86]: bytes_val
    Out[86]: b'this is bytes'

    In [87]: decoded = bytes_val.decode('utf8')

    In [88]: decoded  # this is str (Unicode) now
    Out[88]: 'this is bytes'

### BOOLEANS

The two boolean values in Python are written as True and False. Comparisons and other conditional expressions evaluate to either True or False. Boolean values are combined with the and and or keywords:

    In [89]: True and True
    Out[89]: True

    In [90]: False or True
    Out[90]: True


### TYPE CASTING

The str, bool, int, and float types are also functions that can be used to cast values to those types:

    In [91]: s = '3.14159'

    In [92]: fval = float(s)

    In [93]: type(fval)
    Out[93]: float

    In [94]: int(fval)
    Out[94]: 3

    In [95]: bool(fval)
    Out[95]: True

    In [96]: bool(0)
    Out[96]: False


### NONE

`None` is the Python null value type. If a function does not explicitly return a value, it implicitly returns `None`:

    In [97]: a = None

    In [98]: a is None
    Out[98]: True

    In [99]: b = 5

    In [100]: b is not None
    Out[100]: True
    
`None` is also a common default value for function arguments:

    def add_and_maybe_multiply(a, b, c=None):
        result = a + b

        if c is not None:
            result = result * c

        return result
        
While a technical point, it’s worth bearing in mind that None is not only a reserved keyword but also a unique instance of NoneType:

    In [101]: type(None)
    Out[101]: NoneType


### DATES AND TIMES

The built-in Python datetime module provides datetime, date, and time types. The datetime type, as you may imagine, combines the information stored in date and time and is the most commonly used:

    In [102]: from datetime import datetime, date, time

    In [103]: dt = datetime(2011, 10, 29, 20, 30, 21)

    In [104]: dt.day
    Out[104]: 29

    In [105]: dt.minute
    Out[105]: 30

Given a datetime instance, you can extract the equivalent date and time objects by calling methods on the datetime of the same name:

    In [106]: dt.date()
    Out[106]: datetime.date(2011, 10, 29)

    In [107]: dt.time()
    Out[107]: datetime.time(20, 30, 21)

The strftime method formats a datetime as a string:

    In [108]: dt.strftime('%m/%d/%Y %H:%M')
    Out[108]: '10/29/2011 20:30'

Strings can be converted (parsed) into datetime objects with the strptime function:

    In [109]: datetime.strptime('20091031', '%Y%m%d')
    Out[109]: datetime.datetime(2009, 10, 31, 0, 0)

See Table 2-5 for a full list of format specifications.

When you are aggregating or otherwise grouping time series data, it will occasionally be useful to replace time fields of a series of datetimes—for example, replacing the minute and second fields with zero:

    In [110]: dt.replace(minute=0, second=0)
    Out[110]: datetime.datetime(2011, 10, 29, 20, 0)

Since datetime.datetime is an immutable type, methods like these always produce new objects.

The difference of two datetime objects produces a datetime.timedelta type:

    In [111]: dt2 = datetime(2011, 11, 15, 22, 30)

    In [112]: delta = dt2 - dt

    In [113]: delta
    Out[113]: datetime.timedelta(17, 7179)

    In [114]: type(delta)
    Out[114]: datetime.timedelta

The output `timedelta(17, 7179)` indicates that the timedelta encodes an offset of 17 days and 7,179 seconds.

Adding a timedelta to a datetime produces a new shifted datetime:

    In [115]: dt
    Out[115]: datetime.datetime(2011, 10, 29, 20, 30, 21)

    In [116]: dt + delta
    Out[116]: datetime.datetime(2011, 11, 15, 22, 30)


## Control Flow

Python has several built-in keywords for conditional logic, loops, and other standard control flow concepts found in other programming languages.

### IF, ELIF, AND ELSE

The if statement is one of the most well-known control flow statement types. It checks a condition that, if True, evaluates the code in the block that follows:

    if x < 0:
        print("It's negative")

An if statement can be optionally followed by one or more elif blocks and a catch-all else block if all of the conditions are False:

    if x < 0:
        print("It's negative")
    elif x == 0:
        print("Equal to zero")
    elif 0 < x < 5:
        print("Positive but smaller than 5")
    else:
        print("Positive and larger than or equal to 5")

If any of the conditions is True, no further elif or else blocks will be reached. With a compound condition using and or or, conditions are evaluated left to right and will short-circuit:

    In [117]: a = 5; b = 7

    In [118]: c = 8; d = 4

    In [119]: if a < b or c > d:
       .....:     print('Made it')
    Made it

In this example, the comparison c > d never gets evaluated because the first comparison was True.

It is also possible to chain comparisons:

    In [120]: 4 > 3 > 2 > 1
    Out[120]: True


### FOR LOOPS

for loops are for iterating over a collection (like a list or tuple) or an iterater. The standard syntax for a for loop is:

    for value in collection:
        # do something with value

You can advance a for loop to the next iteration, skipping the remainder of the block, using the `continue` keyword. Consider this code, which sums up integers in a list and skips None values:

    sequence = [1, 2, None, 4, None, 5]
    total = 0
    for value in sequence:
        if value is None:
            continue
        total += value

A for loop can be exited altogether with the `break` keyword. This code sums elements of the list until a 5 is reached:

    sequence = [1, 2, 0, 4, 6, 5, 2, 1]
    total_until_5 = 0
    for value in sequence:
        if value == 5:
            break
        total_until_5 += value

The `break` keyword only terminates the innermost for loop; any outer for loops will continue to run:

    In [121]: for i in range(4):
       .....:     for j in range(4):
       .....:         if j > i:
       .....:             break
       .....:         print((i, j))
       .....:
    (0, 0)
    (1, 0)
    (1, 1)
    (2, 0)
    (2, 1)
    (2, 2)
    (3, 0)
    (3, 1)
    (3, 2)
    (3, 3)

As we will see in more detail, if the elements in the collection or iterator are sequences (tuples or lists, say), they can be conveniently unpacked into variables in the for loop statement:

    for a, b, c in iterator:
        # do something


### WHILE LOOPS

A while loop specifies a condition and a block of code that is to be executed until the condition evaluates to False or the loop is explicitly ended with `break`:

    x = 256
    total = 0
    while x > 0:
        if total > 500:
            break
        total += x
        x = x // 2
        
### PASS

`pass` is the “no-op” statement in Python. It can be used in blocks where no action is to be taken (or as a placeholder for code not yet implemented); it is only required because Python uses whitespace to delimit blocks:

    if x < 0:
        print('negative!')
    elif x == 0:
        # TODO: put something smart here
        pass
    else:
        print('positive!')

### RANGE

The range function returns an iterator that yields a sequence of evenly spaced integers:

    In [122]: range(10)
    Out[122]: range(0, 10)

    In [123]: list(range(10))
    Out[123]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Both a start, end, and step (which may be negative) can be given:

    In [124]: list(range(0, 20, 2))
    Out[124]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

    In [125]: list(range(5, 0, -1))
    Out[125]: [5, 4, 3, 2, 1]

As you can see, range produces integers up to but not including the endpoint. A common use of range is for iterating through sequences by index:

    seq = [1, 2, 3, 4]
    for i in range(len(seq)):
        val = seq[i]

While you can use functions like `list` to store all the integers generated by range in some other data structure, often the default iterator form will be what you want. This snippet sums all numbers from 0 to 99,999 that are multiples of 3 or 5:

    sum = 0
    for i in range(100000):
        # % is the modulo operator
        if i % 3 == 0 or i % 5 == 0:
            sum += i

While the range generated can be arbitrarily large, the memory use at any given time may be very small.

### TERNARY EXPRESSIONS

A ternary expression in Python allows you to combine an if-else block that produces a value into a single line or expression. The syntax for this in Python is:

    value = true-expr if condition else false-expr

Here, true-expr and false-expr can be any Python expressions. It has the identical effect as the more verbose:

    if condition:
        value = true-expr
    else:
        value = false-expr

This is a more concrete example:

    In [126]: x = 5

    In [127]: 'Non-negative' if x >= 0 else 'Negative'
    Out[127]: 'Non-negative'

As with if-else blocks, only one of the expressions will be executed. Thus, the “if” and “else” sides of the ternary expression could contain costly computations, but only the true branch is ever evaluated.

While it may be tempting to always use ternary expressions to condense your code, realize that you may sacrifice readability if the condition as well as the true and false expressions are very complex.