## **Python Bootcamp - Unit 3**
---
**Author:** David Dobolyi

**Key Concepts**
- [Data Types](#Data-Types)
    - [Numeric](#Numeric)
    - [Boolean](#Boolean)
        - [Comparison Operators](#Comparison-Operators)
        - [Boolean Operators](#Boolean-Operators)
    - [Text Sequence](#Text-Sequence)
- [Explicit Type Conversion](#Explicit-Type-Conversion)
- [Date and Time Types](#Date-and-Time-Types)

---
### Data Types

As noted in the prior unit, a key aspect of working with Python (and programming languages more generally) is understanding the different data types available to you to work with (e.g., see the [official documentation](https://docs.python.org/3/library/datatypes.html)). From a data anaylsis perspective, fundamental ones include:

- **Numeric:** int, float
- **Boolean:** bool
- **Text Sequence:** str

Collectively, these four represent Python's core set of basic, *primitive* data types. Note however that there are many other data types in Python that are less common, such as complex (i.e., a numeric type used for imaginary numbers) and binary sequences (e.g., byte). Others, such as `None` (with its corresponding `NoneType`) can be used to define a null or nonexistent value, but this keyword is most typically used when setting up function arguments. In other words, this list is not intended to be an exhaustive set of all data types available in Python.

Moreover, many additional data types can be imported from modules and packages; examples of these include [datetime](https://docs.python.org/3/library/datetime.html) (i.e., date and time types), NumPy [data types](https://numpy.org/doc/stable/user/basics.types.html), and pandas [data types](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes) (e.g., [*categorical*](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html), which functions similarly to a *factor* in R). We will discuss dates and times briefly at the end of this unit, whereas NumPy and pandas data types will be detailed in a subsequent unit.

Finally, it's important to understand that the term data "type" can also vary depending on the source. For example, within the official documentation, lists and tuples are defined as built-in data types in some [places](https://docs.python.org/3/library/index.html), yet they are refered to as data structures in [others](https://docs.python.org/3/tutorial/datastructures.html). For the purposes of this bootcamp, the term data "type" will focus on data with a value (e.g., `int 4`, `float 5.5`, `str 'Hello world!'`), whereas data "structures" will focus on how data are organized (e.g., lists and tuples, which will be discussed in detail below).

#### Numeric

As the name implies numeric data deals with numbers, including both integers (i.e., whole numbers abbreviated as `int`) and floats (i.e., non-whole numbers that may contain a fraction abbreviated as `float`). For example, the following number will implicitly be treated as an integer by Python:

In [1]:
4

4

To verify the type of this example, we can use the ***type*** function:

In [2]:
type(4)

int

As expected, the type of 4 is int (i.e., integer). By contrast, a number such as 5.5 will be automatically treated as a float:

In [3]:
5.5

5.5

In [4]:
type(5.5)

float

While technically floats are more general and could cover all integers, it's worth noting the subtle difference between an int and a float conceptually. Specifically, declaring a value as an int formally defines it as being a whole number, whereas floats are less restrictive. Distinguishing between the two data types can be important in certain domains (e.g., when counting the number of individuals in a household, it would only make sense to treat the values as integers).

From a practical perspective, it's worth noting that arithmetic involving integers and floats can result in changes to the data type. For instance, consider the following math examples:

In [5]:
4 + 5

9

In [6]:
type(4 + 5)

int

In [7]:
4 + 5.5

9.5

In [8]:
type(4 + 5.5)

float

In the former operation, the sum of two integers clearly remains an integer, whereas in the latter, the sum of an integer and a float changes to a float, which is the broader of the numeric data types. This latter change of data type is known as *implicit type conversion*. After discussing ther other basic data types, we will also briefly touch on alternative ways of changing data types more formally (i.e., *explicit type conversion*).

#### Boolean

Certain types of questions result in a True or False answer. For these cases, the appropriate data type is Boolean, which is sometimes referred to as logical data. The two main Boolean keywords (i.e., reserved words) in Python are, unsurprisingly `True` and `False`:

In [9]:
True

True

In [10]:
False

False

Both `True` and `False` fall under the abbreviate `bool` data type; for example:

In [11]:
type(False)

bool

Note that case matters here: in other words, for Python to recognize the terms "true" and "false" as Boolean, they must be written in title case.

Moreover, it's important to recognize that Boolean data also have a corresponding numeric value. Specifically, `False` is associated with the integer `0`, while `True` is typically associated with the integer `1`.

##### **Comparison Operators**

One of the primary uses of Boolean data in Python is for performing comparisons, which result in `True` and `False` outcomes. To make comparisons possible, we need to use Python's various *comparison operators*, which include:

In [12]:
2 < 3  # stricly less than

True

In [13]:
2 <= 3 # less than or equal

True

In [14]:
2 > 3 # strictly greater than

False

In [15]:
2 >= 3 # greater than or equal

False

In [16]:
2 == 3 # equal

False

In [17]:
2 != 3 # not equal

True

In addition to these six, there are two more related to *objects* that we will discuss later:

- is: object identity
- is not: negated object indentity

More importantly, comparison operators are useful for making an important distinction regarding the value of data versus the type of data. For instance, consider these comparisons:

In [18]:
True == 1

True

In [19]:
type(True) == type(1)

False

While the value of `int 2` and `float 2` are identical, notice how the types of data in question are not identical. As such, it will be important to keep data types in mind as you continue to work with Python, since the formal type of data in question can influence the type of analysis that should be used (e.g., logistic regression involves binomial data, which would correspond to Boolean).

##### **Boolean Operators**

In addition to comparison operators, there are a number of *Boolean operators* (also known as *logical operators*) that can be coupled with comparison operators to make more complex comparisons. These operators include (in ascending order of priority):

- **or**
- **and**
- **not**

More practically, these Boolean operators have clear outcomes when combining multiple comparisons. For instance, notice how **or** evaluates when using Boolean values:

In [20]:
True or True

True

In [21]:
True or False

True

In [22]:
False or True

True

In [23]:
False or False

False

Python treats **or** as a short-circuit operator: for the sake of efficiency, if the first argument is `False`, then the result will be the second argument. This is done to make code evaluation more efficient, although generally speaking the result should be logically consistent with your expectations. Python's **and** works in a similar fashion:

In [24]:
True and True

True

In [25]:
True and False

False

In [26]:
False and True

False

In [27]:
False and False

False

For **and**, if the first argument is false, then the result will be the first argument; otherwise it will be the second argument. Finally, regarding **not**, this operator can flip a result:

In [28]:
not False

True

In [29]:
not True

False

It's worth noting however that order of operations (i.e., priority) matters, so the following two outcomes are not identical:

In [30]:
not True or True

True

In [31]:
not (True or True)

False

As noted in Unit 2, parantheses always take precedence, so they can be helpful for clarifying exactly the comparison you are looking for formally (i.e., feel free to use parantheses liberally).

More practically, comparison operators will come into practice for managing control flow and make simultaneous comparisons. For example, regarding the latter, we can check the "truth" of two arithetmic calculations at once:

In [32]:
((1 + 3) < 5) and (3 > 2)

True

We will use comparison operators in subsequent units to help with filtering data as well.

#### Text Sequence

We have already used the text sequence data type in Unit 1 when discussing the `'Hello world!'` example, which, as noted earlier, represents a text sequence or `str`:

In [33]:
'Hello world!'

'Hello world!'

In [34]:
type('Hello world!')

str

As the name of the data type implies, text sequences represent one or more characters in a sequence. A single character such as `a` would nevertheless still qualify as a string:

In [35]:
type('a')

str

All text strings must be quoted, and the typical quote used in Python is a single quote, although double quote will work too (although notice the output still shows a single quote in the latter case):

In [36]:
'cat'

'cat'

In [37]:
"cat"

'cat'

In [38]:
'cat' == "cat"

True

While the type of quote does not matter for defining a string, it can have implications in certain cases where the strings themselves contain quotes. For instance, the following direct quote is not a problem to write since the type of quotes within the string differ from the outer, *encapsulating* quotes:

In [39]:
'David said, "Hello world!"'

'David said, "Hello world!"'

In this example however, note that the type of quotes used for encapsulation is not arbitrary. In other words, we could not use double quotes to encapsulate a string if the string itself contains double quotes:

In [40]:
"David said, "Hello world!""

SyntaxError: invalid syntax (1295707276.py, line 1)

As you can see in the syntax highlighting, the string is not fully in <font color = "red">**red**</font>, suggesting parts of it where not seen as a string. This is due to the matching sets of quotes failing to encapsulate the string fully as intended.

As a potential workaround, we could flip the quotes within the string as follows:

In [41]:
"David said, 'Hello world!'"

"David said, 'Hello world!'"

However, this string is not identical to the one we looked at earlier:

In [42]:
'David said, "Hello world!"' == "David said, 'Hello world!'"

False

This is due to the fundamental difference between the type of quotes being used. In other words:

In [43]:
"'" == '"'

False

The actual workaround to this nested quotes issue is to use a special escape character, the backslash, to write an *escape sequence*:

In [44]:
"David said, \"Hello world!\""

'David said, "Hello world!"'

This escaped string is now identical to the one we used earlier:

In [45]:
"David said, \"Hello world!\"" == 'David said, "Hello world!"'

True

The special utility of the backslash in the string can lead to some confusion. For instance, let's assume we wanted to incorporate a backslash into our string. When doing so, notice how the backslash appears twice in the output of the following command:

In [46]:
'This string is supposed to have one \ in it!'

'This string is supposed to have one \\ in it!'

Technically a backslash in a string must itself be escaped to be treated literally. In this case, Python automatically treats a single backslash followed by whitespace as a literal backslash, meaning `\` becomes `\\` (i.e., an escaped single backslash). To more clearly see what the string actually contains, we can use print function:

In [47]:
print('This string is supposed to have one \ in it!')

This string is supposed to have one \ in it!


Again, this occurs due to Python automatically converting the backslash to an escaped one, meaning the following two strings are identical (which may not be the case in a language like R):

In [48]:
'This string is supposed to have one \ in it!' == 'This string is supposed to have one \\ in it!'

True

There are many more predefined escape sequences in Python, which are summarized in the following [link](https://docs.python.org/3/reference/lexical_analysis.html#literals) or available via `help('\\')`. For instance:

In [49]:
print('This string has a newline \n and a tab \t in it...')

This string has a newline 
 and a tab 	 in it...


There is plenty more to discuss when it comes to text sequence data, and much of it is important when it comes to analytic techniques such as *natural language processing (NLP)*. Unsurprisingly, Python contains plenty of functions and methods to help make dealing with strings easier. For example:

In [50]:
len('Hello world!') # count the length of a string

12

This function is helpful for making the backslash issue we discussed earlier more apparent. Note that the length of the following string is 1 rather than 2:

In [51]:
len('\\')

1

Again, Python provides plenty of additional functionality for working with strings, as shown in the following examples:

In [52]:
'Hello' + ' ' + 'world!' # concatenate multiple strings

'Hello world!'

In [53]:
'hello'.upper() # upper case a string

'HELLO'

In [54]:
'HELLO'.lower() # lower case a string

'hello'

In [55]:
'hello'.title() # title case a string

'Hello'

Regarding the latter three cases involving case, it's worth mentioning that Python is case sensitive. As such, note the following two strings are not identical:

In [56]:
'hello' == 'Hello'

False

Ultimately, it's up to you to find and use the appropriate functions Python provides to accomplish what you need to do.

---
### Explicit Type Conversion

While we have now covered the basics of the primitive data types in Python, it's worth touching on how we can convert between data types as needed.

Earlier, we showed how Python can perform *implicit* type conversion when working with numeric data:

In [57]:
4 + 5.5

9.5

In [58]:
type(4 + 5.5)

float

Again, the sum of an integer and a float is a float, since without converting to float, we'd lose precision.

Nevertheless, Python offers us the option to perform *explit* type conversion using *type casting*:

In [59]:
int(9.5)

9

In [60]:
type(int(9.5))

int

Unsurprisingly, this particular casting led to some data loss, since the integer version is not identical to the original float:

In [61]:
9.5 == int(9.5)

False

While not particularly useful in this case, there are times where we want control of our data types and the option to transition between them. For example, suppose we had a number stored within a text sequence:

In [62]:
'123'

'123'

In [63]:
type('123')

str

Since `'123'` is a string, we cannot perform arithmetic with it:

In [64]:
'123' + 456

TypeError: can only concatenate str (not "int") to str

The result of attempting to sum a `str` and an `int` is a <font color = "red">**TypeError**</font> that lets us know this operation does not really make sense as written.

Instead, we need to use type casting to explicitly tell Python what we want to do:

In [65]:
int('123') + 456

579

This code could be read as follows: take the `int` form of the `str 123` and add it to the `int 456` -- an operation Python understands and aptly performs. While it may not seem useful to be able to do this right now, keep type casting in the back of your mind since many functions require data to be in a specific form (i.e., type) to run.

With regard to the type casting functions available, these may be summarised as follows (with examples):

In [66]:
int('123') # convert to an integer

123

In [67]:
float(4)   # convert to a float

4.0

In [68]:
bool(0)    # convert to a Boolean

False

In [69]:
str(123)   # convert to a text sequence (i.e., string)

'123'

---
### Date and Time Types

The focus of this unit has been on the primitive data types in Python, although many more data types are provided out-of-the-box as described in the [documentation](https://docs.python.org/3/library/datatypes.html). Among these, date and time types are particularly important, since they are common in data analysis (e.g., for time series analysis).

To work with dates or times in Python, one option is to import the [datetime](https://docs.python.org/3/library/datetime.html) module (alternatively, one could use various NumPy [types](https://numpy.org/doc/stable/reference/arrays.datetime.html), which we will discuss in a later unit). Let's go ahead and import datetime (using the alias dt):

In [70]:
import datetime as dt

Once imported, we can use datetime (i.e., dt) to work with dates or times. For example, if we wanted to see today's date, we can use the *today* method of the *date* data type:

In [71]:
dt.date.today()

datetime.date(2021, 11, 30)

In [72]:
type(dt.date.today())

datetime.date

As shown above, these functions return a date, which include the year, month, and day, respectively.

Assuming we wanted to know not just the date but also the time, we could use the *now* method of *datetime*:

In [73]:
dt.datetime.now()

datetime.datetime(2021, 11, 30, 21, 40, 53, 731575)

In [74]:
type(dt.datetime.now())

datetime.datetime

As shown, the datetime output again includes the year, month, and day, followed by the hour, minute, second, and microsecond (with the hour shown in military time [e.g., 18 = 6pm]).

Additional data types are available for working with specific aspects of dates and times such as *time*, *timezone*, and *timedelta* (i.e., the difference between two times). For instance, regarding the latter, we can use *timedelta* to find yesterday's date:

In [75]:
dt.date.today() - dt.timedelta(1)

datetime.date(2021, 11, 29)

By default, the first argument of *timedelta* refers to days, but we can also adjust this via arguments. For example, to find the time 3 hours from now, we can use the following code:

In [76]:
dt.datetime.now() + dt.timedelta(hours = 3)

datetime.datetime(2021, 12, 1, 0, 40, 54, 464876)

Moreover, several functions are available to improve the formatting and display of dates and times, which can be helpful for various purposes (e.g., formatting the display of time on a plot). For example, we can use [strftime](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) to display today's date in a more presentable format:

In [77]:
dt.datetime.strftime(dt.date.today(), '%A, %B %d, %Y')

'Tuesday, November 30, 2021'

Using a datetime, we could extend this further:

In [78]:
dt.datetime.strftime(dt.datetime.now(), '%A, %B %d, %Y %I:%M %p')

'Tuesday, November 30, 2021 09:40 PM'

For specifics regarding the format codes (e.g., %A shows the full name of a weekday such as `'Monday'`), see the following [link](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).

Finally, it's worth noting Python also supports working with timezones, which can add a substantial amount of complexity/confusion when working with data. For instance, to return the current time in coordinated universal time (UTC; see this [link](https://www.nhc.noaa.gov/aboututc.shtml) for more details) time, you can use the following code:

In [79]:
dt.datetime.now(tz = dt.timezone.utc)

datetime.datetime(2021, 12, 1, 2, 40, 55, 633245, tzinfo=datetime.timezone.utc)

Coupled with the *tzname* method, we can confirm the time is indeed in UTC:

In [80]:
dt.datetime.now(tz = dt.timezone.utc).tzname()

'UTC'

Assuming we wanted to see the time in our local time zone, we can add in *astimezone*: 

In [81]:
dt.datetime.now(tz = dt.timezone.utc).astimezone()

datetime.datetime(2021, 11, 30, 21, 40, 56, 664498, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=68400), 'EST'))

In [82]:
dt.datetime.now(tz = dt.timezone.utc).astimezone().tzname()

'EST'

As is likely apparent, working with datetimes can quickly become complex, and as such, additional libraries are available to make working with them easier. For example, [dateutil](https://dateutil.readthedocs.io/en/stable/) can be invaluable for simplifying the issue of timezones or complex timedeltas. For example, we can use *relativedelta* to see the date one month from now (which is not quite the same as something like "4 weeks from now"):

In [83]:
from dateutil.relativedelta import *

dt.date.today() + relativedelta(months = 1)

datetime.date(2021, 12, 30)

Additionally, we can use *parse* to help convert strings into usable datetime formats without manual preparation:

In [84]:
from dateutil.parser import parse

parse('January 1, 2020 4:12pm')

datetime.datetime(2020, 1, 1, 16, 12)

In an upcoming unit on NumPy and pandas, we will spend more time discussing date formats and how to use them, but for now, this should provide a valuable preview of what's possible in Python.