# Python data types notebook

## 1. Introduction

In this notebook we will be looking at the data types discussed in the Week 2 & 3 content, how they work in Python, and how you can manipulate them. 

Everything in Python is an object. A data type of an object or variable specifies the type of data that is stored inside that object and how you can then interact with it. As a reminder, virtually everything in Python is an object - not just variables. As a consequence, you will be seeing a lot of the dot syntax mention in Week 1 in this notebook to access methods and attributes within objects of interest (methods, packages, data objects, etc.). If you wish to see the methods and attributes available for a given object, you can type the object and dot and then click TAB which will then show the available list. This can be super useful to avoid typos.

This notebook is very long, but is intended to be a key reference guide for working with Python data types. *Do not* feel that you need to complete the notebook in 1 sitting, or even 2. Feel free to make notes for yourself by inserting cells throughout the notebook as you go through, or take notes on a separate document of your choosing.  


**Note** there is code throughout this workbooks which will intentionally produce an error. Try to read the error messages and see if you can understand what is causing the problem and why. Python error messages tend to be quite helpful and specific actually (the same cannot be said for R all of the time, so enjoy the specificity while you can!). Reading error messages and not being frightened by them is a key skill for any programmer! 

✅ Remember to RUN ALL THE CELLS IN ORDER AS YOU GO THROUGH THIS NOTEBOOK (if you skip some, you might see some unexpected errors).
✅ to see line numbers within in code cell, you can use the keyboard shortcut SHIFT-L or navigate to View > Toggle Line Numbers

### Python has 5 basic data types.
* Boolean: True, False 
* numeric
    * integer: -1, 2
    * float:  1.0, 15.3 
    * complex: 1+4j 
* string (character): “health care”, “social care”  

### From the `pandas` library we can get the categorical data type 
* category: apple, banana, orange 

### From the `datetime` module we can get date and time datatypes 
* date: 2023-06-01
* time: 15:30:43
* datetime: 2023-06-01 15:30:43

### Python also provides functions to examine data type features, for example 
* `type()` - what is the object’s data type?  
* `len()` - how long is the data? 

* `isinstance()` - to check if an object is a given data type, returns a Boolean value 
 - this function takes 2 arguments, `isinstance(object, datatype)`. You can list multiple data types `isinstance(object, (datatype, datatype, datatype)` which is the equivalent of an "or statement" (so if one statement is true, the output will be true)

### Casting 
The process of speciying a data type on a varibale is called **casting** in Python. Becuase Python is an OOP (object-oriented programming) language, classes are used to define data type. Therefore, casting is done using constructor functions. 

* `int()` - constructs an integer number from a float or string (if the string represents a whole number)
* `float()` - constructs a float form number from an integer or string (if the string represents a float or integer) 
* `str()` - constructs a string from a wide variety of data types including string, integer, float, etc. 
* `bool()` - constructs Boolean data types from integers or floats 

Let's discuss each of these Python data types one by one. 

In [None]:
# first lets load the packages and modules we need for this notebook 

import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype 
from pandas.api.types import union_categoricals
import datetime as dt 
from dateutil import parser, tz, relativedelta
import re


## 2. Boolean

In Python, the bool data type is built-in, meaning it does not need to be imported and is always available. No other value besides `True` and `False` will have `bool` as its type. The class `bool` is actually a subclass of the class `int` as True and False values are represented as 1 and 0. 

To work with boolean data types and values, we can use relational, logical, and identity operators. 

In [None]:
type(False)

In [None]:
type(True)

It is possible to assign a Boolean value to a variable, but it is not possible to assign a value to a Boolean value (as `True` and `False` are keywords).

In [None]:
a_true_variable = True 

print(a_true_variable)

In [None]:
type(a_true_variable)

In [None]:
True = 44 # this will produce an error 

As keywords, True and False must have an upper case 1st letter 

In [None]:
f = false # this will produce an error 

Since `True` and `False` are equivalent to `1` and `0`, they can be cast to the respective types with `int()`, `float()`, and `complex()`

In [None]:
print(int(True)) 

print(int(False))

*Note*: the code above is nested (where 1 piece of code is placed inside another). So in the example it is executed/read from the inside out, `int()` then `print()`


In [None]:
print(float(True))
print(float(False))

In [None]:
print(complex(True))
print(complex(False))

You can also convert `True` and `False` to strings using `str()`

In [None]:
print(str(True))
print(str(False))

print(type(str(True))) 
print(type(str(False)))

Due to the concepts of Truthy and Falsey, non-empty strings are considered `True`. So, if you convert `False` to a string and then back to a Boolean type, it will be `True`.

In [None]:
print(bool(str(False))) 

### 2.1 Comparison operators 

Comparison operators compare the value of each operand according to the specified operator and return a boolean output: 

In [None]:
x1 = 10 
x2 = 20 

# equal to 
x1 == x2

In [None]:
# not equal to 
x1 != x2

In [None]:
# less than or equal to
x1 <= x2

In [None]:
# Greater than or equal to
x1 >= x2

### 2.2 Identity operators 

`is` and `is not` are identity operators, which determine whether the given operands refer to the same object (have the same identity). 

You can have 2 objects that are equal, but not identitical:


In [None]:
obj0 = 1001 

obj1 = 1000 + 1 

print(obj0, obj1)

In [None]:
print(obj0 == obj1)

In [None]:
print(obj0 is obj1)

print(obj0 is not obj1)

As we can see both objects *equal* 1001, but they do not refer to the same object. We can confirm that they do not reference the same object with `id()`

In [None]:
print(id(obj0))

print(id(obj1))

# indeed obj0 and obj1 have different ids

### 2.3 Logical operators 

Logical operators are used to combine conditional statements, with a boolean output

In [None]:
x1 > 5 and x2 < 25

In [None]:
x1 > 5 or x2 > 10

In [None]:
not(x1 > 5 and x2 < 25)

### 2.4 Casting with Boolean data

Integers and floating point numbers can be cast to boolean type data using the `bool()` function. An `int`, `float` or `complex` number set to zero returns `False`. An `int`, `float` or `complex` number set to any other number, positive or negative, returns `True`.

In [None]:
zero_int = 0 

bool(zero_int)

In [None]:
pos_int = 1

bool(pos_int)

In [None]:
neg_flt = -5.4

bool(neg_flt)

## 3. Numeric

In Python, numeric data types are also built-in, meaning they do not need to be imported and are always available. There are 3 primary numeric data types in Python, including variations from Numpy (denoted with np.X):

1. integer: `int`
2. float: `float`
    - `np.float32`
    - `np.float64`
3. complex: `complex`

There are some built-in functions when working with numeric data that we can use 

* `round()` to round a number to the nearest integer. **Note** when the number ends in .5 `round()` has some unexpected behaviour as it **rounds ties to even**. When you round ties to even, you first look at the digit one decimal place to the left of the last digit in the tie. If that digit is even, then you round down. If the digit is odd, then you round up. You can also specify the number of decimal places (`ndigits`) by passing a second integer argument to `round()` 
* `abs()` to get the absolute value of a number 
* `sum()` used to sum a list of values. The first argument is the iterable while the second optional argument is the start. Start allows us to provide a value to initialize the summation process, if left empty the default is 0 
* `min()` returns the minimum value from an iterable or the smallest of the given arguments 
* `max()` returns the maximum value from an iterable or the largest of the given arguments 

Arithmetic operators and expressions are useful when working with numeric data. 

In [None]:
ff = 40.56

type(ff)

In [None]:
ii = 444 

type(ii)

In [None]:
i = 123
print(isinstance(i, int))
print(isinstance(i, (float, str, set, dict)))

In Python you can use `_` in place of a comma or full stop to format entry of large numbers

In [None]:
largenum = 4_000_000_000_000_000_0005

print(largenum, type(largenum))

You can also optionally use scientific notation but inputting `e` or `E` followed by a positive or negative integer. 

In [None]:
scnum0 = 4.2e-4

print(scnum0)

In [None]:
scnum1 = .4E7

print(scnum1)

As mentioned in the Week 2 topic on numeric data, there is a difference in precision between the native/built-in Python float and the Numpy `.float32` and `.float64` data types. Let's look at an example:

In [None]:
a = 58682.7578125

print(type(a))

print(a)

In [None]:
print(type(np.float32(a)))

print(np.float32(a))

In [None]:
print(type(np.float64(a)))

print(np.float64(a))

Complex numbers in Python are constructed (real + imag*1j). The convention of wrapping complex numbers in paranthesis helps to minimize confusion around if the displayed output represents a string or mathematical expression.

In [None]:
## lets create a complex number 

cc = 5j + 4 #note here I put the imaginary element before the real element 

cc # when printed Python automatically reformats the number to (real + imaginary) and wraps it in parentheses

In [None]:
type(cc)

In [None]:
cc.real # to access the real component - note the output is a float 

In [None]:
cc.imag #to access the imaginary component - note the output is a float 

### 3.1 Built-in functions 

In [None]:
abs(-5.89)

In [None]:
abs(345)

In [None]:
round(7809.3)

In [None]:
round(3428.9)

In [None]:
round(2.5) # rounded down due to rounding ties to even

In [None]:
round(348.23905) # rounded down due to rounding ties to even

In [None]:
round(3.5) # rounded up due to rounding ties to even 

In [None]:
round(237.87239465) # rounded up due to rounding ties to even

In [None]:
round(237.87239465, ndigits = 3)

You can input a list of numbers, which is denoted by square brackets `[]` into the `sum()` function: 

In [None]:
sum([1, 2, 3, 4, 5])

In [None]:
sum([1, 2, 3, 4, 5], 100)  # Positional argument

In [None]:
sum([1, 2, 3, 4, 5], start = 100)  # Keyword argument

**Note** in general, it is good coding practice to use keyword arguments where possible as they are more explicit and readable

You can also use input a range of values into `sum()` using `range()`, which functions as follows `range(start, stop, step)`

In [None]:
# defining a for loop to iterate through the generated sequence of numbers starting from 1 to 10 with a step of 2 using range() function

for each_element in range(1,10,2):

    print("The value is = ", each_element)


In [None]:
sum(range(1, 6))

In [None]:
max([32.67, 60, 73, 723, 4.3])

In [None]:
min([32.67, 60, 73, 723, 4.3])

In the examples above I have mixed integers and floats, which is supported by the functions as they are comparable types. However, if you include a string, you will get a `TypeError`

In [None]:
max([32.67, 60, 73, "723", 4.3]) # produces an error

In [None]:
min([32.67, 60, 73, 723, "4.3"]) # produces an error

The `min()` and `max()` functions can be useful when removing the output value from a given list

In [None]:
sample = [4, 5, 7, 6, -12, 4, 42, 60, 2, 34]
print(sample)

In [None]:
sample.remove(min(sample))

print(sample)

In [None]:
sample.remove(max(sample))

print(sample)

**Note** Notice that the original object `sample` had 10 values in the list. After running the code place to remove the minumum value, sample has 9 values. And after the code block to remove the maximum value, sample has 8 values. This is because each time we run the code, the object sample is being modified and updated in the global state. Try running each code block a few times to see what happens...

### 3.2 Casting with numeric data 

We can convert between numeric data types using `float()`, `int()`, and `complex()`

In [None]:
# lets remind ourselves what the objects ii and ff were that we created above
print(ii)
print(ff)

In [None]:
print(float(ii))

In [None]:
print(int(ff))

In [None]:
print(complex(ff)) 
# try this with the object ii - what do you expect the difference to be?

### 3.3 Arithmetic operators 

In [None]:
a0 = 5
a1 = 30

In [None]:
# addition 
a0 + a1

In [None]:
# subtraction 
a1 - a0

In [None]:
# multiplication 
a0 * a1

In [None]:
# division 
a1 / a0

In [None]:
# modulus 
a1 % a0

In [None]:
# exponent 
a1 ** a0

In [None]:
# integer/floor division 
a1 // a0

You can also use the compound assignment operators as a shorthand when modifying a numeric data typed object

In [None]:
# addition assignment operator 
a1 += a0

print(a1)

In [None]:
# subtraction assignment operator 
a1 -= a0

print(a1)

**Note** that the output above is 30... although when we created the objects above we assigned `a0` to 5 and `a1` to 30. Have a brief thought as to why this might be the case. Reveal the answer below after you have considered this

<details><summary style='color:darkblue'>Solution to why the output of the subtraction assignment operator is 30</summary>

This is because each time we run code relating to the object `a1`, it is being modified and updated in the global state. So if you run the code cells in order, `a1` is 35 based on the addition assignment operator cell and thus in the substraction assignment operator code cell we are subtracting 35 - 5 = 30 rather than 30 - 5 = 25. If you go through and run the code cells a few more times the output will update accordindly. 

## 4. String 

Strings are ordered text data which are represented by enclosing the text data in single, double, or triple quotes. Single (`'`) and double (`"`) quotes can be used interchangeably, but triple quotes (`'''` or `"""`) are used to store multiline text. 

There are a series of built-in functions you can use with strings

* `len()` function returns the length of a string (number of characters)
* Keywords `in`, `not in`, and `if` statements
* `find()` function returns the index value of the input found in the string. If it is not found `-1` is returned. *not to be confused* with the negative or reverse indexing value 
    * `index()` works the same way, though instead of `-1` when the input element is not found, it returns a `ValueError`
* `capitalize()` to capitalize the first element in a string 
* `center()` is used to center align the string by specifying the field width
* `endswith()` to check if the given string ends with the specified input character
* `upper()` converts any lower case characters in the string to upper case  
* `lower()` converts any upper case characters in the string to lower case 
* `replace()` replaces the element in the string with the second input element as follows `string.replace("elementinstring", "elementtoreplace")`
* concatenation with `+` 
* `join()` to join all elements of the input string by some specified character 
* `split()` to split into a list of strings 
* `strip()` to delete all leading and trailing characters 
* `lstrip()` to delete all leading characters 
* `rstrip()` to delete all trailing characters 
* `fstring` is a new method of intropolation (substitution of placeholder(s) with an input value) in Python3 (previously `format()`). We saw an example of this in the Week 1 tutorial notebook 

In [None]:
string0 = 'Edinburgh is the capital of Scotland'

print(string0, type(string0))

In [None]:
string1 = "edinburgh is the capital of Scotland"

print(string1, type(string1))

In [None]:
string2 = """Edinburgh is 
the capital 
of Scotland"""

print(string2, type(string2))

# change string2 to be single quotes and see if there is a difference 
# you can highlight the full string and type in the single quotes and Jupyter will fill them in on either side for you 

To access elements of a string, we must use square brackets `[]`. When slicing, the colon `:` specifies the start of the index n the left and the end on the right. If you leave the right hand side empty, it will go until the end of the object.

In [None]:
string0[4]

In [None]:
string0[0:12]

In [None]:
string0[13:]

If you want to include a quote character as part of the string itself, you may be tempted to try something as follows:

In [None]:
print('This string contains a single quote (') character.')


But, as you can see this produces a `SyntaxError`. We can do 2 things instead:

1. delimit the string with the other type of quote 
2. use an escape sequence (using the `\` to "escape" from the string)

In [None]:
# option 1 

print("This string contains a single quote (') character.")

print('This string contains a double quote (") character.')


In [None]:
# option 2 
print('This string contains a single quote (\') character.')


To include a literal backslash in a string using escape characters, you need to use two `\\`

In [None]:
print("This string contains a backlash \\ character")

Now lets look at some of the built-in functions described above for working with strings:

In [None]:
# check if Edinburgh is present in the string 
print("Edinburgh" in string0)

In [None]:
# check if Glasgow is NOT present in the string 
print("Glasgow" not in string0)

In [None]:
# we can also use an if-statement. Remember to watch out for your spaces as they delineate white spaces (see Week 1 topic 4) 

if "Edinburgh" in string0:
    print("Yes, 'Edinburgh' is present.")

In [None]:
print(string0.find("bur"))

In [None]:
print(string0.find("ee")) # -1 here means it is not found 

Compare `find()` to `index()`:

In [None]:
print(string0.index("bur"))

In [None]:
print(string0.index("ee")) 

With `index()` you can also specify a start and end point as follows `string.index("input", start[, end]])`

In [None]:
string0.index("in", 1, 6) 
# here we are searching for "in" somewhere between index positions 1 and 6
## with the output being the index where it is found, if applicable

In [None]:
string0.endswith("y")

In [None]:
string0.endswith("l")

In [None]:
print(string1) # in string1 Edinburgh is not capitalized, but we can change that easily with code! 

print(string1.capitalize())

In [None]:
print(string0.upper())

In [None]:
print(string0.lower())

In [None]:
print(string0.center(70))

With `center()` you can also fill out the spaces with any other character input in the second argument 

In [None]:
print(string0.center(70, "-"))

In [None]:
print(string0.replace("Edinburgh", "Stirling"))  # which is not true today, but it used to be! 

We can concatenate or combine two strings using the `+` operator  

In [None]:
string3 = "Stirling used to be"

In [None]:
print(string0 + ", and " + string3) #change the comma and see what happens

You can also use the `join()` function which allows you to add a character between the elements of the input string. This is particularly useful with data structures such as lists or tuples of strings (which we will look at next week) as the function is iterable. 

In [None]:
help(str.join)

In [None]:
'*'.join(string0) 

In [None]:
# or with list input 
'.'.join(['abc', 'def', 'ghi'])

You can also leave the first argument empty to input an empty space between a list of string elements.

In [None]:
text = ['Python', 'is', 'a', 'fun', 'programming', 'language']
print(text)

# join elements of text with space
print(' '.join(text))

To split the elements of a string into a list of strings we can use the `split()` function

In [None]:
string0.split() # notice the output is in [], this is how we know the output is a list data structure

In [None]:
string4 = "             hello            "

In [None]:
# delete all leading and trailing characters
print(string4.strip())

In [None]:
# delete all trailing characters
print(string4.rstrip())

In [None]:
# delete all leading characters

print(string4.lstrip())

Withn the `strip()` functions you can also specify a character to be deleted. However the characters need to be inputted in the specific order in which they are present.

In [None]:
string5 = "   ***----hello---*******     "

In [None]:
string5.strip("*") # this will not work as expected because there is first empty spaces before the asterisk

In [None]:
string5.strip(" *")

In [None]:
string5.strip(" *-")

**`f-string`** are strings that have an `f` (or `F`) at the beginning and curly braces containing expressions that will be replaced with their values. The `f` in `f-string` is for format, so formatted strings. We saw an example of an `f-string` in Tutorial 1. 

In [None]:
activity = "Data Science"

f"{activity} is fun"

You can also call methods or functions directly within an `f-string`

In [None]:
f"{activity.lower()} is fun"

#### *A quick aside* 

You can also use `min()` and `max()` as we discussed in the numeric data types section with strings to find the 'smallest' and 'largest' letters in your string object. In this context, smallest means closest to the beginning of the alphabet, and largest means closest to the end of the alphabet. 

**Note** upper case letters come before lower case letters in Python's default character set (UTF-8), so a string with both lower case and upper case letters will produce different output and may not behave as you would expect.

In [None]:
min("abcdefghijklmnopqrstuvwxyz")

In [None]:
max("abcdefghijklmnopqrstuvwxyz")

In [None]:
min("abcdWXYZ")

In [None]:
max("abcdWXYZ")

### 4.1 Regular expressions 

To work with regular expressions in Python, we need the `re` module, which has a set of functions that allows us to search a string for a match:

* `findall()`	returns a list containing all matches in the order they are found, if no matches are found an empty list is returned
* `search()`	returns a Match object if there is a match anywhere in the string. If there is more than one match only the 1st occurrence is returned. If not matches are found, `None` is returned
* `split()` returns a list where the string has been split at each match. The `maxsplit` parameter controls the number of occurrences 
* `sub()`	replaces one or many matches with an input string. The `count` parameter can be used to control the number of replacements

[This webpage](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf) is an excellent guide for Python regular expressions

In [None]:
# help("re") 
# uncomment above to learn more about the re module 

In [None]:
# just to remind us of the string we were working with at the beginning
print(string0) 

In [None]:
xf0 = re.findall("i", string0)
xf1 = re.findall("y", string0)

print(xf0)
print(xf1)

In [None]:
s0 = re.search("\s", string0)

print("The first white-space character is located in position:", s0.start())

In [None]:
# or as an fstring 
f"The first white-space character is located in position: {s0.start()}"

In [None]:
print(re.split("\B", string0))

In [None]:
print(re.sub("\S", "*", string0)) 

# check the regular expression cheat sheet link to confirm what you think the difference is between "\s" and "\S"

### 4.2 Raw string 

Another type of string is the so-called **raw string** which interpret backslashes as regular characters rather than an escape sequence. This data type can be useful when a string needs to contain a backslash, such as for a regular expression or Windows directory path, and you don't want it to be treated as an escape character. 

Despite this key difference between a regular string and a raw string, a raw string is not a distinct data type. This is becuase a raw string is just a regular string in which each backslash is represented as `\\`.

* You can create a raw string by prefixing a normal string with `r` or `R` (similiar to `f-strings` where the `f` input can be upper or lower case) 
* Convert (not cast as it is not a different data type) a string literal to a raw string use `repr()`, which will return a string in single quotes 
* Raw string *cannot* end with an odd number of backslashes becuase backslashes escape the trailing quotation marks (`'` or `"`).

In [None]:
s = "hi\nHello"

print(s)

In [None]:
rs = r"hi\nHello"

print(rs) #notice the new line character is printed rather than treated as a regular expression 

In [None]:
print(type(rs)) # there is no "raw string" data type, just str

print(rs == "hi\\nHello")


As I mentioned above, this can be particularly useful when you are representing a Windows path (which are delimited by backslahes `\`) and do not want to escape each backslash as `\\`. Raw strings allow you to write them as is:

In [None]:
path = "C:\\Windows\\system32\\cmd.exe"

rpath = r"C:\Windows\system32\cmd.exe"

print(path)
print(rpath)
print(path == rpath)

In [None]:
# no error here (even number of backslashes)

path2 = 'C:\\Windows\\system32\\' 
print(path2)

In [None]:
# this will produce an error (odd number of backslashes)

rpath2 = r'C:\Windows\system32\' 
print(rpath2)

In [None]:
# this will also produce an error (odd number of backslashes)

rpath3 = r'C:\Windows\system32\\\' 
print(rpath3)

Now let's try to convert a regular string into a raw string:

In [None]:
# from above our regular string is s 

raw_s = repr(s)

print(raw_s) # and now it is a raw string in single quotes and the new line character is printed

# print(s) # uncomment this line to compare the two if you need to be reminded from above and do not want to scroll up 

## 5. Categorical

We will be using the `pandas` package to work with categorical data (which we imported at the top of the notebook). 

As a general rule, if you are using categorical data, add some checks to make sure the data is clean and complete before converting to the `pandas` category type. Additionally, it is important to check for `NaN` values after combining or converting dataframes, which we will look at next week.

When working in `pandas` to check the `pandas` data type (called dtype), we must use the `dtype` method: `obj.dtype`.

A few aspects about `Categoricals` are: 

- All values of categorical data are either in `categories` or `np.nan`
- Missing values are not permitted in the Categorical's `categories`, only in values. 
- Order is defined by the order of `categories`, not the lexical order of the values. 
- Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array. This is to say that behind the scenes, cateogrical data are an array of numerics or **codes**. 

In [None]:
c0 = CategoricalDtype(categories = ["a", "b", "c", "d"])

c0

You can use the string `'category'` in place of a `CategoricalDtype` when you want the default behavior of the categories being unordered, and equal to the set values present in the array. In other words, `dtype='category'` is equivalent to `dtype=CategoricalDtype(None, ordered = False)`.

Two instances of CategoricalDtype compare equal whenever they have the same categories and order. When comparing two unordered categoricals, the order of the categories is not considered. 

In [None]:
c1 = CategoricalDtype(["a", "b", "c"], ordered = False)

# evaluate as equal, since order is not considered when ordered=False
c1 == CategoricalDtype(["b", "c", "a"], ordered = False)

In [None]:
# Evaulate as unequal, since the second CategoricalDtype is ordered
c1 == CategoricalDtype(["a", "b", "c"], ordered = True)

All instances of `CategoricalDtype` compare equal to the string `'category'`.



In [None]:
c1 == 'category'

The function `describe()` can be used to get a summary of your categorical data, producing a nicely formatted table of counts and frequency for each category. 

In [None]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"]) 

cat.describe() # cateory b is empty, category a appears 1 time, category c twice, and there is one missing value 

In [None]:
print(type(cat))

print(cat.dtype)

In [None]:
print(cat.codes)

Categorical data has a `categories` and an `ordered` property, which list their possible values and whether the ordering matters or not. These properties are special methods of Series data structures and can be accessed using the `cat` attribute: `series.cat.categories` and `series.cat.ordered`. If you don’t manually specify categories and ordering, they are inferred from the passed arguments. A **pandas Series** are a one-dimensional array and the main data structure in the pandas framework for storing one-dimensional data. We will learn more about Series next week, but the important thing to note for now is that unlike a dataframe, a Series cannot contain multiple columns (i.e., it is 1-dimensional). 

In [None]:
# let's construct a Series from the categorical cat to see these properties 

scat = pd.Series(cat, dtype = "category", copy = True)

print(scat.cat.categories)
print(scat.cat.ordered)

print(scat.dtype)

**Note** when constructing a `Series` fom a `Categorical`, the input `Categorical` is not copied. This means that changes to the `Series` will in most cases change the original `Categorical`. To avoid this potential headache, include the `copy = True` argument. 

In [None]:
scat.cat.codes # missing values always have a code of -1 

In [None]:
# you can also specify a Series as category dtype using series.astype("category")
## the outputs of s are the same as in the cell above, but shown to highlight these are just 2 different ways of doing the same thing

scat1 = pd.Series(cat, copy = True)

s = scat1.astype("category")

print(s.cat.categories)
print(s.cat.ordered)
print(s.dtype)

### 5.1 Working with categories 

There are a variety of actions you may wish to take when wrangling and manipulating categorical data. Some of these supported by `pandas` are: 

* `pd.unique()` to see the unique categories 
* `pd.value_counts()` to see the number of duplicate values per category 

* `Categorical.rename_categories()` to rename categories
    - or with `Series`: `Series.cat.rename_categories()` 
* `Categorical.add_categories()` to append new categories  
    - or with `Series`: `Series.cat.add_categories()`
* `Categorical.remove_categories()` to remove specified categories 
    - or with `Series`: `Series.cat.remove_categories()`
* `Categorical.remove_unused_categories()` to remove unused categories (those with no values) 
    - or with `Series`: `Series.cat.remove_unused_categories()`
* `Categorical.set_categories()` to remove and add new categories in 1 step
    - or with `Series`: `Series.cat.set_categories()`
    
* `union_categoricals()` to combine categorical objects that do not necessarily have the same categories. By default, the resulting categories will be ordered as they appear in the data. If you want the categories to be lexically sorted, use `sort_categories=True` argument. This function also works with `Series` containing categorical data, but the resulting array wil always be a plain `Categorical`

When working with ordered categories in particular, there are a few specific operations that are likely to be useful: 

* `Categorical.as_ordered()` to set categorial data to be ordered. This will by default return a new object
    - `as_unordered()` to set ordered categorical data to be unordered. This will by default return a new object 
    - or with `Series`: `Series.cat.as_ordered()` or `Series.cat.as_unordered()`
* `Categorical.min()` / `Categorical.max()` to see the first and last category in an ordered categorical, respectively. Raises a `TypeError` if applied to a categorical that is unordered
* `Categorical.sort_values()` to sort values in the object
* `Categorical.reorder_categories()` to reorder categories. All old categories must be included in the new categories and no new categories are allowed. This will necessarily make the sort order the same as the categories order.
    - or with `Series`: `Series.cat.reorder_categories()`
  

In [None]:
# first lets create a categorical object of unintentional injury types to work with 

injury_cat = pd.Categorical(["Fall", "Transport accident", "Crushing", "Scakld", "Accidental exposure"]*4)

print(injury_cat) # we can see that we have 20 observations (length = 20)

In [None]:
# to see the categories 

injury_cat.categories

In [None]:
injury_cat.ordered

In [None]:
injury_cat.describe()

In [None]:
# to see the unique values in the data 

pd.unique(injury_cat)

In [None]:
# to see the number of duplicate values for each category 

pd.value_counts(injury_cat)

In looking through the summaries of our object, it looks like there is a typo or input error in the "Scald" category. We can fix this by renaming the category by using `Categorical.rename_categories()`

In [None]:
injury_cat = injury_cat.rename_categories(["Fall", "Transport accident", "Crushing", "Scald", "Accidental exposure"])

print(injury_cat)

We can also do this by first making our object a Series:

In [None]:
injury_s = pd.Series(injury_cat, dtype = "category", copy = True)

injury_s.describe() # notice that describe() here prints a different sort of output than with injury_cat
## this is due to difference in data structure! 
## injury_cat is a Categorical where as injury_s is a Series data structure with dtype category 

In [None]:
injury_s = injury_s.cat.rename_categories(["Fall", "Transport accident", "Crushing", "Scald", "Accidental exposure"])

print(injury_s)

In Python, categories do not need to be strings: 

In [None]:
injury_s1 = injury_s.cat.rename_categories([1, 2, 3, 4, 5])

print(injury_s1)

However, categories must be unique or a `ValueError` is raised 

In [None]:
injury_s.cat.rename_categories([1, 1, 1, 1, 1]) # this will produce an error

Unlike in R, when renaming categories in Python, they cannot be `NaN` else a `ValueError` is raised. Missing values are not permitted in the Categorical's `categories` (hence not when renaming categories), only in values. 

In [None]:
injury_s.cat.rename_categories([1, 2, 3, 4, np.nan]) # this will produce an error

Above we made `cat` which had an `np.nan` value, but note the object only has 3 categories, though 4 values. 

In [None]:
print(cat)

In [None]:
# append a new category 

injury_s = injury_s.cat.add_categories([4])

print(injury_s)

In [None]:
pd.value_counts(injury_s)

As we can see from `value_counts`, category 4 has been added, but it is unused. So let's remove it 

In [None]:
injury_srem = injury_s.cat.remove_unused_categories()

pd.value_counts(injury_srem)

You can also remove specified categories, empty or not 

In [None]:
injury_srem1 = injury_s.cat.remove_categories(["Fall"])

injury_srem1.cat.categories

**Note** `Categorical.set_categories()` cannot know whether some category is omitted intentionally or because it is misspelled. This can result in surprising behaviour!

In [None]:
injury_s.cat.categories # just to remind ourselves of what the current global state of injury_s include 

In [None]:
injury_set = injury_s.cat.set_categories(["Fall", "Transport accident", "Crushing", "Scald", "Accidental exposure", "Other"]) 

# in 1 go we are removing 4 and adding "other"

injury_set.cat.categories

To combine categoricals with the same or different categories, you can use `union_categoricals()` 

In [None]:
a_cat = pd.Categorical(["b", "c"])
b_cat = pd.Categorical(["a", "b"])

combo_cat = union_categoricals([a_cat, b_cat])

combo_cat

In [None]:
# to make resulting categories lexically sorted 

combo_cat1 = union_categoricals([a_cat, b_cat], sort_categories = True)

combo_cat1

When unionizing or combining ordered categoricals, the categories must be identitical else a TypeError is raised. You can combine ordered categoricals with different categories or orderings using the `ignore_ordered = True` argument

In [None]:
c_cat = pd.Categorical(["a", "b", "c"], ordered=True)

d_cat = pd.Categorical(["c", "b", "d"], ordered=True)

In [None]:
union_categoricals([c_cat, d_cat], ignore_order=True)

In [None]:
union_categoricals([c_cat, d_cat]) # produces an error 

### 5.2 Ordinal categorical data 

As mentioned above, when creating categorical data or a Series you can specify if it is ordinal using `ordered = True`. However, sometimes you have read in data and want to set the categorical data to be ordered. Let's use `injury_cat` again

In [None]:
print(injury_cat)

In [None]:
# confirm it is currently not ordered 

print(injury_cat.ordered) 

In [None]:
injury_catorder = injury_cat.as_ordered()

print(injury_catorder)
# notice the order is as they were originally input, not in alphabetical order by default 

In [None]:
print(injury_catorder.ordered)

In [None]:
# to see the categories 

injury_catorder.categories

In [None]:
# to see the first and last category in an ordered categorical

injury_catorder.min(), injury_catorder.max()

In [None]:
# to unorder your object 

injury_catorder.as_unordered()

You can sort the values in your categorial according to the order they appear as well. For this, let's use `injury_set`

In [None]:
# to remind us of how it looks now 

print(injury_set)

In [None]:
# be default, values are sorted in the order that they first appear 
injury_set.sort_values()

In [None]:
# reordering 
injury_set_reorder = injury_set.cat.reorder_categories(['Scald', 'Transport accident', 'Crushing', 'Accidental exposure', 'Fall', 'Other'], ordered = True)

print(injury_set_reorder)

In [None]:
# and to sort the values according to the reordering 

injury_set_reorder.sort_values()

## 6. Date and time 

To work with date and time data, we need the `datetime` module. This was loaded at the top of the notebook as `dt`. We will also be using [`strftime`](https://strftime.org/) to help us navigate date and time objects. 

As discussed in the week 2 content around date and time data, these data can be very complicated. To help avoid communication mistakes, the International Organization for Standardization (ISO) developed ISO 8601. This standard specifies that all dates should be written in order of most-to-least-significant data. This means the format is year, month, day, hour, minute, and second: `YYYY-MM-DD HH:MM:SS`. We can use ISO 8601 format with Python using `datetime`.

To create a `datetime` instance there are multiple methods: 

* `dt.date(year, month, day)`, `dt.time(hour, minute, second)`, or `dt.datetime(year, month, day, hour, minute, second)` though all arguments must be specified and each require integers as input
* `dt.fromisoformat()` to create an ISO 8601 format date or datetime object. To use this method, you must provide a string as input 

When you do not know in advance what information will be your input, the following alternative initializers are helpful: 
* `date.today()` creates a `datetime.date` instance with the current local date
* `datetime.now()` creates a `datetime.datetime` instance with the current local date and time
* `datetime.combine()` combines instances of `datetime.date` and `datetime.time` into a single `datetime.datetime` instance



In [None]:
date0 = dt.date(year = 2024, month = 6, day = 1) # remove day and see what happens 

print(date0, type(date0))

In [None]:
time0 = dt.time(hour = 15, minute = 30, second = 43)

print(time0, type(time0))

In [None]:
datetime0 = dt.datetime(year = 2024, month = 6, day = 1, hour = 15, minute = 30, second = 43)

print(datetime0, type(datetime0))

In [None]:
# you can get the current or now time 

now = dt.datetime.now()

print(now, type(now))

In [None]:
# you can get the current or now date

today = dt.date.today()

print(today, type(today))

In [None]:
# using strftime() to get useing bits of information from datetime objects

now.strftime("%d") ## try out some other options! (see the link at the top of this section)

In [None]:
# is our datetime object naive or aware?
## we can determine this by looking at the timezone information 

print(now.tzinfo())

The object `now` is naive as `datetimeobject.tzinfo` is `None` 

In [None]:
current_time = dt.time(now.hour, now.minute, now.second)

print(dt.datetime.combine(today, current_time))

In [None]:
dt.date.fromisoformat("2023-06-01")

If your date is not already in ISO format, we can use `strftime` code to create an object outline the format and then pass this object along with the date object to `dt.datetime.strptime()`

In [None]:
date_string = "01-31-2024 14:45:27"
format_string = "%m-%d-%Y %H:%M:%S"

In [None]:
ISOnow = dt.datetime.strptime(date_string, format_string)

print(ISOnow, type(ISOnow))

### 6.1 Working with timezones 


The Python `datetime.tzinfo` documentation recommends using a third-party package called `dateutil` to manage timezones. At the top of the document we imported `tz` from `dateutil`. 

* `tz.tzname()` to print the name of the timezone 
* `tz.tzlocal()` to get a concrete instance of `datetime.tzinfo` using your system's local time 
* `tz.gettz()` to create time zones not reported by your system. You muss pass the official [IANA name](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) for the input time zone of interest. **Note** this is the reccomended method when using Noteable to ensure your local timezone is used

In [None]:
now1 = dt.datetime.now(tz = tz.tzlocal())

print(now1)
print(now1.tzname())

In [None]:
Edinburgh_tz = tz.gettz("Europe/London")

now2 = dt.datetime.now(tz = Edinburgh_tz)

print(now2)
print(now2.tzname())

### 6.2 Arithmetic with `datetime` objects

Python `datetime` instances support several types of arithmetic, which use `dt.timedelta` instances to represent time intervals. `timedelta` instances support addition and subtraction as well as positive and negative integers for all arguments. You can even provide a mix of positive and negative arguments.

In [None]:
# lets reuse now2 from above where we made an object with the current datetime info using BST timezone 

tomorrow = dt.timedelta(days =+ 1)

now2 + tomorrow

You can also use negative values as input arguments in `timedelta`

In [None]:
yesterday = dt.timedelta(days =- 1)

now2 + yesterday

In [None]:
# what if we wanted to add 4 days and subtract 5 hours from now 

delta = dt.timedelta(days =+ 4, hours =- 5)

now2 + delta

The above approach is limited, however, because it cannot add or subtract intervals larger than a day, such as a month or a year. This is where `relativedelta` comes to save the day (or month, or year)!

In [None]:
delta2 = relativedelta.relativedelta(years =+ 4, months =+ 1)

now + delta2

You can also use `relativedelta` to calculate the difference between 2 `datetime` instances instead of subtraction operator. 

In [None]:
print(now2) # to remind ourselves 
print(now2.tzname())

In [None]:
summer_solstice = dt.datetime(year = 2024, month = 6, day = 20, hour = 21, minute = 50)

print(summer_solstice)

In [None]:
relativedelta.relativedelta(now2, summer_solstice)

The above produces an error... but why? If you read the `TypeError` closely you will get a good hint.

---


The answer lies in the naive and aware status of `datetime` objects! We cannot compare a naive and aware `datetime` object, and `now2` is aware! `now` is naive, so we will use that.

In [None]:
time_till_solstice0 = relativedelta.relativedelta(summer_solstice, now)


print(time_till_solstice0, type(time_till_solstice0))

In [None]:
## or using subtraction operator with naive objects 

time_till_solstice1 = summer_solstice - now

print(time_till_solstice1, type(time_till_solstice1))

### 6.2.1 Solutions with 2 aware objects 

To make summer solstice aware we can use `datetimeobject.replace(tzinfo = timezone)`

In [None]:
summer_solstice_aware = summer_solstice.replace(tzinfo = Edinburgh_tz)

print(summer_solstice_aware.tzname())

In [None]:
## solution using aware objects and relativedelta 

time_till_solstice2 = relativedelta.relativedelta(summer_solstice_aware, now2)


print(time_till_solstice2, type(time_till_solstice2))

In [None]:
## solution using substraction operator with aware objects 
time_till_solstice3 = summer_solstice_aware - now2

print(time_till_solstice3, type(time_till_solstice3))

## 7. Missing data 

`None` is its own datatype in Python, whereas `np.nan` is a special floating-point value representing missing data. 

In [None]:
type(None)

In [None]:
type(np.nan)

`pandas` methods (on Series and DataFrame objects) for working with missing data all work with `pandas` Category data type. These methods include 

* `pdobj.isna()` or `pdobj.isnull()` used to detect missing values in an array-like object, produces a boolean array output 
* `pdobj.notna()` or `pdibj.notnull()` used to detect non-missing values in an array-like object, produces a boolean array output 
* `pdobject.fillna()` used to replace missing values with a specified value. A copy of the data object is returned with missing vales filled or imputed. There is a `pd.dataframe.fillna()` method we will be exploring further next week
    - argument `method = ffill` will forward-fill to propogate the previous value forward
    - argument `method = bfill` will back-fill to propogate the next values backward
* `Categorical.dropna()` to return a filtered version of the Categorical object. As above, there is a `pd.dataframe.dropna()` method we will be exploring further next week 


When working with Categorical's `codes`, missing values will always have a code of `-1`. 

In [None]:
print(cat) # remind ourselves of the categorical object we made above with a missing value

In [None]:
cat.isna()

In [None]:
cat.notna()

In [None]:
cat.fillna("c")

**Note** when using `.fillna()` with Categorical data types, you must use a category existing witin the object, you cannot specify a new category

In [None]:
cat.fillna("missing") # this produces as error

In [None]:
cat.dropna()

In [None]:
## detecting null values in a non-categorical type Series 

datamix = pd.Series([1, np.nan, 'hello', 56.8, None]*2)

In [None]:
print(datamix, type(datamix))

In [None]:
datamix.isna()

In [None]:
datamix.notna()

In [None]:
# to get a subset of the data that is not null specifically 

datamix[datamix.notnull()]

In [None]:
datamix.dropna()

In [None]:
# to replace all missing data with 0 

datamix.fillna(0)

In [None]:
# forward-fill 

datamix.fillna(method = "ffill")

In [None]:
# back-fill 

datamix.fillna(method = "bfill")

That's all for now around missing data - we will return to this topic next week in greater detail with different data structures. 

---

## 8. You did it! 🎉 

Well done for making it to the end of this notebook. If you have not done so yet, move to the RMarkdown data types notebook next. 

---
*Dr. Brittany Blankinship (2024)*