# Getting Started With Python - Part II

Part 2 of this workshop will expand on the concepts we learned last week to explore how to create a custom object type and simple 'package'. We will begin by learning how to read and write from files in Python before creating a class that implements some custom file parsing operations. Finally, we will have a quick gallery-style look at some established Python packages for data analysis in different fields. 

Specifically, we will cover:

- Reading from and writing to files in Python
- Importing packages and understanding namespaces
- Creating a simple class in Python

## 2.0 Lesson I review

Last week, we worked through some introductory Python in the Jupyter environment. We learned that within Jupyter, we can write code and text within *cells*, and that these cells could be set to Code or Markdown (text) using the dropdown menu at the top (or by using `Esc -> Y -> Enter` for Code and `Esc -> M -> Enter` for Markdown). Finally, once we had written the text/code we wanted in a cell, we could use `Shift + Enter` to execute/render the contents of a cell.

We started with a review of a few simple object types: integers, floats, and strings: 

In [1]:
my_int = 3
my_float = 3.0
my_str = 'These are some words.'

print(type(my_int), type(my_float), type(my_str))

<class 'int'> <class 'float'> <class 'str'>


We learned that different object types have different *methods* and *attributes* associated with them, which we can call on by adding a `.` followed by the method/attribute name at the end of the object:

In [2]:
print(my_str.upper())
print(my_str.title())
print(my_int.denominator)

THESE ARE SOME WORDS.
These Are Some Words.
1


Note that attributes like `denominator` do not require parentheses, since they are pieces of information and not functions.

All the methods and attributes belonging to an object can be returned using the `dir` function. This will also list out lots of methods/attributes that start and end with two underscores; these are more advanced and can usually be ignored. 

In [3]:
print(dir(my_str))

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']


We learned about object types that can store multiple values, like lists and dicts:

In [4]:
my_list = [1, 3, 6]
fruit_counts = {'apple': 3, 'banana': 5}

Finally, we learned about loops, which let us repeat operations over multiple values in objects like the above:

In [5]:
for elem in my_list:
    print(elem)

1
3
6


## 2.1 Reading from and writing to files in Python

To start today's lesson, we're going to cover an incredibly key Python skill for all fields of research: reading from and writing to files from Python.

### 2.1.1 Writing to files

Most of the time, data analysis workflows start with reading in some data from one or more files and conclude with _writing_ data (e.g. filtered datasets, model outputs) to files as well. 

To interact with files in Python, we have to first create what as referred to as a *file object*. This is a temporary variable (like the `i` and `elem` variables we used in our for loops earlier) that includes special methods and attributes specific to reading and writing from files.

The best and safest way to create a file object in Python is using the `with` keyword along with the `open` function. Let's work through this to write some sample text to a file:

In [6]:
with open('my_file.txt', 'w') as f:
    f.write('these are some words')

Using the file tree on the left side of the Jupyter interface, we can now see that Python has created a file called `my_file.txt`. Double clicking on it will open the file in a new Jupyter tab, and will show that this file indeed contains the specified text.

Let's break down what happened here. The `open` function takes in two arguments: first, a filename, and second, the **mode**. The mode determines what Python does to the file. The three modes we're covering today are:

1. `'w'` - **w**rite mode. This means Python is primed to _write_ to the specified filename.
2. `'r'` - **r**ead mode. This means Python is primed to _read_ from the specified filename.
3. `'a'` - **a**ppend mode. This means Python is primed to _write_, but _will not overwrite file contents_ - it will instead append anything it's told to write at the bottom of the file. 

Next, the file we are interacting with using `open` is temporarily assigned to `f`. This means that for the duration of the indented code immediately afterwards (like in a for loop), `f` is an object with several special methods and attributes we can use to read/write to the file. Finally, we indent the next line and use the `.write` method, which takes in a string, to specify the text we want to write.

Note that the `write` method specifically takes in a string. We can convert other object types such as floats and integers to strings using the `str()` function. Let's append some numbers to our file:

In [7]:
with open('my_file.txt', 'a') as f:
    f.write(str(3))

We notice that Python directly just added the number `3` at the end of the line. In instances like this, we have to explicitly tell Python to add a new line using the special `\n` character.

In [8]:
with open('my_file.txt', 'a') as f:
    f.write('\n' + str(3))

### 2.1.2 Reading from files

Reading from files works similarly - the only difference is that we change the mode to `'r'`, which 'unlocks' methods for reading instead. Note that if the mode is incorrectly set (e.g. if you set it to `'r'` and use the `.write()` method) Python will throw an error, because it's not sure which you're trying to do! 

Let's read from the file we just created:

In [9]:
with open('my_file.txt', 'r') as f:
    print(f.read())

these are some words3
3


We can of course save the contents of what's been read in to an object:

In [10]:
with open('my_file.txt', 'r') as f:
    contents = f.read()
    
print(contents)

these are some words3
3


However, note that `.read()` saves *all* the contents of the file into a *single string*. Usually, we will want to use something like `.readlines()` instead, which will pull in each line into a separate string:

In [11]:
with open('my_file.txt', 'r') as f:
    content_list = []
    for line in f.readlines():
        content_list.append(line)
    
print(content_list)

['these are some words3\n', '3']


Since the `\n` special character is so common, Python has a special string method called `.rstrip()` that lets you remove it:

In [12]:
with open('my_file.txt', 'r') as f:
    content_list = []
    for line in f.readlines():
        content_list.append(line.rstrip())
        
print(content_list)

['these are some words3', '3']


Most field-specific packages will have special functions to read in specific data types (e.g. `json` dumps from webservers, `csv` files containing tabular data, etc) but under the hood, the same principles will nearly always apply, and it's likely these special functions are simply doing what we just learned but with a bit more file-specific code tacked on.

## 2.2 Packages and namespaces

If you've ever seen a Python script before, chances are the first few lines of the script have looked something like this:

```python
import os
import sys
from math import pi
```

What does all this mean? What we see here is in some ways the bread and butter of what makes Python such an excellent, flexible, and popular programming language for research purposes: custom packages. Python allows anyone to bundle functions and even custom object classes designed for specific data types into packages. These packages, once installed, can be loaded into a workspace using the `import` keyword.

Let's have our first look at this using the `os` package, which contains some useful operating system level operations like listing out the files in a folder. This package comes pre-installed with Python. Let's give it a go:

In [13]:
import os

Using `import` yields no output, but we now have a host of functions we can use that belong to this package. For example, the `listdir()` function from `os` lists out the contents of our current directory (e.g. the folder in which your Jupyter notebook is). 

However, we cannot just type out `listdir()`:

In [14]:
listdir()

NameError: name 'listdir' is not defined

Python will be confused and claim that it doesn't know what `listdir()` is. This is because Python is very specific about **namespaces**. A namespace can be thought of as a collection of 'names' - object types, functions, methods, attributes - that all belong to a specific thing. Namespaces allow for Python development to be much cleaner and more organized.

In this case, we have to tell Python that we mean to use the `listdir()` function from the `os` namespace. We've already seen the syntax for how to work with namespaces - it's the same as how we called methods and attributes earlier! Let's now call on `listdir()` from the `os` package:

In [15]:
os.listdir()

['my_file.txt',
 'getting-started-1.ipynb',
 'getting-started-2.ipynb',
 '__pycache__',
 'README.md',
 '.gitignore',
 '.ipynb_checkpoints',
 'fileparser.py',
 '.git']

Why is this so important? Well, it means that more than one function/object named `listdir` can exist in the same Python environment, and both are kept separate by their namespace:

In [16]:
listdir = ['desktop', 'documents'] # should probably be called list_of_dirs! 

print(listdir)
print(os.listdir())

['desktop', 'documents']
['my_file.txt', 'getting-started-1.ipynb', 'getting-started-2.ipynb', '__pycache__', 'README.md', '.gitignore', '.ipynb_checkpoints', 'fileparser.py', '.git']


We can also directly import certain parts of a package using slightly different syntax. For example, if we were to import Python's `math` library, it contains a function called `math.log10`:

In [17]:
import math
print(math.log10(5))

0.6989700043360189


Although this is usually best avoided for individual functions, we *could* load in the `log10` function directly.

Before running this next cell, be sure to **restart your kernel**. This will keep your notebook intact, but reset your Python kernel, which means Python will forget all objects you've created and packages you've imported - as if you just opened Python for the first time. This can be done in Jupyter from `Kernel -> Restart Kernel` in the menu bar up top. 

Let's load in `log10` directly:

In [18]:
from math import log10

log10(5)

0.6989700043360189

We can now use the function without making the namespace explicit; in fact, doing so raises an error, claiming `math` is not defined:

In [None]:
math.log10(5)

This syntax is useful in two instances:

1. Some packages are super complex and contain submodules with their own sets of functions. In that case, we can use the `from` syntax to just import those submodules. For instance, the massive `Bio` package has a big submodule called `SeqIO` - to avoid having to type out `Bio.SeqIO` every time, we can just use `from Bio import SeqIO`. This is the more common and more widely accepted use of this import method. 
2. The package is massive and memory intensive, and you really only need that one function. Still best avoided but somewhat justifiable in a pinch. 

Let's restart our kernel once more to remove the standalone `log10` function before moving on.

One final thing to note is that the `dir()` function can also be used to check on the contents of a package:

In [None]:
import math
print(dir(math))

## 2.3 Creating a class/'package' in Python

For our final section of this workshop, we are going to use the read/write operations we learned several moments ago to create a more feature-rich file parser. In doing so, we will learn both how classes are made in Python as well as what's actually happening in the background when we run the `import` function. 

Here, we are going to create a simple class called `fp`, for 'fileparser'. A **class** in Python is a custom object that has its own user-defined methods and attributes. All classes have one special default method, called `__init__` (two underscores, init, two underscores) that Python will run when we make a new object using our class.

Although the custom class will be written in code cells for the remainder of this lesson material, we will actually be writing it in a new text file *during* the lesson. We can open a text file in Jupyter using `File -> New -> Text File`. If you are using Jupyter Notebook instead of Jupyter Lab, head back to the tab that showed a list of files and use the `New` dropdown on the top right to create a text file. Finally, if you are using Spyder, use the `New File` dialog to create a new Python script.

If you're using Lab, the file (`untitled.txt` by default) can be renamed via right clicking on it and selecting `Rename`. Let's name this file `fileparser.py` and get started on our class.

### 2.3.1 Using `__init__` to define attributes

We will start by using the `class` keyword and writing our `__init__` method. Let's write this out and then break it down:

In [38]:
class fp:
    def __init__(self, fname):
        self.fname = fname
        self.fname_length = len(fname)

The first thing to note here is that when we made functions in the past, the `def` keyword was not indented at all. Remember that indentation is very important in Python! The fact that `def` here is indented underneath `class fp` means that this function belongs to `fp`. In other words, it is a method! This is what methods look like 'under the hood'. 

Next, we notice that this function has two arguments, or inputs: `self` and `name`. `self` is highlighted blue in Jupyter because it is another special Python keyword. Remember how when we used a method like `my_str.upper()`, we didn't provide any input to the parentheses? This is because under the hood, the `upper` method takes in `self` as input - in other words, it understands that its input is `my_str`, or the object it**self**. 

Inside a class, virtually all methods take in the `self` argument at minimum - this is because all methods, by design, act on the function itself. Any other inputs are extra and not necessarily required. However, we are adding an `fname` argument to represent the input filename, which we will have to give as input whenever we want to use this class.

Looking inside the contents of the `__init__` function, we see there are two variable assignments using syntax we haven't seen before. As it turns out, this is where **attributes** are assigned. We haven't seen very many attributes in built-in objects, but they are much more common in custom packages. Here, we save two attributes: the filename itself, and then the length of the filename, using the `len` function. These are defined by assigning to attributes of `self`, and we can call attributes anything we want. 

Let's save this text file and then try to import this into our workspace here:

In [39]:
import fileparser

print(dir(fileparser))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'fp']


There it is at the end - `fp`. And this is all the `import` keyword does - read in a Python script that defines custom classes (and more)! 

Of course, to use `fp`, we have to use the full namespace. Let's give `my_file.txt` from earlier as input to a new instance of `fp`:

In [40]:
test_file = fileparser.fp('my_file.txt')

print(dir(test_file))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_get_extension', 'convert_to_list', 'extension', 'fname', 'fname_length']


Now we can use the two attributes we've defined:

In [41]:
print(test_file.fname, test_file.fname_length)

my_file.txt 11


### 2.3.2 Creating methods

Of course, the real action with most object classes is in the methods. So let's create some! Back in our `fileparser.py` file, let's update the class with a simple method that converts the lines of the file to a list and removes any instances of `\n`. 

To add another method, we define another function at the same indent level underneath the `class` keyword. We only need to provide `self` as an argument because we can actually get the filename from the `self.fname` attribute we've already made! This helps simplify our methods even further.

In [42]:
class fp:
    def __init__(self, fname):
        self.fname = fname
        self.fname_length = len(fname)
    
    def convert_to_list(self):
        """
        Loops through lines in file and returns
        contents as a list.
        """
        fname_lines = []
        with open(self.fname, 'r') as f:
            for line in f:
                fname_lines.append(line.rstrip('\n'))
        return fname_lines

A few things to note here:

- Earlier, we used `for line in f.readlines()` to do something similar. However, Python knows this is a common operation, and so if we just loop over `f` directly it will pull in one line at a time. This is actually really powerful as well because it only loads one line into memory each time, which means you can do this over files that are millions of lines long if you'd like! 
- Notice that we are using `self.fname` in `open()`. Since we created `self.fname` in `__init__`, Python will know to look there to get the filename. This is a really powerful feature of attributes - they can be used a ton internally.
- `__init__` is pretty unique in not having a `return` keyword at the end, but nearly all other methods should have one of those! 

Let's save `fileparser.py` and restart our kernel just to clear out the older imported version of `fileparser`. Then, let's import this updated version of the class:

In [43]:
import fileparser

test_file = fileparser.fp('my_file.txt')

print(dir(test_file))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'convert_to_list', 'fname', 'fname_length']


We can see that `convert_to_list` has been added! Let's give it a whirl: 

In [44]:
test_file.convert_to_list()

['these are some words3', '3']

### 2.3.3 Dynamically updating attributes with methods

We've seen that methods can call on attributes, but it also turns out attributes can also be dynamically created with methods! 

Let's create a method that will return the extension of a file. We can do this using the `rfind` method that strings have, which returns the position at which the *rightmost* instance of a character is. Let's test it out:

In [45]:
test_file_name = 'test_file.sorted.txt'
test_file_name.rfind('.')

16

This tells us that the last instance of a `.` is at position 16. We can use slicing notation, which we learned last week in the context of lists, on strings as well:

In [46]:
test_file_name[16:] # the colon means 'character 16 onwards'

'.txt'

Looks good - let's implement this into a method. Back to `fileparser.py`:

In [47]:
class fp:
    def __init__(self, fname):
        self.fname = fname
        self.fname_length = len(fname)
        
    def convert_to_list(self):
        """
        Loops through lines in file and returns
        contents as a list.
        """
        fname_lines = []
        with open(self.fname, 'r') as f:
            for line in f:
                fname_lines.append(line.rstrip('\n'))
        return fname_lines
    
    def get_extension(self):
        """
        Gets extension of input filename.
        """
        i = self.fname.rfind('.')
        extension = self.fname[i:]
        return extension

Same old song and dance - let's restart the kernel and give it a shot:

In [48]:
import fileparser

test_file = fileparser.fp('my_file.txt')

print(test_file.get_extension())

.txt


Looking good - but wouldn't it be neat if we could save that as an attribute? Turns out we can! 

In [49]:
class fp:
    def __init__(self, fname):
        self.fname = fname
        self.fname_length = len(fname)
        self.extension = self.get_extension()
        
    def convert_to_list(self):
        """
        Loops through lines in file and returns
        contents as a list.
        """
        fname_lines = []
        with open(self.fname, 'r') as f:
            for line in f:
                fname_lines.append(line.rstrip('\n'))
        return fname_lines
    
    def get_extension(self):
        """
        Get extension of input filename.
        """
        i = self.fname.rfind('.')
        extension = self.fname[i:]
        return extension

Same old kernel restart and reimport:

In [50]:
import fileparser

test_file = fileparser.fp('my_file.txt')

test_file.extension

'.txt'

And there we have it! The attribute was dynamically updated in the background as soon as `__init__` was run.

#### 2.3.3.2 Internal vs external use

We may actually want to hide the `get_extension` method and keep it for internal use. We can signify this by adding an `_` to the start (i.e. renaming it `_get_extension`) and being careful to update the reference to it in `__init__`. While this doesn't hide the method altogether, it's conventionally used to denote 'private' methods for internal use only - so if you ever see this in a package you are using, chances are you don't need to be using that function! 

### 2.3.4 Function and object imports

To wrap up, let's quickly look at how constants and functions look on the other side. We'll add the following **outside the class** - e.g. these should be fully left indented! 

In [51]:
def package_name():
    print('fileparser!')
    
euler = 2.71828

Now, for the last time (promise!) let's restart our kernel and re-import. We can now access this function and this constant as well. 

In [52]:
import fileparser

fileparser.package_name()

fileparser!


In [53]:
fileparser.euler

2.71828

It's important to note that the reason this function and this constant come directly after `fileparser` is that left-aligning them fully makes them part of the main `fileparser` namespace. On the other hand, if we try to access `convert_to_list` similarly -

In [54]:
fileparser.convert_to_list()

AttributeError: module 'fileparser' has no attribute 'convert_to_list'

...it won't work. That's because it's a method of `fileparser.fp`:

In [55]:
fileparser.fp.convert_to_list()

TypeError: convert_to_list() missing 1 required positional argument: 'self'

Notice the error is now that the function is missing an input, not that Python has no idea what we're referring to! 

## 2.4 Package Gallery

That's it for the lesson material - to wrap up, we'll have a quick look at some established Python packages! 

Here are a few good ones that we'll be looking up on a browser:

- [ArcPy](https://pro.arcgis.com/en/pro-app/arcpy/get-started/what-is-arcpy-.htm)
    - [importing ArcPy](https://pro.arcgis.com/en/pro-app/arcpy/get-started/importing-arcpy.htm)
    - [example](https://pro.arcgis.com/en/pro-app/arcpy/functions/dataset-properties.htm)
- [BioPython](https://biopython.org/)
    - [importing BioPython](https://biopython.org/wiki/Documentation)
    - [example](https://biopython.org/wiki/SeqIO)
- [NLTK](https://www.nltk.org/) (Natural Language Toolkit)
- [EarthPy](https://earthpy.readthedocs.io/en/latest/)
    - [example](https://earthpy.readthedocs.io/en/latest/gallery_vignettes/plot_bands_functionality.html#sphx-glr-gallery-vignettes-plot-bands-functionality-py)
- [chempy](https://pypi.org/project/chempy/)
- [`scikit-learn`](https://scikit-learn.org/stable/)
    - [example](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting)

## Appendix

This section contains material that was not originally part of the lesson, but was covered/requested during the workshop.

### Reading a file line by a line

Generally, it's good practice to always use the following syntax (`for line in f`) if looping through a file one line at a time:

In [1]:
with open('my_file.txt', 'r') as f:
    for line in f:
        print(line)

these are some words3

3


This will go through the file one line at a time, but will not save anything but the current line in memory as it goes along. For most uses, this is just fine, because many data files are quite large and we may only want one piece of information from a single line in a dataset at a time.

If, however, we want to save all the lines in full, we could create a list and use the `.append` method to save the line one at a time:

In [2]:
content_lines = []
with open('my_file.txt', 'r') as f:
    for line in f:
        content_lines.append(line.rstrip())

This also allows for some modification of each line (e.g. using the `.rstrip` string method above).

Finally, there is a helpful method that simplifies this by returning a list that already contains all the lines in the file - `.readlines`. The only downside to this is that all the lines are returned as-is, so if you'd like to change them in any way, that would have to be done after the fact.

In [3]:
with open('my_file.txt', 'r') as f:
    all_lines = f.readlines()
    
print(all_lines)

['these are some words3\n', '3']


### Type conversions

Where possible, we can convert a Python object from one type to another by using the object name with 'function syntax'. What this **actually** does is create a new instance of the object using the specified class (e.g. `str()` looks like a function, but is actually similar to `fileparser.fp` above in that it is referring to the object class directly, and we are just making a new instance of it) but the syntax looks virtually equivalent.

In [4]:
my_str = '3'
my_str_as_int = int(my_str)

print(my_str_as_int, type(my_str_as_int))

3 <class 'int'>


This will only work where it makes sense to use the new object type. For instance, changing the string `'This is a string'` to an integer will raise an error since Python won't know how to convert letters into an integer. 

All the Python object types we've learned about can be used this way:

- `str()`
- `int()`
- `float()`
- `list()`
- `dict()`