# Welcome to the Center for Computational Biology's Introduction to Programming for Bioinformatics bootcamp!


## Downloading materials
If you haven't downloaded the materials yet, please refer to the link in the reminder email you received on Friday.

You'll need to download a copy of the materials in that link and then upload those materials to your own Google Drive. This way, you have your **own copy** that you can modify and edit as you progress through the bootcamp.

In the meantime, we'll introduce the teaching staff and talk about the logistics of the bootcamp.

## Lecture logistics
### Browser
We highly recommend using [Google Chrome](https://www.google.com/chrome/) to view and edit these notebooks. There shouldn't be an issue with using other browsers, but because Google Colaboratory (more on this soon) is a Google product, we expect that most functions will work best with Google Chrome.

### Zoom (virtual students)
 Make sure you have Zoom desktop or mobile installed (instead of using it from a browser). Please turn your video off during lecture to minimize lagginess.

### Slack
Slack is an interactive workspace that allows for us to share resources, provide support, and otherwise communicate with you all during the workshop. **If you are a virtual attendee, Slack will be your primary resource for questions and discussion durign the bootcamp.** If you haven't signed up already, check the pre-bootcamp email for the invitation link. If you haven't used Slack before, feel free to refer to [this quick-start guide](https://slack.com/help/articles/360059928654-How-to-use-Slack--your-quick-start-guide).

### Lecture
If you're attending in-person, feel free to raise your hand if you'd like to ask a question out loud. Otherwise, TAs will monitor the #lecture-discussion Slack channel for lecture-related questions.

**During coding exercises**: We strongly encourage you to work with fellow bootcampers at your table. Our TAs will walk around and check on how you're doing – please wave one of them down if you'd like to ask a question!

Each day of the bootcamp, the head instructor and TAs will be available during bootcamp hours (9AM - 5PM) to respond to questions and comments in the below channels.

* **Have an issue with Zoom, Google Drive, or your Colab notebook?**: Please DM one of the TAs on Zoom or Slack so they can help you out!
* **Want to introduce yourself, share a cool paper, or chat about science/coding related things?**: We have a *#water-cooler* channel!

## Topics and schedule

### What are you going to learn?

You are going to learn Python! 🐍

Python is a simple and powerful programming language that is used for many applications, from simple tasks to large software development projects. It has become popular as both a first language for beginning students and in everyday use advanced programmers.

Our goal is to show you how to apply programming to the problems and tasks that you face in the lab – in short, we'll be teaching you **basic bioinformatics**. By the end of this course, you will be able to do the following:
* Parse and read in data from various biological file formats.
* Write custom functions for analyzing your data.
* Learn how to find and use different Python packages for analyzing your data.
* Understand how to construct scripts to automate your data analysis.
* Make publication-quality figures with your data.

Our aim is for you to leave this course with a sufficiently generalized knowledge of programming (and the confidence to read the manuals) that you will be able to apply your skills to whatever you happen to be working on.

### What are you *not* going to learn?

With only one week to teach this course, there are many topics that we are not going to cover. :(

We will focus on basic skills that can be employed by biologists at all levels: manage and handle large data sets, automating and speeding up tasks, efficiently use existing software packages to answer biological questions, build pipelines for data analysis and management, and perform basic statistical tests and analysis.

This means we will *not* delve into more complex topics like software development, algorithms, or deploying code on computing clusters. We're happy to refer you to other resources or programs that may be useful for these topics, but **our goal is to get you started with the basics so that you can direct *your own* learning.**


### Weekly Schedule
That being said, we do have a defined list of topics that we'll be going through day by day.<br>

**Monday, January 8th**: Introduction to Colab & Python

**Tuesday, January 9th**: Introduction to numpy

**Wednesday, January 10th**: More on numpy, introduction to pandas
* Mini-project 1: The `badhealth` dataset
* Mini-project 2: The `hepatocellular` dataset

**Thursday, January 11th**: Pandas for tabular data operations

**Friday, January 12th**: Plotting data and putting it all together
* Mini-project 3: The [Tabula Muris Senis](https://www.nature.com/articles/s41586-020-2496-1) cell atlas

### Daily Schedule

Each day is broken up into two interactive lecture modules: instructors will share their screens and lecture while working through the notebook. You are *strongly encouraged* to work through your copy of the notebook side-by-side! (More details on that in a few minutes.)

Below is a rough outline of the schedule: the amount of time we spend on each segment will vary.

- **8:30-9:10am**: Morning coffee and pastries, eat on terrace outside
- **9:10am-12:30pm**: Morning lecture and exercises
- **Lunch from 12:30-1:30pm**: Lunch break. More coffee and pastries available, eat on terrace outside
- **1:40-5:00pm**: Afternoon lecture and exercises
- **End at 5pm!**

Last but not least: we will try to build in some short 5-10 minute breaks throughout the day, but please feel free to take breaks at your own discretion.

## Cheat sheets

During the course of the week, we'll be distributing copies of **cheat sheets** that cover the essential content for each section. These cheat sheets aren't just supplemental handouts, they're actually *essential* for your progress during the bootcamp.

Over the years, we've found that students who focus heavily on trying to memorize how to do each operation often begin to fall behind once we introduce more advanced content. We believe that the key to student success in such a short bootcamp is not memorizing lines of code, but *intuiting* what principles of the code are important and relevant to the task at hand.

We encourage you to frequently refer to the cheat sheet and make it a part of your learning process this week. Whenever you learn about a new concept, take note of the corresponding cheat sheet section so you know where to find the code syntax again. Memorization will come with time and practice, so for now, focus on learning!

# Getting started with Google Colaboratory

During the bootcamp, you'll be writing and executing your code with a tool called **Google Colaboratory** (Google Colab for short).

Colab is like an interactive coding notebook: you can write chunks of text just like you would with a word document or Google Doc, interspersed with chunks of Python code. Colab is especially nifty because it allows you to run your code using Google's cloud computing resources instead of your own computer's hardware. What's more, your code is saved to the cloud in real time, just like with the other tools in Google Drive.

At this point, you should have your own copies of the Colab notebooks we're going to use throughout the week. During the daily lectures, the instructors will be working on their copy of the notebook. **You can (and should!) work through your own copy of notebook side by side in order to get the most out of the bootcamp.** No worries about mistakes or overwriting: you can always re-download a fresh copy of the materials from Google Drive.

> *Note*: Each day, we'll upload the instructor's "filled out" notebook to the Google Drive. This way you won't miss out on anything if you have to step away from the lecture!

Let's get started by familiarizing ourselves with the basic tools in Colab, many of which may be familiar to you if you've used Google Drive products (Docs, etc) before.

## Adjusting the font size or theme

One of the first things you might want to do is change the font size or dark/light mode theme of the notebook. You can do that easily by going to the top toolbar, then clicking `Tools -> Settings`.

## Using the Table of Contents

The Table of Contents is an invaluable tool for flipping through the notebooks we'll be using each day of the bootcamp. You can access the Table of Contents by clicking on the topmost button of the *left hand side toolbar*.

Try clicking on the "Editing text cells" heading, which will bring you to the section below.

## Editing text cells

Much like with live lectures, you'll probably want to take notes as we go. Luckily, Colab allows you to write notes in a number of different ways. Let's start by editing some text within an already established text cell.

Single click on the sentence below to reveal the text cell. Afterwards, double click to fill in the blanks, then hold down `Shift + Enter` (both Windows and macOS) to save your changes.


> Today is Monday, January 8th.

My instructor is Monica.



You can edit any of the text cells in Colab using this method. Colab supports basic formatting like **bold**, *italic*, and even


*   bullet
*   points!


Editing your format is super easy! Watch as I turn this line into a block quote, then a numbered list.


Alternatively, you can *create* new text cells anywhere you want. You can do this by hovering over the top/bottom middle portion of an existing text or code cell until the + Code and + Text buttons appear.

Give it a try: hover over the bottom middle portion of this text cell, then click the `+ Text` button.

Your new text cell should appear right above this text cell.

## Connecting to a runtime
A **runtime** in Google Colab is a computing environment that allows us to execute the code in our notebook. The resources for a runtime are hosted using Google's computing resources, which means that you can run code and execute analyses on Colab regardless of your own laptop's computing capabilities.

Click on the "Connect" button on the right hand side of the top toolbar. You should see it flash through a few different status messages. After the runtime is connected, you should see a green checkmark and two bars labeled "RAM" and "Disk".

## Working with code cells

Now that we're connected to our runtime, we can start running some code cells: these are the bread and butter of our interactive notebooks.

When you're ready, go ahead and run the cell by doing one of the following:
1. Hovering over the cell and pressing the Play button that appears on the left side.
2. Click on the cell and pressing `Shift + Enter` (both Windows and macOS).

In [None]:
print("Today's a good day to learn Python!")

Today's a good day to learn Python!


You can edit and run code cells just like you would with text cells. In fact, editing cells to annotate code is encouraged: you can do this with **comments**, which are lines of code that are prefaced with `#` characters.

In [None]:
# This is a "comment": programmers use comments to annotate their code.
# Comments are not considered code, so Python doesn't "run" comment lines.
# We've tried to mark up our code with useful comments, and you should write comments in your code as you go along!

print("Comments aren't visible in the output of the code cell: they're just for the reader!")
# print("That means that this bit of code won't actually run!")
print("You can put comments in between or after lines of code, and they won't be visible.") # Look, no comment!

Comments aren't visible in the output of the code cell: they're just for the reader!
You can put comments in between or after lines of code, and they won't be visible.


Last but not least, you can create your own code cells, just like you would with text cells. There are a few ways to do this:

1. Hovering over the top/bottom middle portion of an existing code cell until the `+ Code` and `+ Text` buttons appear.
2. Clicking on an existing cell and using a shortcut:
  - (macOS) `Command + M +B`
  - (Windows) `Ctrl + M + B`

Let's put it all together: try creating a new code cell below, then copy/paste the following and run the code cell.

```
"Hello world!"
```

In [None]:
# try it out
print("Hello world!")

Hello world!


Congratulations! You've just taken your first baby step to writing code in Python. 🐍

# Python Basics

## Cheat sheet
To view the Day 1 cheat sheet, click [here](https://drive.google.com/drive/folders/1JTZ_sJijmZXH1d5OxviX1gnxYIoSNBlz).

## Basic types

Many of the tasks scientists usually want to do with Python involve sorting and manipulating data. However, before we can do that, we have to understand how Python parses the data we give it.

Python follows what we call an **object-oriented programming** structure. In short, Python is a language that centers around manipulation of "objects" of different types. Data, which we would like to analyze, can be represented as certain **types** of objects. Tasks or functions can also be represented as different types of objects.

Although Python can be a very powerful language, it is can be a very *particular* language, especially for learners who are unfamiliar with the (in)flexibility of code. Let's start by going over some basic data types.

### Numerics

There are two *numeric* data types: **integers** and **floats**. You can think of integers as, well, integers, and floats as decimal point containing numbers. In Colab, you can identify numeric types by the turquoise color they're displayed with in the code cell.

In [None]:
# a float
100.0

100.0

In [None]:
# an integer
42

42

As you might already be anticipating, you can perform simple mathematical operations with numerics using conventional operators like the addition (`+`), subtraction (`-`), multiplication (`*`), and division (`/`) signs.

In [None]:
5 + 2 - 3

4

In [None]:
5*4

20

In [None]:
10/2

5.0

Take care to mind the order of operations: you can use parentheses to enclose operations as necessary.

In [None]:
(5 + 3) - 3

5

Python also supports advanced numerical operators:

- `//`: Division, but returns only the integer component of the resultant value.



In [None]:
# Try 23/5, then 23//5
print(23/5)
print(23//5)

4.6
4


- `%`: Division, but returns only the remainder of the operation.

In [None]:
# Try 23/5, then 23%5
print(23/5)
print(23%5)

4.6
3


- `**`: Raising the given base to the specified exponent.

In [None]:
2**3

8

### Strings

**Strings** refer to literal strings of characters. In Colab, you can also identify strings by the <font color = '#ed865c'>**red color**</font> they're shown with in the code cells.


In [None]:
"This is a string!"

'This is a string!'

Strings are most commonly encoded by single (`'`) or double(`"`) quotes. Both yield the same data type (string), but the convention of using one format over another will vary depending on what you're doing with the string.

**Double quotes**: The default mode of encoding strings. Anything between the double quotes is a string.

In [None]:
"This is a string using double quote encoding."

'This is a string using double quote encoding.'

**Single quotes**: Useful for encoding prose or any other text where you conventionally use double quotes to indicate some sort of dialogue.

In [None]:
"She said: "Hello!"" # run me and see what happens

SyntaxError: ignored

In [None]:
'She said: "Hello!"' # now run me and see what happens

'She said: "Hello!"'

Python allows you to perform arithmetic-like operations with strings. For example, we can **concatenate** (combine) strings by using the addition operator.

In [None]:
greeting = "Hello! "
order = "I would like to order a pizza. "

print(greeting + order) # order matters when printing
print(order + greeting)

Hello! I would like to order a pizza. 
I would like to order a pizza. Hello! 


We can also repeat the content of a string by using the multiplication operator.

In [None]:
print(greeting * 3)

Hello! Hello! Hello! 


In [None]:
# What happens if you try that with a float multiplier instead of an integer?

print(greeting * 3.0)

TypeError: ignored

Just as with our numerical operations, the order of operations applies with strings as well:

In [None]:
print((greeting * 3) + order)
print((greeting + order) * 3)

Hello! Hello! Hello! I would like to order a pizza. 
Hello! I would like to order a pizza. Hello! I would like to order a pizza. Hello! I would like to order a pizza. 


## Assigning variables
**Variables** are simply representations of the data that you want to work with. For example, we use variables in mathematical equations like `F = ma`, where each variable (`F`, `m`, `a`) represents a value corresponding to force, mass, and acceleration.

To create a variable, you write out your variable name, an assignment operator (`=`), and then what you want it to represent. Spaces around the equal sign are optional but encouraged for code readability.

Let's run the cell below:

In [None]:
todays_date = "January 8th, 2024" # assigning the variable todays_date

Before, when we ran code cells containing strings and numerics and saw their values in the output: however, when you assign variables, there's no output given to the console. Nevertheless, once you've assigned the variable, it's **stored** in your runtime, meaning that you can access it by **calling** the variable by its name, which will **return** its value to output.

In [None]:
print(todays_date)
print((todays_date + "! ") * 3)

January 8th, 2024
January 8th, 2024! January 8th, 2024! January 8th, 2024! 


## Updating variables
If you want to change a variable's value, you have to "update" it by assigning the variable again. If the variable value is derived from other variables, you will also need to "update" if any variable values are altered.

In [None]:
# a simple F = ma calculation
mass = 237.0
acceleration = 9.81
force = mass*acceleration

Again, assigning variables does *not* result in their value being returned to output. If we want to see what the `force` variable contains, we need to call the variable's name.

In [None]:
# try it out:
# call the force variable
print(force)
print(mass * acceleration)

2324.9700000000003
2324.9700000000003


Let's change `mass` to 250 and calculate `force` again.

In [None]:
# try it out:
# reassign mass to 250
mass = 250

In [None]:
# try it out:
# call force again
print(force)

2324.9700000000003


Notice that `force` did not automatically update its value given the new value of `mass`. Thus, we have to update `force` as well.

In [None]:
force = mass*acceleration
force

2452.5

## Built-in functions
A **function**, as its name intuitively reveals, performs some sort of a task using code. Much like a variable, you interact with functions by calling their names, with the additional step of providing inputs to the function if necessary.

> *Tip:* In Colab, you can usually identify functions by the ochre color they're displayed with in the code cell.

Some of the simplest, but most crucial functions in Python are **built-in functions**, which are "built in" to base Python. We'll start out with four important built-ins:
- `len()`
- `print()`
- `type()`
- `help()`

We'll use these built-in functions to prime our understanding of how functions work.

`len()` lets us check the length of an object. The input to `len()` goes between the parentheses.

In [None]:
len("This function is useful for counting the number of characters in a string.")

74

In [None]:
len(8888888) # integers don't have lengths

TypeError: ignored

In [None]:
len(5.0) # and neither do floats

TypeError: ignored

Some functions are particularly special because they tell us about the properties of our code. `print()` is a function that **prints** the given input to the Colab notebook's output. It's a useful function for a multitude of reasons: the first one being that you can use it to check the value of variables.

Go ahead and try to use `print()` to check the contents of the variable `todays_date`.

In [None]:
# try it out:
# use print() with todays_date as the input
print(todays_date)

January 8th, 2024


`print()` can also be used to explicitly show the outputs of operations that would normally not be visible. This is easier shown than explained.

In [None]:
# Will only display last line of code
40 + 100
90**3
500*3.146

1573.0

In [None]:
# Will display each line of code
print(40 + 100) # the operation happens inside of print()
print(90**3)
print(500*3.146)

140
729000
1573.0


Above, we can only observe the output of the *last* operation. By using `print()`, we can display the result of each operation.

`print()` is an example of a function that can take *multiple* inputs. If we use `print()` with multiple inputs, we can display multiple objects in an ordered manner.

In [None]:
# # try it out:
# print force, mass, acceleration (in that order)
# simply separate each input with a comma
print(force, mass, acceleration)

# then try printing acceleration, mass, force (in that order)
print(acceleration, mass, force)

2452.5 250 9.81
9.81 250 2452.5


Functions can be **nested** within each other, such that the result of one function can be fed directly into another function. Nested functions are evaluated from the inside out. For example, we can use `len()` nested within `print()` to print the length of multiple strings.

In [None]:
# len() is evaluated first, which gives the length of the string as an integer
# this integer is passed on to print(), yielding a printed integer

print(len('Monday'))
print(len('Tuesday'))
print(len('Wednesday'))

6
7
9


However, be careful: `print()` simply ***displays*** the value of an object or the outcome of a command, so you can't use it to update a variable.

In [None]:
print(mass)
printed_addition = print(mass + 1) # the result is displayed, but not stored -- don't be fooled!

250
251


In [None]:
printed_addition + 1 # we get an error because the variable printed_addition contains nothing (NoneType)

TypeError: ignored

The `type()` function is a useful built-in function for checking the type of objects. This can be useful for troubleshooting whether or not something is the type that you expect it to be.

In [None]:
# try it out:
# use print() to check the type of 42, 42.0, and "42"
print(type(42), type(42.0), type("42"))

<class 'int'> <class 'float'> <class 'str'>


Edit the code below and notice the difference between typing `print()` vs. just `print` inside of `type()`. Why is that?

In [None]:
# try it out:
# check the type of print() using type()
print(type(print))

<class 'builtin_function_or_method'>


Functions are the bread and butter of conducting analyses in Python, so we'll get to learn a lot of them as we go. You won't need to remember all of them off the top of your head, and you can always look up important ones using the `help()` function.

In [None]:
help(len) # if you forget about what len does

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



## Intro to data structures
Now that we've familiarized ourselves with some basic data types and manipulations, we can move on to **data structures**. At their simplest, data structures are simply ways to organize units of data. Some data structures are only to organize specific types of data, and others can be used to organize mixed types of data.

Today, we'll focus on teaching you all about the **built-in data structures**: lists, sets, tuples, and dictionaries. These are the building blocks of the more advanced data structures that we'll learn about tomorrow and use for the rest of the week.

### Lists

Let's begin with a familiar and intuitive data structure: **lists**. In your day-to-day life, you likely come up with simple lists all the time: grocery lists, task lists, so on and so forth.

The structure of lists in Python is quite similar: a list is a data structure that stores an ordered series of values. Lists are typically denoted with square brackets and stored/referenced using a variable name.

In [None]:
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
print(weekdays)
print(type(weekdays)) # type() will identify data structures as well

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
<class 'list'>


Lists can be used to store a variety of data types. Most lists you'll use will be composed of the same data type, but it is *possible* to store multiple data types in the same list.

In [None]:
mixed_list = ['one', 2, 3.0, 'four', 5]
print(mixed_list)

['one', 2, 3.0, 'four', 5]


If you have a list of numerics, you can perform some simple arithmetic operations using built-in functions:

* `sum()`: Sums values across a list of numerics.
* `min()`: Returns the minimum value in a list of numerics.
* `max()`: Returns the maximum value in a list of numerics.

In [None]:
num_list = [5, 9, 2.3, 14, 3, 2, 10] # a sample list of numerics

# try it out: use sum(), min(), and max() on num_list and see what you get!
print(sum(num_list), min(num_list), max(num_list))

45.3 2 14


Lastly, lists can be *nested* within each other: you can do this by simply including a list as an element within a list.

In [None]:
weekends = ['Saturday', 'Sunday']

week = [weekdays, weekends]
print(week)

[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], ['Saturday', 'Sunday']]


## Indexes and slicing

So far, we've taught you about types of data and how to store them. Now, we'll move on to how to *access* the data that you store.

In Python, **iterable** objects are a special class of objects that can be accessed using something called **indexing**. In practical terms, that means that you can access the first, second, third, etc. elements of certain objects like lists and strings. Consider the following string.

In [None]:
someString = 'abcdefghij' # first ten letters of the alphabet

Python uses **zero-based indexing**, in which the starting element of a string is indexed as the 0th element rather than the 1st element. (If you've used R in the past, you may stumble a bit while shifting to zero-based indexing. Sorry!)

Below are the indices for all of the characters in `someString`. Although the string itself is ten characters (confirm it with `len()` if you want!), the indices of each character run from 0 to 9.

```
someString:  a b c d e f g h i j
Index:       0 1 2 3 4 5 6 7 8 9
```

You can access the character at a specific index by using square brackets directly after the variable name, then using an integer index.

In [None]:
print("The first character is", someString[0])

print("The last character is", someString[9])

print("The last character is also", someString[-1])

The first character is a
The last character is j
The last character is also j


We can also use square brackets in conjunction with a colon operator `:` to extract a subset of a string, called a **substring**. The colon operator indicates that we want to retrieve the values starting from the left index, then going up to *but not including* the right index. This means that if you want to include the last element of the string in your substring, you actually have to add 1 to the last index!

In [None]:
print(someString[0:4]) # only characters from index 0 to 3, does not include index 4

print(someString[4:9]) # only characters from index 4 to 8, does not include index 9

print(someString[4:10]) # only characters from index 4 to 9, does not include index 10 (which doesn't exist)

abcd
efghi
efghij


Below, we've provided the index guide for the string `'Goodbye!'`, stored in the variable `farewell`.



```
String:  g o o d b y e !
Index:   0 1 2 3 4 5 6 7
```

In [None]:
farewell = "Goodbye!"

# try it out:
# slice and print out out the substring 'Good'
print(farewell[0:4])
# try it out:
# slice and print out the substring 'bye!'
print(farewell[4:8])

Good
bye!


Although using explicit left/right indices works well for shorter strings, it can become cumbersome for large strings with large indices. Thankfully, we can efficiently index or slice elements using these nifty little shortcuts:



In [None]:
print(farewell[:4]) # left index defaults to the first index if left empty

print(farewell[4:]) # right index defaults to the last index + 1 if left empty

print(farewell[-4:]) # you can use *negative* indexes as well: mostly useful to get the last element

Good
bye!
bye!


Indexing is an important skill to know for parsing data structures. It turns out that you can use the square index brackets to access single elements or subsets of elements in lists, just as you would with strings.

In [None]:
fruits = ['apple', 'orange', 'banana', 'apple', 'grapes', 'orange']

# try it out:
# use indexing to print the last three fruits in the list
print(fruits[3:])
print(fruits[-3:])

['apple', 'grapes', 'orange']
['apple', 'grapes', 'orange']


If we're working with a nested list, we simply use *two sets* of square brackets to access elements in a sub-list.

In [None]:
print('The nested list: ', week)

print(week[0]) # first sub-list

print(week[1]) # second sub-list

print(week[1][0]) # get first element from the second sub-list

The nested list:  [['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], ['Saturday', 'Sunday']]
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
['Saturday', 'Sunday']
Saturday


## Coercing types

Unlike strings and lists, numeric (`int` and `float`) types cannot be indexed. This is slightly frustrating if we want to obtain certain positions in a `int` or `float` value.

In [None]:
# How can we access the hundreth decimal place value of acceleration?
acceleration

9.81

In [None]:
# try it out:
# what happens when you try to index the hundredth decimal place?
acceleration[3]

TypeError: ignored

Fortunately, we can convert numerics to strings with ease by using the `str()` function.

Notice that the value is now surrounded by quotes, indicating that it has been converted to a string.

In [None]:
str(acceleration) # try to index as usual *after* the conversion to str

'9.81'

Performing these types of conversions is also referred to **coercion** or **typecasting**.

We can use built-in functions to convert one data type to another:
- `int()`: Converts to an integer
- `float()`: Converts to a float
- `list()`: Converts to a list of elements

In [None]:
# try it out:
# convert the string '4950.18' into a float

float('4950.18')

4950.18

In [None]:
# try it out:
# convert the int 49 to a string

str(49)

'49'

In [None]:
# try it out:
# convert the string 'banana' into a list

list('banana')

['b', 'a', 'n', 'a', 'n', 'a']

In [None]:
# try it out:
# convert the string '900' into a int

int('900')

900

Not all coercions are possible: for example, character-only strings won't become floats or ints, and you can't turn a non-iterable object into a list.

In [None]:
# this won't work
print(greeting)
print(int(greeting))

Hello


ValueError: ignored

In [None]:
# this won't work
print(495)
print(list(495))

495


TypeError: ignored

Lastly, the same "updating" rule applies to coercing variables. You can coerce a variable's value to whatever you want (as long as it's a valid coercion!), but the coercion won't actually alter the variable's value unless you update the variable.

In [None]:
acceleration = 9.81
print("The original variable type is", type(acceleration)) # check original type of variable's value
print(str(acceleration)) # coerce the variable's value to a string, then print it
print("The current variable type is", type(acceleration)) # check the type again

acceleration = str(acceleration) # update to coerced value
print(acceleration) # print the variable's value
print("The current variable type is", type(acceleration)) # check the type again

The original variable type is <class 'float'>
9.81
The current variable type is <class 'float'>
9.81
The current variable type is <class 'str'>


# More operations with strings

## Special characters
You can use **special characters** in strings to modify how they print. The most common special characters for string formatting are `\t` and `\n`, which correspond to **tabs** and **newlines**. You can think of the backslash as a character that tells the computer to treat the next character as something other than the default, such that 'n' becomes a new line when preceded by a backslash, 't' becomes as tab, etc.

Let's examine the following string:

In [None]:
"This is the first line,\n\tthis is the second line,\n\t\tthis is the third line."

'This is the first line,\n\tthis is the second line,\n\t\tthis is the third line.'

This might look messy, but `print()` function knows how to format the string in accordance with the special characters.

In [None]:
print("This is the first line,\n\tthis is the second line,\n\t\tthis is the third line.")

This is the first line,
	this is the second line,
		this is the third line.




If you want to include an actual backslash (or an actual \n or \t) in your string, you can simply add another backslash (sort of like a double negative!)

In [None]:
print("This is the first line, \n\t this is the second line.") # a newline and a tab
print("There is only one line in this string,\\not \\two.") # extra backslashes cancel out newline and tab activity

This is the first line, 
	 this is the second line.
There is only one line in this string,\not \two.


Why bother with formatting? It turns out that some file formats use these special characters as part of their unique formatting. For example, **BED (Browser Extensible Data) files**, which are used to view genomic annotations in genome browsers, use tab delimiting with `\t` to define annotation tracks:

In [None]:
print('''chr7\t127471196\t127472363\tPos1\t0\t+\t127471196\t127472363\t255,0,0
chr7\t127472363\t127473530\tPos2\t0\t+\t127472363\t127473530\t255,0,0''')

chr7	127471196	127472363	Pos1	0	+	127471196	127472363	255,0,0
chr7	127472363	127473530	Pos2	0	+	127472363	127473530	255,0,0


Similarly, **FASTA/FASTQ files** use newlines with `\n` to mark different sequences within a single file:

In [None]:
print('''>chr1 Jackalope chromosome 1;length=7
GATTACA\n>chr2 Jackalope chromosome 2;length=7\nTTACAGA''')

>chr1 Jackalope chromosome 1;length=7
GATTACA
>chr2 Jackalope chromosome 2;length=7
TTACAGA


Once you're more advanced in your Python journey, you can generate these particularly formatted files by simply writing strings and inserting characters where appropriate.

# Exercises

Let's finish up with some exercises. The solutions to these exercises are available in the Day 1 Solutions notebook, and we'll also post the lecturer's copy of the notebook after the end of each day.

## Set A

**A1**: Assign three variables called month, year, and day with the month, year, and day of your birth.

In [None]:
### write your code below ###
month =  "Feb"
year = 1996
day = 13

**A2**: Using `print()`, display the value of each variable in the following order: year, month, day.

In [None]:
### write your code below ###
print(year, month, day)

1996 Feb 13


**A3**: Check the type of each of the variables.

In [None]:
### write your code below ###
print(type(year))
print(type(month))
print(type(day))

<class 'int'>
<class 'str'>
<class 'int'>


**A4**: Check the length of each of the variables.

In [None]:
### write your code below ###
print(len(month))
print(len(str(day)))
print(len(str(year)))

3
2
4


**A5**: Coerce the type of each variable to a string, then check the type of each of the variables again.

In [None]:
### write your code below ###
day = str(day)
year = str(year)
print(type(day))
print(type(year))

<class 'str'>
<class 'str'>


**A6**: Create a list that contains the variables in the following order: year, month, day.

In [None]:
### write your code below ###
b_day = [year, month, day]
print(b_day)

['1996', 'Feb', '13']


## Set B

**B1**: Try printing the concatenation of `stringA` and `stringB`.

In [None]:
stringA = 'AGGAGGU'
stringB = 'AUG'
stringC = 'CAG'

### write your code below ###
print(stringA + stringB)

AGGAGGUAUG


**B2**: Try printing `stringC`, repeated 38 times.

In [None]:
### write your code below ###
print(stringC * 38)

CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG


## Set C

Let's try some last quick exercises before we go off to lunch!

**C1**: Calculate the average of 5, 6, 7, 8, and 9.

In [None]:
### write your code below ###
num_list = [5, 6, 7, 8, 9]
avg_list= sum(num_list)/len(num_list)
print(avg_list)

7.0


**C2**: Correct the errors and/or make the necessary changes to the code below. There are multiple ways to make things right!

In [None]:
### original code ###
# print('0' * 40) # This should print 0.
# print(40 / 8) # This should print 5 (integer value, not float)
# print(60 / 5 % 2) # This should print 60.0
### original code ###

### solution ###
print(0 * 40) # This should print 0.
print(41 // 8) # This should print 5 (integer value, not float)
print(60 / (5 % 2)) # This should print 60.0
### solution ###

0
5
60.0


**C3**: Add special characters to make the string print in a fasta format. (See below)

In [None]:
### original code ###
# fasta = '>gene1ACTAGCTACAGTTCGCNAGC>gene2TCGATCNATCGATNGTCGAT'
# print(fasta)
### original code ###

### solution ###
fasta = '>gene1\nACTAGCTACA\nGTTCGCNAGC\n>gene2\nTCGATCNATC\nGATNGTCGAT'
print(fasta)
### solution ###

>gene1
ACTAGCTACA
GTTCGCNAGC
>gene2
TCGATCNATC
GATNGTCGAT


The result of printing your edited string should look like this:

    >gene1
    ACTAGCTACA
    GTTCGCNAGC
    >gene2
    TCGATCNATC
    GATNGTCGAT