# Intro to Python, pandas, and iPython notebooks 

### iPython notebooks 

Welcome to jupyter notebook on HPC (high-performance compute cluster). iPython notebooks / jupyter notebooks are interactive multimedia environments to write and run code.

Here is the link: https://hpc3.rcic.uci.edu/biojhub3/
Please choose the "containerized notebook image" with "R4.0

For those who have programmed and run their own programs before, perhaps you're more familiar with running your program once from start to end. iPython notebooks provide a way to run code more interactively.

Each unit in a jupyter notebook is called a cell. Typically when I use ipynbs, I run one cell at a time. You can run a cell by clicking on it and pressing shift+enter. Try it on the cell below.

In [1]:
print('hello world')

hello world


Another really nice feature of ipynbs that particularly well-suited for data science is the persistence of variables after running a cell of code. That is to say, if you run some code in one cell, you can then access it in another cell and investigate what it is. This allows you to run small chunks of code at a time, make sure that they're doing what you want them to, and then to continue using your variables.

Take the following example, say we have a list and we want to swap out all instances of 'dog' for 'cat'

In [2]:
pet_list = ['dog', 'cat', 'guinea_pig', 'lizard', 'cat', 'dog', 'mouse']

First, we can see if we can identify which parts of the list contain 'dog' instances.

In [3]:
# loop through the list and find the indices of the list where the element is 'dog'
dog_indices = []
for i, element in enumerate(pet_list):
    if element == 'dog':
        dog_indices.append(i)

In [4]:
# make sure that each of the indices we found are 'dog'
for i in dog_indices:
    print()
    print(pet_list[i])
    print(pet_list[i] == 'dog')


dog
True

dog
True


In [5]:
# now that we can be confident that we've identified all the elements of the list that are 'dog', we can replace them
for i in dog_indices:
    pet_list[i] = 'cat'

In [6]:
# and finally, verify that all of the elements of the list have been replaced
print(pet_list)
print('dog' in pet_list)

['cat', 'cat', 'guinea_pig', 'lizard', 'cat', 'cat', 'mouse']
False


This is a pretty simple example, but hopefully it demonstrates the value of being able to stop and examine what your code is doing while you're writing it, instead of debugging it by running everything over again every single time. This convenience really shines when you're dealing with big data and don't have to bother loading it / transforming it (which can be very time consuming steps) every time you want to try something new. 

An additional nice functionality of ipynbs is their direct compatibility with Markdown. For each cell that you write, you can choose whether it should be interpreted as code, markdown, or raw text. Markdown supports *text formatting* **such as this**. You can also make headings to different sections of your jupyter notebook, which you can see in the other parts of this notebook.

### Python

This notebook is running Python. Python is a programming language that is commonly used in bioinformatics. It is (in my opinion) easy to understand and write, and has really powerful libraries that you can load for data science and visualization. I'm hoping a lot of you know how to program already but I'll go over some basic Python syntax and programming basics here.

In [12]:
import pandas as pd # library for data matrix manipulation
import seaborn as sns # library for plotting pandas-formatted data
# by the way, lines that start with '#' are called comments
# add them to your code to remember what you were doing

In [85]:
# types of variables (simple)
my_int = 1 # whole numbers are 'ints' (integers)
print(my_int)
print(type(my_int))
print()

my_float = 0.003 # numbers with decimal places are 'floats' and have higher precision than ints
print(my_float)
print(type(my_float))
print()

my_string = 'hello world' # text enclosed in either '' or "" are 'strings'
print(my_string)
print(type(my_string))
print()

1
<class 'int'>

0.003
<class 'float'>

hello world
<class 'str'>



In [86]:
my_list = ['frog', 'bat', 'axlotl'] # list
print(my_list[0]) # indexing a list (python is a 0-based language)
print(my_list[1])
print(my_list[2])
print(type(my_list))

frog
bat
axlotl
<class 'list'>


In [87]:
my_dict = {'axlotl': 'amphibian', 'bat': 'mammal', 'frog': 'amphibian'} # dictionary - store key:value pairs
print(my_dict['axlotl'])
print(my_dict['bat'])
print(my_dict['frog'])
print(type(my_dict))

amphibian
mammal
amphibian
<class 'dict'>


In [88]:
data = ['amphibian', 'mammal', 'amphibian']
ind = ['axlotl', 'bat', 'frog']
df = pd.DataFrame(data=data, index=ind, columns=['kind']) # pandas data frame - I work with these every day!
print(df)
print(type(df))

             kind
axlotl  amphibian
bat        mammal
frog    amphibian
<class 'pandas.core.frame.DataFrame'>


In [15]:
# for loops

# iterate until a certain number
for i in range(10):
    print(i)
    
print()

# iterate through a list
my_list = ['frog', 'bat', 'axlotl'] # list
for animal in my_list:
    print(animal)
    
print()
    
# iterate through a list while getting the number of each iteration using enumerate()
for i, animal in enumerate(my_list):
    print('Animal at {}: {}'.format(i, animal)) # string formatting - this is also useful

0
1
2
3
4
5
6
7
8
9

frog
bat
axlotl

Animal at 0: frog
Animal at 1: bat
Animal at 2: axlotl


In [19]:
# if / else blocks - execute code based on whether or not a condition is met
for animal in my_list:
    if animal == 'frog' or animal == 'axlotl': # or logic - one or the other condition is met
        print('Animal {} is an amphibian'.format(animal))
    else:
        print('Animal {} is not an amphibian'.format(animal))
        
print()

first_element = True # boolean variable, can be True or False
for animal in my_list:
    if first_element and animal == 'frog': # and logic - both conditions must be true
        print('First animal is frog')
    elif first_element and animal == 'bat': # elif: the following will execute only if the first part does not. in this case, it will never run.
        print('First animal is bat')
    else:
        # you can index strings the same way you can index lists. here we're just trying to see if the word
        # starts with a vowel
        if animal[0] == 'a' or animal[0] == 'e' or animal[0] == 'i' or animal[0] == 'o' or animal[0] == 'u': 
            print('Found an {}'.format(animal))
        else:
            print('Fount a {}'.format(animal))
    
    first_element = False

Animal frog is an amphibian
Animal bat is not an amphibian
Animal axlotl is an amphibian

First animal is frog
Fount a bat
Found an axlotl


In [21]:
# while loops - execute code until a condition is met
# useful when you're not sure when something's going to be done ie you don't know the exact index
my_list = ['cat', 'bat', 'planarian', 'frog', 'axlotl']
i = 0
while my_list[i] != 'frog' and my_list[i] != 'axlotl': # inequality, check if something is not equal to something else
    i += 1 # += operator, increment by 1
print('First amphibian ({}) occurs at index {}'.format(my_list[i], i))

First amphibian (frog) occurs at index 3


In [22]:
# list indexing / slicing

# access individual elements using individual numbers 
# 0 is first element, -1 is last element, -2 is second from last element
print(my_list[0])
print(my_list[-1])
print(my_list[-2])

print()

# slice list using the : operator
print(my_list[:-1]) # all elements of the list but the last one
print(my_list[1:]) # all elements of the list by the first one
print(my_list[2:4]) # some middle elements of the list

cat
axlotl
frog

['cat', 'bat', 'planarian', 'frog']
['bat', 'planarian', 'frog', 'axlotl']
['planarian', 'frog']


In [4]:
# list comprehension - python provides a compact way to iterate through lists
# this is a little tricky so I don't entirely recommend if if you aren't solid
# on other programming concepts

# say we want to make a list from this dictionary that includes both 
# the animal and the type of animal it is in the format "axlotl_amphibian"
my_dict = {'axlotl': 'amphibian', 'bat': 'mammal', 'frog': 'amphibian'} 

# original for loop 
new_list = []
for key, item in my_dict.items(): # this is how you iterate through key:item pairs in a dictionary btw
    new_list.append(key+'_'+item) # this is how you add an element to a list and how you concatenate strings together
print(new_list)

print()

# list comprehension
new_list = [key+'_'+item for key, item in my_dict.items()]
print(new_list)

['axlotl_amphibian', 'bat_mammal', 'frog_amphibian']

['axlotl_amphibian', 'bat_mammal', 'frog_amphibian']


In [8]:
# list comprehension with if / else
# you can use if / else logic in list comprehension as well 

# say we want to make a list of True and False values to say
# whether or not each animal in a list is an amphibian
my_list = ['frog', 'bat', 'axlotl']

# original for loop
new_list = []
for animal in my_list:
    if animal == 'frog' or animal == 'axlotl':
        new_list.append(True)
    else:
        new_list.append(False)
print(new_list)

print()

# list comprehension
new_list = [True if animal == 'frog' or animal == 'axlotl' else False for animal in my_list]
print(new_list)

[True, False, True]

[True, False, True]


In [10]:
# list comprehension with if 
# this is a little different for whatever reason

# say we want to get a list of animals that are amphibians from our list
my_list = ['frog', 'bat', 'axlotl']

# original for loop
new_list = []
for animal in my_list:
    if animal == 'frog' or animal == 'axlotl':
        new_list.append(animal)
print(new_list)

print()

# list comprehension
new_list = [animal for animal in my_list if animal == 'frog' or animal == 'axlotl']
print(new_list)

['frog', 'axlotl']

['frog', 'axlotl']


### Pandas

Pandas is a really powerful library for data matrix manipulation. I'll go over some common uses of it here. 

If you end up using a lot of pandas though, please feel free to reach out to me via Slack! Big data manipulation is pretty tricky and the language that is used in the field is specific and difficult to master for Googling. If you feel like you're trying to do something with a matrix that _feels_ like there's got to be a better way to do it, you're probably right and I can help you investigate.

In [47]:
# create a dataframe
data = ['amphibian', 'mammal', 'amphibian']
ind = ['axlotl', 'bat', 'frog']
df = pd.DataFrame(data=data, index=ind, columns=['kind'])
df.index.name = 'animal'
df

Unnamed: 0_level_0,kind
animal,Unnamed: 1_level_1
axlotl,amphibian
bat,mammal
frog,amphibian


In [48]:
# write a dataframe to a tsv file
df.to_csv('animals.tsv', sep='\t')

In [49]:
# read in a dataframe from a tsv file
df = pd.read_csv('animals.tsv', sep='\t') # https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
df.head() # use df.head() to look at what's in the beginning of the dataframe

Unnamed: 0,animal,kind
0,axlotl,amphibian
1,bat,mammal
2,frog,amphibian


In [50]:
# use groupby to count the number occurrences of the different kinds 
temp = df.groupby('kind').count().reset_index()
temp.rename({'animal': 'counts'}, axis=1, inplace=True)
temp

Unnamed: 0,kind,counts
0,amphibian,2
1,mammal,1


In [51]:
# you can use pandas to perform different groupby operations as well 
# for instance, say we have a column to represent how many animals 
# a person has
df['fairlie'] = [3,0,70] # i don't actually have any of these pets
df['liz'] = [0,4,0] # neither do the other tas
df['jaz'] = [10,2,3]
df['nam'] = [25,6,2] # am i really the only one that doesn't have a 3 letter name?
print(df.head())

print()

temp = df.groupby('kind').sum().reset_index() 
print(temp.head())

   animal       kind  fairlie  liz  jaz  nam
0  axlotl  amphibian        3    0   10   25
1     bat     mammal        0    4    2    6
2    frog  amphibian       70    0    3    2

        kind  fairlie  liz  jaz  nam
0  amphibian       73    0   13   27
1     mammal        0    4    2    6


In [52]:
# transpose
print(df.head())

temp = df.transpose()

print()

print(temp.head())

   animal       kind  fairlie  liz  jaz  nam
0  axlotl  amphibian        3    0   10   25
1     bat     mammal        0    4    2    6
2    frog  amphibian       70    0    3    2

                 0       1          2
animal      axlotl     bat       frog
kind     amphibian  mammal  amphibian
fairlie          3       0         70
liz              0       4          0
jaz             10       2          3


In [53]:
# access elements in the df using conditions
temp = df.loc[df.kind == 'amphibian'] # only amphibians
print(temp.head())

   animal       kind  fairlie  liz  jaz  nam
0  axlotl  amphibian        3    0   10   25
2    frog  amphibian       70    0    3    2


In [54]:
# access elements in the df using more than one condition (using and logic)

# only amphibians that are also frogs (this is a bad example but I promise it will extend to more complex situations)
temp = df.loc[(df.kind == 'amphibian')&(df.animal == 'frog')]
print(temp.head())

  animal       kind  fairlie  liz  jaz  nam
2   frog  amphibian       70    0    3    2


In [62]:
# perform database-style merges 
# there are a lot of options so please look at the documentation for more info!
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
data = [['frog', 0, 10], ['axlotl', 0, 20]]
# data = [['frog', 'axlotl'], [0,0], [10,20]]

columns = ['animal_new', 'ali', 'kyoko']
new_df = pd.DataFrame(data=data, columns=columns)
print(new_df)
print(df)
print()

# merge new animal ownership info with old 
temp = df.merge(new_df, how='outer', left_on='animal', right_on='animal_new')
print(temp)

print()

# fill NaN (not a number) values with 0s and replacce the missing animal_new values
temp['animal_new'] = temp['animal']
temp.fillna(0, inplace=True)
print(temp)

  animal_new  ali  kyoko
0       frog    0     10
1     axlotl    0     20
   animal       kind  fairlie  liz  jaz  nam
0  axlotl  amphibian        3    0   10   25
1     bat     mammal        0    4    2    6
2    frog  amphibian       70    0    3    2

   animal       kind  fairlie  liz  jaz  nam animal_new  ali  kyoko
0  axlotl  amphibian        3    0   10   25     axlotl  0.0   20.0
1     bat     mammal        0    4    2    6        NaN  NaN    NaN
2    frog  amphibian       70    0    3    2       frog  0.0   10.0

   animal       kind  fairlie  liz  jaz  nam animal_new  ali  kyoko
0  axlotl  amphibian        3    0   10   25     axlotl  0.0   20.0
1     bat     mammal        0    4    2    6        bat  0.0    0.0
2    frog  amphibian       70    0    3    2       frog  0.0   10.0


In [76]:
# melt
# sometimes you need to convert the format from wide format, where each row has multiple data entries.
# (in this case, each entry corresponds to the number of animals each person has),
# into long format, where each row represents one entry
# i commonly use melt to coerce the data into a format that is compatible with
# plotting tools I want to use
# for me at least, the options are a little confusing but just play with it until 
# it does what you want
# https://pandas.pydata.org/docs/reference/api/pandas.melt.html

print(df)
print()

temp = df.melt(id_vars=['animal', 'kind'], value_vars=['fairlie', 'liz', 'jaz', 'nam'], 
               var_name='person', value_name='counts')
print(temp)

   animal       kind  fairlie  liz  jaz  nam
0  axlotl  amphibian        3    0   10   25
1     bat     mammal        0    4    2    6
2    frog  amphibian       70    0    3    2

    animal       kind   person  counts
0   axlotl  amphibian  fairlie       3
1      bat     mammal  fairlie       0
2     frog  amphibian  fairlie      70
3   axlotl  amphibian      liz       0
4      bat     mammal      liz       4
5     frog  amphibian      liz       0
6   axlotl  amphibian      jaz      10
7      bat     mammal      jaz       2
8     frog  amphibian      jaz       3
9   axlotl  amphibian      nam      25
10     bat     mammal      nam       6
11    frog  amphibian      nam       2


In [77]:
# pivot
# pivot is the opposite of melt. turn a long-form data matrix into a wide-form one

print(temp)
print()

temp = temp.pivot(['animal', 'kind'], columns='person', values='counts')
print(temp)

    animal       kind   person  counts
0   axlotl  amphibian  fairlie       3
1      bat     mammal  fairlie       0
2     frog  amphibian  fairlie      70
3   axlotl  amphibian      liz       0
4      bat     mammal      liz       4
5     frog  amphibian      liz       0
6   axlotl  amphibian      jaz      10
7      bat     mammal      jaz       2
8     frog  amphibian      jaz       3
9   axlotl  amphibian      nam      25
10     bat     mammal      nam       6
11    frog  amphibian      nam       2

person            fairlie  jaz  liz  nam
animal kind                             
axlotl amphibian        3   10    0   25
bat    mammal           0    2    4    6
frog   amphibian       70    3    0    2


In [79]:
# perform mathematical operations on your dataframe
# pandas can perform many math operations on elements of your dataframe in parallel (so it's really efficient)

# create a new column that's the sum of the animals that liz and fairlie have
df['3rd_years'] = df['fairlie']+df['liz']
print(df)

   animal       kind  fairlie  liz  jaz  nam  3rd_years
0  axlotl  amphibian        3    0   10   25          3
1     bat     mammal        0    4    2    6          4
2    frog  amphibian       70    0    3    2         70


In [80]:
# you can also perform operations that involve constants that 
# are not columns in the dataframe

# create a new column that multiplies jaz's animals by 10
df['jaz_x10'] = df['jaz']*10
print(df)

   animal       kind  fairlie  liz  jaz  nam  3rd_years  jaz_x10
0  axlotl  amphibian        3    0   10   25          3      100
1     bat     mammal        0    4    2    6          4       20
2    frog  amphibian       70    0    3    2         70       30


In [81]:
# what if we want to determine what percent of nam's total collection
# of animals that each individual animal comprises?

# use the sum function to compute the total number of
# animals that nam has
nam_total = df['nam'].sum()
print(nam_total)
print()

# create a new column that computes percent of 
# total animals that each animal nam has comprises
df['nam_percent'] = (df['nam']/nam_total)*100
print(df)

33

   animal       kind  fairlie  liz  jaz  nam  3rd_years  jaz_x10  nam_percent
0  axlotl  amphibian        3    0   10   25          3      100    75.757576
1     bat     mammal        0    4    2    6          4       20    18.181818
2    frog  amphibian       70    0    3    2         70       30     6.060606


In [82]:
# unique values - get the unique values present in a column of a dataframe

print(df['kind'].tolist())

print(df['kind'].unique().tolist())

['amphibian', 'mammal', 'amphibian']
['amphibian', 'mammal']
