# Introduction to Python

Python is, just like R, an interpreted programming language. The emphasis behind Python is code readability and syntax which allows to write concepts in less lines of code.

Currently there are two main versions of Python. The older, but still more commonly used is 2.7.X and the newer 3.X. When transitioning from 2.7 to 3.X a number of changes were made to the language such that it wasn't possible to just transfer all scripts. The same goes for most packages in Python which is why even though 2.7.X is no longer actively developped it is also not yet discontinued (the current end-of-life date is set to 2020).

In this class we will be working with Python 2.7.13 which is still the default version when using the Anaconda installer.

# Python Data Types

Numbers

In [33]:
5  # this will not show in a notebook but will in a regular pyton interpreter
print 5
a = 12

print a + 2

print a + 3.2

5
14
15.2


Strings: String operations in Python are very fast (much faster than in R)

In [3]:
'a'

print 'abc'

my_string = 'a' + 'bc'
print my_string

abc
abc


Lists: can store multiple types and are mutable.

In [31]:
my_list = [123, 'aString', 4.5]
print len(my_list)

print my_list + [5, 6, 7]

my_list.append('12')
print my_list

print my_list[0]
print my_list[0:3]

3
[123, 'aString', 4.5, 5, 6, 7]
[123, 'aString', 4.5, '12']
123
[123, 'aString', 4.5]


Dictionaries: Have very fast lookup time!

In [8]:
my_dictionary = {'apples': 3, 'surfing': 'yes'}
my_dictionary['surfing']

'yes'

In [9]:
my_dictionary['running'] = 'nope'
print my_dictionary

{'running': 'nope', 'surfing': 'yes', 'apples': 3}


Tuples: are similar to lists but immutable.

In [10]:
my_tuple = (4, 3, 2, 1)
print my_tuple
print my_tuple.index(3)

(4, 3, 2, 1)
1


# Python Packages

Similar to R, Python has good default capabilities which can become very powerful once you start adding packages. Some of the most commonly used packages are:

In [21]:
import pandas as pd  # used generally for mixed data types
import numpy as np  # used generally for pure numeric matrices

In [16]:
data_dir = '/home/ucsd-train40/projects/tardbp_shrna/deseq2/'

# Use the pandas read_table function to read in your data. Make sure the
# file name matches what your data is called on tscc and it is located in 
# the folder that you have defined as data_dir. 

# Use a commend to specify the character that comment lines start with... 
# we don't want to read those in. Set the gene_id as the index for this 
# dataframe with index_col=0
# counts = pd.read_table(data_dir+"all_counts.txt", comment="#", index_col=0)

#  I prefer reading in data tables without having indeces
counts = pd.read_csv(data_dir + "tardbp_counts_for_deseq2.csv", comment="#")

In [17]:
print counts.shape
counts.head()

(16582, 5)


Unnamed: 0,Geneid,NT_shRNA_hepg2_Rep1,NT_shRNA_hepg2_Rep2,TARDBP_shRNA_hepg2_Rep1,TARDBP_shRNA_hepg2_Rep2
0,ENSG00000227232.4,61,92,69,58
1,ENSG00000237683.5,23,21,17,28
2,ENSG00000239906.1,11,2,5,7
3,ENSG00000241860.2,26,32,35,35
4,ENSG00000228463.4,77,69,63,66


In [19]:
# We are going to get rid of the Chr, Start, End, Strand columns by feeding
# the drop command a LIST of column names to drop. Notice the syntax for writing 
# a list is ['item1','item2'] Use axis=1 to specify that we are dropping these 
# COLUMNS. Axis=0 would search to drop those values from the ROWS. Again, look 
# at how this manipulation changed the shape and content of your dataframe.

counts_only = counts.drop(['Geneid'], axis = 1)
counts_only.head()

Unnamed: 0,NT_shRNA_hepg2_Rep1,NT_shRNA_hepg2_Rep2,TARDBP_shRNA_hepg2_Rep1,TARDBP_shRNA_hepg2_Rep2
0,61,92,69,58
1,23,21,17,28
2,11,2,5,7
3,26,32,35,35
4,77,69,63,66


In [20]:
# We are going to rename the columns to something more meaningful using the function
# rename. inside of rename, we will call columns={"oldname":"newname","oldname":"newname"}
# Keep in mind that you can enter to a new line whenever you are inside of a parenthesis
# or bracket without breaking your command. This makes your code easier to read. Rename
# all the columns that you want to, and assign that output to a variable called counts.

counts = counts.rename(columns={"NT_shRNA_hepg2_Rep1":"NT_hepg2_Rep1",
                               "NT_shRNA_hepg2_Rep2":"NT_hepg2_Rep2",
                               "TARDBP_shRNA_hepg2_Rep1":"TARDBP_hepg2_Rep1",
                               "TARDBP_shRNA_hepg2_Rep2":"TARDBP_hepg2_Rep2"})

counts.head()

Unnamed: 0,Geneid,NT_hepg2_Rep1,NT_hepg2_Rep2,TARDBP_hepg2_Rep1,TARDBP_hepg2_Rep2
0,ENSG00000227232.4,61,92,69,58
1,ENSG00000237683.5,23,21,17,28
2,ENSG00000239906.1,11,2,5,7
3,ENSG00000241860.2,26,32,35,35
4,ENSG00000228463.4,77,69,63,66


# Reading and Writing Files

In [22]:
# to write to a file you have to freat a file object to which you write.
# after finishing writing, make sure you close yor file object or else it's reasy for the output to get corrupted
f = open('test_writing.txt', 'w')
f.write('This\n')
f.write('Is\n')
f.write('A\n')
f.write('Test!\n')
f.close()

# reading files
f = open('test_writing.txt', 'r')
my_text = f.read()
print my_text  # this is quite annoying so we have to parse it

my_parsed_file = my_text.split()
print my_parsed_file



This
Is
A
Test!

['This', 'Is', 'A', 'Test!']


In [24]:
# while read is convenient it is recommended to read line-by-line.
my_other_parsed_file = []
for line in f:
    my_other_parsed_file.append(line)
    
    
print my_other_parsed_file

[]


In [28]:
f2 = open('test_writing.txt', 'r')

my_other_parsed_file = []
my_other_file = []
for line in f2:
    my_other_parsed_file.append(line.split()[0])  # careful with 0-indexing in python
    my_other_file.append(line)
    
    
print my_other_parsed_file
print my_other_file

['This', 'Is', 'A', 'Test!']
['This\n', 'Is\n', 'A\n', 'Test!\n']


# Other things about Python

A commonly used feature in Python is 'list comprehension'. This allows you to write (complex) statements in one line. It is also considered a bit faster than usual operations

In [29]:
normal_loop_output = []
for i in range(0, 10):
    normal_loop_output.append(i)
    
lc_loop_output = []
[lc_loop_output.append(i) for i in range(0, 10)]

print normal_loop_output
print lc_loop_output

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
