# Introduction to Python Packages

### Why are packages useful? We can write less code and we don't need to reinvent the wheel!


In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://i.imgflip.com/42bnn2.jpg",width=300)

Let's look at this example for calculating the mean of a list of numbers. 

In [None]:
nums = [5,3,2] 
calc_mean = sum(nums)/len(nums)
print(calc_mean)

The ```statistics``` package has a function called ```mean```. Let's import it and use the ```mean``` function to calculate the mean of the numbers in the list

In [None]:
from statistics import mean

nums =[5,3,2] 
print(mean(nums))

### Math-based packages: Numpy and random

### Random: Functions

Now let's start by exploring the ```random``` package. The ```random```package is good for generating pseudo random float or int values and has many useful functions and applications. Let's start by importing the package:

In [None]:
import random

We can also get a random number using the ```random()``` function. However, this will only return a float between 0.0 and 1.0.

In [None]:
random.random()

Like regular floats, we can perform mathematical operations on the value returned by ```random()```. For example:

In [None]:
random.random() * 100

In [None]:
beatles = ['George', 'Paul', 'Ringo', 'John']

In [None]:
random.choice(beatles)

If we'd like to get the same item each time, we can set a seed. A random seed is a starting point in generating random numbers. A random seed specifies the start point when a computer generates a random number sequence.

In [None]:
random.seed(22)
random.choice(beatles)

and if we run it again...we should get the same value returned

In [None]:
random.seed(22)
random.choice(beatles)

### Using Random to create pseudodata

The ```random``` function is very useful for generating pseudodata. Here we are generating a dataset of 100 float values randomly selelcted using the ```random``` function. 

In [None]:
#initialize a list which we will add random values to
pseudodata = []

#populating the pseudodata list with 100 float values
for i in range(100):
    pseudodata.append(random.random())

Let's print out the first 10 floats in the psuedodata list.

In [None]:
print(pseudodata[:10])

### Numpy

```Numpy``` is a python package meant to support large arrays and matrices and allows for mathematical functions to be used with them. Let's start by importing the package. 

In [None]:
import numpy as np

Let's take a simple example of using the function ```sum()``` to find the sum of all data points within pseudodata

In [None]:
np.sum(pseudodata)

How about the ```mean()```?

In [None]:
np.mean(pseudodata)

And the variance, or ```var()```?

In [None]:
np.var(pseudodata)

If we come accross a function and we're not sure what it does or how it works, we can use ```help()```

In [None]:
help(np.sum)

There are times when we will want to transform our data. Some reasons may be that we want to improve the interpretability of the data or gain insight into the relationship between variables. Log10 transformation is a common tranformation method used. 

Let's use ```numpy```'s ```log10()``` function in order to complete a log10 transformation of the data and see the first 10 items that have been log10 transformed.

In [None]:
np.log10(pseudodata)[:10]

### Graphing packages: Seaborn and Matplotlib

Seaborn and Matplotlib are commonly used python packages for plotting graphs in python. Why do we need two packages for plotting? Matplotlib is generally used for more basic plotting such as plotting barplots, scatterplots, pie charts, etc. While both Matplotlib and Seaborn can work with lists and dataframes, Seaborn is better integrated for use with dataframes, provides a wider range of visualization patterns, and specializes in statistics visualizations, such as distribution plots.

Let's start by importing seaborn as sns.

In [None]:
import seaborn as sns

Let's start by using seaborn's ```distplot()``` function to plot the distribution of our non-transformed pseudodata.

In [None]:
sns.distplot(pseudodata)

At the moment, the plot does not have any x or y axis labels, so let's add some labels and a title using the ```set()``` function. 

In [None]:
a = sns.distplot(pseudodata)
a.set(xlabel='pseudodata', ylabel='density', title='Distribution of Pseudodata')

Now let's start using our new superpowers of python and plotting to start doing some data visualization. Let's start by seeing what happens to the distribution if our pseudodata consists of very larger numbers (remember, right now they are all floats between 0.0 and 1.0). To do this, we will start by multiplying all of our floats in *pseudodata* by 1000000 to create a new dataset we will call *pseudodata_large*.

In [None]:
#initialize your new list
pseudodata_large = []

#iterate through all values in pseudodata, multiplying each one by 1000000 before appending it to your new list
for i in pseudodata:
    pseudodata_large.append(i*1000000)    

Let's see what the first 10 values of our new dataset are

In [None]:
pseudodata_large[:10]

***Exercise 1:*** Now try plotting the distribution of your data. Make sure to set axes labels and a title. Has the distribution shifted and if so, how?

In [None]:
# Try these in your free time!

***Exercise 2:*** When we are dealing with data with very large values, sometimes we will want to transform the data. For this exercise, log transform the pseudodata_large dataset and plot the distribution of the transformed data. Make sure to add axes labels and a title. After transforming the data, has the distribution shifted? And if so, how?

In [None]:
# Try these in your free time!

We can make many different types of plots using seaborn. For example, we can make a violin plot using the ```violinplot()``` function as shown below. To see what other plots can be made with seaborn, you can read the documentation at https://seaborn.pydata.org/

In [None]:
d = sns.violinplot(pseudodata)
d.set(ylabel='pseudodata', title='Sample Violin Plot')

### Reading in our own data and plotting with Matplotlib

Here, we will be reading in our dataset which contains all of the sizes of the different chromosomes that correspond to the hg19 reference. The data file should already be available to you, but for your own reference, the table was taken from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes.


We will start by initializing two lists, x and y, to which we will append data to as we open and read through the file line-by-line using ```with open```. There are other methods you can use to open, read, and close a file in python. However, we encourage you to start with ```with open``` as it automatically takes care of more of the logistics of opening and closing file compared to other methods. 

In [None]:
#list x will contain chromosome names
#list y will contain the size of the chromosomes
#Important: the index of the chromosome name and its corresponding size in lists x and y have to be the same
x=[]
y=[]

#opens the file
with open ('data/hg19.chrom.sizes.txt') as chromSize:
    for line in chromSize:# The for loop allows you to read the file line-by-line
        print('line:', line) #what does the line contain
        x_value = line.split('\t')[0].rstrip()
        y_value = int(line.split('\t')[1].strip('\n').rstrip())
        print('x value:', x_value, 'y value:', y_value) #the x and y values we are appending to our list
        x.append(x_value)
        y.append(y_value)
  
#just printing the first five values of lists x and y
print(x[:5])
print(y[:5])


Now that we have our data stored as lists x and y. Let's get ready to plot our data using matplotlib! Start by importing matplotlib's pyplot as plt. 

In [None]:
import matplotlib.pyplot as plt

Let's make a simple barplot using the ```bar()``` function and our lists, x and y, that contain our data

In [None]:
plt.bar(x,y)

Great! We have created a plot...but it's a bit of a mess. Let's see if we can clean it up by making it larger, rotating the x-axis tick labels, adding axes labels, and a title. 

In [None]:
plt.figure(figsize=(20,5))
plt.bar(x, y)
plt.xticks(x, rotation='vertical')
plt.ylabel('Chromosome Length (bp)')
plt.xlabel('Chromosome Name')
plt.title('Length of Hg19 Chromosomes')
plt.show()

# Brief Introduction to Pandas

## Import the pandas package. But nickname it 'pd' for short.

In [None]:
import pandas as pd

## Read a CSV (or TSV or anything!)

Here we are reading in a .tsv where the first column is "geneSymbol" and the second column is "chromosome"

We want to save it to a variable so that we can manipulate it in other ways and CHECK IT OUT!

In [None]:
gene_chrom_table = pd.read_csv('data/gene_chrom.tsv', sep='\t')

# Check out your data! What are we working with here?


We can sneak a peek at just the first few rows...

In [None]:
gene_chrom_table.head()

In [None]:
gene_chrom_table.head(2)

Or the last few rows...

In [None]:
gene_chrom_table.tail(3)

We can get a sense of the "shape" of the data...

In [None]:
gene_chrom_table.shape

The function shape:

In [None]:
len(gene_chrom_table.columns)

In [None]:
len(gene_chrom_table)

How about indexing into a row? 

In [None]:
gene_chrom_table['geneSymbol']

In [None]:
[gene_chrom_table.geneSymbol == 'TP53']

In [None]:
gene_chrom_table[gene_chrom_table.geneSymbol == 'TP53']

In [None]:
gene_chrom_table.loc[11939]

In [None]:
gene_chrom_table[gene_chrom_table.geneSymbol.isin(['TP53', 'BRCA1'])]