<a href="https://colab.research.google.com/github/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/blob/main/Intro_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

As you have seen, Python is a very powerful and dynamic programming language with several built-in functions. 

Sometimes, however, importing libraries is essential to perform certain operations without excessive coding from scratch.

Usually, at the beginning of the code or in the first cell of the notebook, you want to import all the libraries that you need. 



In [None]:
# In this example, we will be importing the following libraries :
# pandas, 
# numpy, 
# random
# datetime

# let's import them!

import pandas as pd # the aliases speed up the calling the library
import numpy as np
import datetime
import random

# What can Pandas do for you?

Today we will learn:

*   Import data in the form of DataFrames (tables);


* Get info on your imported DataFrames;

* Subset the DataFrame


*   Merge and concatenate data.



# Intro on Pandas DataFrame

## Create DataFrames

In [None]:
df1 = pd.DataFrame({'patient': ['b', 'a', 'c', 'e', 'f'], 'height [cm]': np.random.randint(140, 200, 5)})
df1

In [None]:
df2 = pd.DataFrame({'patient': ['a', 'b', 'd','f'], 'weight [kg]': np.random.uniform(45, 120, 4)})
df2

In [None]:
df3 = pd.DataFrame({'patient': ['b', 'a', 'c', 'd', 'f'], 'shoe size [EU]': np.random.randint(36, 46, 5)})
df3

#### Create DataFrame from Series

In [None]:
show_size = pd.Series(np.random.randint(36, 46, 5))
patient = pd.Series(['b', 'a', 'c', 'd', 'f'])

In [None]:
patient

In [None]:
show_size

In [None]:
df3 = pd.DataFrame()
df3['patient'] = patient
df3['show_size'] = show_size
df3

## Data import
First and foremost, let's import some data.

Most of the data you will deal with in this course is in the form of text (.txt), comma separated variables (.csv), tab separated variables (.tsv), excel files (.xlsx), etc.

Pandas will take care of importing different types of data.

It creates different objects to contain the data. We will use DataFrames at first.

#### Importing from web

In [None]:
SNPs = pd.read_csv("https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/pandas/CD93_exomeSNPs_annotation.csv")
SNPs

In [None]:
#what happens when you try to upload a tsv file?
drugs =  pd.read_csv('https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/pandas/drugs.tsv')


In [None]:
drugs =  pd.read_csv('https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/pandas/drugs.tsv', sep='\t')
drugs

#### Importing from a local drive

In [None]:
# choose file from your computer (this works only in google colab, not in Jupyter notebook)
from google.colab import files
uploaded = files.upload()
file_name = 'kidpackgenes.csv'

In [None]:
import io
genes = pd.read_csv(io.BytesIO(uploaded[file_name]))
# Dataset is now stored in a Pandas Dataframe

#### Importing from Google Drive via PyDrive

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Create a folder on your google drive for this couse (in this case I called it HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L)

In [None]:
!mkdir /content/drive/MyDrive/HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L 


In [None]:
!cd /content/drive/MyDrive/HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L 
#changes to the folder of interest

In [None]:
directory = '/content/drive/MyDrive/HS21-Big_Data_Analysis_in_Biomedical_Research_376-1723-00L'
file_name = 'kidpackgenes.csv'

In [None]:
genes = pd.read_csv('%s/%s' %(directory, file_name))
genes

#### Importing from your computer (on Jupyter Notebook)

In [None]:
# on Jupyter Lab on your computer, you would add the path of the file
# for instance (in MacOsX )

genes = pd.read_csv('/Users/elisa/kidpackgenes.csv' )

# for instance (in Windows)

genes = pd.read_csv('C:\Documents\kidpackgenes.csv' )


## Get info about your DataFrame

#### Show parts of the DataFrame

In [None]:
SNPs.head(10) #shows you the first n rows of the DataFrame

In [None]:
SNPs.tail(5) # shows the end of the DataFrame

In [None]:
SNPs.sample(10) # shows random rows of the DataFrame

#### Show info about size/shape of DataFrame, columns names, data types and null values


In [None]:
SNPs.info()

In [None]:
SNPs.describe()

In [None]:
SNPs.columns

In [None]:
SNPs.dtypes

In [None]:
# how many null values in the data frame?
SNPs.isna().sum()

In [None]:
# how many null values in a specific column?
SNPs['PolyPhen prediction'].isna().sum()

# Getting data from DataFrames


#### Get values from one column

In [None]:
SNPs['Variant name']

In [None]:
#columns are Series!!!

type(SNPs['Variant name'])

In [None]:
type(SNPs)

#### Get a value from a specific position of the dataframe

When you have the names of the columns, use:


.loc[ ]




In [None]:
# index 10 and column Variant name
SNPs.loc[10, 'Variant name']

In [None]:
# a range of rows and column Variant name
SNPs.loc[10:20, 'Variant name']

In [None]:
# a range of rows for all columns
SNPs.loc[10:20, :]

When you have the numerical coordinates, use:

.iloc[ ]


In [None]:
SNPs.iloc[10, 2]

I want to know the variant consequence of the variant rs3746732

In [None]:
# use a boolean mask as row selection

mask = SNPs['Variant name'] == 'rs3746732'

SNPs.loc[mask, 'Variant consequence']

#### Select only specific columns of the DataFrame

In [None]:
drugs_subset = drugs[['PharmGKB Accession Id' ,	'Name', 'Type']] # list of columns!
drugs_subset

### Exercises



#### Exercise 1

Import a csv file from your computer and get the info on column names, size, data types, number of null values per column.

#### Exercise 2

From the DataFrame above, select only the first 4 columns.

#### Exercise 3

Create a dictionary that has as key PharmGKB Accession Id and as values the Name of the drug (dataframe *drugs* from above)

In [None]:
drugs.head()

In [None]:
# HINT:
# Use enumerate

#### Exercise 4

Find the row in the file above for aspirin.

# Data wrangling - part 1

### Merge dataframes

In [None]:
help(pd.merge)

In [None]:
pd.merge(df1, df2)

In [None]:
pd.merge(df2, df1, how='inner')

In [None]:
pd.merge(pd.merge(df2, df1, how='outer'), df3,how='outer')

In [None]:
pd.merge(df2, df1, on='patient')

### Concatenate dataframes

In [None]:
help(pd.concat)

In [None]:
df = pd.concat([df2, df1, df3], ignore_index=True)
df

#### Exercise 5

Merge the following dataframes




In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
