<img src="../figures/HeaDS_logo_large_withTitle.png" width="300">

<img src="../figures/tsunami_logo.PNG" width="600">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Center-for-Health-Data-Science/PythonTsunami/blob/intro/Importing_data/Importing_data.ipynb)

# Importing Data into Python

*Prepared by [Alberto Santos](https://heads.ku.d)*

## Objectives

*   Learn how to import data from different sources into Python

#### *Note*: In colab, access to files is a little bit different than when you access directly in your computer

We will need to 1st download the files to Colab using:

##### create a data folder
`!mkdir -p data`

##### download file
`!wget https://raw.githubusercontent.com/Center-for-Health-Data-Science/PythonTsunami/intro/data/file_name -P data`
`

In [None]:
#Download data to Colab
!mkdir -p data
!wget https://raw.githubusercontent.com/Center-for-Health-Data-Science/PythonTsunami/intro/data/sample.txt -P data
!wget https://raw.githubusercontent.com/Center-for-Health-Data-Science/PythonTsunami/intro/data/sample.csv -P data
!wget https://raw.githubusercontent.com/Center-for-Health-Data-Science/PythonTsunami/intro/data/iris.tsv -P data

## Importing Text Files

A text file *(.txt)* is the most common file we will deal with. Text files are structured as a sequence of lines, where each line includes a sequence of characters.

To import the contents of a text file, we will first need to define where the file is located: `pathfile`.

### Open file
We first need to open the file. To do so, we use the `open()` built-in function. 

- `open()` has a required argument that is the path to the file and an argument to indicate the mode (i.e. 'r': open for reading; 'w': open for writing, 'a' for appending).  The following table lists the valid values of access mode parameters:

| mode | format | read/write | create new file? | comments |
| :--: | :--:   | :--:       | :--:             | :--:     |
| `'r'`    | text   | read | no | Default mode |
| `'rb'`   | binary | read | no | Raises I/O error if the file does not exist |
| `'r+'`   | text   | read/write  | no | Raises I/O error if the file does not exist |
| `'rb+'`   | binary | read/write | no | Raises I/O error if the file does not exist|
| `'w'`    | text   | write | yes | Truncates and overwrites data if file exists |
| `'wb'`    | binary   | write | yes | Truncates and overwrites data if file exists |
| `'w+'`   | text   | read/write | yes | Truncates and overwrites data if file exists |
| `'wb+'`    | binary   | write | yes | Truncates and overwrites data if file exists |
| `'a'`    | text   | write | yes | Data is inserted at the end of the file |
| `'ab'`    | binary   | write | no | Data is inserted at the end of the file |
| `'a+'`   | text   | read/write | yes | Data is inserted at the end of the file |
| `'ab+'`    | binary   | write | no | Data is inserted at the end of the file |

- `open()` returns us a file object that we can then use to read, write or append content to.

In [None]:
with open('data/sample.txt', 'r') as reader:
    print(reader.read())

### Path to file

We can also define a variable with the path where the file is located by using the module `pathlib` and the object `Path`. This can help avoid problems with Operating Systems using a different path structures (i.e Windows `/`, Unix `\`).

In [6]:
from pathlib import Path

filepath = Path('data/sample.txt')

In [7]:
with open(filepath, 'r') as reader:
    print(reader.read())

Country/Region
Mainland China
Japan
Singapore
Hong Kong
Japan
Thailand
South Korea
Malaysia
Taiwan
Germany
Vietnam
France
Macau
UK
United Arab Emirates
US
Australia


### Reading the file

There are three methods to read content (i.e. `read()`, `readline()`, and `readlines()`) that can be called on this file object.

- `read()` reads the content of the file -- Accepts as parameter the number of characters to be read
- `readline()` reads one line -- Accepts as parameter the number of characters to be read of the line
- `readlines()` reads all lines and stores them in a list

In [17]:
with open(filepath, 'r') as reader:
    print(reader.readline())

Country/Region



In [10]:
with open(filepath, 'r') as reader:
    print(reader.readlines())

['Country/Region\n', 'Mainland China\n', 'Japan\n', 'Singapore\n', 'Hong Kong\n', 'Japan\n', 'Thailand\n', 'South Korea\n', 'Malaysia\n', 'Taiwan\n', 'Germany\n', 'Vietnam\n', 'France\n', 'Macau\n', 'UK\n', 'United Arab Emirates\n', 'US\n', 'Australia']


We can for instance read line by line using a loop (see [Loops](https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/intro/Loops/Loops.ipynb))

In [12]:
with open(filepath, 'r') as reader:
    for line in reader:
        print("New line:", line)

New line: Country/Region

New line: Mainland China

New line: Japan

New line: Singapore

New line: Hong Kong

New line: Japan

New line: Thailand

New line: South Korea

New line: Malaysia

New line: Taiwan

New line: Germany

New line: Vietnam

New line: France

New line: Macau

New line: UK

New line: United Arab Emirates

New line: US

New line: Australia


## Importing CSV Files

Comma Separated Value files (csv) are very common in biology and are used to store tabular data. In this type of files, every field on each line is separated by a delimiter, indicating where one field ends and the next field starts. 

These files are often either comma-separated (.csv)or tab-separated (.tsv or .txt).

In principle, we can simply read them in the same way as text files.

In [19]:
filepath = Path('data/sample.csv')
with open(filepath, 'r') as reader:
    print(reader.read())

Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,City,Date_last_updated_AEDT,lat,lon
Hubei,China,02/15/2020 23:00,56249,1596,5623,,2020-02-16 15:00:00,31.1517252,112.8783222
Guangdong,China,02/15/2020 23:00,1316,2,442,,2020-02-16 15:00:00,23.1357694,113.1982688
Henan,China,02/15/2020 23:00,1231,13,415,,2020-02-16 15:00:00,34.0000001,113.9999999


However, by doing so we lose the tabular structure in the data and we can not access it in any smart way, for instance printing the 3rd country in our table.


One option is to include a bit of logic to break down each line using the string function `split` and specifying the delimiter. This will at least help us get list that we can then access by index (see [Lists.ipynb](https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/intro/Data_structures/Lists.ipynb)) 

In [20]:
my_lines = []
with open(filepath, 'r') as reader:
    for line in reader:
        line_list = line.split(',')
        my_lines.append(line_list)
        
print(my_lines)

[['Province/State', 'Country/Region', 'Last Update', 'Confirmed', 'Deaths', 'Recovered', 'City', 'Date_last_updated_AEDT', 'lat', 'lon\n'], ['Hubei', 'China', '02/15/2020 23:00', '56249', '1596', '5623', '', '2020-02-16 15:00:00', '31.1517252', '112.8783222\n'], ['Guangdong', 'China', '02/15/2020 23:00', '1316', '2', '442', '', '2020-02-16 15:00:00', '23.1357694', '113.1982688\n'], ['Henan', 'China', '02/15/2020 23:00', '1231', '13', '415', '', '2020-02-16 15:00:00', '34.0000001', '113.9999999']]


In [26]:
# 3rd country in our table
my_lines[3][1]

'China'

### csv library

Python has a specific library for reading this type of files `csv`. The functionality in this library maintains the tabular structure (to some extent) and makes life easier to access the data afterwards.

In [27]:
#Import the library to be able to use it
import csv

In [32]:
with open(filepath,'r') as myFile:  
    lines=csv.reader(myFile, delimiter=',')  
    for line in lines:  
        print(line)

['Province/State', 'Country/Region', 'Last Update', 'Confirmed', 'Deaths', 'Recovered', 'City', 'Date_last_updated_AEDT', 'lat', 'lon']
['Hubei', 'China', '02/15/2020 23:00', '56249', '1596', '5623', '', '2020-02-16 15:00:00', '31.1517252', '112.8783222']
['Guangdong', 'China', '02/15/2020 23:00', '1316', '2', '442', '', '2020-02-16 15:00:00', '23.1357694', '113.1982688']
['Henan', 'China', '02/15/2020 23:00', '1231', '13', '415', '', '2020-02-16 15:00:00', '34.0000001', '113.9999999']


## Remove some lines

The easiest way to remove lines from a file is actually to keep only the ones you want.
We will need to read the file, store the lines we want in a variable, and then open the file for writing and save the lines we wanted.

**1) We append 2 lines in the file data/sample.txt**

In [None]:
filepath = 'data/sample.txt'

with open(filepath, 'a') as f:
    f.write('\nThis is not a valid line\n')
    f.write('This is one either\n')

**2) We read the file and keep the first 18 lines only**

In [None]:
num_valid_lines = 18
my_valid_lines = []
i = 0
with open(filepath, 'r') as f:
    for line in f:
        if i < num_valid_lines:
            my_lines.append(line)
        i += 1

**3) We write back into the file only the valid lines**

In [None]:
with open(filepath, 'w') as f:
    f.write("".join(my_lines))

## Exercise

1) Read file **iris.tsv** using both approaches (text and csv).

2) Open the sample.txt and write a new line with `Denmark/Zealand`.

3) The, read again the file to see the new content.

## ----------------------------------------------------------------------------------------------------------------------------------
<h2><center>Extra</center></h2>

## ----------------------------------------------------------------------------------------------------------------------------------

## Web Scraping

Web scraping is a technique to automatically access and extract large amounts of information from a website.

### Important notes about web scraping:

- Read through the website’s Terms and Conditions whether or not you can use the data posted in the website.
- Make sure you are not downloading data at too rapid rate, because this may break the website or you may potentially be blocked.

In this case we will be scraping UniProt website to extract Gene Ontology terms associated with a specific protein.

In [33]:
#specify the url of the website you are interested in, in this case UniProt
url = "http://www.uniprot.org/uniprot/"
# Protein GTPase KRas
protein = "P01116"

Import the necessary libraries:

- [urllib](https://docs.python.org/3/library/urllib.html)
- [bs4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [47]:
import requests
from bs4 import BeautifulSoup

In [51]:
page = requests.get(url+protein)
#Parse the html, store it in Beautiful Soup format
bsf = BeautifulSoup(page.text, "html.parser")

In [108]:
biological_process_go = []
ext_data = bsf.find('ul', class_='noNumbering biological_process')
if ext_data is not None:
    for data in ext_data.find_all('li'):
        cells = data.find("a")
        if cells is not None:
            cells = cells.find(text=True).strip()
            biological_process_go.append(cells)

In [109]:
biological_process_go

['actin cytoskeleton organization',
 'cytokine-mediated signaling pathway',
 'endocrine signaling',
 'epithelial tube branching involved in lung morphogenesis',
 'female pregnancy',
 'forebrain astrocyte development',
 'homeostasis of number of cells within a tissue',
 'liver development',
 'MAPK cascade',
 'negative regulation of cell differentiation',
 'negative regulation of neuron apoptotic process',
 'positive regulation of cell population proliferation',
 'Zimmermann G.',
 'positive regulation of cellular senescence',
 'positive regulation of gene expression',
 'Oh Y.T.',
 'positive regulation of MAP kinase activity',
 'positive regulation of NF-kappaB transcription factor activity',
 'positive regulation of nitric-oxide synthase activity',
 'positive regulation of protein phosphorylation',
 'Oh Y.T.',
 'positive regulation of Rac protein signal transduction',
 'Ras protein signal transduction',
 'Gaudet P.',
 'regulation of long-term neuronal synaptic plasticity',
 'regulation o

## Exercise

1) Can you do the same for GO Molecular Functions?

2) Extract the pathway information for protein Erythropoietin (Human) (Look at the Html structure)

## References

- [An Overview Of Importing Data In Python](https://towardsdatascience.com/an-overview-of-importing-data-in-python-ac6aa46e0889)
- [How to Web Scrape with Python in 4 Minutes](https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460)