# Data science in Python

- Course GitHub repo: https://github.com/alanwilter/python-data-science
- Python website: https://www.python.org/ 

## Session 1.2: Using existing python modules to explore data in files

- [Importing module `statistics`](#Importing-module-statistics)
  - [Exercise 1.2.1](#Exercise-1.2.1)
- [Python file and directory manipulations](#Python-file-and-directory-manipulations)
  - [Exercise 1.2.2](#Exercise-1.2.2)
- [Using the `csv` module](#Using-the-csv-module)
  - [Exercise 3.2](#Exercise-3.2)

## Mind map

<img src="img/mind_maps/mind_maps.002.jpeg">

## Importing module `statistics`

Like other laguages, Python has the ability to import external modules (or libraries) into the current program. These modules may be part of the standard library that is automatically included with the Python installation, they may be extra libraries which you install separately or they may be other Python programs you have written yourself. Whatever the source of the module, they are imported into a program via an **`import`** command.

For example, if we wish to access the `mean()` and `median()` functions in Python, we can use the **`import`** keyword to get [the module named `statistics`](https://docs.python.org/3/library/statistics.html) and access its contents with the dot notation:

In [None]:
import statistics
statistics.mean([1, 2, 3, 4, 4])

Also we can use the `as` keyword to give the module a different name in our code, which can be useful for brevity and avoiding name conflicts:

In [None]:
import statistics as stats
stats.mean([1, 2, 3, 4, 4])

Alternatively we can import the separate components using the `from … import` keyword combination:

In [None]:
from statistics import mean, median
mean([1, 2, 3, 4, 4])

We can import multiple components from a single module, either on one line like as seen above or on separate lines:

In [None]:
from statistics import mean
from statistics import median

### Listing module contents

Using the [function `dir()`](https://docs.python.org/3/library/functions.html?highlight=dir#dir) and passing the module name:

In [None]:
import statistics
dir(statistics)

### Getting help directly from Jupyter notebook

In [None]:
statistics?

In [None]:
help(statistics)

## Exercise 1.2.1

- Calculate the average GDP per capita per country in Europe in 1962, its median and standard deviation using `data/gapminder.csv` data; and compare these figures with those from Americas.

## Python file and directory manipulations

These two modules `os.path` and `os` implements some useful functions on pathnames, and for accessing the filesystem. To read or write files, we use `open()`. 

### [`os.path` — Common pathname manipulations](https://docs.python.org/3/library/os.path.html)

- `join(*paths)` : joins the paths together into one long path
- `exists(path)` : returns whether path exists
- `isfile(path)` : returns whether path is a “regular” file (as opposed to a directory)
- `isdir(path)` : returns whether path is a directory
- `dirname(path)` : returns directory containing the path
- `basename(path)` : returns the path minus the dirname(path) in front
- `split(path)` : returns (dirname(path), basename(path))

### [`os` — Miscellaneous operating system interfaces](https://docs.python.org/3/library/os.html)

- `listdir(path)` : returns a list of files/directories in the directory path

Building the path to your file from a list of directory and filename makes your script able to run on any platforms.

In [None]:
import os.path
data_filepath = os.path.join("data", "gapminder.csv")
# data/mydata.txt - Unix
# data\mydata.txt - Windows
print(data_filepath)

Checking if a file exists before opening it:

In [None]:
os.path.exists(data_filepath)

Checking if it is a file:

In [None]:
os.path.isfile(data_filepath)

or a directory:

In [None]:
os.path.isdir(data_filepath)

Extracting the directory of the file path:

In [None]:
data_dirname = os.path.dirname(data_filepath)
print(data_dirname)

Checking if it is a directory:

In [None]:
os.path.isdir(data_dirname)

Extracting the file name from the file path:

In [None]:
data_filename = os.path.basename(data_filepath)
print(data_filename)

Getting the directory and the file name from the file path using `os.path.split()` which returns two variables its directory and file name:

In [None]:
data_dirname, data_filename = os.path.split(data_filepath)
print(data_dirname, data_filename)

Listing the content of a directory using `os.listdir()` is equivalent to `ls` in the shell:

In [None]:
import os
print(os.listdir(data_dirname))

## Exercise 1.2.2

- List all `.txt` files from the `data` directory, print their file path only if it is a file.
- Check that the file `genes.txt` exists in `data/`, open the tab separated file, and calculate the length of each genes.

## Using the `csv` module

The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. The `csv` module implements methods to read and write tabular data in CSV format.

The csv module’s `reader()` and `writer()` methods read and write CSV files. You can also read and write data into dictionary form using the `DictReader()` and `DictWriter()` methods.

For more information about this built-in Python library go to [CSV File Reading and Writing documentation](https://docs.python.org/3/library/csv.html).

Let's now read our `data/genes.txt` tab separated file using the `csv` module into a dictionary based on the column headers using `csv.DictReader()`.

|gene |	chrom |	start |	end |
|-- | -- | -- | -- | 
|BRCA2 |	13 |	32889611 |	32973805 |
|TNFAIP3 |	6 |	138188351 |	138204449 |
|TCF7 |	5 |	133450402 |	133487556 |

First, import the `csv` module:

In [None]:
import csv

Read the data and store each dictionary into a list. Note that `DictReader()` returns an [ordered dictionary](https://docs.python.org/3/library/collections.html#ordereddict-objects).

Ordered dictionaries are like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.

In [None]:
data = []
with open("data/genes.txt") as f:
    reader = csv.DictReader(f, delimiter = "\t")
    for row in reader:
        print(row)
        data.append(row)

for d in data:
    print(d['chrom'], d['gene'], d['start'], d['end'])

data is a list of ordered dictionary representing each row of the data file:

In [None]:
# accessing first dictionary from the list
print(data[0])

# printing its keys
print(data[0].keys())

# its values
print(data[0].values())

# the value associated with the key 'gene'
print(data[0]['gene'])

In [None]:
# looping over the list to print each gene
for d in data:
    print(d['gene'])

In [None]:
# calculating the length of each gene and adding its value into the dictionary
for d in data:
    d['len'] = int(d['end']) - int(d['start']) + 1
    print(d)

The main advantage of using the `DictReader()` method and the `csv` module is to write code that is easier to read and more flexible. Using the name of the column instead if its index make it more meaningful when reading code, and using this method of reading comma or tab separated files, give you the flexibility to add columns and changed their orders without having to modify your code.

Let's have a look now at the file `data/genes_withstrand.txt` and spot the differences with `data/genes.txt`. Even though columns `chrom` and `gene` have been swapped and column `strand` added, the code written previously is still working.

In [None]:
data_withstrand = []
with open("data/genes_withstrand.txt") as f:
    reader = csv.DictReader(f, delimiter = "\t")
    for row in reader:
        print(row)
        data_withstrand.append(row)

for d in data_withstrand:
    print(d['chrom'], d['gene'], d['start'], d['end'])

In [None]:
# Write a delimited file using the csv module from a list of dictionaries 
with open("gene_lengths.txt", "w") as f:
    writer = csv.DictWriter(f, data[0].keys(), delimiter='\t')
    writer.writeheader() # write header

    for d in data:
        writer.writerow(d) # write row

# Open the output file and print out its content
with open("gene_lengths.txt") as f:
    for line in f:
        print(line.strip())

## Getting help from the official Python documentation

The most useful information is online on https://www.python.org/ website and should  be used as a reference guide.

- [Python3 documentation](https://docs.python.org/3/) is the starting page with links to tutorials and libraries' documentation for Python 3
    - [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
        - [Modules](https://docs.python.org/3/tutorial/modules.html)
        - [Brief Tour of the Standard Library: Mathematics](https://docs.python.org/3/tutorial/stdlib.html#mathematics)
    - [The Python Standard Library Reference](https://docs.python.org/3/library/index.html) is the reference documentation of all libraries included in Python like:
        - [`statistics` - Mathematical statistics functions](https://docs.python.org/3/library/statistics.html)
        - [`os.path` — Common pathname manipulations](https://docs.python.org/3/library/os.path.html)
        - [`os` — Miscellaneous operating system interfaces](https://docs.python.org/3/library/os.html)
        - [`csv` — CSV File Reading and Writing](https://docs.python.org/3/library/csv.html)

## Exercise 1.2.3

- Change the script you wrote for [Exercise 1.2.1](#Exercise-1.2.1) to make use of the `csv` module to calculate the average GDP per capita per country in Europe in 1962, its median and standard deviation using `data/gapminder.csv` data; and compare these figures with those from Americas.

## Next session

Go to our next notebook: [Session 1.3: Creating functions and modules to write reusable code](13_python_data.ipynb)