# Data science in Python

- Course GitHub repo: https://github.com/pycam/python-data-science
- Python website: https://www.python.org/ 

## Session 1.2: Using existing python modules to explore data in files

- [Importing module `statistics`](#Importing-module-statistics)
  - [Exercise 1.2.1](#Exercise-1.2.1)
- [Python file and directory manipulations](#Python-file-and-directory-manipulations)
  - [Exercise 1.2.2](#Exercise-1.2.2)
- [Using the `csv` module](#Using-the-csv-module)
  - [Exercise 1.2.3](#Exercise-1.2.3)

## Mind map

<img src="img/mind_maps/mind_maps.002.jpeg">

## Importing module `statistics`

Like other laguages, Python has the ability to import external modules (or libraries) into the current program. These modules may be part of the standard library that is automatically included with the Python installation, they may be extra libraries which you install separately or they may be other Python programs you have written yourself. Whatever the source of the module, they are imported into a program via an **`import`** command.

For example, if we wish to access the `mean()` and `median()` functions in Python, we can use the **`import`** keyword to get [the module named `statistics`](https://docs.python.org/3/library/statistics.html) and access its contents with the dot notation:

In [3]:
import statistics #imports the whole module
statistics.mean([1, 2, 3, 4, 4])

2.8

Also we can use the `as` keyword to give the module a different name in our code, which can be useful for brevity and avoiding name conflicts:

In [4]:
import statistics as stats #shortens the name of the module for ease later
stats.mean([1, 2, 3, 4, 4])

2.8

Alternatively we can import the separate components using the `from … import` keyword combination:

In [5]:
from statistics import mean, median #imports just the actions wou want from inside the module
mean([1, 2, 3, 4, 4])

2.8

We can import multiple components from a single module, either on one line like as seen above or on separate lines:

In [6]:
from statistics import mean
from statistics import median #this is useful if you have different modules you want to import from, especially if similar actions can be found in different modules

### Listing module contents

Using the [function `dir()`](https://docs.python.org/3/library/functions.html?highlight=dir#dir) and passing the module name:

In [7]:
import statistics
dir(statistics)

['Decimal',
 'Fraction',
 'StatisticsError',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_coerce',
 '_convert',
 '_counts',
 '_exact_ratio',
 '_fail_neg',
 '_find_lteq',
 '_find_rteq',
 '_isfinite',
 '_ss',
 '_sum',
 'bisect_left',
 'bisect_right',
 'chain',
 'collections',
 'decimal',
 'groupby',
 'harmonic_mean',
 'math',
 'mean',
 'median',
 'median_grouped',
 'median_high',
 'median_low',
 'mode',
 'numbers',
 'pstdev',
 'pvariance',
 'stdev',
 'variance']

### Getting help directly from Jupyter notebook

In [8]:
statistics?

In [9]:
help(statistics)

Help on module statistics:

NAME
    statistics - Basic statistics module.

MODULE REFERENCE
    https://docs.python.org/3.6/library/statistics
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides functions for calculating statistics of data, including
    averages, variance, and standard deviation.
    
    Calculating averages
    --------------------
    
    Function            Description
    mean                Arithmetic mean (average) of data.
    harmonic_mean       Harmonic mean of data.
    median              Median (middle value) of data.
    median_low          Low median of data.
    median_high         High median of data.
    median_grouped      Median, or 50th per

## Exercise 1.2.1

- Calculate the average GDP per capita per country in Europe in 1962, its median and standard deviation using `data/gapminder.csv` data; and compare these figures with those from Americas.

In [23]:
from statistics import mean, median, stdev
eu_gdppercap_1962 = [] #because you are using a 'for loop' here, it is important that you make an empty file and place inside it the data that you are interested in
americas_gdppercap_1962 = []

with open("data/gapminder.csv") as f:
    for line in f:
        data = line.strip().split(',')
        if data[1] == 'Europe' and data[2] == '1962':
            eu_gdppercap_1962.append(float(data[5])) #this will not work if you dont transform the data to float type

print('European GDP per Capita in 1962')
print(eu_gdppercap_1962)
            
print(statistics.mean(eu_gdppercap_1962),statistics.median(eu_gdppercap_1962),statistics.stdev(eu_gdppercap_1962))

type(data[5])

with open("data/gapminder.csv") as f:
    for line in f:
        data = line.strip().split(',')
        if data[1] == 'Americas' and data[2] == '1962':
            americas_gdppercap_1962.append(float(data[5])) #this will not work if you dont transform the data to float type

print('Americas GDP per Capita in 1962')
print(americas_gdppercap_1962)
            
print(statistics.mean(americas_gdppercap_1962),statistics.median(americas_gdppercap_1962),statistics.stdev(americas_gdppercap_1962))

type(data[5])

European GDP per Capita in 1962
[2312.888958, 10750.72111, 10991.20676, 1709.683679, 4254.337839, 5477.890018, 10136.86713, 13583.31351, 9371.842561, 10560.48553, 12902.46291, 6017.190733, 7550.359877, 10350.15906, 6631.597314, 8243.58234, 4649.593785, 12790.84956, 13450.40151, 5338.752143, 4727.954889, 4734.997586, 6289.629157, 7481.107598, 7402.303395, 5693.843879, 12329.44192, 20431.0927, 2322.869908, 12477.17707]
8365.4868143 7515.7337375 4199.193906418378
Americas GDP per Capita in 1962
[7133.166023, 2180.972546, 3336.585802, 13462.48555, 4519.094331, 2492.351109, 3460.937025, 5180.75591, 1662.137359, 4086.114078, 3776.803627, 2750.364446, 1796.589032, 2291.156835, 5246.107524, 4581.609385, 3634.364406, 3536.540301, 2148.027146, 4957.037982, 5108.34463, 4997.523971, 16173.14586, 5603.357717, 8422.974165]
4901.5418704 4086.114078 3421.740568771563


str

## Python file and directory manipulations

These two modules `os.path` and `os` implements some useful functions on pathnames, and for accessing the filesystem. To read or write files, we use `open()`. 

### [`os.path` — Common pathname manipulations](https://docs.python.org/3/library/os.path.html)

- `join(*paths)` : joins the paths together into one long path
- `exists(path)` : returns whether path exists
- `isfile(path)` : returns whether path is a “regular” file (as opposed to a directory)
- `isdir(path)` : returns whether path is a directory
- `dirname(path)` : returns directory containing the path
- `basename(path)` : returns the path minus the dirname(path) in front
- `split(path)` : returns (dirname(path), basename(path))

### [`os` — Miscellaneous operating system interfaces](https://docs.python.org/3/library/os.html)

- `listdir(path)` : returns a list of files/directories in the directory path

Building the path to your file from a list of directory and filename makes your script able to run on any platforms.

In [31]:
import os.path
data_filepath = os.path.join("data", "gapminder.csv")
# data/mydata.txt - Unix
# data\mydata.txt - Windows
print(data_filepath)

data/gapminder.csv


Checking if a file exists before opening it:

In [32]:
os.path.exists(data_filepath)

True

Checking if it is a file:

In [33]:
os.path.isfile(data_filepath)

True

or a directory:

In [34]:
os.path.isdir(data_filepath)

False

Extracting the directory of the file path:

In [35]:
data_dirname = os.path.dirname(data_filepath)
print(data_dirname)

data


Checking if it is a directory:

In [36]:
os.path.isdir(data_dirname)

True

Extracting the file name from the file path:

In [37]:
data_filename = os.path.basename(data_filepath)
print(data_filename)

gapminder.csv


Getting the directory and the file name from the file path using `os.path.split()` which returns two variables its directory and file name:

In [38]:
data_dirname, data_filename = os.path.split(data_filepath)
print(data_dirname, data_filename)

data gapminder.csv


Listing the content of a directory using `os.listdir()` is equivalent to `ls` in the shell:

In [39]:
import os
print(os.listdir(data_dirname))

['GRCm38.gff3', 'GRCh38.gff3', 'genes.txt', 'sample.fa', 'genes_withstrand.txt', 'AilMel.gff3', 'gapminder.csv', 'mydata.txt', 'gapminder_gdp_oceania.csv', 'gapminder_gdp_asia.csv', 'GRCz11.gff3', 'gapminder_gdp_americas.csv', 'gapminder_gdp_africa.csv', 'gapminder_gdp_europe.csv', 'glpa.fa']


## Exercise 1.2.2

- List all `.txt` files from the `data` directory, print their file path only if it is a file.
- Check that the file `genes.txt` exists in `data/`, open the tab separated file, and calculate the length of each genes.

In [86]:
#part 1
import os
dirname='data'
for filename in os.listdir(dirname):
    filepath = os.path.join(dirname, filename)
    if os.path.isfile(filepath) and filename.endswith('.txt'):
        print(filepath)

data/genes.txt
data/genes_withstrand.txt
data/mydata.txt


In [95]:
#part 2
import os
filepath = os.path.join('data','genes.txt')
if os.path.exists(filepath):
    print('file',filepath,'found')
    with open(filepath) as h:
        header=h.readline().strip().split()
        print(header)
        for line in h:
            data = line.strip().split() #there is no comma between the data, so no need to specify it
            gene_length= int(data[3]) - int(data[2]) + 1
            print('gene',data[0],'length is',gene_length)

file data/genes.txt found
['gene', 'chrom', 'start', 'end']
gene BRCA2 length is 84195
gene TNFAIP3 length is 16099
gene TCF7 length is 37155


## Using the `csv` module

The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. The `csv` module implements methods to read and write tabular data in CSV format.

The csv module’s `reader()` and `writer()` methods read and write CSV files. You can also read and write data into dictionary form using the `DictReader()` and `DictWriter()` methods.

For more information about this built-in Python library go to [CSV File Reading and Writing documentation](https://docs.python.org/3/library/csv.html).

Let's now read our `data/genes.txt` tab separated file using the `csv` module into a dictionary based on the column headers using `csv.DictReader()`.

|gene |	chrom |	start |	end |
|-- | -- | -- | -- | 
|BRCA2 |	13 |	32889611 |	32973805 |
|TNFAIP3 |	6 |	138188351 |	138204449 |
|TCF7 |	5 |	133450402 |	133487556 |

First, import the `csv` module:

In [56]:
import csv

Read the data and store each dictionary into a list. Note that `DictReader()` returns an [ordered dictionary](https://docs.python.org/3/library/collections.html#ordereddict-objects).

Ordered dictionaries are like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.

In [57]:
data = []
with open("data/genes.txt") as f:
    reader = csv.DictReader(f, delimiter = "\t")
    for row in reader:
        print(row)
        data.append(row)

for d in data:
    print(d['chrom'], d['gene'], d['start'], d['end'])

OrderedDict([('gene', 'BRCA2'), ('chrom', '13'), ('start', '32889611'), ('end', '32973805')])
OrderedDict([('gene', 'TNFAIP3'), ('chrom', '6'), ('start', '138188351'), ('end', '138204449')])
OrderedDict([('gene', 'TCF7'), ('chrom', '5'), ('start', '133450402'), ('end', '133487556')])
13 BRCA2 32889611 32973805
6 TNFAIP3 138188351 138204449
5 TCF7 133450402 133487556


data is a list of ordered dictionary representing each row of the data file:

In [58]:
# accessing first dictionary from the list
print(data[0])

# printing its keys
print(data[0].keys())

# its values
print(data[0].values())

# the value associated with the key 'gene'
print(data[0]['gene'])

OrderedDict([('gene', 'BRCA2'), ('chrom', '13'), ('start', '32889611'), ('end', '32973805')])
odict_keys(['gene', 'chrom', 'start', 'end'])
odict_values(['BRCA2', '13', '32889611', '32973805'])
BRCA2


In [59]:
# looping over the list to print each gene
for d in data:
    print(d['gene'])

BRCA2
TNFAIP3
TCF7


In [60]:
# calculating the length of each gene and adding its value into the dictionary
for d in data:
    d['len'] = int(d['end']) - int(d['start']) + 1
    print(d)

OrderedDict([('gene', 'BRCA2'), ('chrom', '13'), ('start', '32889611'), ('end', '32973805'), ('len', 84195)])
OrderedDict([('gene', 'TNFAIP3'), ('chrom', '6'), ('start', '138188351'), ('end', '138204449'), ('len', 16099)])
OrderedDict([('gene', 'TCF7'), ('chrom', '5'), ('start', '133450402'), ('end', '133487556'), ('len', 37155)])


The main advantage of using the `DictReader()` method and the `csv` module is to write code that is easier to read and more flexible. Using the name of the column instead if its index make it more meaningful when reading code, and using this method of reading comma or tab separated files, give you the flexibility to add columns and changed their orders without having to modify your code.

Let's have a look now at the file `data/genes_withstrand.txt` and spot the differences with `data/genes.txt`. Even though columns `chrom` and `gene` have been swapped and column `strand` added, the code written previously is still working.

In [62]:
data_withstrand = []
with open("data/genes_withstrand.txt") as f:
    reader = csv.DictReader(f, delimiter = "\t")
    for row in reader:
        print(row)
        data_withstrand.append(row)

for d in data_withstrand:
    print(d['chrom'], d['gene'], d['start'], d['end'])

OrderedDict([('chrom', '13'), ('gene', 'BRCA2'), ('start', '32889611'), ('end', '32973805'), ('strand', '+')])
OrderedDict([('chrom', '6'), ('gene', 'TNFAIP3'), ('start', '138188351'), ('end', '138204449'), ('strand', '+')])
OrderedDict([('chrom', '5'), ('gene', 'TCF7'), ('start', '133450402'), ('end', '133487556'), ('strand', '-')])
13 BRCA2 32889611 32973805
6 TNFAIP3 138188351 138204449
5 TCF7 133450402 133487556


In [63]:
# Write a delimited file using the csv module from a list of dictionaries 
with open("gene_lengths.txt", "w") as f:
    writer = csv.DictWriter(f, data[0].keys(), delimiter='\t')
    writer.writeheader() # write header

    for d in data:
        writer.writerow(d) # write row

# Open the output file and print out its content
with open("gene_lengths.txt") as f:
    for line in f:
        print(line.strip())

gene	chrom	start	end	len
BRCA2	13	32889611	32973805	84195
TNFAIP3	6	138188351	138204449	16099
TCF7	5	133450402	133487556	37155


## Getting help from the official Python documentation

The most useful information is online on https://www.python.org/ website and should  be used as a reference guide.

- [Python3 documentation](https://docs.python.org/3/) is the starting page with links to tutorials and libraries' documentation for Python 3
    - [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
        - [Modules](https://docs.python.org/3/tutorial/modules.html)
        - [Brief Tour of the Standard Library: Mathematics](https://docs.python.org/3/tutorial/stdlib.html#mathematics)
    - [The Python Standard Library Reference](https://docs.python.org/3/library/index.html) is the reference documentation of all libraries included in Python like:
        - [`statistics` - Mathematical statistics functions](https://docs.python.org/3/library/statistics.html)
        - [`os.path` — Common pathname manipulations](https://docs.python.org/3/library/os.path.html)
        - [`os` — Miscellaneous operating system interfaces](https://docs.python.org/3/library/os.html)
        - [`csv` — CSV File Reading and Writing](https://docs.python.org/3/library/csv.html)

## Exercise 1.2.3

- Change the script you wrote for [Exercise 1.2.1](#Exercise-1.2.1) to make use of the `csv` module to calculate the average GDP per capita per country in Europe in 1962, its median and standard deviation using `data/gapminder.csv` data; and compare these figures with those from Americas.

In [79]:
import csv #this doesnt have indexes, it has keys and values.

from statistics import mean, median, stdev
eu_gdppercap_1962 = [] #because you are using a 'for loop' here, it is important that you make an empty file and place inside it the data that you are interested in
americas_gdppercap_1962 = []

with open("data/gapminder.csv") as f:
    reader = csv.DictReader(f, delimiter = ",")
    for data in reader:
        if data['continent'] == 'Europe' and data['year'] == '1962':
            eu_gdppercap_1962.append(float(data['gdpPercap'])) #this will not work if you dont transform the data to float type
        if data['continent'] == 'Americas' and data['year'] == '1962':
            americas_gdppercap_1962.append(float(data['gdpPercap'])) #this will not work if you dont transform the data to float type

print('European GDP per Capita in 1962')
print(eu_gdppercap_1962)
            
print(statistics.mean(eu_gdppercap_1962),statistics.median(eu_gdppercap_1962),statistics.stdev(eu_gdppercap_1962))

type(data['gdpPercap'])

print('Americas GDP per Capita in 1962')
print(americas_gdppercap_1962)
            
print(statistics.mean(americas_gdppercap_1962),statistics.median(americas_gdppercap_1962),statistics.stdev(americas_gdppercap_1962))


European GDP per Capita in 1962
[2312.888958, 10750.72111, 10991.20676, 1709.683679, 4254.337839, 5477.890018, 10136.86713, 13583.31351, 9371.842561, 10560.48553, 12902.46291, 6017.190733, 7550.359877, 10350.15906, 6631.597314, 8243.58234, 4649.593785, 12790.84956, 13450.40151, 5338.752143, 4727.954889, 4734.997586, 6289.629157, 7481.107598, 7402.303395, 5693.843879, 12329.44192, 20431.0927, 2322.869908, 12477.17707]
8365.4868143 7515.7337375 4199.193906418378
Americas GDP per Capita in 1962
[7133.166023, 2180.972546, 3336.585802, 13462.48555, 4519.094331, 2492.351109, 3460.937025, 5180.75591, 1662.137359, 4086.114078, 3776.803627, 2750.364446, 1796.589032, 2291.156835, 5246.107524, 4581.609385, 3634.364406, 3536.540301, 2148.027146, 4957.037982, 5108.34463, 4997.523971, 16173.14586, 5603.357717, 8422.974165]
4901.5418704 4086.114078 3421.740568771563


## Next session

Go to our next notebook: [Session 1.3: Creating functions and modules to write reusable code](13_python_data.ipynb)