# Modules and The Python Standard Library

In the previous lesson, we learned how to create reusable snippets of code using functions.  In this lesson, we are going to take things a step further and learn how to create collections of reusable functions and other code by creating *modules* and *packages*.  Modules allow you to write a function or other code once and reuse it for different projects.  A package is a collection of related modules.  

Before learning how to create your own modules and packages, we will explore the [Python Standard Library](https://docs.python.org/3/library/index.html) of packages and modules.  One of the things that makes Python so popular is its "batteries included" philosophy.  It is important for you to understand what is already included in the standard library, so you don't end up "reinventing the wheel." 

After learning how to import and reuse code written by others, you will go on to create your own reusable modules and packages. We will also be learning about how to create Python scripts and use Python from the command line.  

Lastly, while Jupyter Notebook is good for adhoc Python code, it is not an ideal tool for creating reusable Python modules.  It is at this point we will introduce you to an integrated development environment (IDE) in the form of [PyCharm](https://www.jetbrains.com/pycharm/).  

## Project Structure

Up until this point, you have been able to complete all of your assignments and examples in a single Jupyter Notebook.  Real-world data science projects are rarely this simple and will require multiple files including Python source code, HTML reports, PDF reference material, source data, processed data, and documentation.  

To manage the complexity of a larger data analysis project, we need to organize and structure our project properly.  For this class, we will use a project structure loosely based on [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/). 

Below is the directory structure we will be using throughout the rest of the course.  You may not need every directory and file for each assignment or project.  For instance, the `.gitignore` file is used by the Git version control system which we will introduce in later lessons.  Similarly, the `docs` folder is for project-level documentation and will not be needed until the final project.  


```nohighlight
project-name/            <- Name of your analysis project.
├── .gitignore           <- Tells version control to ignore files like .pyc files.
├── README.md            <- The top-level README for developers using this project.       
├── data
│   ├── interim          <- Intermediate data that has been transformed.
│   ├── processed        <- The final, canonical data sets for modeling.
│   └── raw              <- The original, immutable data dump.
├── docs                 <- Default location for project-level documentation
├── notebooks            <- Jupyter notebooks.
├── references           <- Data dictionaries, manuals, and all other explanatory materials.
├── reports              <- Generated analysis as HTML, PDF, PNG, etc.
├── requirements.txt     <- The requirements file for reproducing the analysis environment, e.g.
│                           generated with `pip freeze > requirements.txt`
└── src                  <- Source code for use in this project.
    ├── mypackage        <- Place for your custom code
    │   ├── __init__.py  <- Makes mypackage a package
    │   ├── mymodule1.py <- Module in `mypackage`
    │   └── mymodule2.py <- Module in `mypackage`
    ├── script1.py       <- Script to prepare data, generate reports, etc...
    ├── script2.py       <- Script to prepare data, generate reports, etc...
    └── tests            <- Directory for the projects' unit tests. 
        │                   Run using a unit test runner like py.test or nose.
        ├── test_123.py  <- Unit tests for the project.
        ├── test_321.py  <- Unit tests for the project.
        └── __init__.py  
```

## Assigned Reading and Viewing

### Modules, Packages, and Scripts 

* Read Chapters 22 to 24 in Learning Python
* Read [Modules](https://python.swaroopch.com/modules.html) from a *Byte of Python*
* Read [__main__ — Top-level script environment](https://docs.python.org/3/library/__main__.html)

### The Python Standard Library 

* Read [Introduction](https://docs.python.org/3/library/intro.html) to the Python Standard Library and review [The Python Standard Library index](https://docs.python.org/3/library/index.html). 
* Familiarize yourself with the [datetime module](https://docs.python.org/3/library/datetime.html).  
* Familiarize yourself with the [csv module](https://docs.python.org/3/library/csv.html).

### Getting Started with PyCharm

* Read the [Quick Start Guide](https://www.jetbrains.com/help/pycharm/quick-start-guide.html) to PyCharm. 
* Read [Creating and Running Your First Python Project](https://www.jetbrains.com/help/pycharm/step-1-creating-and-running-your-first-python-project.html).  Use this and the Quick Start Guide to assist you with your homework. 

## Standard Library

Here are a few key modules from Python's Standard Library that you as a data scientist should be aware of.  

### datetime

Dates and times are often found real world datasets.  The Python [datetime](https://docs.python.org/3/library/datetime.html) module provides classes and functions for manipulating dates and times. Here are a few examples of the datetime module in action. 

In [1]:
import datetime

# Get the timestamp for current time
now = datetime.datetime.now()

# Get the timestamp for current time in the UTC timezone
utcnow = datetime.datetime.utcnow()

# Get the today's date
today = datetime.date.today()

# Get the date 60 days from now
later = today + datetime.timedelta(days=60)

### os

The [os](https://docs.python.org/3/library/os.html) provides platform independent operating system functions. 

In [2]:
import os

# Get the current working directory
working_directory = os.getcwd()

# List the files and directories in the working directory
dir_contents = os.listdir(working_directory)

# Get information on the current operating system
uname = os.uname()

### os.path

The [os.path](https://docs.python.org/3/library/os.path.html) module provides useful functions for acting on file paths.  It handles operating system specific details like whether the file system uses forward slashes or backslashes. 

In [3]:
import os

# Get an absolute path 
working_directory = os.getcwd()
abspath_dir = os.path.abspath(working_directory)

# Get a path for 'foo.txt' in the working directory
foo_path = os.path.join(working_directory, 'foo.txt')

# Return base name of the path. In this case it is the filename
foo_basename = os.path.basename(foo_path)

# Gives the directory name for the path.  
foo_dir = os.path.dirname(foo_path)

# Tests if the path exists
foo_exists = os.path.exists(foo_path)

### collections

Python's [collections](https://docs.python.org/3/library/collections.html) contains useful extensions to Python's built-in dictionaries, lists, tuples, and sets. Here are examples of a few useful collection data types. 

In [4]:
#  Counter is a sub-class of dict used to count objects 
from collections import Counter

items = ['a', 'b', 'b', 'c', 'd', 'a', 'a', 'a', 'c', 'b']

for item, count in Counter(items).most_common(3):
    print('{} - {}'.format(item, count))

a - 4
b - 3
c - 2


In [5]:
# namedtuple allows you to access tuple data using names instead of position
from collections import namedtuple

# Define a new type called Coordinates
Coordinates = namedtuple('Coordinates', ['x', 'y', 'z'])

# Create a normal tuple and a namedtuple
coords_tuple = (32, -12, 100)
coords_named = Coordinates(32, -12, 100)

# Printing the normal tuple and the namedtuple

print('Normal tuple: ')
print(coords_tuple)
print()

print('Named tuple: ')
print(coords_named)

Normal tuple: 
(32, -12, 100)

Named tuple: 
Coordinates(x=32, y=-12, z=100)


In [6]:
# A dictionary that remembers the order of the items
from collections import OrderedDict

ordered_dict = OrderedDict([('b', 3), ('a', 4), ('c', 1), ('d', 2)])
ordered_dict

OrderedDict([('b', 3), ('a', 4), ('c', 1), ('d', 2)])

### statistics

Python 3.4 introduced a built-in statistics module.  This is useful for calculating common statistics such as mean, median, mode, and variance. In most real-world data science applications, you will probably use [NumPy](http://www.numpy.org/) or a similar library instead of this one. 

In [7]:
import statistics

values = [1, 2, 3, 4, 4]

# Compute the mean
mean = statistics.mean(values)

# Compute the median
median = statistics.median(values)

# Compute the mode
mode = statistics.mode(values)

# Compute the variancehttp://localhost:8888/notebooks/06_modules_and_standard_library/Modules%20and%20The%20Python%20Standard%20Library.ipynb#
var = statistics.variance(values)

print('Mean: {}, Median: {}, Mode: {}, Variance: {}'.format(mean, median, mode, var))

Mean: 2.8, Median: 3, Mode: 4, Variance: 1.7


### math

Python 3.4 introduced a built-in [math](https://docs.python.org/3/library/math.html) module.  It implements common math functions such as the square root, sin, cos, and tangent.  In most real-world data science applications, you will probably use [NumPy](http://www.numpy.org/) or a similar library instead of this one. 

In [8]:
from math import sqrt, log10

# Calculate the square root of 16
print('The square root of 16 is {}'.format(sqrt(16)))

# Calculate the base 10 logarithm of 1 million
print('The log of a 1 million is {}'.format(log10(1000000)))

The square root of 16 is 4.0
The log of a 1 million is 6.0
