# Programming for Chemists: File Input/Output and Plotting

As scientists, one of the key applications of programming is processing and analysing data. These data can arrive from a piece of lab equipment, a computer program, manually conducted experiment  etc... and there are a plethora of ways to read in data into Python, but the most important tool at the disposal of Data Scientists and Analysts working in Python today is the [Pandas](https://pandas.pydata.org/) library, making up the backbone of most projects involving data. In this session we will cover the basics of the Pandas and matplotlib libraries.

## Reading data into Python

### Excel vs. Pandas
Conventionally Microsoft Excel is used for data processing and analysis due to its versatility, ease of use and reliability, but it has inherent limitations:

* It will slow down as the data sets become larger.
* Excel has a limit of 1,048,576 rows in a spreadsheet.
* It is harder to create and use mathematical equations on the data. 

Pandas is a solution which overcomes all these limitations:

* It is incredibly fast at processing very large volumes of data. You can apply hundreds of computations to millions of data points instantly using pandas.
* The only limitation on the amount of data is the computing power and memory of the computer it is running on. If Public Health England had used pandas in place of their Excel spreadsheet they would have never lost 16000 test results.
* It can talk to fast numerical libraries such as numpy and scipy offering mathematical operations with the speed of the C programming language.
* It contains a machine learning backbone making it better at automatically reading and categorizing data. It can clean up data much easier than Excel and is capable of automating a lot of other processes including repairing data holes and eliminating duplicates.

Pandas is not necessarily a replacement for Excel, and the best course of action is often to use both Excel and Pandas together. You can start a project in Excel and port it over to Pandas which can easily read `.csv` files. For reference the name Pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. 

As Pandas is a separate Python library we have to import it in order to use it. Importing libraries in Python is simple:

In [1]:
import pandas

This works fine, but when we reference functions from the library in the next section we will have to write `pandas.name_of_func` each time. We can instead import the library and assign it to a different name:

In [None]:
import pandas as pd

Now we can reference the methods as `pd.name_of_func` which is faster to write and less prone to mis-typing! Pandas deals with the following three data structures:

| Data Structure | Dimensions     | Description                                        |
|:---------------|:---------------|:---------------------------------------------------|  
| Series         | 1              | 1D labeled homogeneous array, size immutable.      |
| DataFrame      | 2              | General 2D labeled, size-mutable tabular structure |
| Panel          | 3              | General 3D labeled, size-mutable array.            |

All three data structures are **value** mutable (can be changed) and except Series all are size mutable. Out of the three data structures DataFrames are the most widely used and important structures so will make up the entirety of this Pandas tutorial. 

Let's begin by reading in the file called `tablet-spectra.csv` which contains spectra, measured in the transmittance mode, of 460 pharmaceutical tablets; readings are from 600 to 1898 nm in 2 nm increments. The data set is free to use, taken from [OpenMV](OpenMV.net) which contains a variety of example data sets including [tablet-spectra](https://openmv.net/info/tablet-spectra).

### DataFrame

A DataFrame is a two-dimensional array with heterogeneous data. Consider the following example:

| Name  | Age  | Grade |
|:------|:-----|:------|  
| Rob   | 27   |  A    |
| Susan | 34   |  C    |
| Jane  | 71   |  A    |
| Tom   | 62   |  D    |