# Programming for Chemists: File Input/Output and Plotting

As scientists, one of the key applications of programming is processing and analysing data. These data can arrive from a piece of lab equipment, a computer program, manually conducted experiment  etc... and there are a plethora of ways to read in data into Python, but the most important tool at the disposal of Data Scientists and Analysts working in Python today is the [pandas](https://pandas.pydata.org/) library, making up the backbone of most projects involving data. In this session we will cover the basics of the pandas along with matplotlib used to visualise data.

## Excel vs. Pandas
Conventionally Microsoft Excel is used for data processing and analysis due to its versatility, ease of use and reliability, but it has inherent limitations:

* It will slow down as the data sets become larger.
* Excel has a limit of 1,048,576 rows in a spreadsheet.
* It is harder to create and use mathematical equations on the data. 

Pandas is a solution which overcomes all these limitations:

* It is incredibly fast at processing very large volumes of data. You can apply hundreds of computations to millions of data points instantly using pandas.
* The only limitation on the amount of data is the computing power and memory of the computer it is running on. If Public Health England had used pandas in place of their Excel spreadsheet they would have never lost 16000 COVID plethoratest results.
* It can talk to fast numerical libraries such as numpy and scipy offering mathematical operations with the speed of the C programming language.
* It contains a machine learning backbone making it better at automatically reading and categorizing data. It can clean up data much easier than Excel and is capable of automating a lot of other processes including repairing data holes and eliminating duplicates.

Pandas is not necessarily a replacement for Excel, and the best course of action is often to use both Excel and pandas together. You can start a project in Excel and port it over to pandas which can easily read `.csv` files. For reference the name pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. 

As pandas is a separate Python library we have to import it in order to use it:

In [2]:
import pandas as pd

Pandas deals with the following three data structures:

| Data Structure | Dimensions     | Description                                        |
|:---------------|:---------------|:---------------------------------------------------|  
| Series         | 1              | 1D labeled homogeneous array, size immutable.      |
| DataFrame      | 2              | General 2D labeled, size-mutable tabular structure |
| Panel          | 3              | General 3D labeled, size-mutable array.            |

All three data structures are **value** mutable (can be changed) and except Series all are size mutable. Out of the three data structures DataFrames are the most widely used and important structures so will make up the entirety of this Pandas tutorial. 

## DataFrame

A DataFrame is a two-dimensional array with heterogeneous data (data with high variability of data types and formats). Consider the following example:

| Name  | Age  | Grade |
|:------|:-----|:------|  
| Rob   | 27   |  A    |
| Susan | 34   |  C    |
| Jane  | 71   |  A    |
| Tom   | 62   |  D    |

We can implement this in pandas using the following constructor:

`pd.DataFrame(data, index, columns, dtype)`

* **data:** data takes various forms like `ndarray`, `series`, `map`, `lists`, `dict`, constants and also another DataFrame.
* **index:** For the row labels, the Index to be used for the resulting frame is `np.arange(n)` if no index is passed.
* **columns:** For column labels, the default syntax is  `np.arange(n)`. This is only true if no index is passed.
* **dtype:** Data type of each column.

In [25]:
# create a nested list containing our data
data = [['Rob', 27, 'A'], ['Susan', 34, 'C'], ['Jane', 71, 'A'], ['Tom', 62, 'D']]
# call the DataFrame from the pandas library and assign names to eah column
df = pd.DataFrame(data, columns=['Name', 'Age', 'Grade'])

print(df)

    Name  Age Grade
0    Rob   27     A
1  Susan   34     C
2   Jane   71     A
3    Tom   62     D


### Viewing Data

There are a multitude of ways to manipulate and view your data:

* View the top and bottom rows of the frame:

In [18]:
# print the top two rows
print(df.head(2))

# print the bottom two rows
print(df.tail(2))

    Name  Age Grade
0    Rob   27     A
1  Susan   34     C
   Name  Age Grade
2  Jane   71     A
3   Tom   62     D


* Display the index labels of each row:

In [29]:
df.index

RangeIndex(start=0, stop=4, step=1)

* Display the labels of each column:

In [28]:
df.columns

Index(['Name', 'Age', 'Grade'], dtype='object')

* `describe` shows a statistical summary of your DataFrame:

In [27]:
df.describe()

Unnamed: 0,Age
count,4.0
mean,48.5
std,21.299452
min,27.0
25%,32.25
50%,48.0
75%,64.25
max,71.0


* Transpose data:

In [26]:
df.T

Unnamed: 0,0,1,2,3
Name,Rob,Susan,Jane,Tom
Age,27,34,71,62
Grade,A,C,A,D


* Sorting values in specific column:

In [33]:
# sort the  along 
df.sort_values(by='Age')

Unnamed: 0,Name,Age,Grade
0,Rob,27,A
1,Susan,34,C
3,Tom,62,D
2,Jane,71,A


### Selecting data

* Selecting single column is done using the same syntax we used previously for dictionaries:

In [38]:
df['Name']

0      Rob
1    Susan
2     Jane
3      Tom
Name: Name, dtype: object

* We can also slice our data like we did in the previous session:

In [45]:
# slice the DataFrame selectin the first 2 rows
df[0:3]

Unnamed: 0,Name,Age,Grade
0,Rob,27,A
1,Susan,34,C
2,Jane,71,A


* Extract a cross section using a label:

In [47]:
df.loc[Grade[0]]

NameError: name 'Grade' is not defined

Let's begin by reading in the file called `tablet-spectra.csv` which contains spectra, measured in the transmittance mode, of 460 pharmaceutical tablets; readings are from 600 to 1898 nm in 2 nm increments. The data set is free to use, taken from [OpenMV](OpenMV.net) which contains a variety of example data sets including [tablet-spectra](https://openmv.net/info/tablet-spectra).