# Data manipulation using Pandas
### Workshop 1
*Jun 27 - IACS-MACI Internship*

**Pandas** is a python library containing tools for data manipulation. It extends from `numpy` as it deal with operations on multidimensional array.

The main class of Pandas is `DataFrame` which is similar to tables in Excel. In other words, **Pandas allow us to work with tabular data in python**.

<img src="images/pandas_excel.jpeg" width="300"/>

Data scientist use Pandas a lot! This library provides **high-level functions** that **acelerate** the **analysis** and **preprocessing** of our data.



## The DataFrame class
The [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) class contains all you need to reproduce/create an excel table.
<img src="images/dataframe_excel.png" width="900"/>

Lets create a `DataFrame` instance from scratch.

First we need to import Pandas library. 

To install pandas use: `pip install pandas` or `conda install pandas`

In [1]:
import pandas as pd # conventionally we rename pandas as 'pd'

In [5]:
data = [
    [1, 'a', .1],
    [2, 'b', .2],
    [3, 'c', .3],
    [4, 'd', .4],
    [5, 'e', .5],
    [6, 'd', .6]
]
indices = [2,3,4,5,6,7]

columns = ['col1', 'col2', 'col3']

In [7]:
df = pd.DataFrame(data=data, index=indices, columns=columns)

In [8]:
df

Unnamed: 0,col1,col2,col3
2,1,a,0.1
3,2,b,0.2
4,3,c,0.3
5,4,d,0.4
6,5,e,0.5
7,6,d,0.6


Once the DataFrame is created we can operate over it using `pandas` functions

### Having already an excel file?

Most of the time, we recieve tabular data already stored in a file. 

In this case, we can load tables from different formats such as xlsx, csv, sql, etc,... 

Some formats (like the `xlsx`) requires special modulest to work. In the worst case, python displays an error indicating the name of the module. Then you only have to install it.

<img src="images/error_xlsx.png" width="700"/>

In [11]:
df = pd.read_excel('./data/example_1.xlsx')

In [12]:
df

Unnamed: 0,col1,col2,col3
0,1.0,a,0.1
1,2.0,b,0.2
2,3.0,c,0.3
3,4.0,d,0.4
4,5.0,e,0.5
5,6.0,d,0.6


Notice that the indices and the columns formats are not the same. Since we are working with a `DataFrame` object we can modify everything without changing the `xlsx` file  

In [15]:
df.index = [2, 3, 4, 5, 6, 7] # Changing a class parameter

In [16]:
df

Unnamed: 0,col1,col2,col3
2,1.0,a,0.1
3,2.0,b,0.2
4,3.0,c,0.3
5,4.0,d,0.4
6,5.0,e,0.5
7,6.0,d,0.6


In order to change the column format we have to use the `astype()` function from the [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) class. 

Thus, every column in a `DataFrame` is a `Series` object,

In [27]:
type(df), type(df['col1'])

(pandas.core.frame.DataFrame, pandas.core.series.Series)

The `Series` object has functions that operates over the column data. 

In the example, we need to change the format of the `col1` from floats to integers.

In [28]:
df['col1'] = df['col1'].astype(int)

In [29]:
df

Unnamed: 0,col1,col2,col3
2,1,a,0.1
3,2,b,0.2
4,3,c,0.3
5,4,d,0.4
6,5,e,0.5
7,6,d,0.6


of course we can also save the `DataFrame` in the desire format

In [32]:
df.to_csv('./data/example_1.csv', index=False) # Put index=false to avoid creating a new index column

In [33]:
pd.read_csv('./data/example_1.csv')

Unnamed: 0,col1,col2,col3
0,1,a,0.1
1,2,b,0.2
2,3,c,0.3
3,4,d,0.4
4,5,e,0.5
5,6,d,0.6


Finally, Pandas can also load data from remote sources (`url`). For example, the science ministry of Chile has climate change resources for investigation. In this case, we use [the amount of water falling in 24 hours](https://github.com/MinCiencia/Datos-CambioClimatico/tree/main/output/agua24_dmc).

In this case the link is 

`https://github.com/MinCiencia/Datos-CambioClimatico/blob/main/output/agua24_dmc/1955/1955_agua24_dmc.csv`

which display a visualization of the tabular data. However, in order to load the csv in a `DataFrame` we must access to the raw format


<img src="images/raw_csv.png" width="900"/>

In [37]:
pd.read_csv('https://raw.githubusercontent.com/MinCiencia/Datos-CambioClimatico/main/output/agua24_dmc/1955/1955_agua24_dmc.csv')

Unnamed: 0,time,latitud,longitud,RRR24_Valor,Traza_Valor,CodigoNacional,nombreEstacion
0,1955-08-11 12:00:00,-45.91833,-71.67778,2.0,0.0,450005.0,Balmaceda Ad.


## Pandas operations

- Concat
- Merge
- Join