## Data Manipulation with Pandas



<br>

**Fernando Batista**

*fernando.batista[at]iscte-iul.pt*
<br><br>


**Credits**: slides [pandas: Powerful data analysis tools for Python](https://www.slideshare.net/wesm/pandas-powerful-data-analysis-tools-for-python)

# About me ...

* Docente do ISCTE-IUL e investigador no [INESC-ID](http://www.inesc-id.pt)
* Interesses de investigação
  * PLN (Processamento de Língua Natural) 
  * _Machine Learning_
  * Processamento de Fala
  * _Social Media_
* Unidades curriculares (lecionadas mais recentemente)
  * Processamento Computacional da Língua
  * Text Mining
  * Aprendizagem Probabilística para PNL
  * Sistemas Operativos

## How to run these slides?

* If you just want the notebook, go to my GitHub Repository\
https://github.com/fmmb/Estruturas-Dados.git

* If you want to execute the code, you need to use the following URL\
https://mybinder.org/v2/gh/fmmb/Estruturas-Dados.git/master \
If it didn't work, please do the following
  * go to https://mybinder.org
  * Copy&Paste the address of my GitHub Repository
  * Copy the resulting URL and share your Binder with others
  
Optionally, you may also want to activate [RISE](https://rise.readthedocs.io/) before opening the notebook (or just reload it afterwords)
```
 ! pip install RISE
```

Creating the Slides: 
`jupyter nbconvert Pandas.ipynb --to slides`

# Pandas
* http://pandas.pydata.org \
Free software released under the three-clause BSD license
* Rich relational data tool built on top of **NumPy**
  * Like R's `data.frame` on steroids
  * Excellent performance
  * Easy-to-use, consistent API
* A foundation for data analysis in Python

# Pandas
<div style="float:right"> <img src="attachment:related_tags_over_time.png" width="500"/> </div>

* Initial release in 2008, now in heavy production use in the industry
* Generally much better performance than other open source alternatives (e.g. R)
* Hope: basis for the “next generation” statistical computing and analysis environment

# What can you do with pandas?
- Calculate statistics and answer questions about the data, like
    - What's the average, median, max, or min of each column? 
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more. 
- Read and store the cleaned, transformed data back into a CSV, other file or database

# Pandas
<div style="float:right"> <img src="attachment:series-and-dataframe.png" width="600"/> </div>

* Efficient implementation of a **DataFrame**
* **DataFrames**: multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data
* implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs

# Pandas objects
* Series
  * analog to one-dimensional array with flexible indices
* DataFrame
  * analog to a two-dimensional array with both flexible row indices and flexible column names
  * you can think of a DataFrame as a sequence of aligned Series objects.
    * "aligned" means that they share the same index.
* Index


# Series object

<div style="float:right"> <img src="attachment:pandas-series.png" width="200"/> </div>

* Subclass of `numpy.ndarray`
* Data: any type
* Index labels need not be ordered

# Dataframe object

<div style="float:right"> <img src="attachment:pandas-dataframe.png" width="400"/> </div>

* `NumPy` array-like
* Each column can have a different type
* Row and column index
* Size mutable: insert and delete columns

# Let's have a little fun

## with some practical stuff

<br>

* [Pandas fundamentals](pandas-fundamentals.ipynb)
* [Practical example: IMBD database](pactical-example.ipynb)

![pandas-operations.png](attachment:pandas-operations.png)

* Binary operations are joins


# Pandas GroupBy

![pandas-groupby.png](attachment:pandas-groupby.png)

# Hierarquical indexes


![pandas-hierarq-index.png](attachment:pandas-hierarq-index.png)

* Semantics: a tuple at each tick
* Enables easy group selection
* Terminology:“multiple levels”
* Natural part of GroupBy and reshape operations