## Data Manipulation with Pandas

**Credits**: slides [pandas: Powerful data analysis tools for Python](https://www.slideshare.net/wesm/pandas-powerful-data-analysis-tools-for-python)

# Contents

- Introducing Pandas Objects: Series, DataFrame, Index
- Data Indexing and Selection
- Operating on Data in Pandas
- Practical Example: Exploring the IMDB dataset

# Pandas
* http://pandas.pydata.org
* Free software released under the three-clause BSD license
* One of the most popular library that data scientists use
* Rich relational data tool built on top of **NumPy**
  * Like R's `data.frame` on steroids
  * Excellent performance
  * Easy-to-use, consistent API
* A foundation for data analysis in Python

# Pandas
* Labeled axes to avoid misalignment of data
    * Data[:, 2] represents weight or weight2?
    * When merge two tables, some rows may be different
* Missing values or special values may need to be removed or replaced

![image.png](attachment:1e8184aa-3273-4931-82b8-6562749b5bfb.png)

# Pandas
<img src="images/related_tags_over_time.png" width="500"/>

* Initial release in 2008, now in heavy production use in the industry
* Generally much better performance than other open source alternatives (e.g. R)
* Hope: basis for the “next generation” statistical computing and analysis environment

# What can you do with pandas?
- Calculate statistics and answer questions about the data, like
    - What's the average, median, max, or min of each column? 
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more. 
- Read and store the cleaned, transformed data back into a CSV, other file or database

# Pandas
<img src="images/series-and-dataframe.png" width="600"/>

* Efficient implementation of a **DataFrame**
* **DataFrames**: multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data
* implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs

# Pandas objects
* Series
  * analog to one-dimensional array with flexible indices
* DataFrame
  * analog to a two-dimensional array with both flexible row indices and flexible column names
  * you can think of a DataFrame as a sequence of aligned Series objects.
    * "aligned" means that they share the same index.
* Index


# Series object

<div style="float:right"> <img src="images/pandas-series.png" width="200"/> </div>

* Subclass of `numpy.ndarray`
* Data: any type
* Index labels need not be ordered

# Dataframe object

<div style="float:right"> <img src="images/pandas-dataframe.png" width="400"/> </div>

* `NumPy` array-like
* Each column can have a different type
* Row and column index
* Size mutable: insert and delete columns