Data Science Fundamentals: Python |
[Table of Contents](../index.ipynb)
- - - 
<!--NAVIGATION-->
Module 10. | **[Introduction to Pandas](./01_introduction_to_pandas.ipynb)** | [Introducing Panda Objects](./02_introducing-pandas-objects.ipynb) | [Data Manipulation with Pandas](./03_data_manipulation_pandas.ipynb) | [Getting Started with Pandas](./04_getting_started_pandas.ipynb) | [ZachHallRepo](./05_ZachHallRepo.ipynb) | [Exercises](./06_pandas_exercises.ipynb)

# What is pandas?
Now we begin to see one of the reasons why Python is so popular: **libraries**. Libraries are self-contained packages of coded capabilities that can be imported into Python to extend its functionality. One of the most fundamental libraries for working with data is called _pandas_.

>"pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python."  http://pandas.pydata.org/pandas-docs/stable/
    
The main way pandas does this is through the implementation of two data structures: Series and DataFrame. We will learn more about them soon but for now let's load some data and see some of the things we can do.

## Why use a library?
Have you ever seen a data file where the first row contains column headers? Of course you have! That is way more common than just having rows of data. By using the correct _parser_ in pandas, we can automatically read the headers. Of course, we can program that ourselves in base Python, but why would we want to? Someone else already did a good job of doing it so let's use their code.

That same idea applies to many other computing tasks, too. If the functionality is already available in a library we can put it to work instead of having to write the code ourslves.

# Data Manipulation with Pandas

In the previous chapter, we dove into detail on NumPy and its ``ndarray`` object, which provides efficient storage and manipulation of dense typed arrays in Python.
Here we'll build on this knowledge by looking in detail at the data structures provided by the Pandas library.
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``.
``DataFrames`` are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

As we saw, NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.
While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.
Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

In this chapter, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

# Minimally sufficient pandas
There is an approach called "minimally sufficient pandas" that I try to follow. I will let the original author explain why.

>The whole point of a data analysis library should be to provide you with the tools so that you can focus on the data analysis. While Pandas does provide you with the right tools, it doesn’t do so in a way that allows you to focus on the analysis. Instead, users are forced to tread through the complex and overabundant syntax.
>
>I endorse the following as my definition for Minimally Sufficient Pandas.
>
>	- It is a small subset of the library that is sufficient to accomplish nearly everything that it has to offer.
>	- It allows you to focus on doing data analysis and not the syntax
>
>With this minimally sufficient subset of Pandas:
>
>   - Your code will be simple, explicit, straightforward, and boring
>   - You will choose one obvious way to accomplish a task
>   - You will use this obvious way every single time
>   - You won’t have to retain as many commands in working memory
>   - Your code will be easier to understand by others and by you

Source: https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428

## Installing and Using Pandas

Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built.
Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).
If you followed the advice outlined in the Preface and used the Anaconda stack, you already have Pandas installed.

Once Pandas is installed, you can import it and check the version:

Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``. This import convention will be used throughout the remainder of this repo:

In [1]:
import pandas as pd
pd.__version__

'1.0.5'

## Reminder about Built-In Documentation

As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ``?`` character). (Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb) if you need a refresher on this.)

For example, to display all the contents of the pandas namespace, you can type

```ipython
In [3]: pd.<TAB>
```

And to display Pandas's built-in documentation, you can use this:

```ipython
In [4]: pd?
```

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

In [4]:
pd?

- - - 
<!--NAVIGATION-->
Module 10. | **[Introduction to Pandas](./01_introduction_to_pandas.ipynb)** | [Introducing Panda Objects](./02_introducing-pandas-objects.ipynb) | [Data Manipulation with Pandas](./03_data_manipulation_pandas.ipynb) | [Getting Started with Pandas](./04_getting_started_pandas.ipynb) | [ZachHallRepo](./05_ZachHallRepo.ipynb) | [Exercises](./06_pandas_exercises.ipynb)
<br>

- - -

Copyright © 2020 Qualex Consulting Services Incorporated.