# Practical Data Analysis Using Jupyter Notebook

## Ch. 4: Creating Thy First `pandas` DataFrame 
---

## Introduction

This here is an introduction to `pandas` DataFrames and my notes from the excellent book *Practical Data Analysis Using Jupyter Notebook*.  Please enjoy!

## Topics
* Techniques for manipulating tabular data
* Understanding pandas and DataFrames
* Handling essential data formats
* Data dictionaries and data types
* Creating your first DataFrame

## Imports

In [1]:
import pandas as pd
pd.__version__

'1.4.3'

---

## Techniques for manipulating tabular data

The `pandas` Python library name was taken from the term **panel data** (by McKinney) by shortening and combining the terms to get pan and da. Panel data is defined as observations that can be measured over a period of time with multiple dimensional values and is very common in statistical studies and research papers.

Panel data is presented in tabular form with rows and columns and comes in a few different types, such as balanced, unbalanced, long, short, fixed, and rotating.

![Panel Data](../data/source/panel_data.JPG)

Imagine, if you will, we had a summary cross table like this :

![Panel Data](../data/source/begin_table.JPG)

but what if we had 100 cities and it went back 10 years?  Increasing the number of columns would make this very hard to navigate.

A best practice in data analysis is to begin with the end in mind. So, for this example, the output table we want to produce will look similar to the following table:

![final_table](../data/source/final_table.JPG)

From the preceding output, we can see that:

* The first advantage of having data structured similar to the way it is in the preceding output table is that there is a single conformed data type for each column, which is also known as a dimension or axis.
* The second advantage is that it becomes much easier for statistical analysis to be carried out because each dimension can be treated as an independent array of values of the same data type where calculations can be performed using NumPy.
* The third advantage is the ability to sort by any field in the table without worrying about the data values in each row/tuple becoming misaligned or inconsistent.
* Keeping the integrity of your data builds trust in your process and ensures your analysis will be accurate.


## Understanding `pandas` and DataFrames

A `pandas` DataFrame is defined as a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A DataFrame is a two-dimensional data structure—that is, data is aligned in a tabular fashion in rows and columns. A `pandas` DataFrame consists of three principal components: the data, the rows, and the columns.

Some key benefits of using DataFrames include the following :
* It allows you to convert all source files into readable data objects for easier merging and analysis.

* It provides auto- or defined indexing to help with looking up a value or selecting a cross selection from your DataFrame, which is also known as a data slice.
* Each column can be treated as a single NumPy array, which can collectively have multiple data types.
* It really excels at fixing data alignment and missing data elements, which are displayed and referenced as Not a Number (NaN).
* It allows pivoting and reshaping data without going back to the source of record for each dataset.
* It is easy to add, remove, or change data using single Python commands to expedite the analysis of one or more data sources.
* Allows aggregations, such as Group By, and other calculations against metrics, such as sum, min, max, can all be performed against the DataFrame.
* Allows merging, sorting, joining, and filtering against one or more DataFrames.

What I enjoy about using pandas and DataFrames is the flexibility of the built-incommands that are provided to you as a data analyst. Let's walk through a few examples.

In [None]:
# make a bit of test data in a dict
product_data = {
 'product a': [13, 20, 0, 10],
 'project b': [10, 30, 17, 20],
 'project c': [6, 9, 10, 0]
}

# creating a dataframe
purchase_data = pd.DataFrame(product_data)
purchase_data.head()

## Handling essential data formats

Source files can come in multiple formats, including :

* CSV
* JSON
* XML

### Data hierarchy

Data hierarchies are defined and consistent groupings of data fields or records. The hierarchy can be obvious—for example, a son has a father and a mother—but from a data perspective, that relationship must be defined. In XML file format, you use a concept called an XML tree.

This hierarchy relationship is commonly known as a parent-child relationship.



## Data dictionaries and data types

A data dictionary will come in all shapes and sizes. This means it could be documented outside the source file, which is common on a help page, a wiki, or a blog or within the source data (JSON, XML)

Remember that you may need to convert a data type between multiple sources, especially when blending between different systems and file formats. For example, in JSON, a number defined as `real` would be called `float` in Python.

## Creating thy first DataFrame

Here are a couple useful commands to run in `pandas` :

* `pd.read_csv(‘inport_filename.csv', header=1)` : Reads data from a CSV file directly into a pandas DataFrame
* `my_df.to_csv(‘export_filename')` : Directly exports the DataFrame to a CSV file to your workstation (with the click of a mouse!)
* `my_df.shape` : Provides the number of rows and columns of your DataFrame
* `my_df.info()` : Provides metadata about your DataFrame, including data types for each column
* `my_df.describe()` : Includes statistical details with a column that includes the count, mean, standard deviation, min, max, and percentiles (25th, 50th, and 75th) for any numeric column
* `my_df.head(2)` : Displays the first two records from the DataFrame
* `my_df.tail(2)` : Displays the last two records from the DataFrame
* `my_df.sort_index(1)` : Sorts by the labels along an axis—in this example, by the column label headers alphabetically from left to right
* `my_df.isnull()` : Displays a list of all rows with a True/False indicator if any of the values by column are null

In [None]:
# CSV example
df = pd.read_csv('../data/source/evolution_of_data_analysis.csv', header=0, sep="|")
df.info()

### Business Q: How many milestone events occurred by decade?

 To answer this question, we need to use the `groupby` feature.

In [10]:
df = pd.read_csv('../data/source/evolution_of_data_analysis.csv', header=0, sep="|")
df.groupby(['Decade']).agg({'Year': 'count'}) # what is agg?

Unnamed: 0_level_0,Year
Decade,Unnamed: 1_level_1
1940s,2
1950s,2
1960s,1
1970s,2
1980s,5
1990s,9
2000s,14
2010s,7
