<a href="https://colab.research.google.com/github/gg5d/DS-1002/blob/main/pandas1_student_F23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas DataFrames I


### University of Virginia
### Programming for Data Science
### Last Updated: September 13, 2023
---  

### PREREQUISITES
- variables
- data types
- operators
- list comprehensions (not essential)
- numpy arrays (not essential)


### SOURCES
- ten minutes to pandas  
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html


- sort_values()  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html


- value_counts()  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html


- to_csv() : saving to CSV file  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html


- read_csv() : load CSV file into DataFrame  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


- dropna() : drop missing data  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html


- fillna() : impute missing data  
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html




### OBJECTIVES
- Introduce pandas dataframes and the essential operations

# Pandas DataFrames

The **Series**: a 1-dimensional labeled array capable of holding any data type.

The **DataFrame**: a 2-dimensional labeled data structure with columns of potentially different types.

> Note: Pandas used to have a 3-dimensional structure called a **panel**, but it has been removed from the library.\
Ironically, the name "pandas" was partly derived the word "panel": $pan(el)-da(ta)-s$.\
To handle higher dimensional data, the Pandas team suggests using [XArray](https://xarray.pydata.org/en/stable/), which also build on NumPy arrays.

By far, the most important data structure in Pandas is the dataframe, with the series playing a supporting role.

In fact, dataframe objects are built out of series objects.

So, **to understand what a dataframe is and how it behaves, you need to understand what is series is and how it is constructed.**

Before going into that, here are two quick observations about dataframes:

First, dataframes are **inspired by the R structure** of the same name. They have many similarities, but there are fundamental differences between the two that go beyond mere language differences. Most important is the Pandas dataframes have **indexes**, whereas R dataframes do not.

Second, it is helpful to think of Pandas as wrapper around NumPy and Matplotlib that makes it much easier to perform common operations, like select data by column name or visualizing plots. But this comes at a cost -- Pandas is slower than NumPy. The represents the classic trade-off between **ease-of-use** for humans and machine **performance**.

For shorthand, `df` will refer to pandas DataFrames.  

DataFrames can be created with pandas.    
Various formats (`csv`,`json`,...) can be loaded into DataFrames.   

The [ten minutes to pandas link](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) above gives a good, brief overview of pandas. Be sure to review.

Import pandas like this, where the alias `pd` is convention:

In [None]:
import numpy as np
import pandas as pd

## Series

A **series** is a one-dimensional array-like object containing

- a sequence of values (of similar types to NumPy types) of the **same type** and
- an associated array of data labels, called its **index**

A **series** is at heart a one-dimensional array with **labels** along its axis.

Think of **the index as a separate data structure** that is attached to the array.
* The array holds the data.
* The index holds the names of the observations or things that the data are about.

So, Pandas moves us out of what we called anonymous data.

Why have an index?
* Provides a way to access elements of the array by name
* Allows series objects that share index labels to be combined

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

 You can get the array representation and index object of the Series via its *`array`* and *`index`* attributes, respectively:

[An **attribute** is  a <ins>value</ins> associated with an object which is usually referenced by name using dotted expressions. For example, if an object o has an attribute a it would be referenced as o.a.]

In [None]:
  # call .array

The result of the .array attribute is a PandasArray which usually wraps a NumPy array but can also contain special extension array types which will be discussed more in Ch 7.3: Extension Data Types.

In [None]:
  # call .index

Often, you'll want to create a Series with an index identifying each data point with a label:

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

In [None]:
obj2.index

you can use labels in the index when selecting single values or a set of values:

In [None]:
obj2["a"]

In [None]:
obj2["d"] = 6
obj2

In [None]:
obj2[["c", "a", "d"]]

You can use NumPy functions or NumPy-like operations

In [None]:
obj2[obj2 > 0]

In [None]:
obj2 * 2

Another way to think about a Series is as a fixed-length, **ordered dictionary**, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dictionary:

Given a Python dictionary, you can create a Series from it by passing the dictionary:

In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
sdata

In [None]:
type(sdata)

In [None]:
obj3 = pd.Series(sdata)
obj3

A Series can be converted back to a dictionary with its *`to_dict`* method:

A *`method`* is a function which is defined inside a class body. If called as an attribute of an instance of that class, the method will get the instance object as its first argument (which is usually called self).

In [None]:
obj3.to_dict()

You can override the order by passing an index with the dictionary keys in the order you want them to appear in the resulting Series:

In [None]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

since no value for "California" was found, it appears as NaN (Not a Number), which is considered in pandas to mark missing or NA values.

The **isna** and **notna** functions in pandas should be used to detect missing data:

A *`function`* is a series of statements which returns some value to a caller. It can also be passed zero or more arguments which may be used in the execution of the body.

In [None]:
pd.isna(obj4)

In [None]:
pd.notna(obj4)

Series also has these as instance methods:

An *`instance`* is an individual object of a certain class]

In [None]:
obj4.isna()

A useful Series feature for many applications is that it **automatically aligns by index label** in arithmetic operations: **(this is cool!)**

In [None]:
obj3

In [None]:
obj4

In [None]:
obj3 + obj4

Both the Series object itself and its index have a **name** attribute, which integrates with other areas of pandas functionality:

In [None]:
obj4.name = "population"
obj4.index.name = "state"
obj4

A Series’s index can be altered in place by assignment:

In [None]:
obj

In [None]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

## Dataframe



A **dataframe is a collection of series** with a common index.

To this collection of series the dataframe adds a set of labels along the horizontal axis.
* The index is **axis 0**
* The columns are another kind of index, called **axis 1**

Note that both index and column labels can be **multidimensional**.
* The are called Hierarchical Indexes and go the technical name of `MultiIndexes`.
* As an example, consider that a table of text data might have a two-column index: `(book_id, chap_id)`
* See [the Pandas documentation](https://pandas.pydata.org/docs/user_guide/advanced.html).

It is **crucial** to understand the difference between the index of a dataframe and its data in order to understand how dataframes work.

Many a headache is caused by not understanding this difference :-)

**Indexes are powerful and controversial.**
* They allow for all kinds of magic to take place when combining and accessing data.
* But they are expensive and sometimes hard to work with (especially multiindexes).
* They are especially difficult if you are coming from R and expecting dataframes to behave a certain way.

## Some visuals to help

<img src="https://pynative.com/wp-content/uploads/2021/02/dataframe.png" width="50%" height="50%"/>

<img src="https://miro.medium.com/max/700/1*KOBhtOeFntu6CyJUsCdN0g.jpeg" width="50%" height="50%"/>

# DataFrames Constructors

Several ways to create pandas dataframes

**Passing a dictionary of objects:**

In [None]:
# x, y, z, are lists in the dict

df = pd.DataFrame({
    'x': [0, 2, 1, 5],
    'y': [1, 1, 0, 0],
    'z': [True, False, False, False]
})

**`.index`**  
https://pandas.pydata.org/docs/reference/api/pandas.Index.html

In [None]:
df.index

In [None]:
type(df.index)

**`list()`**  
casts object to list  
here will give you the df index as a list

In [None]:
                   # call list() on the index of df

**`.columns`**  
gives you the column labels

In [None]:
                    # call .columns to get the column labels

**`object`** = text data type in pandas

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-types

In [None]:
# can also cast as a list
                                # call list() on the column labels of df

**`.values`**  
gives a Numpy representation of the dataframe (more on numpy later)  
can also use `.to_numpy()`

In [None]:
                    # call .values on the DataFrame to get the values

In [None]:
                      # check the data type

**Passing the three required pieces:**
- columns as list
- index as list
- data as list of lists

In [None]:
df2 = pd.DataFrame(
    columns=['x','y'],
    index=['row1','row2','row3'],
    data=[[9,3],[1,2],[4,6]])

In [None]:
df2

**Passing a nested list (or list-like):**

index not provided

In [None]:
my_data = [
    ('a', 1, True),
    ('b', 2, False)
]
df3 = pd.DataFrame(my_data, columns=['f1', 'f2', 'f3'])

In [None]:
df3

# Naming indexes

In [None]:
df3.index.name = 'obs_id'

In [None]:
df3

# Copying DataFrames with `copy()`

Use `copy()` to give the new df a clean break from the original.  

Otherwise, the copied df will point to the same object as the original.

In [None]:
df = pd.DataFrame({'x':[0,2,1,5], 'y':[1,1,0,0], 'z':[True,False,False,False]})

In [None]:
df_deep    = df.copy()  # deep copy; changes to df will not pass through
df_shallow = df         # shallow copy; changes to df will pass through

In [None]:
print('--df')
df

In [None]:
# update values in df.x
df.x = 1

In [None]:
print('--Updated df')
df

In [None]:
print('--df_shallow')
print(df_shallow)
print('--df_deep')
print(df_deep)

In [None]:
# rebuild df
df = pd.DataFrame({'x':[0,2,1,5], 'y':[1,1,0,0], 'z':[True,False,False,False]})
df

Notice `df_shallow` mirrors changes to `df`, since it references its indices and data.  
`df_deep` does not reference `df`, and so changes to `df` do not impact `df_deep`.

# Column Data Types

**With `.types`**

In [None]:
df.dtypes

**With `.info()`**

In [None]:
df.info()

**Column Renaming with `.rename()`**

Can rename one or more fields at once using a dict.  

Rename the field `z` to `is_label`:

In [None]:
df = df.rename(columns={'z': 'is_label'})

In [None]:
df

# Column Referencing

Use bracket notation or dot notation.  
- bracket notation: variable name must be a string

**Bracket**

In [None]:
df['y']

**Dot** (i.e. as object attribute)

In [None]:
df.y

Dot notation is very convenient, since as object attributes they can be tab-completed in various editing environments.

But:
- It only works if the column names are not reserved words
- It can't be used when created a new column (see below)

Column attributes and methods work with both:
 - example: `.values`

In [None]:
df.y.values, df['y'].values

show only the first value, by indexing:

In [None]:
df.y.values[0]

# Column Selection

You select columns from a dataframe by passing a value or list (or any expression that evaluates to a list).

In [None]:
# single bracket
df['x']  # this is a Series

In [None]:
# double bracket
df[['x']]  # this is a DataFrame

In [None]:
df[['y', 'x']]

### TRY FOR YOURSELF (UNGRADED EXERCISES)

1) Create a dataframe called `dat` by passing a dictionary of inputs. Here are the requirements:
- has a column named `features` containing at least four floats numbers
- has a column named `labels` containing integers 0, 1, 2  

Print the df.

In [None]:
dat =
dat

2) Rename the `labels` column in `dat` to `label`.  
[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)

In [None]:
dat =
dat

# Adding New Columns

It is typical to create a new column from existing columns.  

In this example, a new column (or field) is created by summing `x` and `y`:

In [None]:
df['x_plus_y'] = df.x + df.y

In [None]:
df

Notice the components:

- the left side has form: DataFrame name, bracket notation, new column name
- the assignment operator `=` is used
- the right side contains an expression; here, two df columns are summed

Bracket notation also works on the fields, but it's more typing:

In [None]:
df['x_plus_y'] = df['x'] + df['y']
df

The bracket notation must be used when assigning to a new column. This will break:

In [None]:
df.'x_plus_y' = df.x + df.y

# Removing Columns with `del` and `.drop()`

## `del`

`del` can drop a DataFrame or single columns from the frame

In [None]:
df_drop = df.copy()

In [None]:
df_drop

In [None]:
# delete the column 'x'
del df_drop['x']

In [None]:
df_drop

## `.drop()`

Can drop one or more columns/rows.

takes `axis` parameter:
- axis=0 refers to rows  
- axis=1 refers to columns  

In [None]:
df_drop =                                      # drop x_plus_y and is_label columns

df_drop

In [None]:
df

In [None]:
df_dropp =    # drop   first and third  rows
df_dropp