Title: "Python Workshop: Introduction to Python - Part IV"

Author: "Dr. Armin Hatefi"

Date: "Monday, January 16, 2023"

**Note:** This content is protected and may not be shared, uploaded, or distributed.

In [2]:
import numpy as np
import pandas as pd

## DataFrame
A `DataFrame` represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). 

In [3]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [4]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [5]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result:

In [6]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [7]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [8]:
frame2["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [9]:
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

**Note**: `frame2[column]` works for any column name, but `frame2.column` works only when the column name is a valid Python variable name and does not conflict with any of the method names in DataFrame. For example, if a column's name contains whitespace or symbols other than underscores, it cannot be accessed with the dot attribute method.

In [10]:
frame2["debt"] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


In [11]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,0.0,True
1,2001,Ohio,1.7,1.0,True
2,2002,Ohio,3.6,2.0,True
3,2001,Nevada,2.4,3.0,False
4,2002,Nevada,2.9,4.0,False
5,2003,Nevada,3.2,5.0,False


In [12]:
del frame2["eastern"]
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

## Reindexing Functionality 

In [13]:
index = frame2.index
index[:3]

RangeIndex(start=0, stop=3, step=1)

Index objects are immutable and thus can’t be modified by the user:

In [14]:
index[1] = 'a'

TypeError: Index does not support mutable operations

In [15]:
frame2.columns[1] = 'State2'

TypeError: Index does not support mutable operations

In [16]:
'year' in frame2.columns

True

In [17]:
2 in frame2.index

True

An important method on pandas objects is `reindex`, which means to create a new object with the values rearranged to align with the new index.

In [18]:
frame2.reindex([1,0,3,4,5,6])

Unnamed: 0,year,state,pop,debt
1,2001.0,Ohio,1.7,1.0
0,2000.0,Ohio,1.5,0.0
3,2001.0,Nevada,2.4,3.0
4,2002.0,Nevada,2.9,4.0
5,2003.0,Nevada,3.2,5.0
6,,,,


In [19]:
lables = ['state','year','pop','Debt']

In [20]:
frame2.reindex(columns=lables)

Unnamed: 0,state,year,pop,Debt
0,Ohio,2000,1.5,
1,Ohio,2001,1.7,
2,Ohio,2002,3.6,
3,Nevada,2001,2.4,
4,Nevada,2002,2.9,
5,Nevada,2003,3.2,


you can also reindex by using the `loc` operator, and many users prefer to always do it this way. This works only if all of the new index labels already exist in the DataFrame (whereas `reindex` will insert missing data for new labels):

In [21]:
frame.loc[[2,1],['state','year','pop']]

Unnamed: 0,state,year,pop
2,Ohio,2002,3.6
1,Ohio,2001,1.7


In [22]:
frame.drop(index=[2,4])

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
3,Nevada,2001,2.4
5,Nevada,2003,3.2


In [23]:
frame.drop(columns=['year'])

Unnamed: 0,state,pop
0,Ohio,1.5
1,Ohio,1.7
2,Ohio,3.6
3,Nevada,2.4
4,Nevada,2.9
5,Nevada,3.2


In [24]:
frame.drop(['year'],axis=1)

Unnamed: 0,state,pop
0,Ohio,1.5
1,Ohio,1.7
2,Ohio,3.6
3,Nevada,2.4
4,Nevada,2.9
5,Nevada,3.2


In [25]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
frame

Unnamed: 0,b,d,e
Utah,-1.532584,-0.762912,1.16924
Ohio,0.725733,1.466095,-1.164577
Texas,0.35471,0.219619,-0.927728
Oregon,-0.501022,0.417725,-0.220918


In [26]:
def frange(x):
    return x.max() - x.min()

In [27]:
frame.apply(frange)

b    2.258317
d    2.229007
e    2.333818
dtype: float64

If you pass `axis="columns"` to `apply`, the function will be invoked once per row instead. A helpful way to think about this is as "apply across the columns":

In [28]:
frame.apply(frange, axis="columns")

Utah      2.701824
Ohio      2.630673
Texas     1.282438
Oregon    0.918747
dtype: float64

Many of the most common array statistics (like `sum` and `mean`) are DataFrame methods, so using apply is not necessary.

The function passed to `apply` need not return a scalar value; it can also return a Series with multiple values:

In [29]:
def f2(x):
    return pd.Series([x.min(), x.max(),x.mean()], index=["min", "max","mean"])

In [30]:
frame.apply(f2)

Unnamed: 0,b,d,e
min,-1.532584,-0.762912,-1.164577
max,0.725733,1.466095,1.16924
mean,-0.238291,0.335132,-0.285996


In [31]:
frame.describe()

Unnamed: 0,b,d,e
count,4.0,4.0,4.0
mean,-0.238291,0.335132,-0.285996
std,1.004194,0.913773,1.04971
min,-1.532584,-0.762912,-1.164577
25%,-0.758913,-0.026013,-0.98694
50%,-0.073156,0.318672,-0.574323
75%,0.447466,0.679817,0.126622
max,0.725733,1.466095,1.16924


In [32]:
frame.quantile(q=[0.025,0.5,0.975])

Unnamed: 0,b,d,e
0.025,-1.455217,-0.689222,-1.146814
0.5,-0.073156,0.318672,-0.574323
0.975,0.697907,1.387467,1.064978
