### <font color="brown">Pandas - Continued</font>

In [1]:
from pandas import Series
from pandas import DataFrame
import numpy as np
import pandas as pd

---

#### <font color="brown">Creating DataFrames - Continued</font>

**3. Creating a DataFrame from a 2D NumPy array**

In [2]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

Unnamed: 0,0,1
0,0.540759,0.863442
1,0.734473,0.825567
2,0.245405,0.291765


**Change index and column names**

In [5]:
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

Unnamed: 0,first,second
one,0.540759,0.863442
two,0.734473,0.825567
three,0.245405,0.291765


**Or set them up at creation time**

In [6]:
randdf = DataFrame(rand2d, index=['one', 'two', 'three'],
                   columns = ['first', 'second'])
randdf

Unnamed: 0,first,second
one,0.540759,0.863442
two,0.734473,0.825567
three,0.245405,0.291765


---

#### <font color="brown">Columns</font>

**Membership**

In [7]:
popdat = {'state': ['Arizona','Arizona','Arizona','Virginia','Virginia'],
          'year': [2005, 2010, 2015, 2010, 2015],
          'pop': [5.9, 6.6, 6.8, 7.9, 8.3]}
popdf = DataFrame(popdat)
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [8]:
'debt' in popdf.columns

False

**Each column is a Series**

**Column can be referenced by using column name as index into dataframe**

In [9]:
print(popdf['state'])
print(popdf['state'].name)
print(popdf['state'].values)
print(popdf['state'].index)

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object
state
['Arizona' 'Arizona' 'Arizona' 'Virginia' 'Virginia']
RangeIndex(start=0, stop=5, step=1)


**Alternatively, a column can be referenced as an attribute of the dataframe**

In [10]:
popdf.state

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object

**Can get at a subset of columns with list, similar to rows of ndarray or index of Series**

In [11]:
popdf[['state','pop']]

Unnamed: 0,state,pop
0,Arizona,5.9
1,Arizona,6.6
2,Arizona,6.8
3,Virginia,7.9
4,Virginia,8.3


**Changing column names**

In [12]:
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [13]:
popdf.columns = ['year','state','pop']
popdf

Unnamed: 0,year,state,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


<font color="red">**Warning: Changing column names assigns new names, does NOT rearrange!**</font>

In [14]:
# restore to original
popdf.columns = ['state','year','pop']
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


---

#### <font color="brown">Indexing and Manipulating rows and columns</font>

**Row indexing by position, using loc**

In [15]:
popdf.loc[1]

state    Arizona
year        2010
pop          6.6
Name: 1, dtype: object

**Row of a DataFrame is a Series**

In [16]:
print(popdf.loc[1].name)
print(popdf.loc[1].values)

1
['Arizona' 2010 6.6]


**Range of rows**

In [17]:
popdf.loc[1:3]

Unnamed: 0,state,year,pop
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9


**<font color="red">Note above, end value of range of rows is INCLUSIVE!</font>**

**Subset of rows, subset of columns**

In [18]:
popdf.loc[[0,2],['state','pop']]  

Unnamed: 0,state,pop
0,Arizona,5.9
2,Arizona,6.8


**Adding a column**

In [None]:
# assign same value to all rows in the column
popdf['debt'] = 1.5
popdf

In [None]:
# Assign different value for each row
popdf['debt'] = np.arange(1,6)
popdf

In [None]:
popdat2 = {'Arizona': {2005: 5.9, 2010: 6.6, 2015: 6.8},
           'Virginia': {2010: 7.9, 2015: 8.3}}
popdf2 = DataFrame(popdat2)
popdf2

In [None]:
# Different value for each row
popdf2['NJ'] = [8.2, 8.4, 8.6]
popdf2

**What if assigned values fewer than number of rows**

In [None]:
debts = Series([1.2, 1.5, 1.7])
popdf['debt'] = debts
popdf

**NaNs are used to pad insufficient number of values for column**

**Creating a new column with values as a function of the other columns**

In [None]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

In [None]:
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

In [None]:
randdf['third'] = randdf['first'] > randdf['second']
randdf

**Row (index) membership**

In [None]:
'three' in randdf.index

In [None]:
randdf['three']

**<font color="red">Above syntax of dataframe['name'] can only be used with column names</font>**

**Row indexing by labels, using loc**

In [None]:
randdf.loc['two']

In [None]:
randdf.loc['two':'three']

In [None]:
randdf.loc[1:2]  

**<font color="red">Can't use numeric indexes here because dataframe is indexed by string labels**

---

**Adding a row using loc**

In [None]:
popdf2

In [None]:
popdf2.loc[2020] = [7.2, 8.6, 8.9]
popdf2

In [None]:
popdf2.loc[[2010,2020]]

**Deleting a column with del operation**

In [None]:
popdf

In [None]:
del popdf['debt']
popdf

In [None]:
randdf

In [None]:
del randdf['second']
randdf

---

#### <font color="brown">Indexing a DataFrame with iloc (using integer indices)</font>

In [None]:
popdf

In [None]:
popdf.loc[1]

In [None]:
popdf.loc[1,'year']

**Using iloc**

In [None]:
popdf.iloc[1,0]   # use index for rows and columns

In [None]:
popdf.iloc[1:4]   # slice rows

In [None]:
popdf.iloc[[1,2,3]]  # list of row indexes

In [None]:
popdf.iloc[:,'state']

**<font color="red">With iloc you can only use integer indexes for rows and columns</font>**

In [None]:
popdf.iloc[:,0]

*same as*

In [None]:
popdf['state']

In [None]:
popdf2

In [None]:
popdf2.iloc[2:,[0,2]]

---

#### <font color="brown">Creating a DataFrame from a CSV file (typical usage)</font>

**Using the Pandas method read_csv**

In [4]:
mpgfile = open("auto_mpg_original.csv")
mpgs = pd.read_csv(mpgfile)
mpgs

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino
...,...,...,...,...,...,...,...,...,...
401,27.0,4.0,140.0,86.0,2790.0,15.6,82.0,1.0,ford mustang gl
402,44.0,4.0,97.0,52.0,2130.0,24.6,82.0,2.0,vw pickup
403,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage
404,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger


In [None]:
mpgs.shape

**Note: NAs are read in as NaN which is basically a missing/null value.**

In [None]:
# first 15 rows
mpgs.head(15)

**Metadata - descriptive information - of dataframe**

In [None]:
mpgs.info()

**In the info above, note that each column now has an inferred datatype, not object.<br>
Also note the number of non-null values per column For instance, mpg has 8 missing values, and horsepower has 6 missing values.**

##### <font color="brown">Get all rows for which mpg column has a null value</font>

In [None]:
mpgs[mpgs['mpg'].isnull()]

##### <font color="brown">Get all rows for which horsepower column has a null value</font>

In [None]:
mpgs[mpgs['horsepower'].isnull()]

##### <font color="brown">Get summary starts for numeric columns</font>
**describe** method

In [None]:
mpgs.describe()

---

#### <font color="brown">Numpy ufuncs work with DataFrames</font>

In [None]:
df = DataFrame(np.random.randn(4,3),columns=list("ABC"),index=["One","Two","Three",'Four'])
df

In [None]:
np.abs(df)  

In [None]:
df  # is original changed?

**Original is not changed**

**Alternatively can use dataframe method abs(), this won't change original df either**

In [None]:
df.abs()

In [None]:
df

**Assign to effect the change**

In [None]:
dfabs = df.abs()
dfabs

In [None]:
dfabs.mean()

**<font color="red">Note: default axis is 0, so above gets column means</font>**

In [None]:
dfabs.mean(axis=1)  # row means

In [None]:
dfabs.cumsum(axis=1)   # cumulative sums of rows

In [None]:
dfabs.sum()     # sum of each column

##### **What if there are NaN values?**

In [None]:
dfabs2 = dfabs.copy()
dfabs2

In [None]:
dfabs2.iloc[1,1] = np.nan
dfabs2

In [None]:
dfabs2['B'].sum()   

**NaNs are skipped when summing**<br>
**But they can be considered if needed, with skipna parameter set to False**

In [None]:
dfabs2.mean(skipna=False)  

In [None]:
dfabs

In [None]:
dfabs['C'].argmax()

In [None]:
dfabs.loc['Three'].argmax()