# Lecture 7 – Table Fundamentals

### Spark 010, Spring 2024

In [None]:
# Just run me
import numpy as np

### String Containment

Another useful Python **operator** is the `in` keyword, which checks if the first value is present within the second value:

In [None]:
'berkeley' in 'uc berkeley'

In [None]:
'stanford' in 'uc berkeley'

In [None]:
'berkeley' in 'UC BERKELEY'

## Boolean Indexing

We can now use this Boolean array to index the entries in the original array where the condition specified returned `True`.

In [None]:
hs_or_higher = np.array([86.9, 83.9, 88.5, 87.2, 84.4])
above_85_hs = hs_or_higher > 85
above_85_hs

In [None]:
hs_or_higher[above_85_hs]

Or we can index the entries which returned `False`.

In [None]:
hs_or_higher[~above_85_hs]

## Introduction to DataFrames

DataFrames (or tables) allow us to organize data in a systematic and easy-to-work-with way. Each table consists of **columns**, which represent variables, and **rows**, which represent one individual or observation.

Most of our datasets will be stored in `.csv` files (CSV stands for "Comma Separated Values"), which we will _import_ into our notebook using the `pd.read_csv(...)` function.

First, we import the library.

In [None]:
import pandas as pd

We can load in the same dataset of California public universities from the first lecture by passing in the _filepath_ string corresponding to where our `.csv` file is in our computer's folder structure. (Don't worry, you don't need to know how this works)

In [None]:
schools = pd.read_csv('data/cal_unis.csv')
schools.head()

One of the first things we often want to know about our data or table is how big it is. We can use the usual `len` function we learned about already to get the number of rows, or we can use the `.shape` method to return the dimensions of our table.

In [None]:
schools.shape[0] # Find the number of rows in the `schools` table

In [None]:
schools.shape[1] # Find the number columns in the `schools` table

In [None]:
len(schools)

In [None]:
schools.shape

### Accessing the first few Rows

We will take a subset of the first five schools in the table for illustration purposes.

In [None]:
some_schools = schools.head(5) #display the first five rows of the table
some_schools

In [None]:
other_schools = schools.tail(3) #display the last three rows of the table
other_schools

Each column in a DataFrame is an **array**, which is useful when we want to perform arithmetic on entire columns. We can extract a particular column with the `df.loc[...]` method. Note that when we talk about DataFrame methods in the `pandas` library, we will use `df` to refer to the name of a general DataFrame. When using these methods, remember to replace `df` is the name of the table you're working with.

### Accessing Columns and Rows

In [None]:
some_schools.loc[:,'City']

In [None]:
some_schools.iloc[:,3]

In [None]:
some_schools.loc[2,:]

### Quick Check 1

In [None]:
states = pd.read_csv('data/us-state-capitals.csv')

In [None]:
states.head()

What should we pass into `.loc[]` in order to get the latitudes of each state capital as an array?

In [None]:
states.loc[...].values # Replace the three dots with your answer

## `.loc` and `drop`

A common workflow when working with tables is to **import** the table, **identify** relevant columns, and then make a **new table** with only the columns we want to work with. The `.loc()` and `.drop()` table methods allow us to do just that. Notice how both methods achieve the same result, just by slightly different means.

In [None]:
some_schools

What if we only want to display the columns `Name` and `Enrollment`?

In [None]:
some_schools.loc[:,['Name', 'Enrollment']] # Select only the columns 'Name' and 'Enrollment'

We can also do this by specifying the columns labels we *don't* want to display

In [None]:
some_schools.drop(columns = ['Founded', 'County', 'Institution', 'City']) # Drop columns so that you are left with only 'Name' and 'Enrollment'

**Remember** that _the above_ table methods return a **new table**, so the original `some_schools` table is not modified!

In [None]:
some_schools

## Adding Columns

Another thing we might want to do with a table is add additional columns that provide additional tables. We can use the `df.insert()` method to add columns to an existing table.

In [None]:
some_schools

In [None]:
# Add a column with the nicknames for each of the five schools (Cal, UCD, UCI, UCLA, UCM)
some_schools.insert(1,'Nickname',['Cal', 'UCD', 'UCI', 'UCLA', 'UCM'])
some_schools

In [None]:
# Add two columns to `some_schools`: one with the nickname for the school and the other for how old the school is
some_schools.insert(5,'Years Old', 2022 - some_schools.loc[:,'Founded'])

Notice that the method `df.insert()` *does* change the table.

In [None]:
some_schools

### Creating tables from scratch

We can also use `pd.DataFrame()` to make an entirely new table from scratch.

In [None]:
pd.DataFrame()

In [None]:
type(pd.DataFrame())

In [None]:
states = pd.DataFrame({'State':['California', 'New York', 'Florida', 'Texas', 'Pennsylvania'],
                       'Code':['CA', 'NY', 'FL', 'TX', 'PA'],
                       'Population':[39.3, 19.3, 21.7, 29.3, 12.8]})
states

### Quick Check 2

Given the table `states`, fill in the blanks in the second cell to create a new table that corresponds to the following table:

| State | Code | FedVote |
| --- | --- | --- |
| California | CA | D|
| New York | NY | D |
| Florida | FL | R |
| Texas | TX | R |
| Pennsylvania | PA | D |

In [None]:
# Fill in the ... to drop the approprate column
states = states.drop(columns = ...)
states

In [None]:
# Fill in the ... to insert the appropriate column
states.insert(...)
states