# Lecture 8 – Table Fundamentals and Visualization

### Spark 010, Spring 2024

In [3]:
# Let's import numpy and pandas
import numpy as np
import pandas as pd

## DataFrames

DataFrames (or tables) allow us to organize data in a systematic and easy-to-work-with way. Each table consists of **columns**, which represent variables, and **rows**, which represent one individual or observation.

Most of our datasets will be stored in `.csv` files (CSV stands for "Comma Separated Values"), which we will _import_ into our notebook using the `pd.read_csv(...)` function.

We can load in the same dataset of California public universities from the first lecture by passing in the _filepath_ string corresponding to where our `.csv` file is in our computer's folder structure. (Don't worry, you don't need to know how this works)

In [3]:
schools = pd.read_csv('data/cal_unis.csv')
schools.head()

Unnamed: 0,Name,Institution,City,County,Enrollment,Founded
0,"University of California, Berkeley",UC,Berkeley,Alameda,42519,1869
1,"University of California, Davis",UC,Davis,Yolo,39152,1905
2,"University of California, Irvine",UC,Irvine,Orange,35220,1965
3,"University of California, Los Angeles",UC,Los Angeles,Los Angeles,45428,1882
4,"University of California, Merced",UC,Merced,Merced,8544,2005


### Creating Data Frames From Scratch

We can also use `pd.DataFrame()` to make an entirely new table from scratch.

In [4]:
pd.DataFrame()

In [5]:
type(pd.DataFrame())

pandas.core.frame.DataFrame

In [6]:
states = pd.DataFrame({'State':['California', 'New York', 'Florida', 'Texas', 'Pennsylvania'],
                       'Code':['CA', 'NY', 'FL', 'TX', 'PA'],
                       'Population':[39.3, 19.3, 21.7, 29.3, 12.8]})
states

Unnamed: 0,State,Code,Population
0,California,CA,39.3
1,New York,NY,19.3
2,Florida,FL,21.7
3,Texas,TX,29.3
4,Pennsylvania,PA,12.8


### Quick Check 1

Given the table `states`, fill in the blanks in the second cell to create a new table that corresponds to the following table:

| State | Code | FedVote |
| --- | --- | --- |
| California | CA | D|
| New York | NY | D |
| Florida | FL | R |
| Texas | TX | R |
| Pennsylvania | PA | D |

In [None]:
# Fill in the ... to drop the approprate column
states = states.drop(columns = ...)
states

In [None]:
# Fill in the ... to insert the appropriate column
states.insert(...)
states

### Variable Types

When we work with data we have samples (rows in the data frame) and attributes (columns in the data frame).

The column/attributes are usually of different types which also kind of correspond to the different types of variables in Python (or other programming languages).

- Numerical (aka Quantitative)
    - *Discrete*: Whole numbers can be counted
    - *Continuous*: Floating point, decimal, measurements
- Categorical (aka Qualitative)
    - *Ordinal*: Categories with ordering (short, tall, grande venti)
    - *Nominal*: Categories with no inherent ordering (color, shape)



### Quick Check 2

Determine the variable feature type of each of the following variables.

| Name | Type |
| --- | --- |
| Fuel economy (miles per gallon) | ... |
| Number of Semesters at UC Merced  | ... |
| Class Standing  (F,S,J,S) | ... |
| Income Bracket (Low,Med,High) | ... |
| Bank Account Number | ... |

### Quick Check 3

How many variables are being encoded in this plot? Explain your resaoning?

![alt text](data/curiousPlot.png "Title")



*Put your answer in this Cell*