# Video: Introduction to Pandas Data Frames

This video shows the basic usage of Pandas data frames.

## What to Look for with Pandas Data Frames

* Index - what row is this?
* Column names
* Column types

Script:
* We are going to take a quick look at a new kind of data structure in pandas called a data frame.
* Data frames are another data structure to hold data, and have some important differences from the NumPy arrays that we used recently.
* Three differences that I'd like to call out as we get started are indexes, column names, and column types.

## Code Example: Looking at Data Frames

In [None]:
import pandas as pd

Script:
* Like Matplotlib and NumPy, pandas has its own de facto import with a very short name.
* Even the pandas documentation uses it in all the examples, so we will use it too.

In [None]:
fruits = pd.DataFrame(data={"name": ["apple", "orange", "grape"], "color":["red", "orange", "purple"], "flavor": [3, 3, 4], "score": [3, 2, 4]})
fruits

Unnamed: 0,name,color,flavor,score
0,apple,red,3,3
1,orange,orange,3,2
2,grape,purple,4,4


Script:
* So I just made up some data to show off data frames.
* Pandas data frames have a much nicer presentation in Jupyter notebooks.
* Sorry NumPy.
* Don't worry about the details making the data frame yet.
* For now, pay attention to what is different from just using a NumPy array.
* First off, there are column names along the top, and a bunch of numbers down the left side with no label above them.
* Those numbers on the left are the index values.
* We can use those to refer to particular rows.
* Let me take a moment to make this a little nicer.

In [None]:
fruits = pd.DataFrame(index=["apple", "orange", "grape"], data={"color":["red", "orange", "purple"], "flavor": [3, 3, 4], "score": [3, 2, 4]})
fruits

Unnamed: 0,color,flavor,score
apple,red,3,3
orange,orange,3,2
grape,purple,4,4


Script:
* I separated out the fruit names and made them the index.
* So now the index values read apple, orange, and graph instead of 0, 1, 2.
* Generally you should pick the index to match your mental model of how to identify a row.
* Here, I am thinking of the rows as kinds of fruit, not individual pieces of fruit.
* Let me know how I mis-scored your favorite fruit in Yellowdig.
* Index values are not required to be unique, but you'll probably want to keep them unique anyway to avoid confusion
* You can look up rows by their index value using the loc attribute.

In [None]:
fruits.loc["apple"]

color     red
flavor      3
score       3
Name: apple, dtype: object

Script:
* If there were multiple apple rows, you would get multiple rows back.
* Next week, you will see how to use the index to join data frames together and combine their data.
* You can use the iloc attribute to get a row by its position counting from the top.

In [None]:
fruits.iloc[0]

color     red
flavor      3
score       3
Name: apple, dtype: object

Script:
* loc and iloc need to be separate since the index might be integers too, so there would be no way to automatically figure out whether you wanted to lookup a row by index value or position.
* As you saw with the first data frame that I made, the default is to assign sequential integer indexes like a range sequence, so this would happen a lot.
* If you pass iloc a tuple of two integers, it will return a specific element, treating the data frame like a 2 dimensional array.

In [None]:
fruits.iloc[0, 0]

'red'

Script:
* This indexing works just like NumPy with a 2-dimensional array.

In [None]:
fruits

Unnamed: 0,color,flavor,score
apple,red,3,3
orange,orange,3,2
grape,purple,4,4


Slides:
* Looking at the data frame again, both data columns are clearly labeled, color, flavor, and score for this data frame.
* The default indexing on a data frame will select a column by name.

In [None]:
fruits["color"]

apple        red
orange    orange
grape     purple
Name: color, dtype: object

Script:
* If you index a data frame with a column name, you get just that column back.
* Here, you get a series back.
* A series is the pandas abstraction for one column at a time, and has an index, an optional column name, and all the values.
* If you index a pandas data frame with a list of strings, you get back a data frame with the same index and those columns.

In [None]:
fruits[["color"]]

Unnamed: 0,color
apple,red
orange,orange
grape,purple


Script:
* I selected one column first to compare to the previous output.
* Now let's select more than one.

In [None]:
fruits[["color", "flavor"]]

Unnamed: 0,color,flavor
apple,red,3
orange,orange,3
grape,purple,4


Script:
* Or change the order of the columns.

In [None]:
fruits[["flavor", "score", "color"]]

Unnamed: 0,flavor,score,color
apple,3,3,red
orange,3,2,orange
grape,4,4,purple


Script:
* New data frames made this way will share the series and backend storage of the original data frame, so you can remix them freely.
* Leaning into the idea that a data frame is like a dictionary of series, if you iterate a data frame, you get back the column names.

In [None]:
for c in fruits:
    print(repr(c))

'color'
'flavor'
'score'


Script:
* Each series in the data frame has its own type.

In [None]:
for c in fruits:
    print(c, fruits[c].dtype)

color object
flavor int64
score int64


Script:
* So each column in the data frame has its own type.
* Like we saw with NumPy, this will allow pandas to store lots of values efficiently.
* In fact, pandas defaults to using NumPy arrays to store each series, so pandas efficiency defaults to NumPy efficiency.
* Each of the three things called out in this video are simple on their own, and you could implement them yourself pretty easily.
* In the next video, I will talk about having them all together is more than just a minor convenience and enables a lot more library support to make your job easier.