In [1]:
import pandas as pd
import matplotlib.pyplot as plt

So here we're going to go ahead and define our path to the data file. We'll be reusing the pettigrew data that we saw in our sql lesson, so we can do a nicely creative little relative path back to that other folder.  The `..` means that we want it to look up one level in the directory, then it'll start drilling down as we've asked for it.

In [4]:
datafilepath = '../SQL Lesson Materials/pettigrew.csv'

Once we have our path, we can pass that string to the `read_csv()` function from pandas. This can replace the need to use the csv module, and will load the resulting data into a pandas dataframe.  Because I've imported pandas as `pd`, I need to say `pd.` before all functions that I call from that library.  Meanwhile, this function will return bacx a dataframe object that we will need to save into memory.  The variable name ususally used for this is `df` (dataframe, you see it right?)

In [11]:
df = pd.read_csv(datafilepath)

When you are working in Jupypter Notebooks there are two ways to look at the contents of a dataframe.  You can use `print()` like you normally would, but it isn't a very pretty output.  Just text, which isn't very fun.  Instead, you can evaluate just the name of the dataframe variable, and the nice styling of Jupyter will take over.

Take a second look at this output.  There are several important things that you should notice.

* the left side has an unlabelled left side that's bolded.  This is the index row that pandas has added on.  We can change that later.
* the top row of data has automatically been interpreted as column names, and these look different from out rows of data.
* the original order of our data has been preserved.
* in the middle of this table, we can see that it skips the middle and it showing us now just the beginning and the end our data table.
* at the very end of the table it shows us how many rows and columns there are in our table.

Compared to how we were working with our data before, we can see a really important concept about this data structure: it both holds our data and carries with it very data centric metadata.  For example, instead of having a separate list strings that we know correspond to our column headers, the data frame carries that information inside itself.

In [9]:
df

Unnamed: 0,BoxNumber,FolderNumber,Contents,Date
0,1,1,[Provenance documents and biographical sources],n.d.
1,1,2,[Provenance documents and biographical sources],n.d.
2,1,3,[Provenance documents and biographical sources],n.d.
3,1,4,"Aberdeen, George Hamilton-Gordon, 4th Earl of,...",1846 Aug
4,1,5,"Abingdon, Montagu Bertie, 6th Earl of, 1808-18...","1859, 1860"
5,1,6,"Acland, Sir Thomas Dyke, bart., 1787-1871. ALS...",n.d.
6,1,7,"Aikin, Arthur, 1773-1854. 3 ALS to T. J. Petti...",1823-30
7,1,8,"Ailesbury, George William Frederick Brudnell-B...",1858-59
8,1,9,"Ainslie, Sir Whitelaw, 1767-1837. 1 ALS and 2 ...",1828 Jun
9,1,10,"Ainsworth, William Francis, 1807-1896. 5 ALS t...",1843-56


Now that we have a data frame, we can ask for some of this metadata.  Important to note here is that, in Python, there is a disctinction between asking an object to execute a method (roughly, a function that lives inside an object that's in your program) and asking for an object's attribute values.  Sometimes objects will have special methods for getting this data out, and other times it's directly accessible.  This all depends on what you are working with, thus you will be dependent on tutorials, documentation, or the source code to tell you what is available.  Sometimes there are nice introspection things available.  Anyhow.

Presuming that you're looking at something that is tacked onto the end of a varable or other data object (so basically not just a thing hanging out on its own).  So you have something like:

`object.doodad` or `object.pizza()`

* it's a method call if there are `()` on the end
* it's an attribute access if there's *no* parens

So which one is which?

`object.doodad` is an attribute access thing and `object.pizza()` is calling the `pizza()` method of `object`.  Of course, this all presumes that this object in question knows what to do with those things.c

Let's get back to pandas.

There are a series of metadata attributes about a dataframe that you can access to get some information out from it.

In [12]:
print(df.shape)

(601, 4)


This gives us the shape of our data frame as a tuple (so we can compute on this if we need), with the rows the first number and the columns the second number in there.

In [13]:
print(df.columns)

Index(['BoxNumber', 'FolderNumber', 'Contents', 'Date'], dtype='object')


This gives you, roughly, the names of the columns in your dataframe.  I say roughly because from a content perspective, that's true, but this is a special object.  This gives us a good segway to talking about data types and friends in dataframes.

This isn't just a list of strings. This is an `Index` object, that, yes, will roughly look and stink like a list, but exists with special metadata.  See that bit at the end?  It says `object`. We're going to talk about what that specifically means in a minute, but we first need to talk about what it means for a collection-like object to have a data type.  So this is an `Index` type object, and we can look at that clearly.

In [14]:
print(type(df.columns))

<class 'pandas.core.indexes.base.Index'>


But there's also the data type of the contents.  This is actually a pretty normal concept for other programming languages, but completely abnormal if you only know Python.  What pandas generally requires is that every 1D (one dimensional) object have the same data type.  For example, such and such collection is all floating numbers, while so and so collection is all strings.  This is because there are nice fancy things that you can do on some of these special kinds of collections.

The most permissive data type is the `object` time, which allows you to have just a pile of whatever.  In our case, we've got a bunch of strings.

The other data types are two different kinds of numbers