# The Pandas DataFrame - Working With Data Efficiently

## Getting to Know Pandas DataFrames

In [1]:
import pandas as pd

In [2]:
data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

index = range(101, 108)

I’ll use instead is Python’s `range() function` - start at `101` and go to `107`, so I’m going to go `108`. Let’s use the constructor for the DataFrame object.

We’ll pass in the data and then we’ll specify the index for the rows. That is using the keyword argument `index`, and the index labels are stored in that `index range` object that we created, which could have been a `list` or a `tuple`.

In [4]:
df = pd.DataFrame(data, index=index)

Now let’s save this `DataFrame` in, say, the variable `df`. Look at the `type`:

In [5]:
type(df)

pandas.core.frame.DataFrame

We’ve got the `pandas` module, and then we’ve got modules contained within `pandas`, and then the final type is called the `DataFrame`.

Let’s take a look at two important attributes of a DataFrame object. These are the `.index`, which in this case is a `RangeIndex` object.

In [6]:
df.index

RangeIndex(start=101, stop=108, step=1)

It starts at `101` and it ends at `108`, and a step of `1`. So this is, essentially, like a range object that you’re used to in Python.

Then we’ve got the `.columns` attribute, and this is a pandas `Index` object. If we take a look at the type:


In [7]:
df.columns

Index(['name', 'city', 'age', 'py-score'], dtype='object')

In [8]:
type(df.columns)

pandas.core.indexes.base.Index

this is an `Index` object in the `pandas` module, which is also one of the main data structures in pandas. 

## Working With Rows and Columns

In [9]:
df['city']

101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

So in this case, you’re going to access the column with the label '`city`', and this will return a pandas `Series` object. You can think of a pandas `Series` object as either an entire row or an entire column of a `DataFrame`.

In [24]:
type(df['city'])

pandas.core.series.Series

Let’s save this column in the variable, say, `cities`.

In [11]:
cities = df['city']

And if we take a look at this again, we see that we not only extracted the data in that column, we also extracted the index, or the row labels. And so a `pandas` Series object will also contain an `.index` attribute,

In [12]:
cities

101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

In [13]:
cities.index

RangeIndex(start=101, stop=108, step=1)

which, in this case, will be the same as the `.index` in the DataFrame, because we extracted an entire column of the `DataFrame`. 

Another way that you can extract a column is to use dot notation, but **this will only work if the name of the column that you want to extract is a string that’s a valid Python identifier**. So, for example, if we wanted to extract, say, the `age` column, we would simply type `.age`. And then, in this case, we get that `Series` object.

In [15]:
df.age

101    41
102    28
103    33
104    34
105    38
106    31
107    37
Name: age, dtype: int64

But if we tried this with the Python score column, so .`py-score`,

In [17]:
df['py-score']

101    88.0
102    79.0
103    81.0
104    80.0
105    68.0
106    61.0
107    84.0
Name: py-score, dtype: float64

we’re going to get an `AttributeError` because pandas thinks that we are *extracting* the column that’s called py and we’re *subtracting* it from some other `Series` object called score.

 So if we wanted to extract that `py-score` column, we’d have to use the bracket notation and simply write out the full column name.

In [18]:
df.index

RangeIndex(start=101, stop=108, step=1)

To access a whole row, say with index value `103`, we use the `.loc` accessor method. So the way to do this is to call the `DataFrame` with .`loc`, and then bracket notation, and then the actual label of the row that we want to access, so let’s say `103`.

In [19]:
df.loc[103]

name          Jana
city        Prague
age             33
py-score      81.0
Name: 103, dtype: object

In [20]:
type(df.loc[103])

pandas.core.series.Series

And so we have a pandas `Series`. Now, if you recall, we also have the cities `Series` and its index is the exact same. However, with `Series` objects, contrary to when we were working with a DataFrame, where we had to use the `.loc` accessor method, if we’re working with a `Series` object, **we can directly access the index just using bracket notation like this**:

In [21]:
cities

101    Mexico City
102        Toronto
103         Prague
104       Shanghai
105     Manchester
106          Cairo
107          Osaka
Name: city, dtype: object

In [22]:
cities[103]

'Prague'