# Accessing elements in Pandas DataFrames

For me, even though I have used Pandas off and on over many years, I still occasionally get confused about the syntax for accessing and modifying elements in Pandas DataFrames!

**I feel like the basic problem is that some of the Pandas syntax is too flexible**

- Similar notation lets you do multiple things!
- There are multiple ways of doing the same thing!

**So, what I'll try to do is point out the confusion points, or ambiguities in the syntax, and advise on what methods to avoid.**

For a very complete reference, see the
[Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
section of the Pandas User's Guide.

*As Alex Olteanu does in his blog post 
[How to make 538 plots](https://www.dataquest.io/blog/making-538-plots/), 
"We’ll work with data describing the percentages of Bachelors conferred to women in the US 
from 1970 to 2011. We’ll use a dataset compiled by data scientist 
[Randal Olson](http://www.randalolson.com/2014/06/14/percentage-of-bachelors-degrees-conferred-to-women-by-major-1970-2012/), 
who collected the data from the 
[National Center for Education Statistics](https://nces.ed.gov/about/).*

In [1]:
import pandas as pd

## Load DataFrame from CSV – integer default index

If we load data into a DataFrame from a CSV (comma-separated value) text file, Pandas will automatically create rows labels, called an `index`, for the DataFrame, consisting of sequential integers.

The `.head()` DataFrame (and Series) function by default returns the first five rows. You can put a number in the parentheses to specify the number of rows to return. You can also use the `.tail()` function to see the last rows.

In [2]:
df = pd.read_csv('data/women_percent_deg_usa_subset.csv')
df.head()

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
0,1970,4.229798,9.064439,0.8,77.1,44.4
1,1971,5.452797,9.503187,1.0,75.5,46.2
2,1972,7.42071,10.558962,1.2,76.9,47.6
3,1973,9.653602,12.804602,1.6,77.4,50.4
4,1974,14.074623,16.20485,2.2,77.9,52.6


All you have to do is put an integer in the parentheses to see that many rows.

In [3]:
df.tail(3)

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
39,2009,48.667224,48.840474,16.8,85.1,77.1
40,2010,48.730042,48.757988,17.2,85.0,77.0
41,2011,50.037182,48.180418,17.5,84.8,76.7


And even though the `index` looks like integers, it's really a special type of `Index` class under the hood. 

In [4]:
df.index

RangeIndex(start=0, stop=42, step=1)

## Read CSV specifying index column

For this tutorial, though, **I want it to be clear how Pandas behaves when we use integers versus row labels for selecting DataFrame elements**, so I'm going to explicitly tell Pandas to use the Year column from the data as the `index`.

*Note: I could have just used the `df.set_index('Year')` method on the previous DataFrame, but I wanted to show you how to do it all in one step during the load.*

In [5]:
df = pd.read_csv('data/women_percent_deg_usa_subset.csv', index_col='Year')
df.head()

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health,Psychology
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1970,4.229798,9.064439,0.8,77.1,44.4
1971,5.452797,9.503187,1.0,75.5,46.2
1972,7.42071,10.558962,1.2,76.9,47.6
1973,9.653602,12.804602,1.6,77.4,50.4
1974,14.074623,16.20485,2.2,77.9,52.6


### Index values

This time the Index has a name, and is not just a sequential range of integers, but a list of the specific integers present in the Year column of the data.

In [6]:
df.index

Int64Index([1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980,
            1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991,
            1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002,
            2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011],
           dtype='int64', name='Year')

### Column names

- We can access the column names with the `df.columns` method
- That also happens to return a Pandas `Index` – this time with a data type (dtype) of `'object'`, which is how Pandas deals with strings.

In [7]:
df.columns

Index(['Agriculture', 'Business', 'Engineering', 'Health', 'Psychology'], dtype='object')

---

# Single-axis indexing/selecting

## Square brackets notation: `df[]`

**One of the most confusing things about Pandas is how many different things you can do with a set of square brackets after the DataFrame variable name!**

From what I have gathered, this single square brackets `df[]` notation was the original Pandas syntax, but in trying to make it flexible, it became confusing to write and read, and they eventually added the more explicit `df.loc[]` and `df.iloc[]` methods you'll see in a bit. 

*It's still quite handy to use the single brackets for grabbing a single DataFrame column, but for other uses I would recommend the more explicit methods!*

## `df[]` with name inside for a single column

Like with a Python dictionary, you can use the (quoted) name of a column in square brackets to grab a single column out of a DataFrame. The column is returned as a Pandas `Series`, which is the standard 1D data structure, consisting of an `index` and the column's values.

In [8]:
df['Health'].head()

Year
1970    77.1
1971    75.5
1972    76.9
1973    77.4
1974    77.9
Name: Health, dtype: float64

## `df[]` with list inside for multiple columns

If you instead put a list of column names inside the square brackets, Pandas will return an ordered set of those columns in a `DataFrame`. 

**This looks a little weird because a list also has square brackets around it!** 

And, if you try to use this notation to assign just those columns to another variable, there is a problem you'll see below.

In [9]:
df[['Business','Health']].head()

Unnamed: 0_level_0,Business,Health
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970,9.064439,77.1
1971,9.503187,75.5
1972,10.558962,76.9
1973,12.804602,77.4
1974,16.20485,77.9


### This format isn't great for assignment – **SettingWithCopyWarning**

If you use this "list inside of square brackets" notation and assign to a new variable as a way to just keep a couple columns of your DataFrame, you will run into a **SettingWithCopyWarning** if you try to change that new DataFrame. 

*Because of some Pandas inner-workings, when you slice or index into a DataFrame it's not actually clear whether it will create a "view" into the original DataFrame or a copy of the data!* 

- **If you are assigning a subset of a DataFrame to a new variable**, with the intention of creating a copy that you'll work on independently of the original, **create an explicit copy by chaining the `.copy()` function to the end!**
- If you're trying to assign values to a subset of your original DataFrame, **use the `df.loc[,]` notation shown below.**

- Additional learning resources:
    - [The best explanation of the SettingWithCopyWarning I've seen](https://www.dataquest.io/blog/settingwithcopywarning/)
    - [A complicated explanation in the Pandas documentation](https://pandas.pydata.org/pandas-docs/version/0.22/indexing.html#indexing-view-versus-copy)

Here's an example of when the problem arises. We try to put a couple columns into a new DataFrame, and then change the new DataFrame by adding an additional column:

In [10]:
df_temp = df[['Business','Health']]
df_temp['ratio'] = df['Business']/df['Health']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temp['ratio'] = df['Business']/df['Health']


But if we make an explicit `.copy()` of the data, we don't have a problem:

In [11]:
df_temp = df[['Business','Health']].copy()
df_temp['ratio'] = df['Business']/df['Health']

## `df[]` with integer "slice" notation

**Here's one place where things get screwy!**

If you put a range of ordered integers into the square brackets, using Python's "slice notation" with a colon between the integers, Pandas will return a set of **rows** from the DataFrame, including all the columns!!

Note that, just like with a Python list, what's returned is inclusive of the initial integer index, but excludes the final value.

### *I would advise against using this notation!*

*There are other ways of doing the same thing, and if you don't ever use it, maybe you'll forget it exists!*

In [12]:
df[0:3]

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health,Psychology
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1970,4.229798,9.064439,0.8,77.1,44.4
1971,5.452797,9.503187,1.0,75.5,46.2
1972,7.42071,10.558962,1.2,76.9,47.6


## `df[]` with boolean series for multiple rows, all columns

*Another potentially confusing notation, but in this case, convenient enough that you'll see it quite often. Note that you can easily use the `df.loc[,:]` notation below to do the same thing, which is somewhat more clear and readable.*

**It's very common to want all the rows from your DataFrame which pass a certain test**, or set of criteria. You can write the test itself in a very straightforward way, returning a `Series` of True/False boolean values.

Here we'll test which rows, or years, of the Business category had less than 15% women.

In [13]:
df['Business'] < 15

Year
1970     True
1971     True
1972     True
1973     True
1974    False
1975    False
1976    False
1977    False
1978    False
1979    False
1980    False
1981    False
1982    False
1983    False
1984    False
1985    False
1986    False
1987    False
1988    False
1989    False
1990    False
1991    False
1992    False
1993    False
1994    False
1995    False
1996    False
1997    False
1998    False
1999    False
2000    False
2001    False
2002    False
2003    False
2004    False
2005    False
2006    False
2007    False
2008    False
2009    False
2010    False
2011    False
Name: Business, dtype: bool

### Matching rows from conditional (boolean Series) in square brackets

If you put that conditional test statement in square brackets after the DataFrame variable name, Pandas will return all the columns in all the rows that returned True in the test's boolean Series. As we saw above, only four rows pass the test

In [14]:
df[df['Business'] < 15]

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health,Psychology
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1970,4.229798,9.064439,0.8,77.1,44.4
1971,5.452797,9.503187,1.0,75.5,46.2
1972,7.42071,10.558962,1.2,76.9,47.6
1973,9.653602,12.804602,1.6,77.4,50.4


#### More complicated conditionals

You can combine multiple conditions if you put them in parentheses and put a logical operator between.

- `&` = "and"
- `|` = "or"  *(It's called a "pipe" character, and it's `shift-\` above the Enter/Return key)*

In [15]:
df[(df['Agriculture'] <= 10) | (df['Agriculture'] > 50)]

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health,Psychology
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1970,4.229798,9.064439,0.8,77.1,44.4
1971,5.452797,9.503187,1.0,75.5,46.2
1972,7.42071,10.558962,1.2,76.9,47.6
1973,9.653602,12.804602,1.6,77.4,50.4
2011,50.037182,48.180418,17.5,84.8,76.7


---

# Multi-axis indexing/selecting

So far we've been returning either all rows of a single column, or all columns from rows matching a condition. There are two simple ways of selecting along rows and columns simultaneously: 

- The first based on "labels" like column names and index values *(preferred, when possible)*
- The second based on integer position along rows and columns

## `df.loc[row,col]` for label-based, multi-axis indexing

The `df.loc[row,col]` method let's you select first along rows, and then along columns, in both directions simultaneoulsy using row and column "labels", which are the row index and column names. You can use [(quoting from the documentation)](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing)

- A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
- A list or array of labels ['a', 'b', 'c'].
- A slice object with labels 'a':'f' *(Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See [Slicing with labels](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-slicing-with-labels) and [Endpoints are inclusive](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-endpoints-are-inclusive).)*
- A boolean array

### Row and Column specified to return single value

If you give one specific row and column label, you'll get back a single value of the data type contained in that cell.

In [16]:
df.loc[1980, 'Agriculture']

30.75938956

### Row or column list to return a Series or DataFrame

- A single list will return a 1D result, so a Series
- Two lists will return a 2D results, so a DataFrame

In [17]:
df.loc[1975, ['Business','Health']]

Business    19.686249
Health      78.900000
Name: 1975, dtype: float64

In [18]:
df.loc[[1980,1990,2000],['Agriculture','Engineering']]

Unnamed: 0_level_0,Agriculture,Engineering
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1980,30.75939,10.3
1990,32.703444,14.1
2000,45.057766,18.4


### Slice notation is like giving the bounds of a list

Remember that *both the rows and columns of a DataFrame are ordered* (which is different than a Python dictionary). Our DataFrame is in the data order from the CSV file, but things like sorting can change that order. So, when you specify "slice" notation to say `start:end`, Pandas will take the order into account, and will give you an empty result if there is nothing between your specified bounds!

*You can also leave off the start or end of a slice range to include the beginning or end of the series.*

In [19]:
df.loc[1980:1985, 'Health']

Year
1980    83.5
1981    84.1
1982    84.4
1983    84.6
1984    85.1
1985    85.3
Name: Health, dtype: float64

In [20]:
df.loc[2000, 'Engineering':'Psychology']

Engineering    18.4
Health         83.5
Psychology     77.5
Name: 2000, dtype: float64

In [21]:
df.loc[2005:, :'Health']

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005,47.672754,49.791851,17.9,86.0
2006,46.7903,49.210914,16.8,85.9
2007,47.605026,49.000459,16.8,85.4
2008,47.570834,48.888027,16.5,85.2
2009,48.667224,48.840474,16.8,85.1
2010,48.730042,48.757988,17.2,85.0
2011,50.037182,48.180418,17.5,84.8


#### Non-intuitive result when Index isn't sorted

*Be careful! If we sort by the Health column values, the Year index is now not in time order.*

In [22]:
df_health_sort = df.sort_values(by='Health')
df_health_sort.head()

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health,Psychology
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1971,5.452797,9.503187,1.0,75.5,46.2
1972,7.42071,10.558962,1.2,76.9,47.6
1970,4.229798,9.064439,0.8,77.1,44.4
1973,9.653602,12.804602,1.6,77.4,50.4
1974,14.074623,16.20485,2.2,77.9,52.6


**So if we try to make a slice selection on the years, we get an empty result!**

In [23]:
df_health_sort.loc[1970:1972, 'Health']

Series([], Name: Health, dtype: float64)

### `:` by itself to specify whole axis

For either rows or columns, you can specify a `:` by itself to denote all values along that axis, returned as a `Series` or `DataFrame`, again depending on the dimensionality of the result!

In [24]:
df.loc[1970, :]

Agriculture     4.229798
Business        9.064439
Engineering     0.800000
Health         77.100000
Psychology     44.400000
Name: 1970, dtype: float64

In [25]:
df.loc[:, 'Agriculture'].head()

Year
1970     4.229798
1971     5.452797
1972     7.420710
1973     9.653602
1974    14.074623
Name: Agriculture, dtype: float64

In [26]:
df.loc[:,:].head()

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health,Psychology
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1970,4.229798,9.064439,0.8,77.1,44.4
1971,5.452797,9.503187,1.0,75.5,46.2
1972,7.42071,10.558962,1.2,76.9,47.6
1973,9.653602,12.804602,1.6,77.4,50.4
1974,14.074623,16.20485,2.2,77.9,52.6


### Boolean array from condition for either axis

**Note: This is the more explicit (and perhaps more clear) version of the earlier syntax:**

`df[df['Business'] < 15]`

In [27]:
df.loc[df['Business'] < 15, :]

Unnamed: 0_level_0,Agriculture,Business,Engineering,Health,Psychology
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1970,4.229798,9.064439,0.8,77.1,44.4
1971,5.452797,9.503187,1.0,75.5,46.2
1972,7.42071,10.558962,1.2,76.9,47.6
1973,9.653602,12.804602,1.6,77.4,50.4


#### String/Text functions of Index, Series & DataFrame are under the `.str.` methods

Pandas documentation: 
[Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)

In [28]:
df.columns.str.contains('i')

array([ True,  True,  True, False, False])

In [29]:
df.loc[2000, df.columns.str.contains('i')]

Agriculture    45.057766
Business       49.803616
Engineering    18.400000
Name: 2000, dtype: float64

---

## `df.loc[row,col]` for setting values

So far we've been accessing values in DataFrames to read/get what's already there, but **the same methods can be used for setting values on DataFrame subsets**!

As was mentioned above, explicitly invoking the `.copy()` method is the best way to ensure you're making a copy of a subset of a DataFrame, if your intention is to work on a copy. This avoids the dreaded *SettingWithCopyWarning*.

The `df.loc[,]` method is the preferred way to index into a DataFrame for setting values on a slice of the data, and making sure you're not doing what's called "Chained indexing", which often leads to a *SettingWithCopyWarning*.

In [30]:
df_set = df.loc[:,['Engineering','Health']].copy()
df_set.head()

Unnamed: 0_level_0,Engineering,Health
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970,0.8,77.1
1971,1.0,75.5
1972,1.2,76.9
1973,1.6,77.4
1974,2.2,77.9


In [31]:
df_set.loc[1970,:] = 50
df_set.head()

Unnamed: 0_level_0,Engineering,Health
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970,50.0,50.0
1971,1.0,75.5
1972,1.2,76.9
1973,1.6,77.4
1974,2.2,77.9


In [32]:
df_set.loc[df_set['Engineering']<2.0, 'Engineering'] = 0
df_set.head()

Unnamed: 0_level_0,Engineering,Health
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970,50.0,50.0
1971,0.0,75.5
1972,0.0,76.9
1973,0.0,77.4
1974,2.2,77.9


---

## `df.iloc[row,col]` for integer position-based, multi-axis indexing

**I don't use this as often, but it's good to know it exists, in case you want to select certain positions without needing to know the labels. Label-based selection tends to be less error-prone, though!**

The `df.iloc[row,col]` method let's you select first along rows, and then along columns, in both directions simultaneoulsy using row and column *integer position (from 0 to length-1 of the axis).* You can use [(quoting from the documentation)](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing)

- An integer e.g. 5.
- A list or array of integers [4, 3, 0].
- A slice object with ints 1:7.
- A boolean array

### Single value with two specific positions

Notice that we get the value, but we have to look back at our DataFrame to make sure we've gotten what we wanted! Here, as long as we haven't re-sorted `df.iloc[0,0]` happens to be the same as `df.loc[1970,'Agriculture']`)

In [33]:
df.iloc[0,0]

4.22979798

In [34]:
df.iloc[[20,10,0], [3,2,1]]

Unnamed: 0_level_0,Health,Engineering,Business
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1990,83.9,14.1,47.200851
1980,83.5,10.3,36.765725
1970,77.1,0.8,9.064439


#### Location handy when Index is out of order

Remember when we sorted the rows by increasing Health percentage, the years ended up out of order.

In [35]:
df_health_sort.iloc[:3, 2:4]

Unnamed: 0_level_0,Engineering,Health
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1971,1.0,75.5
1972,1.2,76.9
1970,0.8,77.1


#### Boolean tests

Even if the rows are out of order, we can still test their value

In [36]:
df_health_sort.iloc[df.index <= 1975, df.columns.str.contains('i')]

Unnamed: 0_level_0,Agriculture,Business,Engineering
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1971,5.452797,9.503187,1.0
1972,7.42071,10.558962,1.2
1970,4.229798,9.064439,0.8
1973,9.653602,12.804602,1.6
1974,14.074623,16.20485,2.2
1975,18.333162,19.686249,3.2


---

# Method you'll see, but don't really need

A very good article on more advanced Pandas features where there are multiple ways of doing similar things, and [Ted Petrou's](https://medium.com/@petrou.theodore) 
opinions on which to use, is 
[Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428). 

*These pieces are taken from that article, but there's a lot more great content there that we don't have time to cover.*

## Selecting a single column with the "dot" notation

A very common alternative to selecting a single column with the `df['name']` bracket notation you'll see all the time, is what's called the "dot" notation, where you follow the dataframe name by a dot and the column name, `df.name`

In [37]:
min_suff_df = pd.read_csv('data/min_suff_data.csv', index_col='name')
min_suff_df

Unnamed: 0_level_0,state,favorite food,age,height,count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jane,NY,Steak,30,165,10
Niko,TX,Lamb,2,70,4
Aaron,FL,Mango,12,120,3
Penelope,AL,Apple,4,80,12
Dean,AK,Cheese,32,180,8
Christina,TX,Melon,33,172,99
Cornelia,TX,Beans,69,150,44


In [38]:
min_suff_df['state']

name
Jane         NY
Niko         TX
Aaron        FL
Penelope     AL
Dean         AK
Christina    TX
Cornelia     TX
Name: state, dtype: object

In [39]:
min_suff_df.state

name
Jane         NY
Niko         TX
Aaron        FL
Penelope     AL
Dean         AK
Christina    TX
Cornelia     TX
Name: state, dtype: object

## Issues with the "dot" notation

There are three issues with using dot notation. It doesn’t work in the following situations:

- When there are spaces in the column name
- When the column name is the same as a DataFrame method
- When the name of a column you want to access is stored in a variable

### Spaces in the columns name

In [40]:
min_suff_df['favorite food']

name
Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: favorite food, dtype: object

In [41]:
min_suff_df.favorite food

SyntaxError: invalid syntax (<ipython-input-41-01679cbbde06>, line 1)

### Column name is the same as a DataFrame method

`count` is the name of one of our columns, and it's easy to access those values with the "quoted name in square brackets" notation:

In [42]:
min_suff_df['count']

name
Jane         10
Niko          4
Aaron         3
Penelope     12
Dean          8
Christina    99
Cornelia     44
Name: count, dtype: int64

But if we try to access that column's values with the "dot" notation, there is a problem. The output here will seem confusing, but it's basically saying that `.count` is a method (built-in function) of the DataFrame.

In [43]:
min_suff_df.count

<bound method DataFrame.count of           state favorite food  age  height  count
name                                             
Jane         NY         Steak   30     165     10
Niko         TX          Lamb    2      70      4
Aaron        FL         Mango   12     120      3
Penelope     AL         Apple    4      80     12
Dean         AK        Cheese   32     180      8
Christina    TX         Melon   33     172     99
Cornelia     TX         Beans   69     150     44>

### The column name is stored in a variable

It's not uncommon to want to access a column who's name string has been stored in a variable

In [44]:
col_name = 'height'
min_suff_df[col_name]

name
Jane         165
Niko          70
Aaron        120
Penelope      80
Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64

But, again, that doesn't work with the "dot" notation

In [45]:
min_suff_df.col_name

AttributeError: 'DataFrame' object has no attribute 'col_name'

## Lots of Pandas is written with the dot notation. Why?

Many tutorials make use of the dot notation to select a single column of data. Why is this done when the brackets seem to be clearly superior? 

- **It might be because the official documentation contains plenty of examples that use it.**
- **It also *uses three fewer characters which entices the very laziest amongst us*.**