# Accessing your data in DataFrames

*Note: There is also a longer, more complete version of this information in the notebook [FullAccessingDataFrames](FullAccessingDataFrames.ipynb)*

One big problem with Pandas is that, sometimes for historical reasons, there are multiple ways of doing the same thing. **The syntax for accessing DataFrames has always been one of the most confusing aspects for me!**

Let's make a toy DataFrame to see some of the features

---

*To preserve the mystery, select from the notebook menus*

`Edit -> Clear All Outputs`

---

In [1]:
import pandas as pd

In [2]:
data_dict = {'letters':['A','B','c','D','eee'], 
             'hundreds':[100,200,300,400,500], 
             'tens':[10.0,20.0,30.0,40.0,50.0],
             'boolean':[True,False,True,True,False]}

In [3]:
df = pd.DataFrame(data_dict)
df

Unnamed: 0,letters,hundreds,tens,boolean
0,A,100,10.0,True
1,B,200,20.0,False
2,c,300,30.0,True
3,D,400,40.0,True
4,eee,500,50.0,False


## DataFrame Attributes

### Each column has a data "type"

- **object** is how Pandas refers to strings of text
- **int64** is a 64-bit integer (whole number). The number of bits is just the amount of internal storage used for that number. *For integers it limits how big the number can be.* **Note that the default int64 can not store NaN/Null values!** See the [integer NA documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html) for more information.
- **float64** is a "floating point" number (number with decimal places). *For floats the number of bits limits the precision of the number.*
- **bool** is a booleal value, which is just True/False

In [4]:
df.dtypes

letters      object
hundreds      int64
tens        float64
boolean        bool
dtype: object

### DataFrame index

Notice the column of sequential integers off to the left-hand side of the DataFrame output. That is the DataFrame's **index**. 

- **The index contains the names of the rows**
- Because we didn't explicitly specify an index column, Pandas created one for us

In [5]:
df.index

RangeIndex(start=0, stop=5, step=1)

### DataFrame columns

There is a separate index of column names

In [6]:
df.columns

Index(['letters', 'hundreds', 'tens', 'boolean'], dtype='object')

---

## `df[]` with a name for column indexing/selecting

The most common, and concise, way of selecting a column out of a DataFrame is just using square brackets with the column name inside – similar to how you access a dictionary value using it's key.

### Single name gives single column

This returns a Series

In [7]:
df['hundreds']

0    100
1    200
2    300
3    400
4    500
Name: hundreds, dtype: int64

### List of column names inside the square brackets gives multiple columns

You can select multiple columns by putting a *list of column names* inside the square brackets. This returns a DataFrame

In [8]:
df[['tens','hundreds']]

Unnamed: 0,tens,hundreds
0,10.0,100
1,20.0,200
2,30.0,300
3,40.0,400
4,50.0,500


---

## `df[]` with boolean Series returns multiple rows, all columns

### Boolean Series

**It's very common to want all the rows from your DataFrame which pass a certain test**, or set of criteria. You can write the test itself in a very straightforward way, returning a `Series` of True/False boolean values.

In [9]:
df['tens'] < 35

0     True
1     True
2     True
3    False
4    False
Name: tens, dtype: bool

### Rows are returned where boolean Series == True

**Things are getting screwy! If we use a single bracket with a boolean Series inside, we get back rows instead of columns!**

*This can be confusing, but it's convenient enough that you'll see it quite often. Note that you can easily use the `df.loc[,:]` notation below to do the same thing, which is somewhat more clear and readable.*


In [10]:
df[df['tens'] < 35]

Unnamed: 0,letters,hundreds,tens,boolean
0,A,100,10.0,True
1,B,200,20.0,False
2,c,300,30.0,True


In [11]:
df[df['boolean']]

Unnamed: 0,letters,hundreds,tens,boolean
0,A,100,10.0,True
2,c,300,30.0,True
3,D,400,40.0,True


### More complicated conditionals

You can combine multiple conditions if you put them in parentheses and put a logical operator between.

- `&` = "and"
- `|` = "or"  *(It's called a "pipe" character, and it's `shift-\` above the Enter/Return key)*

In [12]:
df[(df['tens'] < 35) & (df['hundreds'] > 200)]

Unnamed: 0,letters,hundreds,tens,boolean
2,c,300,30.0,True


---

*To try the exercise below, if you haven't been executing the cells above, select this cell and from the Jupyter menus choose*

`Run -> Run All Above Selected Cell`

### **EXERCISE**

**Return all rows in `df` where the "hundreds" column value is greater than or equal to 400**

*Note: Type instead of using copy/paste for better retention*

---

## Row Index doesn't have to be integers!

The Index values aren't "row numbers". Instead, **the Index values are the names of the rows, so you can use things like strings or dates for the Index**. Let's start by making a new column.

*Note that if we creating a new column from a list we need to make sure we don't make any mistakes in the order of the DataFrame rows compared to the order in our list! A more reliable solution would be to create a Series with explicit Index values and assign to the new column from that. If you initialize a Series with a Dictionary, the key values will be used as the row Index. Try reordering the key value pairs and verify that the row assignments stay the same!*

*Better:* `df['spelled_out'] = pd.Series({0:'One', 1:'Two', 2:'Three', 3:'Four', 4:'Five'})`

Here we'll just do it the simpler way...

In [13]:
df['spelled_out'] = ['One','Two','Three','Four','Five']
df

Unnamed: 0,letters,hundreds,tens,boolean,spelled_out
0,A,100,10.0,True,One
1,B,200,20.0,False,Two
2,c,300,30.0,True,Three
3,D,400,40.0,True,Four
4,eee,500,50.0,False,Five


### We can set one of the columns as the Index

In [14]:
df2 = df.set_index('spelled_out')
df2

Unnamed: 0_level_0,letters,hundreds,tens,boolean
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
One,A,100,10.0,True
Two,B,200,20.0,False
Three,c,300,30.0,True
Four,D,400,40.0,True
Five,eee,500,50.0,False


### Every column's Series has this same Index

In [15]:
df2['letters']

spelled_out
One        A
Two        B
Three      c
Four       D
Five     eee
Name: letters, dtype: object

In [16]:
df2['tens']

spelled_out
One      10.0
Two      20.0
Three    30.0
Four     40.0
Five     50.0
Name: tens, dtype: float64

---

## `df.loc[row,col]` for label-based, multi-axis indexing

**This is the best access/selection method!** It lets you select first along rows, and then along columns, in both directions simultaneoulsy using row and column "labels", which are the row index and column names.

In [17]:
df2.loc['One','letters']

'A'

### Colon `:` for selecting whole or slices

A notation that comes from accessing lists in Python is the "slice" operator, which is specified with a colon between two values. 

- **The colon by itself denotes all Index entries in a row or column.**

So, here we return all rows and a single column.

In [18]:
df2.loc[:,'hundreds']

spelled_out
One      100
Two      200
Three    300
Four     400
Five     500
Name: hundreds, dtype: int64

### Slice end-points are included in Pandas selections

Unlike Python lists, in Pandas the slice includes both end points!


In [19]:
df2.loc[:,'hundreds':'tens']

Unnamed: 0_level_0,hundreds,tens
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,100,10.0
Two,200,20.0
Three,300,30.0
Four,400,40.0
Five,500,50.0


---

*To try the exercise below, if you haven't been executing the cells above, select this cell and from the Jupyter menus choose*

`Run -> Run All Above Selected Cell`

### **EXERCISE**

**Do you know how to use the slice notation to specify just the rows up to and including "Three"?**

*Note: Type instead of using copy/paste for better retention*

---

### Lists for combinations

Lists of values work the same as with the `df[]` notation. *Note that again, order matters for what is returned!*

In [20]:
df2.loc[:,['tens','letters']]

Unnamed: 0_level_0,tens,letters
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,10.0,A
Two,20.0,B
Three,30.0,c
Four,40.0,D
Five,50.0,eee


### Single rows are a Series, too

- Remember, any 1D result, row or column, will be a Series in Pandas
- When you're returning a row, the Index will be the column names

In [21]:
df2.loc['Three',:]

letters        c
hundreds     300
tens        30.0
boolean     True
Name: Three, dtype: object

### Boolean Series can be used for selecting True rows or columns

But notice with this syntax you can tell easily whether you're using the boolean Series to select rows or columns, unlike with the `df[]` syntax.

In [22]:
df2.loc[df2['tens']<35,:]

Unnamed: 0_level_0,letters,hundreds,tens,boolean
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
One,A,100,10.0,True
Two,B,200,20.0,False
Three,c,300,30.0,True


---

*To try the exercise below, if you haven't been executing the cells above, select this cell and from the Jupyter menus choose*

`Run -> Run All Above Selected Cell`

### **EXERCISE**

**Using the `df2.loc[]` notation, return all columns for all rows in which the "boolean" column is True**

*Note: Type instead of using copy/paste for better retention*

---

## `df.query()` for selecting rows like in SQL "where" statement

**The idea is to return rows where column content meets certain criteria, like when you use a boolean series inside of a `df[]` statement**

This is not a syntax I'm very experienced with, but people coming from the SQL (Structured Query Language) relational database query world might feel more comfortable with this form. It does sometimes have the advantage of shorter, more readable conditional expressions.

Some bulleted text below is from the Pandas [DataFrame.query documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html):

### Format your query as a string in quotes

- You can refer to column names (that don't have spaces) directly in the expression

In [23]:
df2.query('tens < 35')

Unnamed: 0_level_0,letters,hundreds,tens,boolean
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
One,A,100,10.0,True
Two,B,200,20.0,False
Three,c,300,30.0,True


### Query can refer to variables in the environment

- You can refer to variables in the environment by prefixing them with an ‘@’ character

In [24]:
cutoff_value = 35
df2.query('tens < @cutoff_value')

Unnamed: 0_level_0,letters,hundreds,tens,boolean
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
One,A,100,10.0,True
Two,B,200,20.0,False
Three,c,300,30.0,True


### Need backticks around column names with spaces

- You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as \`Area (cm^2)\`). 
- Strings in the conditional need the opposite type of quotes than you've used to surround the expression itself


*Note that this returns a DataFrame even for a single row, which is actually the same behavior as*

```
df2[df2['more letters'] == 'DD']
```

In [25]:
df2['more letters'] = df2['letters'] + df2['letters']

df2.query("`more letters` == 'DD'")

Unnamed: 0_level_0,letters,hundreds,tens,boolean,more letters
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Four,D,400,40.0,True,DD


### Complex conditionals use "and" / "or"

- Logical expressions can either use "and" or "or", or "&" or "|" for more complex conditionals.

**This is where the queries are slightly more readable than the other form**

```
df2[(df2['tens'] < 35) & (df2['boolean'] == True)]
```

In [26]:
df2.query("tens < 35 and boolean == True")

Unnamed: 0_level_0,letters,hundreds,tens,boolean,more letters
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
One,A,100,10.0,True,AA
Three,c,300,30.0,True,cc


---

## I've skipped the integer selection methods

Indexing by integers using the `df.iloc[row index, column index]` method is good to know about, but frankly, I hardly ever find myself using that method – I prefer to use labels. 

*I never use integers inside the `df[]` notation – it is just a historical leftover that should be avoided!*

See the [FullAccessingDataFrames notebook](FullAccessingDataFrames.ipynb) to learn more.

---

## SettingWithCopyWarning

If you select a subset of your DataFrame and assign to a new variable as a way to keep just a couple columns of your DataFrame, you will run into a **SettingWithCopyWarning** if you try to change that new DataFrame. 

*Because of some Pandas inner-workings, when you slice or index into a DataFrame it's not actually clear whether it will create a "view" into the original DataFrame or a copy of the data!* 

In [27]:
df_nums = df2[['hundreds','tens']]
df_nums

Unnamed: 0_level_0,hundreds,tens
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,100,10.0
Two,200,20.0
Three,300,30.0
Four,400,40.0
Five,500,50.0


### It does the operation but gives you a confusing warning

You don't want these warnings all over the place in your notebook or you'll miss really important errors and warnings, and there's a chance you might be doing something wrong, anyway!

In [28]:
df_nums['sums'] = df_nums['hundreds'] + df_nums['tens']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nums['sums'] = df_nums['hundreds'] + df_nums['tens']


### Create an explicit `.copy()`

**If you are assigning a subset of a DataFrame to a new variable**, with the intention of creating a copy that you'll work on independently of the original, **create an explicit copy by chaining the `.copy()` method to the end!**

Additional learning resources:
- [The best explanation of the SettingWithCopyWarning I've seen](https://www.dataquest.io/blog/settingwithcopywarning/)
- [A complicated explanation in the Pandas documentation](https://pandas.pydata.org/pandas-docs/version/0.22/indexing.html#indexing-view-versus-copy)


In [29]:
df_nums = df2[['hundreds','tens']].copy()
df_nums['sums'] = df_nums['hundreds'] + df_nums['tens']
df_nums

Unnamed: 0_level_0,hundreds,tens,sums
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
One,100,10.0,110.0
Two,200,20.0,220.0
Three,300,30.0,330.0
Four,400,40.0,440.0
Five,500,50.0,550.0


---

## `df.loc[row,col]` for setting values

So far we've been accessing values in DataFrames to read/get what's already there, but **the same methods can be used for setting values on DataFrame subsets**!

In [30]:
df_set = df2.loc[:,['tens','hundreds']].copy()
df_set

Unnamed: 0_level_0,tens,hundreds
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,10.0,100
Two,20.0,200
Three,30.0,300
Four,40.0,400
Five,50.0,500


In [31]:
df_set.loc['Three',:] = 50
df_set

Unnamed: 0_level_0,tens,hundreds
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,10.0,100
Two,20.0,200
Three,50.0,50
Four,40.0,400
Five,50.0,500


In [32]:
df_set.loc[df_set['tens'] <= 45.0, 'hundreds'] = 0
df_set

Unnamed: 0_level_0,tens,hundreds
spelled_out,Unnamed: 1_level_1,Unnamed: 2_level_1
One,10.0,0
Two,20.0,0
Three,50.0,50
Four,40.0,0
Five,50.0,500


---

## Selection method you'll see, but don't really need!

A very good article on more advanced Pandas features where there are multiple ways of doing similar things, and [Ted Petrou's](https://medium.com/@petrou.theodore) 
opinions on which to use, is 
[Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428). 

*These pieces are taken from that article, but there's a lot more great content there that we don't have time to cover.*

### Selecting a single column with the "dot" notation

A very common alternative to selecting a single column with the `df['name']` bracket notation you'll see all the time, is what's called the "dot" notation, where you follow the dataframe name by a dot and the column name, `df.name`

In [33]:
df2.hundreds

spelled_out
One      100
Two      200
Three    300
Four     400
Five     500
Name: hundreds, dtype: int64

### Issues with the "dot" notation

There are three issues with using dot notation. It doesn’t work in the following situations:

- When there are spaces in the column name
- When the column name is the same as a DataFrame method
- When the name of a column you want to access is stored in a variable

#### Spaces in the columns name

In [34]:
df2['more letters']

spelled_out
One          AA
Two          BB
Three        cc
Four         DD
Five     eeeeee
Name: more letters, dtype: object

In [35]:
df2.more letters

SyntaxError: invalid syntax (192981390.py, line 1)

#### Column name is the same as a DataFrame method

`sum` can be the name of one of our columns, and it's easy to access those values with the "quoted name in square brackets" notation:

In [36]:
df2['sum'] = df2['tens'] + df2['hundreds']
df2['sum']

spelled_out
One      110.0
Two      220.0
Three    330.0
Four     440.0
Five     550.0
Name: sum, dtype: float64

**But if we try to access that column's values with the "dot" notation, there is a problem.** The output here will seem confusing, but it's basically saying that `.sum` is a method (built-in function) of the DataFrame.

In [37]:
df2.sum

<bound method NDFrame._add_numeric_operations.<locals>.sum of             letters  hundreds  tens  boolean more letters    sum
spelled_out                                                     
One               A       100  10.0     True           AA  110.0
Two               B       200  20.0    False           BB  220.0
Three             c       300  30.0     True           cc  330.0
Four              D       400  40.0     True           DD  440.0
Five            eee       500  50.0    False       eeeeee  550.0>

#### The column name is stored in a variable

It's not uncommon to want to access a column who's name string has been stored in a variable

In [38]:
column_name = 'tens'
df2[column_name]

spelled_out
One      10.0
Two      20.0
Three    30.0
Four     40.0
Five     50.0
Name: tens, dtype: float64

**But, again, that doesn't work with the "dot" notation**

In [39]:
df2.column_name

AttributeError: 'DataFrame' object has no attribute 'column_name'

### Lots of Pandas is written with the dot notation. Why?

Many tutorials make use of the dot notation to select a single column of data. Why is this done when the brackets seem to be clearly superior? 

- It might be because **the official documentation contains plenty of examples that use it**
- It also ***uses three fewer characters which entices the very laziest amongst us***

---