# Assigning subsets

## Assignment
You have used this assignment operation so many times that it has become cliche now. It looks like this:

```{code-block}python
variable = expression
subset_of_DataFrame_or_Series = new_values
```

In [1]:
import pandas as pd
import numpy as np

We will keep the first column of the dataset as the `index` of the Dataframe

In [2]:
df = pd.read_csv('../data/employee_sample.csv', index_col=0)
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY
Tom,Male,White,Engineering,23,107962
Niko,Male,Black,Engineering,1,30347
Penelope,Female,White,Engineering,12,60258
Aria,Female,Black,Engineering,8,43618
Sofia,Female,Black,Parks & Recreation,23,26125
Dean,Male,Black,Parks & Recreation,3,33592
Zach,Male,White,Parks & Recreation,4,37565


## Creating a new column
Before we change any of the data in this DataFrame, we will add a single column to the end. There are multiple ways of doing so, but we will begin by using `[ ]`. Place a string inside of the brackets and make this the left-hand side of the assignment.

The right-hand side can consist of any of the following:
* A scalar value
* A list or array with the same length as the DataFrame
* A pandas Series with an index that matches the index of the DataFrame (a little tricky!)

### Using a scalar value

A **scalar** value is simply one single value, like an integer, string, boolean or date. When using a scalar for column assignment, each value in the column will be the same. Let's create a column **SCORE** and assign it the value 99.

In [3]:
df['SCORE'] = 99
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE
Tom,Male,White,Engineering,23,107962,99
Niko,Male,Black,Engineering,1,30347,99
Penelope,Female,White,Engineering,12,60258,99
Aria,Female,Black,Engineering,8,43618,99
Sofia,Female,Black,Parks & Recreation,23,26125,99
Dean,Male,Black,Parks & Recreation,3,33592,99
Zach,Male,White,Parks & Recreation,4,37565,99


### Using list or array

Instead of creating a new column with all the same values, we can use a list or NumPy array with different values for each row. The only stipulation is that the number of new values in the list/array must be the same as the number of rows in the DataFrame.

Let's create the column **BONUS RATE**, with a list of numbers between 0 and 1.

In [4]:
df['BONUS RATE'] = [.2, .1, 0, .15, .12, .3, 0.5]
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE
Tom,Male,White,Engineering,23,107962,99,0.2
Niko,Male,Black,Engineering,1,30347,99,0.1
Penelope,Female,White,Engineering,12,60258,99,0.0
Aria,Female,Black,Engineering,8,43618,99,0.15
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3
Zach,Male,White,Parks & Recreation,4,37565,99,0.5


We could have just as easily used a one dimensional NumPy array to get the same exact results. Let's do just that and create a random array of integers to represent the floor that the employee works on.

We use the `randint()` function from NumPy's `rand` module. Use the *low* (inclusive) and *high* (exclusive) parameters to bound the range of possible integers. `len(df)` returns the number of rows in the DataFrame ensuring that the size of the array is correct..

In [5]:
floor = np.random.randint(low=1, high=10, size=len(df))
floor

array([2, 4, 6, 2, 9, 6, 8])

Then assign this to the **FLOOR** column:

In [6]:
df['FLOOR'] = floor
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR
Tom,Male,White,Engineering,23,107962,99,0.2,2
Niko,Male,Black,Engineering,1,30347,99,0.1,4
Penelope,Female,White,Engineering,12,60258,99,0.0,6
Aria,Female,Black,Engineering,8,43618,99,0.15,2
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,9
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,6
Zach,Male,White,Parks & Recreation,4,37565,99,0.5,8


### Using series(tricky!)

Let's create a new pandas Series and see what happens when we attempt to assign it as a new column in our DataFrame.

Let's try and add a column for the last name of each person.

In [7]:
last_name = pd.Series(['Smith', 'Jones', 'Williams', 'Green', 'Brown'])
last_name

0       Smith
1       Jones
2    Williams
3       Green
4       Brown
dtype: object

In [8]:
df['last name'] = last_name
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name
Tom,Male,White,Engineering,23,107962,99,0.2,2,
Niko,Male,Black,Engineering,1,30347,99,0.1,4,
Penelope,Female,White,Engineering,12,60258,99,0.0,6,
Aria,Female,Black,Engineering,8,43618,99,0.15,2,
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,9,
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,6,
Zach,Male,White,Parks & Recreation,4,37565,99,0.5,8,


In [9]:
last_name.index

RangeIndex(start=0, stop=5, step=1)

In [10]:
df.index

Index(['Tom', 'Niko', 'Penelope', 'Aria', 'Sofia', 'Dean', 'Zach'], dtype='object')

So, DataFrame - *df*  & Series - *last_name* have different index and that is causing the problem. Lets fix it.

In [11]:
last_name.index = df.index[:len(last_name)]; last_name

Tom            Smith
Niko           Jones
Penelope    Williams
Aria           Green
Sofia          Brown
dtype: object

Make the assignment like we have done above:

In [12]:
df['LAST NAME'] = last_name
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME
Tom,Male,White,Engineering,23,107962,99,0.2,2,,Smith
Niko,Male,Black,Engineering,1,30347,99,0.1,4,,Jones
Penelope,Female,White,Engineering,12,60258,99,0.0,6,,Williams
Aria,Female,Black,Engineering,8,43618,99,0.15,2,,Green
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,9,,Brown
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,6,,
Zach,Male,White,Parks & Recreation,4,37565,99,0.5,8,,


**All missing values!**

Our attempt failed because pandas uses a completely different methodology for combining two pandas objects. 

#### Automatic alignment

Whenever two pandas objects are combined in some fashion the row/column index of one is aligned with the row/column index of the other. This all happens silently and implicitly behind the scenes. So if you are unaware of it, you will be completely taken by surprise.

Our operation failed to add the last names because the index of our Series is the integers 0 through 6, while the index of the DataFrame are the names of the employees. There are no index values in common between the objects, so pandas defaults to NaN (Not a number).

#### Same Index as DataFrame

To use a Series to create a new column, the index must match that of the modifying DataFrame. Let's re-create our Series with the same index as the DataFrame.

In [13]:
last_name = pd.Series(data=['Smith', 'Jones', 'Williams', 'Green', 'Brown', 'Simpson', 'Peters'],
                      index=df.index)
last_name

Tom            Smith
Niko           Jones
Penelope    Williams
Aria           Green
Sofia          Brown
Dean         Simpson
Zach          Peters
dtype: object

Let's try that assignment again. This technically will overwrite our previous **LAST NAME** column

In [14]:
df['LAST NAME'] = last_name
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME
Tom,Male,White,Engineering,23,107962,99,0.2,2,,Smith
Niko,Male,Black,Engineering,1,30347,99,0.1,4,,Jones
Penelope,Female,White,Engineering,12,60258,99,0.0,6,,Williams
Aria,Female,Black,Engineering,8,43618,99,0.15,2,,Green
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,9,,Brown
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,6,,Simpson
Zach,Male,White,Parks & Recreation,4,37565,99,0.5,8,,Peters


### Using Expressions

We can create a new column by combining any number of other columns. One primary way of doing that is through a mathematical expression. For instance, let's create a new column **`BONUS`** by multiplying the **`BONUS RATE`** and **`SALARY`** columns together.

```{note}
Output before assignment

Before adding this new column to your DataFrame, you might want to consider viewing the output before making the assignment. This gives you a little preview so that you can check your work before doing the more permanent assignment.
```

Let's multiply our two columns without assignment:

In [15]:
df['BONUS RATE'] * df['SALARY']

Tom         21592.4
Niko         3034.7
Penelope        0.0
Aria         6542.7
Sofia        3135.0
Dean        10077.6
Zach        18782.5
dtype: float64

Everything appears to be OK, so go ahead and make the assignment. Notice that the output is a Series with index the same as the DataFrame.

In [16]:
df['BONUS'] = df['BONUS RATE'] * df['SALARY']
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,99,0.2,2,,Smith,21592.4
Niko,Male,Black,Engineering,1,30347,99,0.1,4,,Jones,3034.7
Penelope,Female,White,Engineering,12,60258,99,0.0,6,,Williams,0.0
Aria,Female,Black,Engineering,8,43618,99,0.15,2,,Green,6542.7
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,9,,Brown,3135.0
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,6,,Simpson,10077.6
Zach,Male,White,Parks & Recreation,4,37565,99,0.5,8,,Peters,18782.5


## Actual Subset Assignment

So far, we have just added new columns to our DataFrame. We did not change any of the pre-existing values. Let's begin doing this by changing each person's **SCORE** to 100.

The syntax is the same, whether it's adding a new column or changing an existing column:

In [17]:
df['SCORE'] = 100
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,100,0.2,2,,Smith,21592.4
Niko,Male,Black,Engineering,1,30347,100,0.1,4,,Jones,3034.7
Penelope,Female,White,Engineering,12,60258,100,0.0,6,,Williams,0.0
Aria,Female,Black,Engineering,8,43618,100,0.15,2,,Green,6542.7
Sofia,Female,Black,Parks & Recreation,23,26125,100,0.12,9,,Brown,3135.0
Dean,Male,Black,Parks & Recreation,3,33592,100,0.3,6,,Simpson,10077.6
Zach,Male,White,Parks & Recreation,4,37565,100,0.5,8,,Peters,18782.5


### Overwriting a column

You can use the column itself you are assigning in the expression on the right-hand side of the equal sign. For instance, if we want to remove the ugly decimals from the **BONUS** column, we can call the `astype()` method on it and assign it to itself.

In [18]:
df['BONUS'] = df['BONUS'].astype(int)
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,100,0.2,2,,Smith,21592
Niko,Male,Black,Engineering,1,30347,100,0.1,4,,Jones,3034
Penelope,Female,White,Engineering,12,60258,100,0.0,6,,Williams,0
Aria,Female,Black,Engineering,8,43618,100,0.15,2,,Green,6542
Sofia,Female,Black,Parks & Recreation,23,26125,100,0.12,9,,Brown,3135
Dean,Male,Black,Parks & Recreation,3,33592,100,0.3,6,,Simpson,10077
Zach,Male,White,Parks & Recreation,4,37565,100,0.5,8,,Peters,18782


### Assigning a subset of rows

Now that we can change all the values in a single at once, let's learn how to change just a subset of them.

For instance, let's change the **`FLOOR`** for **`Niko`**, **`Penelope`**, and **`Aria`**. Before doing so, let's remember how to make that subset selection with **`.loc`**:

In [19]:
df.loc[['Niko', 'Penelope', 'Aria'], ['FLOOR']] 

Unnamed: 0,FLOOR
Niko,4
Penelope,6
Aria,2


The **`.loc`** indexer allows for row and column selection separated by a comma. It only makes selections based on row/column **labels**. Once we have correctly selected our subset, let's assign it a list of three new integers

In [20]:
df.loc[['Niko', 'Penelope', 'Aria'], 'FLOOR'] = [3, 6, 4]
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,100,0.2,2,,Smith,21592
Niko,Male,Black,Engineering,1,30347,100,0.1,3,,Jones,3034
Penelope,Female,White,Engineering,12,60258,100,0.0,6,,Williams,0
Aria,Female,Black,Engineering,8,43618,100,0.15,4,,Green,6542
Sofia,Female,Black,Parks & Recreation,23,26125,100,0.12,9,,Brown,3135
Dean,Male,Black,Parks & Recreation,3,33592,100,0.3,6,,Simpson,10077
Zach,Male,White,Parks & Recreation,4,37565,100,0.5,8,,Peters,18782


### Assigning subsets with `.iloc`
Similarly, we can use the **`.iloc`** indexer which only makes selections via *integer location*.

Let's assign the 3rd - 6th rows of the **SCORE** column (integer location 5) with the value 99.

In [21]:
df.iloc[3:6, 5] = 99

In [22]:
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,100,0.2,2,,Smith,21592
Niko,Male,Black,Engineering,1,30347,100,0.1,3,,Jones,3034
Penelope,Female,White,Engineering,12,60258,100,0.0,6,,Williams,0
Aria,Female,Black,Engineering,8,43618,99,0.15,4,,Green,6542
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,9,,Brown,3135
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,6,,Simpson,10077
Zach,Male,White,Parks & Recreation,4,37565,100,0.5,8,,Peters,18782


### Assigning an entire column 

Normally, `[ ]` is used to change values of an entire column, but it's also possible to do it with both `.loc` and `.iloc`.

In [23]:
df_selected = df[['GENDER', 'YEARS EXPERIENCE', 'SCORE']].copy()

In [24]:
df_selected.loc[df_selected['YEARS EXPERIENCE'] > 20, 'SCORE'] = 100

In [25]:
df_selected

Unnamed: 0,GENDER,YEARS EXPERIENCE,SCORE
Tom,Male,23,100
Niko,Male,1,100
Penelope,Female,12,100
Aria,Female,8,99
Sofia,Female,23,100
Dean,Male,3,99
Zach,Male,4,100


You have to remember that the fist selection made by both these indexers is the rows. To select all rows, use the colon `:`. For instance, let's see this in action by changing all values in the **FLOOR** column.

In [26]:
df_orig = df.copy()
df.loc[:, 'FLOOR'] = 33
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,100,0.2,33,,Smith,21592
Niko,Male,Black,Engineering,1,30347,100,0.1,33,,Jones,3034
Penelope,Female,White,Engineering,12,60258,100,0.0,33,,Williams,0
Aria,Female,Black,Engineering,8,43618,99,0.15,33,,Green,6542
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,33,,Brown,3135
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,33,,Simpson,10077
Zach,Male,White,Parks & Recreation,4,37565,100,0.5,33,,Peters,18782


And with **`.iloc`**:

In [27]:
df_orig = df.copy()
df.iloc[:, 7] = 22
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,100,0.2,22,,Smith,21592
Niko,Male,Black,Engineering,1,30347,100,0.1,22,,Jones,3034
Penelope,Female,White,Engineering,12,60258,100,0.0,22,,Williams,0
Aria,Female,Black,Engineering,8,43618,99,0.15,22,,Green,6542
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,22,,Brown,3135
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,22,,Simpson,10077
Zach,Male,White,Parks & Recreation,4,37565,100,0.5,22,,Peters,18782


### Assigning with boolean selection
It is more common to use boolean selection to make assignments to subsets than with directly selecting subsets by label or integer location.

For instance, let's say we wanted to give everyone in the engineering department a $5,000 bonus on top of what they already have.

Before making the assignment, let's properly select the data with boolean indexing.

In [28]:
df.loc[df['DEPARTMENT'] == 'Engineering', 'BONUS']

Tom         21592
Niko         3034
Penelope        0
Aria         6542
Name: BONUS, dtype: int64

Once we have confirmed that our selection works, we can make an assignment. We can use the **`+=`** operator to shorten the syntax considerably, which will assign the value back to itself.

In [29]:
df_orig = df.copy()

df.loc[df['DEPARTMENT'] == 'Engineering', 'BONUS'] += 5000

### Assigning with multiple conditions
Let's do an example with multiple boolean conditions. Let's subtract 10 from the **SCORE** of all the black females and white males.

In [30]:
# check our logic first
white_male = (df['GENDER'] == 'Male') & (df['RACE'] == 'White')
black_female = (df['GENDER'] == 'Female') & (df['RACE'] == 'Black')
criteria = white_male | black_female
df[criteria]

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,100,0.2,22,,Smith,26592
Aria,Female,Black,Engineering,8,43618,99,0.15,22,,Green,11542
Sofia,Female,Black,Parks & Recreation,23,26125,99,0.12,22,,Brown,3135
Zach,Male,White,Parks & Recreation,4,37565,100,0.5,22,,Peters,18782


In [31]:
df_orig = df.copy()
df.loc[criteria, 'SCORE'] -= 10
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,SCORE,BONUS RATE,FLOOR,last name,LAST NAME,BONUS
Tom,Male,White,Engineering,23,107962,90,0.2,22,,Smith,26592
Niko,Male,Black,Engineering,1,30347,100,0.1,22,,Jones,8034
Penelope,Female,White,Engineering,12,60258,100,0.0,22,,Williams,5000
Aria,Female,Black,Engineering,8,43618,89,0.15,22,,Green,11542
Sofia,Female,Black,Parks & Recreation,23,26125,89,0.12,22,,Brown,3135
Dean,Male,Black,Parks & Recreation,3,33592,99,0.3,22,,Simpson,10077
Zach,Male,White,Parks & Recreation,4,37565,90,0.5,22,,Peters,18782


## Assigning data in a Series
Assigning subsets of pandas Series is a less common operation but happens analogously to a DataFrame.

Let's first select a copy of the **SALARY** column from our above DataFrame:

In [32]:
s = df['SALARY'].copy()
s

Tom         107962
Niko         30347
Penelope     60258
Aria         43618
Sofia        26125
Dean         33592
Zach         37565
Name: SALARY, dtype: int64

We didn't have to use the `copy()` method, but we did so to avoid the **SettingWithCopy** warning. This is a common warning when making assignments during subset selection.

### Assigning with *.loc* 

Since Series do not have columns, we don't use `[ ]` with them (unless we are doing boolean selection). It can be used to select rows, but is ambiguous and confusing and therefore we avoid it. All the capability of explicitly selecting particular Series values is provided with `.loc` and `.iloc`.

Let's change the salary of **Tom**, **Sofia**, and **Zach**.

In [33]:
s.loc[['Tom', 'Sofia', 'Zach']] = [99999, 39999, 49999]
s

Tom         99999
Niko        30347
Penelope    60258
Aria        43618
Sofia       39999
Dean        33592
Zach        49999
Name: SALARY, dtype: int64

### Assigning with `iloc`
Let's change the salary of **Tom**, **Sofia** and **Zach** with value 40000

In [34]:
s.iloc[[0, 4, 6]] = [40000, 40000, 40000]
s

Tom         40000
Niko        30347
Penelope    60258
Aria        43618
Sofia       40000
Dean        33592
Zach        40000
Name: SALARY, dtype: int64

### Assigning with boolean indexing
We can use boolean indexing to make assignments as well. Using `[ ]` is acceptable here. Let's double all the salaries below 40,000

In [35]:
s_orig = s.copy()
s[s < 40000] *= 2

Both `[ ]` and `.loc` work the same when doing boolean indexing on a Series. However, as mentioned in previous chapter, `.iloc` should almost never be used when doing boolean indexing as it's not implemented fully.

## Conclusion

### Exercise

Try [this exercise](../nbs/Pandas_Exercise.html#subset-assigning) which will test your concepts in this chapter.

### Further Reading

- There isn't one particular sub-section of the documentation that covers this topic precisely. Several examples are available throughout the entire [indexing section](http://pandas.pydata.org/pandas-docs/stable/indexing.html).

- One very important concept when working with pandas is understanding the `inplace` argument. Pandas never actually makes changes to you data, it only returns a view. To make changes to your data, you can use `inplace` argument. Here is a great [blog](https://medium.com/@jman4190/explaining-the-inplace-parameter-for-beginners-5de7ffa18d2e) discussing `inplace` operator in detail.
 
- `SettingWithCopyWarning` is one of the most frequently encountered error message in pandas. Here is another great [blog](https://medium.com/@lengyi/dont-overlook-the-settingwithcopywarning-in-python-51e52b282891) discussing, "what the `SettingWithCopyWarning` is?" and how to avoid it.

```{note} 
Make sure to read both the blog before you start with the next chapter.
```