# Editing Data in DataFrames 

## Outline
* Setting Columns
* Transforming Columns
* Setting data with `loc[]`



Along with creating a DataFrame, we will also want to alter DataFrames after creating them.  

In [123]:
import pandas as pd
import numpy as np
from pathlib import Path

original_df = pd.read_csv(Path('data/employee_attrition.csv'))
print(original_df.shape)
original_df.head()

(1470, 26)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


**Note about the code:**  Throughout the examples, you will see the code sections start with `df = original_df.copy()`.  The rest of the example will then typically work with the `df` variable.  All this does is copy the contents of our "original dataframe" (`original_df`) to a local variable so that the examples don't interfere with each other!

## Setting columns

All columns of a dataframe, which can be accessed with `.` or `[]`. (like `df.Age`, or `df['Age']`), can be assigned to in the same way.  DataFrames can be thought of as a collection of Series (columns), and the pandas library supports adding or replacing them in the data frame.

### Add a new column

In [125]:
df = original_df.copy()

# Remember range() just generates a sequence of numbers... Pandas knows how to turn it into a Series for you!
new_column = range(0, 1470) 

df['new_column'] = new_column

df # notice new 'new_column' at the right edge of the dataframe

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,new_column
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,1,0,8,0,1,6,4,0,5,0
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,1,10,3,3,10,7,1,7,1
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,2,0,7,3,3,0,0,0,0,2
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,0,8,3,3,8,7,3,0,3
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,4,1,6,3,3,2,2,2,2,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,23,3,Male,41,4,2,...,3,1,17,3,3,5,2,0,3,1465
1466,39,No,Travel_Rarely,613,6,4,Male,42,2,3,...,1,1,9,5,3,7,7,1,7,1466
1467,27,No,Travel_Rarely,155,4,2,Male,87,4,2,...,2,1,6,0,3,6,2,0,3,1467
1468,49,No,Travel_Frequently,1023,2,4,Male,63,2,2,...,4,0,17,3,2,9,6,0,8,1468


### Replace an existing column

This looks exactly the same as creating a new one!

In [147]:
df = original_df.copy()

# This looks like a bad idea, but you can assign one column to another!
df.Gender = df.Age

df # notice the Gender column now displays Age

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,41,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,49,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,37,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,33,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,27,40,3,1,...,3,4,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,23,3,36,41,4,2,...,3,3,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,6,4,39,42,2,3,...,3,1,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,4,2,27,87,4,2,...,4,2,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,2,4,49,63,2,2,...,3,4,0,17,3,2,9,6,0,8


We'll see more useful applications of setting columns when we cover Transforming columns later in this lesson.

### Removing columns
To explicitly remove a column, we can use the `.drop()` function on DataFrame. (Note, `drop()` returns a new copy of the DataFrame with the dropped entity.  It doesn't mutate the original)

In [148]:
df = original_df.copy()

df = df.drop(columns=['Gender', 'Age'])

df

Unnamed: 0,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MaritalStatus,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,Yes,Travel_Rarely,1102,1,2,94,3,2,4,Single,...,3,1,0,8,0,1,6,4,0,5
1,No,Travel_Frequently,279,8,3,61,2,2,2,Married,...,4,4,1,10,3,3,10,7,1,7
2,Yes,Travel_Rarely,1373,2,4,92,2,1,3,Single,...,3,2,0,7,3,3,0,0,0,0
3,No,Travel_Frequently,1392,3,4,56,3,1,3,Married,...,3,3,0,8,3,3,8,7,3,0
4,No,Travel_Rarely,591,2,1,40,3,1,2,Married,...,3,4,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,No,Travel_Frequently,884,23,3,41,4,2,4,Married,...,3,3,1,17,3,3,5,2,0,3
1466,No,Travel_Rarely,613,6,4,42,2,3,1,Married,...,3,1,1,9,5,3,7,7,1,7
1467,No,Travel_Rarely,155,4,2,87,4,2,2,Married,...,4,2,1,6,0,3,6,2,0,3
1468,No,Travel_Frequently,1023,2,4,63,2,2,2,Married,...,3,4,0,17,3,2,9,6,0,8


However, remember that you can always use `loc[]`, `iloc[]` or any other data selection strategy to get a new look at a dataframe that contains exactly what you want.  For instance, if I want to "drop all but the first 4 columns", I could just reassign the dataframe and use `iloc[]` to select just that portion of the DataFrame.

In [150]:
df = original_df.copy()

df = df.iloc[:,0:4]

df

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate
0,41,Yes,Travel_Rarely,1102
1,49,No,Travel_Frequently,279
2,37,Yes,Travel_Rarely,1373
3,33,No,Travel_Frequently,1392
4,27,No,Travel_Rarely,591
...,...,...,...,...
1465,36,No,Travel_Frequently,884
1466,39,No,Travel_Rarely,613
1467,27,No,Travel_Rarely,155
1468,49,No,Travel_Frequently,1023


## Transforming columns

It is valuable to use existing data when setting new columns, but you often would like to transform that data somehow first.  Perhaps you want to standardize the Gender column in this example.  Instead of 'Male' and 'Female', we just want 'm', or 'f'.  How can we change the data in that column to match what we want?

### `.map()`
Map is a universal concept in programming, and it always involves taking a collection of something as input, applyting a function to each element in the collection, and returning all of the return values of that function as a new collection.  In our case, we'd like to create a function that can turn the values in the Gender column to either 'm' or 'f', and return a new column of data.  The `.map()` function on the column Series will do just that for us!

In [129]:
df = original_df.copy()

# create our new column
new_gender = df.Gender.map(lambda g: 'f' if g == 'Female' else 'm')

# assign our new column to the 'Gender' column of the Dataframe:
df.Gender = new_gender

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,f,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,m,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,m,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,f,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,m,40,3,1,...,3,4,1,6,3,3,2,2,2,2


### Arithmetic operations with columns

Let's look at another example of transforming data.  Let's say we want identify people who have worked their entire career at this company.  If their "TotalWorkingYears" is equal to their "YearsAtCompany" value, then we want the value in the "Lifer" colum to equal True, otherwise False.

We can do this by comparing two columns as if they were single values:

In [151]:
df = original_df.copy()

# Since we are doing a boolean operation, the result is a Series of boolean values
lifer_col = df.TotalWorkingYears == df.YearsAtCompany 

df['Lifer'] = lifer_col
df

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Lifer
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,1,0,8,0,1,6,4,0,5,False
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,1,10,3,3,10,7,1,7,True
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,2,0,7,3,3,0,0,0,0,False
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,0,8,3,3,8,7,3,0,True
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,4,1,6,3,3,2,2,2,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,23,3,Male,41,4,2,...,3,1,17,3,3,5,2,0,3,False
1466,39,No,Travel_Rarely,613,6,4,Male,42,2,3,...,1,1,9,5,3,7,7,1,7,False
1467,27,No,Travel_Rarely,155,4,2,Male,87,4,2,...,2,1,6,0,3,6,2,0,3,True
1468,49,No,Travel_Frequently,1023,2,4,Male,63,2,2,...,4,0,17,3,2,9,6,0,8,False


Just to demonstrate, we can do all sorts of arithmetic operations on columns:

In [152]:
df = original_df.copy()

df['unfair_compensation'] = df.DailyRate * df.JobLevel
df['BusinessTravel+Gender'] = df.BusinessTravel + df.Gender

df

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,unfair_compensation,BusinessTravel+Gender
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,0,8,0,1,6,4,0,5,2204,Travel_RarelyFemale
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,1,10,3,3,10,7,1,7,558,Travel_FrequentlyMale
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,0,7,3,3,0,0,0,0,1373,Travel_RarelyMale
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,0,8,3,3,8,7,3,0,1392,Travel_FrequentlyFemale
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,1,6,3,3,2,2,2,2,591,Travel_RarelyMale
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,23,3,Male,41,4,2,...,1,17,3,3,5,2,0,3,1768,Travel_FrequentlyMale
1466,39,No,Travel_Rarely,613,6,4,Male,42,2,3,...,1,9,5,3,7,7,1,7,1839,Travel_RarelyMale
1467,27,No,Travel_Rarely,155,4,2,Male,87,4,2,...,1,6,0,3,6,2,0,3,310,Travel_RarelyMale
1468,49,No,Travel_Frequently,1023,2,4,Male,63,2,2,...,0,17,3,2,9,6,0,8,2046,Travel_FrequentlyMale


## Setting data with `loc[]`

Remember how useful the `loc[]` attribute was for reading data from a DataFrame? Turns out it is just as useful for setting data within a dataframe.


First, lets build a smaller DataFrame to demonstrate how this works!


In [153]:
columns = list('abcdef') # Create a list of chars from a string (yay python)
data = [[False for j in columns] for i in range(0, 10)] # create a list of lists of "False"s for our dataframe
sdf = pd.DataFrame(data, columns=columns)

sdf

Unnamed: 0,a,b,c,d,e,f
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [154]:
print(sdf.loc[0, 'a'])

sdf.loc[0, 'a'] = True

sdf

False


Unnamed: 0,a,b,c,d,e,f
0,True,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [155]:
sdf.loc[3] = True

sdf

Unnamed: 0,a,b,c,d,e,f
0,True,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,True,True,True,True,True,True
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [156]:
sdf.loc[7, ['a', 'c', 'f']] = True

sdf

Unnamed: 0,a,b,c,d,e,f
0,True,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,True,True,True,True,True,True
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,True,False,True,False,False,True
8,False,False,False,False,False,False
9,False,False,False,False,False,False


Instead of passing a single value to be assigned, pass data that matches the shape of your query to set that exact data:

In [157]:
sdf.loc[9] = [True, False, True, False, True, False]

sdf

Unnamed: 0,a,b,c,d,e,f
0,True,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,True,True,True,True,True,True
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,True,False,True,False,False,True
8,False,False,False,False,False,False
9,True,False,True,False,True,False


In [158]:
# Here we use the slice operation to select rows 3 and 4, and columns d, e and f
sdf.loc[3:4, 'd':'f'] = [['a','a','a'], ['b','b','b']]

sdf

Unnamed: 0,a,b,c,d,e,f
0,True,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,True,True,True,a,a,a
4,False,False,False,b,b,b
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,True,False,True,False,False,True
8,False,False,False,False,False,False
9,True,False,True,False,True,False
