# Adding, deleting, and selecting data with DataFrames

Regardless of the original data source, once you have data loaded into a DataFrame, you gain the ability to manipulate your data. For example, you can do the following:

  - Add or make new columns
  - Delete or drop columns
  - Select rows and columns
  - Handle missing data with `dropna()` and `fillna()`
    
To explore this, read in a data table consisting of the salaries and personal statistics of major league baseball players.

In [None]:
import pandas as pd

players_df = pd.read_csv("https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/players.csv")

First, examine your DataFrame using some of the techniques that you looked at in the previous lesson.

In [None]:
players_df.shape

(26428, 14)

In [None]:
players_df.describe()

Unnamed: 0,birthyear,deathyear,weight,height,yearid,salary
count,26428.0,492.0,26428.0,26428.0,26428.0,26428.0
mean,1971.389133,2008.833333,199.022136,73.509006,2000.878727,2085634.0
std,9.679736,7.188803,22.631696,2.284665,8.909314,3455348.0
min,1925.0,1989.0,140.0,66.0,1985.0,0.0
25%,1964.0,2006.0,185.0,72.0,1994.0,294702.0
50%,1971.0,2011.0,195.0,74.0,2001.0,550000.0
75%,1979.0,2015.0,215.0,75.0,2009.0,2350000.0
max,1995.0,2018.0,315.0,83.0,2016.0,33000000.0


In [None]:
players_df.columns

Index(['playerid', 'birthyear', 'birthcountry', 'deathyear', 'namefirst',
       'namelast', 'weight', 'height', 'bats', 'throws', 'yearid', 'teamid',
       'lgid', 'salary'],
      dtype='object')

In [None]:
players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26428 entries, 0 to 26427
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   playerid      26428 non-null  object 
 1   birthyear     26428 non-null  int64  
 2   birthcountry  26428 non-null  object 
 3   deathyear     492 non-null    float64
 4   namefirst     26428 non-null  object 
 5   namelast      26428 non-null  object 
 6   weight        26428 non-null  int64  
 7   height        26428 non-null  int64  
 8   bats          26428 non-null  object 
 9   throws        26428 non-null  object 
 10  yearid        26428 non-null  int64  
 11  teamid        26428 non-null  object 
 12  lgid          26428 non-null  object 
 13  salary        26428 non-null  int64  
dtypes: float64(1), int64(5), object(8)
memory usage: 2.8+ MB


In [None]:
players_df

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000
2,benedbr01,1955,USA,,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000
3,campri01,1953,USA,2013.0,Rick,Camp,195,73,R,R,1985,ATL,NL,633333
4,ceronri01,1954,USA,,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26423,strasst01,1988,USA,,Stephen,Strasburg,235,76,R,R,2016,WAS,NL,10400000
26424,taylomi02,1991,USA,,Michael,Taylor,210,75,R,R,2016,WAS,NL,524000
26425,treinbl01,1988,USA,,Blake,Treinen,225,77,R,R,2016,WAS,NL,524900
26426,werthja01,1979,USA,,Jayson,Werth,235,77,R,R,2016,WAS,NL,21733615


### Missing values

`NaN` in pandas represents a null value, meaning that a value is missing. Note that a null value doesn't necessarily mean that the number is `0`. It could be missing due to something like a recording error, its inapplicability, or sampling bias. In later modules, you will take a closer look at how to diagnose missing values.  

The `isnull()` method returns a boolean result indicating whether each value in a DataFrame is missing. `True` equates to a missing value.

In [None]:
players_df.isnull()

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary
0,False,False,False,True,False,False,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26423,False,False,False,True,False,False,False,False,False,False,False,False,False,False
26424,False,False,False,True,False,False,False,False,False,False,False,False,False,False
26425,False,False,False,True,False,False,False,False,False,False,False,False,False,False
26426,False,False,False,True,False,False,False,False,False,False,False,False,False,False


Immediately, you can see that there are quite a few missing values in the *deathyear* field. It makes sense that there are missing values here—if a player is still living, then their death year is indeed unknown. Making this connection shows the importance of understanding how data is collected; the data itself may not provide the answers.

Should you want to exclude all records with a missing value, you can call the `dropna()` method. By default, this will drop any row that includes a missing value in any field. There are many other ways to drop missing values in pandas; to learn more, check the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).

In [None]:
players_df_filtered = players_df.dropna()
players_df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 492 entries, 3 to 25988
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   playerid      492 non-null    object 
 1   birthyear     492 non-null    int64  
 2   birthcountry  492 non-null    object 
 3   deathyear     492 non-null    float64
 4   namefirst     492 non-null    object 
 5   namelast      492 non-null    object 
 6   weight        492 non-null    int64  
 7   height        492 non-null    int64  
 8   bats          492 non-null    object 
 9   throws        492 non-null    object 
 10  yearid        492 non-null    int64  
 11  teamid        492 non-null    object 
 12  lgid          492 non-null    object 
 13  salary        492 non-null    int64  
dtypes: float64(1), int64(5), object(8)
memory usage: 57.7+ KB


Your DataFrame now contains only 492 rows—all rows where the player's death year is unknown have been dropped. 

But what if instead of dropping these rows, you want to fill in the missing values with something other than `NaN`? You can do this using the `fillna()` method. 

This can be useful for making a DataFrame more legible, or for assigning a special value or character to nulls. 


In [None]:
players_df.fillna('To be determined')

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary
0,barkele01,1955,USA,To be determined,Len,Barker,225,77,R,R,1985,ATL,NL,870000
1,bedrost01,1957,USA,To be determined,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000
2,benedbr01,1955,USA,To be determined,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000
3,campri01,1953,USA,2013,Rick,Camp,195,73,R,R,1985,ATL,NL,633333
4,ceronri01,1954,USA,To be determined,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26423,strasst01,1988,USA,To be determined,Stephen,Strasburg,235,76,R,R,2016,WAS,NL,10400000
26424,taylomi02,1991,USA,To be determined,Michael,Taylor,210,75,R,R,2016,WAS,NL,524000
26425,treinbl01,1988,USA,To be determined,Blake,Treinen,225,77,R,R,2016,WAS,NL,524900
26426,werthja01,1979,USA,To be determined,Jayson,Werth,235,77,R,R,2016,WAS,NL,21733615


## Add new columns

Sometimes, you'll need to add new columns by deriving values from calculations involving other columns. For example, if you had data on the weight and height of all players on a team, you could compute a new column for body mass index (BMI) using the following formula:

$$ bmi = \frac{weight * 703}{height^2} $$

To create a column of values calculated from the values in other columns, use the `assign()` method of the DataFrame.

In [None]:
players_df = players_df.assign(bmi = (703 * players_df.weight) / (players_df.height**2))
players_df.head()

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000,24.995556
2,benedbr01,1955,USA,,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000,23.085945
3,campri01,1953,USA,2013.0,Rick,Camp,195,73,R,R,1985,ATL,NL,633333,25.724339
4,ceronri01,1954,USA,,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000,26.77564


Or you could simply add a new column with some default:

In [None]:
players_df['X'] = 0
players_df.head()

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi,X
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192,0
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000,24.995556,0
2,benedbr01,1955,USA,,Bruce,Benedict,175,73,R,R,1985,ATL,NL,545000,23.085945,0
3,campri01,1953,USA,2013.0,Rick,Camp,195,73,R,R,1985,ATL,NL,633333,25.724339,0
4,ceronri01,1954,USA,,Rick,Cerone,192,71,R,R,1985,ATL,NL,625000,26.77564,0


Both of those options are row-wise operators.

## Deleting columns

To remove a column, use the `drop()` method. This method can be used to drop both rows and columns, so the axes have to be specified. Try it out by removing the new column that you just created. By default, the `drop()` method does not actually modify the DataFrame; rather, it creates a new DataFrame with the changes, leaving the existing one untouched. This can be useful in many cases, but you need to be aware of that behavior. 

As an alternative, you can set the `inplace` parameter to `True` to make the operation work on the existing DataFrame. The following code snippet first drops a column and assigns the new modified DataFrame to a new variable named *players_df_changed*. The original DataFrame, *players_df*, still exists. Next, the original DataFrame is modified by dropping the same column while passing the `inplace` parameter set to `True`.

In [None]:
# Delete the 'X' column and store the new DataFrame in a new variable
players_df_changed = players_df.drop(['X'], axis=1)

print('The columns of the new DataFrame')
print(players_df_changed.columns)

print('The columns of the original DataFrame')
print(players_df.columns)

# Alternatively, modify the existing DataFrame by setting the `inplace` parameter to `True`
players_df.drop(['X'], axis=1, inplace=True)
print('The columns of the DataFrame after deleting inplace')
print(players_df.columns)


The columns of the new DataFrame
Index(['playerid', 'birthyear', 'birthcountry', 'deathyear', 'namefirst',
       'namelast', 'weight', 'height', 'bats', 'throws', 'yearid', 'teamid',
       'lgid', 'salary', 'bmi'],
      dtype='object')
The columns of the original DataFrame
Index(['playerid', 'birthyear', 'birthcountry', 'deathyear', 'namefirst',
       'namelast', 'weight', 'height', 'bats', 'throws', 'yearid', 'teamid',
       'lgid', 'salary', 'bmi', 'X'],
      dtype='object')
The columns of the DataFrame after deleting inplace
Index(['playerid', 'birthyear', 'birthcountry', 'deathyear', 'namefirst',
       'namelast', 'weight', 'height', 'bats', 'throws', 'yearid', 'teamid',
       'lgid', 'salary', 'bmi'],
      dtype='object')


## Selecting rows

When you're exploring data in pandas, one thing that you'll need to do constantly is select a subset of rows based on some criterion. Suppose, for example, that you need to see the data for just a single year. Or maybe you need to select some rows by numbers.

### Selecting by number

You have already seen that you can get the first `n` rows or the last `n` rows of a DataFrame using `head(n)` and `tail(n)`, respectively. But what if you wanted to select rows 20-25? To select rows based on position, use `iloc`.  



In [None]:
# Select the first row
players_df.iloc[0]

playerid        barkele01
birthyear            1955
birthcountry          USA
deathyear             NaN
namefirst             Len
namelast           Barker
weight                225
height                 77
bats                    R
throws                  R
yearid               1985
teamid                ATL
lgid                   NL
salary             870000
bmi               26.6782
Name: 0, dtype: object

The `iloc` selector can take inputs in several different ways:

 - An integer
 - A list of integers
 - A slice object
 - A boolean array
 
When a single row is selected, the result is a *Series*—a single column of values indexed with the column names. When a list of integers is provided, the result is a DataFrame instead.

In [None]:
# Select the first row, this time with a list of integers
players_df.iloc[[0]]

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192


To select more than one row, just list the row numbers.

In [None]:
players_df.iloc[[0, 1, 5, 8, 10]]

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192
1,bedrost01,1957,USA,,Steve,Bedrosian,200,75,R,R,1985,ATL,NL,550000,24.995556
5,chambch01,1948,USA,,Chris,Chambliss,195,73,L,R,1985,ATL,NL,800000,25.724339
8,garbege01,1947,USA,,Gene,Garber,175,70,R,R,1985,ATL,NL,772000,25.107143
10,hornebo01,1957,USA,,Bob,Horner,195,73,R,R,1985,ATL,NL,1500000,25.724339


To select multiple rows with a sequential index, use a slice object. In Python, a slice object takes the following form:

```
[start:end]
```

Here, `start` and `end` are integers. It is basically a list of numbers from `start` to `end-1`. For example, `[2:5]` is the same as `[2, 3, 4]`.



In [None]:
# Select rows 5-10
players_df.iloc[5:11]

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
5,chambch01,1948,USA,,Chris,Chambliss,195,73,L,R,1985,ATL,NL,800000,25.724339
6,dedmoje01,1960,USA,,Jeff,Dedmon,200,74,L,R,1985,ATL,NL,150000,25.675676
7,forstte01,1952,USA,,Terry,Forster,200,75,L,L,1985,ATL,NL,483333,24.995556
8,garbege01,1947,USA,,Gene,Garber,175,70,R,R,1985,ATL,NL,772000,25.107143
9,harpete01,1955,USA,,Terry,Harper,195,76,R,R,1985,ATL,NL,250000,23.733553
10,hornebo01,1957,USA,,Bob,Horner,195,73,R,R,1985,ATL,NL,1500000,25.724339


Alternatively, `iloc` may take a list of booleans that is the same length as the index and only return the rows that are `True`. For example, if the list `[False, True, True, False]` is used to select from a DataFrame with four rows, it will skip the first row, take the second and third rows, and skip the last row.

Your DataFrame has over 26,000 rows, and typing a list of 26,000 boolean values is just not practical. But you could generate a list based on some value in the DataFrame. For example, suppose that you want to get a list of all players who weigh over 200 pounds. You could generate a list of booleans like this:

In [None]:
# Create a Series of booleans
over_200 = players_df['weight'] > 200
over_200.head(10)

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: weight, dtype: bool

The *over_200* Series now contains a boolean value for each row in the DataFrame. You can see how many `True` values and how many `False` values there are with the `value_counts()` method.

In [None]:
over_200.value_counts()

False    15829
True     10599
Name: weight, dtype: int64

To select only the players corresponding to a `True` value, you can pass the array of booleans to `iloc`.

In [None]:
players_df.iloc[over_200.values]


Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
0,barkele01,1955,USA,,Len,Barker,225,77,R,R,1985,ATL,NL,870000,26.678192
15,murphda05,1956,USA,,Dale,Murphy,210,76,R,R,1985,ATL,NL,1625000,25.559211
24,davisst02,1961,USA,,Storm,Davis,207,76,R,R,1985,BAL,AL,437500,25.194079
29,grosswa01,1952,USA,,Wayne,Gross,210,74,L,R,1985,BAL,AL,483333,26.959459
39,roeniga01,1954,USA,,Gary,Roenicke,205,75,R,R,1985,BAL,AL,558333,25.620444
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26423,strasst01,1988,USA,,Stephen,Strasburg,235,76,R,R,2016,WAS,NL,10400000,28.601974
26424,taylomi02,1991,USA,,Michael,Taylor,210,75,R,R,2016,WAS,NL,524000,26.245333
26425,treinbl01,1988,USA,,Blake,Treinen,225,77,R,R,2016,WAS,NL,524900,26.678192
26426,werthja01,1979,USA,,Jayson,Werth,235,77,R,R,2016,WAS,NL,21733615,27.863889


The result is a DataFrame with 10,599 rows. Here is another example with multiple conditions: select all players over 200 pounds who bat left-handed and throw right-handed.

In [None]:
over_200_L_R = (players_df['weight'] > 200) & (players_df['bats'] == 'L') & (players_df['throws'] == 'R')
players_df.iloc[over_200_L_R.values]

Unnamed: 0,playerid,birthyear,birthcountry,deathyear,namefirst,namelast,weight,height,bats,throws,yearid,teamid,lgid,salary,bmi
29,grosswa01,1952,USA,,Wayne,Gross,210,74,L,R,1985,BAL,AL,483333,26.959459
40,sheetla01,1959,USA,,Larry,Sheets,217,75,L,R,1985,BAL,AL,60000,27.120178
55,gedmari01,1959,USA,,Rich,Gedman,210,72,L,R,1985,BOS,AL,477500,28.478009
117,walkegr01,1959,USA,,Greg,Walker,205,75,L,R,1985,CHA,AL,195000,25.620444
137,sutclri01,1956,USA,,Rick,Sutcliffe,215,79,L,R,1985,CHN,NL,1260000,24.218074
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26394,saundmi01,1986,CAN,,Michael,Saunders,225,76,L,R,2016,TOR,AL,2900000,27.384868
26399,tholejo01,1986,USA,,Josh,Thole,205,73,L,R,2016,TOR,AL,800000,27.043535
26404,drewst01,1983,USA,,Stephen,Drew,220,72,L,R,2016,WAS,NL,3000000,29.834105
26407,harpebr03,1992,USA,,Bryce,Harper,215,75,L,R,2016,WAS,NL,5000000,26.870222


### Selecting on both axes

Sometimes you want to select only some columns of the DataFrame. The `iloc` selector allows that with a second value representing the second axis. Like the rows, the columns are numbered from `0`. So, to select the first name, last name, weight, height, bats, and throws columns for the first 10 rows of the DataFrame, you would need to specify a slice of `[4:10]` for the columns.



In [None]:
# Rows 0-9 and columns 5-10
players_df.iloc[0:10, 4:10]

Unnamed: 0,namefirst,namelast,weight,height,bats,throws
0,Len,Barker,225,77,R,R
1,Steve,Bedrosian,200,75,R,R
2,Bruce,Benedict,175,73,R,R
3,Rick,Camp,195,73,R,R
4,Rick,Cerone,192,71,R,R
5,Chris,Chambliss,195,73,L,R
6,Jeff,Dedmon,200,74,L,R
7,Terry,Forster,200,75,L,L
8,Gene,Garber,175,70,R,R
9,Terry,Harper,195,76,R,R


To make the rest of this discussion a little easier to illustrate, select a subset of the data. Select all players that played in the team CLE in 2015. And limit yourself to the following columns: *playerid*, *birthyear*, *namefirst*, *namelast*, *weight*, *height*, *bats*, *throws*, and *salary*.

In [None]:
# Create a Series of booleans for the row selection
cle_options = (players_df['teamid'] == 'CLE') & (players_df['yearid'] == 2015)
cle_2015 = players_df.iloc[cle_options.values, [0, 1, 4, 5, 6, 7, 8, 9, 13]]
cle_2015

Unnamed: 0,playerid,birthyear,namefirst,namelast,weight,height,bats,throws,salary
24953,adamsau01,1986,Austin,Adams,200,71,R,R,507700
24954,allenco01,1988,Cody,Allen,210,73,R,R,547100
24955,atchisc01,1976,Scott,Atchison,200,74,R,R,900000
24956,avilemi01,1981,Mike,Aviles,205,70,R,R,3500000
24957,bauertr01,1991,Trevor,Bauer,190,73,R,R,1940000
24958,bournmi01,1982,Michael,Bourn,190,71,L,R,13500000
24959,brantmi02,1987,Michael,Brantley,200,74,L,L,5875000
24960,carraca01,1987,Carlos,Carrasco,212,75,R,R,2337500
24961,chiselo01,1988,Lonnie,Chisenhall,190,74,L,R,2250000
24962,crockky01,1991,Kyle,Crockett,175,74,L,L,510900
