## Updating data frames

## Data: palmer penguins

In [2]:
# import libraries
import pandas as pd
import numpy as np

# import seaborn with its standard abbreviation
import seaborn as sns

# will use the random library to create some random numbers
import random

In [4]:
# import data from seaborn
penguins = sns.load_dataset('penguins')

# 
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## adding a column

General syntax to add a single column is the following
```
df['new_col_name'] = new_col_values
```
- new values could be a Pandas series, a numpy array, but must the same length as the data frame
- a single number or a single string

Example: create a new column where body mass is in kg instead of grams

In [10]:
penguins['body_mass_kg'] = penguins.body_mass_g/1000


print('body_mass_kg' in penguins.columns)

True


to create a new column and insert it at a particular position we use insert():
```
df.insert(loc = integers_index,
        column = new_col_name,
        value = new_col_values)
```
Example- suppose each penguin gets a unique code as a 3 digit number. Add this column at the beginning of the data frame

In [12]:
# create random 3-digit codes
# sample is without replacement
codes = random.sample(range(100,1000), len(penguins))

# add the codes as a column
# works inplace, so you don't need to reassign
penguins.insert(loc = 0,
               column = "code",
               value = codes)

In [13]:
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,552,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75
1,374,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.8
2,957,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25
3,235,Adelie,Torgersen,,,,,,
4,902,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45


What happens if we reassign with insert? tip - don't do this

## adding multiples columns
We can assign multiple columns in the same call using assign()
```
df.assign(new_col1_name = new_col1_values,
           new_col2_name = new_col2_values)
```
Example:
add columns:
- flipper length converted from mm to cm
- a code representing the observer

In [15]:
# create codes for observers
observers = random.choices(['A', 'B', 'C'],
                          k = len(penguins))

penguins = penguins.assign(flipper_length_cm = penguins.flipper_length_mm/10,
                          observer = observers)

penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg,flipper_length_cm,observer
0,552,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75,18.1,C
1,374,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.8,18.6,B
2,957,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25,19.5,A
3,235,Adelie,Torgersen,,,,,,,,A
4,902,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45,19.3,B


## Remove columns

remove columns using `drop()`
```
df = df.drop(columns = col.names)
```
where 'col_names' can be a single column or a list of column names

Example:
We want to drop the flipper length in mm and the body mass in grams

In [18]:
penguins = penguins.drop(columns = ['flipper_length_mm', 'body_mass_g'])

## updating values

### a single value
We can access a single value in a `pd.DataFrame`

In [19]:
# bill length of a penguin in the fourth row
penguins.at[3, 'bill_length_mm']

nan

We got an NA, so lets update it to 38.3 mm. We do this using at[] also

In [22]:
penguins.at[3, 'bill_length_mm'] = 38.3

In [None]:
# using iat, where you use index instead of column name
penguins.iat[1, 0] = 999

## Updating multiple values in a column

### by condition
Think of `case_when` in R.

example:
we want to classify penguins such that:

- small penguins: body mass <  3 kg
- medium penguins: 3 kg <= body mass < 5 kg
- big penguins: body mass > 5 kg

Using `numpy.select()` to create a new column 

In [24]:
# create a list with the conditions
conditions = [penguins.body_mass_kg < 3,
             (3 <= penguins.body_mass_kg) & (penguins.body_mass_kg < 5),
             5<= penguins.body_mass_kg]

# create a list with the choices
choices = ['small',
          'medium',
          'large']

# default = value for anything that doesn't satisfy the conditions
penguins['size'] = np.select(conditions, choices, default = np.nan)

penguins

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,flipper_length_cm,observer,size
0,552,Adelie,Torgersen,39.1,18.7,Male,3.75,18.1,C,medium
1,374,Adelie,Torgersen,39.5,17.4,Female,3.80,18.6,B,medium
2,957,Adelie,Torgersen,40.3,18.0,Female,3.25,19.5,A,medium
3,235,Adelie,Torgersen,38.3,,,,,A,
4,902,Adelie,Torgersen,36.7,19.3,Female,3.45,19.3,B,medium
...,...,...,...,...,...,...,...,...,...,...
339,349,Gentoo,Biscoe,,,,,,B,
340,449,Gentoo,Biscoe,46.8,14.3,Female,4.85,21.5,B,medium
341,511,Gentoo,Biscoe,50.4,15.7,Male,5.75,22.2,C,large
342,925,Gentoo,Biscoe,45.2,14.8,Female,5.20,21.2,C,large


### update a column by selecting values

Sometimes we just want to update a few values that satisfy a condition.
We can do this by selecting using `loc` and then assigning a new value
```
# modifies in place
df.loc[row_selection, col_name] = new_values
```
where

- `row_selection` = rows we want to update,
- `col_name` = a single column name,


In [27]:
# change "male" in the sex column to M
penguins.loc[penguins.sex == "Male", 'sex'] = "M"

penguins.sex.unique()

array(['M', 'Female', nan], dtype=object)

Something that won't work...

In [28]:
penguins[penguins.sex == 'Female']['sex'] = 'F'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  penguins[penguins.sex == 'Female']['sex'] = 'F'


When you select with chained indexing [][], instead of loc you get this warning

## Views and copies

some `pandas` operations return a view of your data while others return a copy of your data.

- **Views** are actual subsets of our original data. When updated, we're modifying the original data frame

- **Copies** are unique objects, independent of our original data frame. When we update a copy we are not modifying the original dataframe

Depending on what we're doing

### Another `SettingWithCopyWarning`

Another common situation when this warnng comes up is when we try updating a subset of our data frame already stored in a variable

In [29]:
# example
biscoe = penguins[penguins.island = 'Biscoe']

biscoe['sample_column'] = 100

SyntaxError: invalid syntax (3415177273.py, line 2)

Essentially what we did was

```python
penguins[penguins.island = 'Biscoe']['sample_column'] = 100
```

In [30]:
# using copy() to avoid this
biscoe = penguins[penguins.island=='Biscoe'].copy()

biscoe['sample_col'] = 100