# Modifying DataFrames

In the previous lesson, you learned what a DataFrame is and how to select subsets of data from one.

In this lesson, you’ll learn how to modify an existing DataFrame. Some of the skills you’ll learn include:
 - Adding columns to a DataFrame
 - Using lambda functions to calculate complex quantities
 - Renaming columns

 ### Adding a Column I

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

The DataFrame `df` contains information on products sold at a hardware store. Add a column to df called `'Sold in Bulk?'`, which indicates if the product is sold in bulk or individually.

In [1]:
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add columns here
df['Sold in Bulk?'] = ['Yes', 'Yes', 'No', 'No']
print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?
0           1  3 inch screw                  0.5   0.75           Yes
1           2   2 inch nail                  0.1   0.25           Yes
2           3        hammer                  3.0   5.50            No
3           4   screwdriver                  2.5   3.00            No


### Adding a Column II

We can also add a new column that is the same for all rows in the DataFrame.

Add a column to df called Is taxed?, which indicates whether or not to collect sales tax on the product.

In [2]:
df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add columns here
df["Is taxed?"] = 'Yes'
print(df)

   Product ID   Description  Cost to Manufacture  Price Is taxed?
0           1  3 inch screw                  0.5   0.75       Yes
1           2   2 inch nail                  0.1   0.25       Yes
2           3        hammer                  3.0   5.50       Yes
3           4   screwdriver                  2.5   3.00       Yes


### Adding a Column III

Finally, you can add a new column by performing a function on the existing columns.

In [3]:
df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add column here
df['Margin'] = df.Price - df['Cost to Manufacture']
print(df)

   Product ID   Description  Cost to Manufacture  Price  Margin
0           1  3 inch screw                  0.5   0.75    0.25
1           2   2 inch nail                  0.1   0.25    0.15
2           3        hammer                  3.0   5.50    2.50
3           4   screwdriver                  2.5   3.00    0.50


### Performing Column Operations

In the previous exercise, we learned how to add columns to a DataFrame.

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

In [4]:
df = pd.DataFrame([
  ['JOHN SMITH', 'john.smith@gmail.com'],
  ['Jane Doe', 'jdoe@yahoo.com'],
  ['joe schmo', 'joeschmo@hotmail.com']
],
columns=['Name', 'Email'])

# Add columns here
df['Lowercase Name'] = df.Name.apply(str.lower)
print(df)

         Name                 Email Lowercase Name
0  JOHN SMITH  john.smith@gmail.com     john smith
1    Jane Doe        jdoe@yahoo.com       jane doe
2   joe schmo  joeschmo@hotmail.com      joe schmo


### Reviewing Lambda Function

A lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable.

In [5]:
mylambda = lambda x: x[0] + x[-1]
print(mylambda('This is a string'))

Tg


### Reviewing Lambda Function: If Statements

We can make our lambdas more complex by using a modified form of an if statement.

In [7]:
mylambda = lambda age: "Welcome to BattleCity!" if age >= 13 else "You must be 13 or older"

print(mylambda(12))
print(mylambda(13))

You must be 13 or older
Welcome to BattleCity!


### Applying a Lambda to a Column

In Pandas, we often use lambda functions to perform complex operations on columns.

In [9]:
df = pd.read_csv('employees.csv')

get_last_name = lambda name: name.split()[-1]
df["last_name"] = df.name.apply(get_last_name)
print(df)

       id               name  hourly_wage  hours_worked  last_name
0   10310      Lauren Durham           19            43     Durham
1   18656      Grace Sellers           17            40    Sellers
2   61254  Shirley Rasmussen           16            30  Rasmussen
3   16886        Brian Rojas           18            47      Rojas
4   89010    Samantha Mosley           11            38     Mosley
5   87246       Louis Guzman           14            39     Guzman
6   20578     Denise Mcclure           15            40    Mcclure
7   12869      James Raymond           15            32    Raymond
8   53461       Noah Collier           18            35    Collier
9   14746    Donna Frederick           20            41  Frederick
10  71127       Shirley Beck           14            32       Beck
11  92522    Christina Kelly            8            44      Kelly
12  22447        Brian Noble           11            39      Noble
13  61654          Randy Key           16            38       

### Applying a Lambda to a Row

We can also operate on multiple columns at once. If we use apply without specifying a single column and add the argument `axis=1`, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax `row.column_name` or `row['column_name']`.

In [10]:
df = pd.read_csv('employees.csv')

total_earned = lambda row: ((40 * row['hourly_wage']) + ((row['hours_worked']-40)*row['hourly_wage'] * 1.5)) if row['hours_worked'] > 40 else (row['hourly_wage'] * row['hours_worked'])

df['total_earned'] = df.apply(total_earned, axis=1)
print(df)

       id               name  hourly_wage  hours_worked  total_earned
0   10310      Lauren Durham           19            43         845.5
1   18656      Grace Sellers           17            40         680.0
2   61254  Shirley Rasmussen           16            30         480.0
3   16886        Brian Rojas           18            47         909.0
4   89010    Samantha Mosley           11            38         418.0
5   87246       Louis Guzman           14            39         546.0
6   20578     Denise Mcclure           15            40         600.0
7   12869      James Raymond           15            32         480.0
8   53461       Noah Collier           18            35         630.0
9   14746    Donna Frederick           20            41         830.0
10  71127       Shirley Beck           14            32         448.0
11  92522    Christina Kelly            8            44         368.0
12  22447        Brian Noble           11            39         429.0
13  61654          R

### Renaming Columns
When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use `df.column_name` (which tab-completes) rather than `df['column_name']` (which takes up extra space).

You can change all of the column names at once by setting the .columns property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong.

In [11]:
df = pd.read_csv('imdb.csv')

# Rename columns here
df.columns = ['ID', 'Title', 'Category', 'Year Released', 'Rating']
print(df)

      ID                                      Title Category  Year Released  \
0      1                                     Avatar   action           2009   
1      2                             Jurassic World   action           2015   
2      3                               The Avengers   action           2012   
3      4                            The Dark Knight   action           2008   
4      5  Star Wars: Episode I - The Phantom Menace   action           1999   
..   ...                                        ...      ...            ...   
215  216                                   Hannibal    drama           2001   
216  217                        Catch Me If You Can    drama           2002   
217  218                                  Big Daddy    drama           1999   
218  219                                      Se7en    drama           1995   
219  220                                      Seven    drama           1979   

     Rating  
0       7.9  
1       7.3  
2       8

### Renaming Columns II

You also can rename individual columns by using the `.rename` method. Pass a dictionary to the columns keyword argument.

In [12]:
df = pd.read_csv('imdb.csv')

# Rename columns here
df.rename(columns={'name': 'movie_title'}, inplace=True)
print(df)

      id                                movie_title   genre  year  imdb_rating
0      1                                     Avatar  action  2009          7.9
1      2                             Jurassic World  action  2015          7.3
2      3                               The Avengers  action  2012          8.1
3      4                            The Dark Knight  action  2008          9.0
4      5  Star Wars: Episode I - The Phantom Menace  action  1999          6.6
..   ...                                        ...     ...   ...          ...
215  216                                   Hannibal   drama  2001          6.7
216  217                        Catch Me If You Can   drama  2002          8.0
217  218                                  Big Daddy   drama  1999          6.4
218  219                                      Se7en   drama  1995          8.6
219  220                                      Seven   drama  1979          6.1

[220 rows x 5 columns]


### Review

In [15]:
orders = pd.read_csv('shoefly2.csv')
print(orders.head(5))

vegan_lambda = lambda x: 'animal' if x == 'leather' else 'vegan'
orders['shoe_source'] = orders.shoe_material.apply(vegan_lambda)

salutation_lambda = lambda row: 'Dear Mr. {}'.format(row.last_name) if row.gender == 'male' else 'Dear Ms. {}'.format(row.last_name)

orders['salutation'] = orders.apply(salutation_lambda, axis=1)

print(orders.head(5))

      id first_name last_name  gender                         email  \
0  54791    Rebecca   Lindsay  female  RebeccaLindsay57@hotmail.com   
1  53450      Emily     Joyce  female        EmilyJoyce25@gmail.com   
2  91987      Joyce    Waller  female        Joyce.Waller@gmail.com   
3  14437     Justin  Erickson    male   Justin.Erickson@outlook.com   
4  79357     Andrew     Banks    male              AB4318@gmail.com   

      shoe_type shoe_material shoe_color  
0         clogs  faux-leather      black  
1  ballet flats  faux-leather       navy  
2       sandles        fabric      black  
3         clogs  faux-leather        red  
4         boots       leather      brown  
      id first_name last_name  gender                         email  \
0  54791    Rebecca   Lindsay  female  RebeccaLindsay57@hotmail.com   
1  53450      Emily     Joyce  female        EmilyJoyce25@gmail.com   
2  91987      Joyce    Waller  female        Joyce.Waller@gmail.com   
3  14437     Justin  Erickson  