---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.15 (Pandas-07)</h1>

## _Modifying Dataframes.ipynb_

## Learning agenda of this notebook

1. How to Add/Delete a Column in a Dataframe (unconditionally) 
2. How to Add/Delete a Row in a dataframe (unconditionally)
3. Adding a New Column with Conditional Values
4. Delete a Row Based on Specific Condition
5. Delete a Column  Based on Specific Condition

##  Read a Sample Dataframe

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('../course-datasets/groupdata.csv')
df.head()


Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [None]:
df.shape

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.dtypes

## 1. How to Add/Delete a Column in a Dataframe (unconditionally)

### a. Add a New Column to a Dataframe
- To add a new column in a dataframe, create an appropriate series and then assign it to the dataframe
- Every time a new series is added to a dataframe, its name automatically becomes an attribute of that dataframe.
- It can be a series created from scratch, which can be cumbersome if the dataframe has thousands of rows.
- Another common way to add a column is construct a series from the existing data within the dataframe
- Let us understand this with an example

In [None]:
df.columns

In [None]:
df.subj1 + df.subj2

In [None]:
df.subj1.add(df.subj2)

In [None]:
ser1 = df.subj1.add(df.subj2, fill_value=0)
ser1

In [None]:
# On the left side of assignment you must use `[]` operator, while on the right you can use dot operator as well
df['total'] = ser1

Note that once, nothing appears to happen after you execute a Jupyter notebook cell, that means some processing has been done in the background. Over here, a new column has been added to the dataframe named df. Let us confirm this

In [None]:
df.head()

**Dear students, how can we add a column in the beginning or in between instead of adding it at the end?**

### b. Delete a Column from a Dataframe
- You can use any of the following ways to delete a column from a dataframe:
    - Use `del df['colname']`, which will remove the column, but will not return it
    - Use `df.pop('colname')` method which will remove that column as well as return the deleted column as a series
    - Use `df.drop()` is a better method than the above two. It can delete more than one columns and is not inplace, i.e., you can use it just to check out and once you are sure you can make it inplace

**Option 1: `del df['colname']`**

In [None]:
df.columns

In [None]:
del df['total']

In [None]:
df.head()

**Option 2: `df.pop('colname')`**

In [3]:
df.pop('address')

0        Lahore
1     Islamabad
2       Karachi
3        Lahore
4      Peshawer
5        Lahore
6       Sialkot
7        Multan
8       Karachi
9        Lahore
10    Islamabad
11      Karachi
12       Lahore
13       Multan
14      Sialkot
15       Multan
Name: address, dtype: object

In [4]:
df.head()

Unnamed: 0,roll no,name,age,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,AFT,group D,Female,65.9,72.8,3500.0


**Option 3:**
```
df.drop(columns=[---],  axis=1, inplace=False)
```
- If you want to drop column(s), pass the names of columns to be deleted as a Python List to the `columns` parameter
- Axis argument specifies the direction of operation
    - `axis=1` or `axis='columns'` means column wise, in vertical direction, used to delete a column from dataframe
- Most of Pandas methods that return a dataframe has an inplace paremeter with default value set to False. It means the operation will not effect the underlying change

In [38]:
import pandas as pd
df = pd.read_csv('../course-datasets/groupdata.csv')
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [40]:
# Remember axis is the direction of operation, 
df.drop(columns=['age', 'address'], axis=1)

Unnamed: 0,roll no,name,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,Mohid,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,AFT,group B,Female,90.2,,4000.0
7,MS08,Idrees,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,Jamil,AFT,group C,Male,90.5,81.3,3500.0
9,MS10,Shahid,AFTERNOON,group D,Male,90.5,81.3,3800.0


In [41]:
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [42]:
df.drop([])

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0


## 2. How to Add/Delete a Row in a dataframe (unconditionally)

In [28]:
import numpy as np
import pandas as pd
df = pd.read_csv('../course-datasets/groupdata.csv')
df.tail()


Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
11,MS12,Maaz,25,Karachi,AFTERNOON,group C,Male,90.5,81.3,
12,MS13,Mujahid,18,Lahore,MORNING,group D,Male,,76.5,7000.0
13,MS14,Sara,28,Multan,AFTERNOON,group A,Female,84.1,76.0,8000.0
14,MS15,Fatima,33,Sialkot,AFT,group C,Female,90.5,81.3,3500.0
15,MS16,Kakamanna,42,Multan,AFTERNOON,group A,Male,90.5,81.3,3800.0


### b. Add a Row in the Dataframe
- To add a new row in a dataframe, create an appropriate dataframe and then use `df.append()` method, which will return a new dataframe with the row added.
```
df.append(other, ignore_index=False)
```
**More on append in next session**

In [29]:
newdf = pd.DataFrame(data=[['MS222', 'New Student', 100, 'Kamokey', 'AFT', 'group D', 'Male', 55.0, 55.0, 9999]],
     columns=['roll no', 'name', 'age', 'address', 'session', 'group', 'gender','subj1', 'subj2', 'scholarship'])
newdf

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS222,New Student,100,Kamokey,AFT,group D,Male,55.0,55.0,9999


If we try to add a row with lesser values than the number of columns in the DataFrame, it results in a ValueError,

In [30]:
df1 = df.append(newdf, ignore_index=True)
df1

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0


### c. Delete a Row in the Dataframe
- The `dr.drop()` method as used above to delete columns, can be used to delete one or more rows from a dataframe
```
df.drop(index=[---],  axis=0, inplace=False)
```
- If you want to drop row(s) pass the row indices to be deleted as a Python List to the `index` parameter
- Axis argument specifies the direction of operation
    - `axis=0` or `axis='index'` means row wise, in horizontal direction, used to delete a row from dataframe
- The default value of inplace argument in most of Pandas methods is False, meaning the method will return a new dataframe after performing the operation and the original dataframe remains unchanged
- Most of Pandas methods taht return a datafram has an inplace paremeter with default value set to False. It means the operation will not effect the underlying change

In [31]:
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [32]:
df.drop(index=[0,1], axis=0).head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0


## 3. Adding a New Column with Conditional Values

**Create a Simple Dataframe**

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv('../course-datasets/groupdata.csv')
df.head()

df.head()


**Add a new column 'Grade Subj1', containing string "Good" if subj1>80, else "Bad"**

In [None]:
df['Grade Subj1'] = ['Good' if i >=80 else 'Bad' for i in df.subj1]
df

**Add a new column 'Grade Subj2', containing string "A", "A-" ... based on multiple conditions of Total marks**

In [None]:
# Create a new column named 'Grade', having overall grades of the three subjects based on their total marks 

# Add a new column named 'Grade'
df['Grade Subj2'] = ['A' if x >= 80 else 'B' if x >= 70 else 'C' if x>= 60 else "Bad Grade" for x in df.subj2]
df

## 4. Delete a Row Based on Specific Condition

In [None]:
df

In [None]:
# Let us drop an entire row from the data frame, in which name is 'Maaz'
# Get the indices where name == 'Maaz' using the .index function
count = df[df['name'] == 'Hadeed'].index
count


In [None]:
# Pass those indices to the drop method to delete those rows
df.drop(count, inplace = True)
df

## 6. Delete a Column  Based on Specific Condition

In [None]:
# Let us drop a column from the data frame, if it contains more than 2 NaN values
# It will delete the Total column

df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > 2)], axis=1)

## Change Datatype of a Pandas Series

### a. Changing Datatype from object to float

In [2]:
import numpy as np
import pandas as pd


df = pd.read_csv('../course-datasets/chiporders.csv')
df.head()


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [3]:
df.dtypes

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

In [4]:
#Suppose we want to change the quantity column to float64 dtype
import numpy as np
df['quantity'] = df.quantity.astype(float)
df.dtypes

order_id                int64
quantity              float64
item_name              object
choice_description     object
item_price             object
dtype: object

In [1]:
#df['item_price'] = df.item_price.astype(float)

In [9]:
df.item_price.str.replace('$', '')
#df['item_price'] = df.item_price.astype(float)

  df.item_price.str.replace('$', '')


0        2.39 
1        3.39 
2        3.39 
3        2.39 
4       16.98 
         ...  
4617    11.75 
4618    11.75 
4619    11.25 
4620     8.75 
4621     8.75 
Name: item_price, Length: 4622, dtype: object

In [13]:
df['item_price'] = df.item_price.str.replace('$', '')
df['item_price']

  df['item_price'] = df.item_price.str.replace('$', '')


0        2.39 
1        3.39 
2        3.39 
3        2.39 
4       16.98 
         ...  
4617    11.75 
4618    11.75 
4619    11.25 
4620     8.75 
4621     8.75 
Name: item_price, Length: 4622, dtype: object

In [15]:
df['item_price'] = df.item_price.str.replace('$', '').astype(float)
df['item_price']

  df['item_price'] = df.item_price.str.replace('$', '').astype(float)


0        2.39
1        3.39
2        3.39
3        2.39
4       16.98
        ...  
4617    11.75
4618    11.75
4619    11.25
4620     8.75
4621     8.75
Name: item_price, Length: 4622, dtype: float64

In [16]:
df.item_price.mean()

7.464335785374297

### b. Changing Datatype from string to boolean

In [17]:
import numpy as np
import pandas as pd
df = pd.read_csv('../course-datasets/groupdata.csv')
df.head()


Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [18]:
df.gender.str.contains('Male')

0      True
1      True
2     False
3      True
4     False
5     False
6     False
7      True
8      True
9      True
10     True
11     True
12     True
13    False
14    False
15     True
Name: gender, dtype: bool

In [19]:
df.gender.str.contains('Female')

0     False
1     False
2      True
3     False
4      True
5      True
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14     True
15    False
Name: gender, dtype: bool

In [20]:
df.gender.str.contains('Female').astype(int)

0     0
1     0
2     1
3     0
4     1
5     1
6     1
7     0
8     0
9     0
10    0
11    0
12    0
13    1
14    1
15    0
Name: gender, dtype: int64

In [21]:
df['gender'] = df.gender.str.contains('Male').astype(int)

In [22]:
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,1,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,1,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,0,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,1,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,0,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,0,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,0,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,1,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,1,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,1,90.5,81.3,3800.0
