---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.16 (Pandas-08)</h1>

<a href="https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-3-Python-for-Data-Scientists/Lec-3.16(Pandas-08-Handling-Missing-Data).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Handling Missing Data.ipynb_

## Learning agenda of this notebook

1. Have an insight about the Dataset
2. Identify the Columns having Null/Missing values using `df.isna()` method
3. Handle/Impute the Null/Missing Values under the `math` Column using `df.loc[mask,col]=value`
4. Handle/Impute the Null/Missing Values under the `group` Column using `df.loc[mask,col]=value`
5. Handle Missing values under a Numeric/Categorical Column using `fillna()`
6. Handle Repeating Values (for same information) under the `session` Column
7. Create a new Column by Modifying an Existing Column
8. Delete Rows Having NaN values using `df.dropna()` method
9. Convert Categorical Variables into Numerical

### NOTES QUICK Recape

**object.fillna(value, method, inplace=True)** we can use only one at a time either ``value`` or ``method``
- **df.math.fillna(value=df.math.mean(), inplace = True)** ``math`` is the column name here
- **df.fillna(method='ffill', inplace = True)**
- **df.fillna(method='bfill', inplace= True)**

**df.map(mapping, na_action=None)**
- in place of ``mapping`` you need to pass a ``dictionary`` containing old and new values for mapping
- ``na_action`` use to ignore values at the time of mapping, here you pass the NaN values if available in your columns these will be ignored

**df.dropna(axis, how, subset, inplace)**
- ``axis = 0`` means rows and ``axis=1`` means columns
- ``how = 'any'`` means if any one value is NaN, then drop this column/row
- ``how = 'all'`` means if all the values are NaN, then drop this column/row
- ``subset =[col_name]`` or ``subset=[[colname1,colname2]]`` here you can give the column on which bases you want to check NaN and drop

**pd.get_dummies(df)** will convert all the columns to numerical values
**pd.get_dummies(df.gender, drop_first=False)** will convert only gender column to numeric

## 1. Have an Insight about the Dataset

In [None]:
! cat datasets/group-marks.csv

In [1]:
# import the pandas library
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


In [2]:
df.shape

(50, 10)

- Whenever the **`pd.read.csv()`** method detects a missing value (nothing between two commas in a csv file or an empty cell in Excel) **it flags it with NaN**. There can be many reasons for these NaN values, one can be that the data is gathered via google form from people and this field might be optional and skipped.
- There can also be a scenario that a user has entered some text under a numeric field about which he/she do not have any information.

## 2. Identify the Columns having Null/Missing values
- The **`df.isna()`** method is recommended to use than `df.isnull()`, which return a **boolean same-sized** object that indicates whether an element is NA value or not. **Missing values get mapped to True**. Everything **else gets mapped to False** values. Remember, characters such as **empty strings ``''`` or `numpy.inf` are not considered NA values**.
- The **`df.notna()`** method is recommended to use than `df.notnull()` methods return a boolean same-sized object that indicates whether an element is NA value or not. **Non-missing values get mapped to True.**

In [80]:
df.math.isna()
#df.math.isnull() #both are doing same IMPORTANT

0     False
1     False
2      True
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23     True
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44     True
45    False
46    False
47    False
48    False
49    False
Name: math, dtype: bool

In [3]:
df.loc[df.math.isna(), :]

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
23,MS24,LAIBA,female,group C,AFTERNOON,37,3000,,73.0,73
44,MS45,ZAINAB,female,group E,MOR,28,3500,,56.0,54


In [4]:
df.isna().head() #it will apply on whole dataframe, check 

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [5]:
df.notna().head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True
2,True,True,True,False,True,True,True,False,True,True
3,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True


In [6]:
# Now we can use sum() on this dataframe object of Boolean values **(True is mapped to 1)**
df.isna().sum() #calling on whole dataframe

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           4
english        3
urdu           0
dtype: int64

In [7]:
# Similarly, we can use sum() on this dataframe object of Boolean values (True is is mapped to 1)
df.notna().sum()

rollno         50
name           50
gender         50
group          47
session        50
age            50
scholarship    50
math           46
english        47
urdu           50
dtype: int64

In [8]:
df.math.notna().sum()
#df['math'].notna().sum() #both are same

46

## 3. Handle/Impute the Null/Missing Values under the `math` Column

### a. Identify the Rows under the `math` Column having Null/Missing values
- The `df.isna()` method works equally good on **Series objects as well**

In [9]:
mask = df.math.isna()
mask

0     False
1     False
2      True
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23     True
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44     True
45    False
46    False
47    False
48    False
49    False
Name: math, dtype: bool

In [10]:
df.loc[df.math.isna(), :]

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
23,MS24,LAIBA,female,group C,AFTERNOON,37,3000,,73.0,73
44,MS45,ZAINAB,female,group E,MOR,28,3500,,56.0,54


In [12]:
# This will return only those rows of dataframe having null values under the math column
#df[mask]         
#df[df.math.isna()] #both are same

#df.loc[mask, :]  
df.loc[df.math.isna(), :] #both are same

#Above all are same

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
23,MS24,LAIBA,female,group C,AFTERNOON,37,3000,,73.0,73
44,MS45,ZAINAB,female,group E,MOR,28,3500,,56.0,54


### b. Replace the Null/Missing Values under the `math` Column of DataFrame df
- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values under the `math` column
- Suppose, we want to **put the average values at the place of missing values**.

In [13]:
df

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
6,MS07,SARA,female,group B,EVENING,47,3000,88,95.0,92
7,MS08,ABDULLAH,male,group B,EVE,33,2000,40,43.0,39
8,MS09,KHAN,male,group D,MORNING,27,2500,64,,67
9,MS10,HASEENA,female,group B,AFT,33,2800,38,60.0,50


In [14]:
# Compute the mean of math column
df.math.mean() #it will give an error

TypeError: can only concatenate str (not "int") to str

> By seeing the error, it appears that the `math` column do not have the `int64` or `float64` type. Let us check this out

In [15]:
# Check out the data type of math column
df['math'].dtypes

dtype('O')

In [93]:
# We can also use the `df.info()` method to display the count of Non-Null columns, their datatypes, their names 
# and memory usage of that dataframe.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rollno       50 non-null     object 
 1   name         50 non-null     object 
 2   gender       50 non-null     object 
 3   group        47 non-null     object 
 4   session      50 non-null     object 
 5   age          50 non-null     int64  
 6   scholarship  50 non-null     int64  
 7   math         46 non-null     object 
 8   english      47 non-null     float64
 9   urdu         50 non-null     int64  
dtypes: float64(1), int64(3), object(6)
memory usage: 4.0+ KB


- **What can be the reason for this?**
- Let us check out the values under this column

In [94]:
df['math']

0     No Idea
1          69
2         NaN
3          47
4          76
5         NaN
6          88
7          40
8          64
9          38
10         58
11         40
12         65
13         78
14         50
15         69
16         88
17         18
18         46
19         54
20         66
21         65
22         44
23        NaN
24         74
25         73
26         69
27         67
28         70
29         62
30         69
31         63
32         56
33         40
34         97
35         81
36         74
37         50
38         75
39         57
40         55
41         58
42         53
43         59
44        NaN
45         65
46         55
47         66
48         57
49         66
Name: math, dtype: object

In [16]:
# We can replace all such values using the ***`replace()`*** method
import numpy as np
df['math'] = df.math.replace('No Idea', np.nan).head() #we have a record having "No Idea" which is object/string 

In [17]:
df

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
6,MS07,SARA,female,group B,EVENING,47,3000,,95.0,92
7,MS08,ABDULLAH,male,group B,EVE,33,2000,,43.0,39
8,MS09,KHAN,male,group D,MORNING,27,2500,,,67
9,MS10,HASEENA,female,group B,AFT,33,2800,,60.0,50


In [18]:
df.math.replace('No Idea', np.nan, inplace=True)

In [19]:
# Note the marks of Saadia in math are changed from string `No Idea` to `NaN`
# Since this seems working fine let us make inplace=True to make these changes in the original dataframe
df.replace('No Idea', np.nan, inplace=True)

In [20]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


In [21]:
# Let us check the data type of math column
df['math'].dtypes

dtype('O')

In [22]:
# It is ***still Object, which is natural***, however, we can change the datatype to ***`df.astype()`*** method
df['math'] = df['math'].astype(float)

In [23]:
# Let us check the data type of math column
df['math'].dtypes

dtype('float64')

In [24]:
# Let us compute the average of math marks again 
df.math.mean() 

64.0

In [104]:
# List only those records under math column having Null values
mask = df.math.isna()
df.loc[mask, 'math']

#df[df.math.isna(),'math'] #it will giving error IMPORTANT, reason is df subscript [] only take one argument at a time either this is column or row
#df['math'] #now it is OK
#df.loc[df.math.isna(), 'math'] #it will not give an error

0    NaN
2    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
19   NaN
20   NaN
21   NaN
22   NaN
23   NaN
24   NaN
25   NaN
26   NaN
27   NaN
28   NaN
29   NaN
30   NaN
31   NaN
32   NaN
33   NaN
34   NaN
35   NaN
36   NaN
37   NaN
38   NaN
39   NaN
40   NaN
41   NaN
42   NaN
43   NaN
44   NaN
45   NaN
46   NaN
47   NaN
48   NaN
49   NaN
Name: math, dtype: float64

In [25]:
df.dtypes

rollno          object
name            object
gender          object
group           object
session         object
age              int64
scholarship      int64
math           float64
english        float64
urdu             int64
dtype: object

In [26]:
df.math.mean()

64.0

In [27]:
# Let us replace these values with mean value of the math column

#df[df.math.isna()] = df.math.mean() #it will put this mean value 64.0 in all the columns of dataframe where df.math.isna() gives true
#df[df.math.isna(), 'math'] = df.math.mean() #and we can write in this way because subscript [] only takes one argument at a time, either column or row
df.loc[df.math.isna(),'math'] = df.math.mean() #therefore solution is loc
#df.math.replace('NaN', df.math.mean()) #it will not work

In [28]:
# Confirm the result
df.isna().sum()
#df.info()

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           0
english        3
urdu           0
dtype: int64

In [29]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,64.0,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,64.0,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


In [30]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,64.0,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,64.0,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


In [31]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,64.0,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,64.0,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


## 4. Handle/Impute the Null/Missing Values under the `group` Column of Dataframe df
- The `group` column contains categorical values, i.e., a value that can take on one of a limited, and usually fixed, number of possible values.

### a. Identify the Rows under the `group` Column having Null/Missing values

In [32]:
df

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,64.0,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,64.0,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55
5,MS06,SAFIA,female,group B,AFT,23,3800,64.0,83.0,78
6,MS07,SARA,female,group B,EVENING,47,3000,64.0,95.0,92
7,MS08,ABDULLAH,male,group B,EVE,33,2000,64.0,43.0,39
8,MS09,KHAN,male,group D,MORNING,27,2500,64.0,,67
9,MS10,HASEENA,female,group B,AFT,33,2800,64.0,60.0,50


In [33]:
mask = df.group.isna()
mask

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32     True
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
Name: group, dtype: bool

In [34]:
df[mask]          
# df[df.group.isna()] #both are same
df.loc[mask, :]  
# df.loc[df.group.isna()] #both are same

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
2,MS03,ARIFA,female,,EVENING,34,3500,64.0,95.0,93
12,MS13,MAHOOR,female,,MOR,25,2345,64.0,81.0,73
32,MS33,SHAISTA,female,,MORNING,29,3500,64.0,72.0,65


### b. Replace the Null/Missing Values under the `group` Column
- After detecting the NaN values, the next question is, **what value we should write in the cells where we have Null/Missing values**
- Since this is a categorical column having datatype object (group A, group B, group C, ...), so let us replace it with the **value inside the column having the maximum frequency**

In [35]:
# Use ***value_counts()*** function which return a Series containing counts of unique values (in descending order)
# with the ***most frequently-occurring element*** ***at first***. It ***excludes NA values by default.***
df.group.value_counts() #check its arguments by shift + tab
#df.count() 

group C    14
group B    13
group D    12
group A     5
group E     3
Name: group, dtype: int64

In [36]:
# Another way of doing is use the ***mode() function on the column
df.group.mode() 

0    group C
Name: group, dtype: object

rollno         0
name           0
gender         0
group          0
session        0
age            0
scholarship    0
math           0
english        3
urdu           0
dtype: int64

In [127]:
# List only those records under group column having Null values
mask = df.group.isna()
df.loc[mask, 'group']     # df.loc[(df.group.isna()), 'group']

Series([], Name: group, dtype: object)

In [128]:
df.loc[(df.group.isna()),'group']

Series([], Name: group, dtype: object)

In [37]:
# Let us replace these values with maximum occurring value in the `group` column
df.loc[(df.group.isna()),'group'] = 'group C'

In [38]:
# Confirm the result
df.isna().sum()
#df.info()

rollno         0
name           0
gender         0
group          0
session        0
age            0
scholarship    0
math           0
english        3
urdu           0
dtype: int64

In [39]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,64.0,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,group C,EVENING,34,3500,64.0,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


>Note that in the original dataframe Arifa group information was missing, and now it is `group C` 

## 5. Handle Missing values under a Numeric/Categorical Column using `fillna()`

### a. Replace the Null/Missing Values under the math Column using `fillna()`
- This is **more recommended way** of filling in the Null values within columns of your dataset **rather than** the use of the `loc` method.

**object.fillna(value, method, inplace=True)**

- The only required argument is either the `value`, with which we want to replace the missing values **OR** the `method` to be used to replace the missing values
- Returns object **with missing values filled** or **None if ``inplace=True``**

In [132]:
# Let us read the dataset again with NA values under math column
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')

In [133]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


>- Before proceeding, let us this time handle the string value `No Idea` under the math column while reading the csv file, **instead of doing** afterwards in the dataframe **using the `replace()`** method as we have done above.
>- For this we will use the `na_values` argument to the `pd.read_csv()` method, to which you can pass a **single value or a list** of values **to be replaced** with **NaN**

In [134]:
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')

In [135]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


In [136]:
df.isna().sum()

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           5
english        3
urdu           0
dtype: int64

In [137]:
df.loc[df.math.isna()]

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,,72.0,74
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
23,MS24,LAIBA,female,group C,AFTERNOON,37,3000,,73.0,73
44,MS45,ZAINAB,female,group E,MOR,28,3500,,56.0,54


In [138]:
# This time instead of loc, use fillna() method with just two arguments
# inplace=True parameter ensure that this happens in the original dataframe

df.math.fillna(value=df.math.mean(), inplace=True)

In [139]:
# Confirm the result
df.isna().sum()
#df.info()

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           0
english        3
urdu           0
dtype: int64

In [140]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,61.644444,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,61.644444,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


### b. Replace the Null/Missing Values under the `group` Column using `fillna()`

In [141]:
# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


In [142]:
df.isna().sum()

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           5
english        3
urdu           0
dtype: int64

In [144]:
# Once again instead of loc,let us use fillna() method with just two arguments

#df.group.fillna(value='group C', inplace=True)
df.group.fillna('group C', inplace=True) #both are same

In [145]:
# Confirm the result
df.isna().sum()
#df.info()

rollno         0
name           0
gender         0
group          0
session        0
age            0
scholarship    0
math           5
english        3
urdu           0
dtype: int64

### IMPORTANT

In [147]:
# Let us fill the math, english and scholarship columns as well again
df.math.fillna(df.math.mean(), inplace=True)
df.english.fillna(df.english.mean(), inplace=True)
df.scholarship.fillna(df.scholarship.mean(), inplace=True)

In [148]:
# Confirm the result
df.isna().sum()


rollno         0
name           0
gender         0
group          0
session        0
age            0
scholarship    0
math           0
english        0
urdu           0
dtype: int64

### c. Replace the Null/Missing Values under the` math` and `group` Column using `ffill` and `bfill` Arguments
- In above examples, we have used the **mean value in case of numeric column** and **mode value in case of a categorical column** as the filling value to the `fillna()` method

**object.fillna(value, method, inplace=True)** we can use only one at a time either ``value`` or ``method``


- We can pass `ffill` or `bfill` as **method argument** to the `fillna()` method. This will replace the null values with other values from the DataFrame
- `ffill` (Forward fill): It fills the NaN value with the **previous value**
- `bfill` (Back fill): It fills the NaN value with the **Next/Upcoming value**

<img align="right" width="490" height="100"  src="images/bfill.png"  >
<img align="left" width="490" height="100"  src="images/ffill.png"  >

In [149]:
# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')
df.head(20)

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
6,MS07,SARA,female,group B,EVENING,47,3000,88.0,95.0,92
7,MS08,ABDULLAH,male,group B,EVE,33,2000,40.0,43.0,39
8,MS09,KHAN,male,group D,MORNING,27,2500,64.0,,67
9,MS10,HASEENA,female,group B,AFT,33,2800,38.0,60.0,50


In [150]:
df.isna().sum()

rollno         0
name           0
gender         0
group          3
session        0
age            0
scholarship    0
math           5
english        3
urdu           0
dtype: int64

In [153]:
# forward fill or ffill attribute
# If have NaN value, just carry forward the previous value
# using ffill attribute, you can fill the NaN value with the previous value in that column

#df.fillna(value = df.math.mean(), method = 'ffill', inplace=True) #it will give an error, we can use either one at a time value or method
df.fillna(method = 'ffill', inplace=True) #we are calling now on direct dataframe df not on df.math.fillna... etc
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,group C,EVENING,34,3500,69.0,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


>Is it working fine?

In [154]:
df.fillna(method = 'bfill', inplace=True)
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,69.0,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90.0,88
2,MS03,ARIFA,female,group C,EVENING,34,3500,69.0,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47.0,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78.0,55


In [155]:
# Confirm the result
df.isna().sum()

rollno         0
name           0
gender         0
group          0
session        0
age            0
scholarship    0
math           0
english        0
urdu           0
dtype: int64

## 6. Handle Repeating Values (for same information) under the `session` Column
- If you observe the values under the `session` column, you can observe that it is a categorical column containing six different categories (as values).
    - Notice that the categories `MORNING` and `MOR` are same
    - Similarly, `AFTERNOON` and `AFT` are same
    - Similarly, `EVENING` and `EVE` are same
- This happens when you have collected data from different sources, where same information is written in different ways
- So the `session` column has six different categories (as values) but should have only three

In [40]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv' )
df

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
6,MS07,SARA,female,group B,EVENING,47,3000,88,95.0,92
7,MS08,ABDULLAH,male,group B,EVE,33,2000,40,43.0,39
8,MS09,KHAN,male,group D,MORNING,27,2500,64,,67
9,MS10,HASEENA,female,group B,AFT,33,2800,38,60.0,50


In [41]:
df.session

0       MORNING
1     AFTERNOON
2       EVENING
3           MOR
4     AFTERNOON
5           AFT
6       EVENING
7           EVE
8       MORNING
9           AFT
10          MOR
11      MORNING
12          MOR
13    AFTERNOON
14          AFT
15      EVENING
16          MOR
17    AFTERNOON
18          AFT
19      MORNING
20    AFTERNOON
21      EVENING
22          MOR
23    AFTERNOON
24          AFT
25      EVENING
26          EVE
27      MORNING
28          AFT
29          MOR
30      EVENING
31          EVE
32      MORNING
33      EVENING
34      MORNING
35    AFTERNOON
36      EVENING
37          MOR
38    AFTERNOON
39          AFT
40      EVENING
41          EVE
42      MORNING
43          AFT
44          MOR
45      EVENING
46          EVE
47      MORNING
48          MOR
49    AFTERNOON
Name: session, dtype: object

In [42]:
# Let use check out the counts of unique values inside the session Column
df.session.value_counts()

EVENING      10
MORNING       9
AFTERNOON     9
MOR           9
AFT           8
EVE           5
Name: session, dtype: int64

In [43]:
df.session.unique()

array(['MORNING', 'AFTERNOON', 'EVENING', 'MOR', 'AFT', 'EVE'],
      dtype=object)

###  Handle  the Repeating Values under the session Column using `map()`
- To keep the data clean we will map all these values to only three categories to `MOR` , `AFT` and `EVE` using the map() function.
```
df.map(mapping, na_action=None)
```
- The `map()` method is used for **substituting each value in a Series with another value**, that may be **derived from a** `dict`. The `map()` method **returns a series** after performing the mapping
- You can give `ignore` as second argument with in ``na_action`` which will propagate NaN values, without passing them to the mapping correspondence.

In [44]:
# To do this, let us create a new mapping (dictionary) 
dict1 = {
    'MORNING' : 'MOR',
    'MOR' : 'MOR',
    'AFTERNOON' : 'AFT',
    'AFT': 'AFT',
    'EVENING' : 'EVE',
    'EVE': 'EVE'
}

In [45]:
# It returns a series with the same index as caller, the original series remains unchanged. 
# So we have assigned the resulting series to `df.session` series
df.session.map(dict1)

0     MOR
1     AFT
2     EVE
3     MOR
4     AFT
5     AFT
6     EVE
7     EVE
8     MOR
9     AFT
10    MOR
11    MOR
12    MOR
13    AFT
14    AFT
15    EVE
16    MOR
17    AFT
18    AFT
19    MOR
20    AFT
21    EVE
22    MOR
23    AFT
24    AFT
25    EVE
26    EVE
27    MOR
28    AFT
29    MOR
30    EVE
31    EVE
32    MOR
33    EVE
34    MOR
35    AFT
36    EVE
37    MOR
38    AFT
39    AFT
40    EVE
41    EVE
42    MOR
43    AFT
44    MOR
45    EVE
46    EVE
47    MOR
48    MOR
49    AFT
Name: session, dtype: object

In [162]:
df.session = df.session.map(dict1)

In [163]:
# Count of new categories in the column session
# Observe we have managed to properly manage the values inside the session column
df.session.value_counts()

MOR    18
AFT    17
EVE    15
Name: session, dtype: int64

In [164]:
# Let us verify the result
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MOR,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFT,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVE,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFT,54,2100,76,78.0,55


## 7. Create a new Column by Modifying an Existing Column
- We have a column scholarship in the dataset, which is in Pak Rupees
- Suppose you want to have a new column which should represent the scholarship in US Dollars
- For that we need to add a new column by dividing each value of scholarship with 150

In [165]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv' )
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


In [167]:
df.scholarship.apply(lambda x: x/170)

0     15.070588
1     16.470588
2     20.588235
3     11.764706
4     12.352941
5     22.352941
6     17.647059
7     11.764706
8     14.705882
9     16.470588
10    17.647059
11    19.482353
12    13.794118
13    15.611765
14    12.570588
15    15.100000
16    20.588235
17    14.705882
18    17.647059
19    12.941176
20    20.588235
21    11.764706
22    14.705882
23    17.647059
24    14.705882
25    20.588235
26    14.705882
27    17.647059
28    23.529412
29    20.588235
30    14.705882
31    17.647059
32    20.588235
33    17.647059
34    14.705882
35    20.588235
36    17.647059
37    14.705882
38    20.588235
39    17.647059
40    14.705882
41    20.588235
42    17.647059
43    14.705882
44    20.588235
45    17.647059
46    14.705882
47    20.588235
48    14.705882
49    17.647059
Name: scholarship, dtype: float64

In [168]:
df['Scholarship_in_$'] = df.scholarship.apply(lambda x : x/150)

In [169]:
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu,Scholarship_in_$
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74,17.08
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88,18.666667
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93,23.333333
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44,13.333333
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55,14.0


## 8. Delete Rows Having NaN values using `df.dropna()` method
**```
df.dropna(axis, how, subset, inplace)
```**

In [170]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


In [None]:
df.shape

In [46]:
# You can use dropna() method to drop all the rows, containing NaN values
df1 = df.dropna() #it will drop all the rows who have NaN values
df1.shape

(41, 10)

In [47]:
df1.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55
6,MS07,SARA,female,group B,EVENING,47,3000,88,95.0,92


In [48]:
# Default Arguments to dropna()
df2 = df.dropna(axis=0, how='any')
df2.shape

(41, 10)

In [174]:
# If we set how='all` it means drop a row only if all of its values are NA
df2 = df.dropna(axis=0, how='all')
df2.shape

(50, 10)

In [175]:
# Use of subset argument and pass it a list of columns based on whose values you want to drop a row
df2 = df.dropna(axis=0, how='any', subset=['math'])
df2.shape

(46, 10)

In [176]:
# Use of subset argument
df2 = df.dropna(axis=0, how='any', subset=['session'])
df2.shape

(50, 10)

In [50]:
# Having `how=all` and `subset=listofcolumnnames`, then it will 
# drop a row only if both the columns have a NA value in that row
df2 = df.dropna(axis=0, how='any', subset=['math', 'session'])
df2.shape

(46, 10)

In [51]:
# If we set the axis=1 and how=all, it means drop a column if all the  values under it is na
df2 = df.dropna(axis=1, how='all')
df2.shape

(50, 10)

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   rollno       50 non-null     object 
 1   name         50 non-null     object 
 2   gender       50 non-null     object 
 3   group        47 non-null     object 
 4   session      50 non-null     object 
 5   age          50 non-null     int64  
 6   scholarship  50 non-null     int64  
 7   math         46 non-null     object 
 8   english      47 non-null     float64
 9   urdu         50 non-null     int64  
dtypes: float64(1), int64(3), object(6)
memory usage: 4.0+ KB


In [53]:
df2 = df.dropna(axis=1, how='any')
df2.shape

(50, 7)

In [54]:
# If we set the axis=1 and how=any, it means drop a column if any value under it is na
df2 = df.dropna(axis=1, how='any')
df2.shape

(50, 7)

In [None]:
df2.head()

## 9. Convert Categorical Variables into Numerical
- Most of the machine learning algorithms do not take categorical variables so we need to convert them into numerical ones. 
- We can do this using Pandas function `pd.get_dummies()`, which will create a binary column for each of the categories. 
```
pd.get_dummies(data, drop_first=False)
```
- Where, the only required argument is `data` which can be a dataframe or a series
- The parameter drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level.

**Note:** Making a dummy variable will take all the `K` distinct values in one coumn and make `K` columns out of them

### a. Convert all categorical variables into dummy/indicator variables

In [55]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


In [185]:
# currently we have 10 columns in the data
df.shape

(50, 10)

In [56]:
# Convert all categorical variables into dummy/indicator variables
df = pd.get_dummies(df)

In [57]:
# Let us view the datafreame, keep a note on the number of columns
df.head()

Unnamed: 0,age,scholarship,english,urdu,rollno_MS01,rollno_MS02,rollno_MS03,rollno_MS04,rollno_MS05,rollno_MS06,...,math_70,math_73,math_74,math_75,math_76,math_78,math_81,math_88,math_97,math_No Idea
0,28,2562,72.0,74,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,33,2800,90.0,88,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,34,3500,95.0,93,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,44,2000,57.0,44,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,54,2100,78.0,55,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0


In [194]:
# The Number of columns has gone to 141 now
df.shape

(50, 142)

In [None]:
df

- So we have 112 columns
- It adds a lot of dimensionality to your data, i.e., increase the number of columns
- It also become difficult to deal with that much number of columns
- This is a trade-off, which is handled by technique called dimensionality reduction

### b. Perform One-Hot Encoding for Categorical Column `gender` Only
- In our dataframe, the gender column is a categorical column having two values 'male' and 'female'
- It will create a dummy binary columns.  
- This is also known as `One Hot Encoding`. You will learn more encoding techniques in the data pre-processing module.


In [58]:
import pandas as pd
df1 = pd.read_csv('datasets/group-marks.csv')
df1.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


In [59]:
# Convert only gender variable into dummy/indicator variables
df2 = pd.get_dummies(df1[['gender']])
df2.head()

Unnamed: 0,gender_female,gender_male
0,1,0
1,1,0
2,1,0
3,1,0
4,0,1


In [60]:
# Since we donot need two separate columns, so simply use the `drop_first` argument of get_dummies to handle this
df2 = pd.get_dummies(df1[['gender']], drop_first=True)
df2.head()

Unnamed: 0,gender_male
0,0
1,0
2,0
3,0
4,1


In [61]:
# We will talk about join in the next session in detail.
df3 = df1.join(df2['gender_male'])
df3.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu,gender_male
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74,0
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88,0
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93,0
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44,0
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55,1


In [62]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55


In [63]:
df.session.value_counts()
dict1 = {
    'MORNING' : 'MOR',
    'AFTERNOON' : 'AFT',
}
#df.session = df.session.map(dict1)
df.session = df.session.replace(dict1)



In [201]:
df

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MOR,28,2562,No Idea,72.0,74
1,MS02,JUMAIMA,female,group C,AFT,33,2800,69,90.0,88
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44
4,MS05,DANISH,male,group C,AFT,54,2100,76,78.0,55
5,MS06,SAFIA,female,group B,AFT,23,3800,,83.0,78
6,MS07,SARA,female,group B,EVENING,47,3000,88,95.0,92
7,MS08,ABDULLAH,male,group B,EVE,33,2000,40,43.0,39
8,MS09,KHAN,male,group D,MOR,27,2500,64,,67
9,MS10,HASEENA,female,group B,AFT,33,2800,38,60.0,50


In [202]:
df1 = pd.get_dummies(df.session)
df1.head()

Unnamed: 0,AFT,EVE,EVENING,MOR
0,0,0,0,1
1,1,0,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0


In [211]:
#pd.get_dummies(df.gender, drop_first=True)  #IMPORTANT
pd.get_dummies(df[['gender']], drop_first=True) #see the difference in column name

Unnamed: 0,gender_male
0,0
1,0
2,0
3,0
4,1
5,0
6,0
7,1
8,1
9,0


In [212]:
df1 = pd.get_dummies(df.gender, drop_first=True)
df3 = df.join(df1['male'])
df3.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu,male
0,MS01,SAADIA,female,group B,MOR,28,2562,No Idea,72.0,74,0
1,MS02,JUMAIMA,female,group C,AFT,33,2800,69,90.0,88,0
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93,0
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44,0
4,MS05,DANISH,male,group C,AFT,54,2100,76,78.0,55,1


In [22]:
df1 = pd.get_dummies(df[['gender']], drop_first=True)
df3 = df.join(df1['gender_male'])
df3.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu,gender_male
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74,0
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88,0
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93,0
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44,0
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55,1


In [3]:
df3 = df.join(df1['gender_male'])
df3.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu,gender_male
0,MS01,SAADIA,female,group B,MORNING,28,2562,No Idea,72.0,74,0
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69,90.0,88,0
2,MS03,ARIFA,female,,EVENING,34,3500,,95.0,93,0
3,MS04,SAADIA,female,group A,MOR,44,2000,47,57.0,44,0
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76,78.0,55,1
