Learn to modify DataFrames: creating new columns from existing data, dropping columns, and finding/handling missing data (NaN).

axis=0 (default) = drop Rows.

axis=1 = drop Columns.

df.isnull() (or df.isna()): Creates a boolean DataFrame showing True for every NaN value.

*df.isnull().sum():* This is the magic command. It chains .sum() to the previous command to give you a count of missing values in every single column. This is always one of the first things you'll run.

*df.dropna():* The "easy" solution. It drops any row containing a NaN value.

*df.fillna(value):* The "smarter" solution. It fills all NaNs with a value you provide (e.g., 0, or the column's average).

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'

In [3]:
df = pd.read_csv(url)
print(df.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


# Creating a New Column

In [4]:
df['sepal_area'] = df['sepal_length']*df['sepal_width']
print(df.head())

   sepal_length  sepal_width  petal_length  petal_width species  sepal_area
0           5.1          3.5           1.4          0.2  setosa       17.85
1           4.9          3.0           1.4          0.2  setosa       14.70
2           4.7          3.2           1.3          0.2  setosa       15.04
3           4.6          3.1           1.5          0.2  setosa       14.26
4           5.0          3.6           1.4          0.2  setosa       18.00


# Dropping a Column

In [8]:
# axis=1 tells pandas to drop a column, not a row
# 'inplace=True' would modify 'df', but this way we just see the result


df_dropped = df.drop('sepal_area', axis=1)
print(df_dropped)

     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]


In [9]:
print(df)

     sepal_length  sepal_width  petal_length  petal_width    species  \
0             5.1          3.5           1.4          0.2     setosa   
1             4.9          3.0           1.4          0.2     setosa   
2             4.7          3.2           1.3          0.2     setosa   
3             4.6          3.1           1.5          0.2     setosa   
4             5.0          3.6           1.4          0.2     setosa   
..            ...          ...           ...          ...        ...   
145           6.7          3.0           5.2          2.3  virginica   
146           6.3          2.5           5.0          1.9  virginica   
147           6.5          3.0           5.2          2.0  virginica   
148           6.2          3.4           5.4          2.3  virginica   
149           5.9          3.0           5.1          1.8  virginica   

     sepal_area  
0         17.85  
1         14.70  
2         15.04  
3         14.26  
4         18.00  
..          ...  
145      

# Finding Missing Data

In [10]:
# first let us create a copy of same exact data to play around

df_exp = df.copy()
print(df_exp)

     sepal_length  sepal_width  petal_length  petal_width    species  \
0             5.1          3.5           1.4          0.2     setosa   
1             4.9          3.0           1.4          0.2     setosa   
2             4.7          3.2           1.3          0.2     setosa   
3             4.6          3.1           1.5          0.2     setosa   
4             5.0          3.6           1.4          0.2     setosa   
..            ...          ...           ...          ...        ...   
145           6.7          3.0           5.2          2.3  virginica   
146           6.3          2.5           5.0          1.9  virginica   
147           6.5          3.0           5.2          2.0  virginica   
148           6.2          3.4           5.4          2.3  virginica   
149           5.9          3.0           5.1          1.8  virginica   

     sepal_area  
0         17.85  
1         14.70  
2         15.04  
3         14.26  
4         18.00  
..          ...  
145      

In [13]:
# Poke some holes in it: set some values to NaN

df_exp.iloc[0,0] = np.nan #row0 col0 to NaN
df_exp.iloc[2,1] = np.nan #row2 col1 to NaN
df_exp.iloc[5,3] = np.nan #row5 col3 to NaN

print(df_exp.head(7))

#to find our missing data , how many in one column
print("\n Missing Values per column :")
print(df_exp.isnull().sum())

   sepal_length  sepal_width  petal_length  petal_width species  sepal_area
0           NaN          3.5           1.4          0.2  setosa       17.85
1           4.9          3.0           1.4          0.2  setosa       14.70
2           4.7          NaN           1.3          0.2  setosa       15.04
3           4.6          3.1           1.5          0.2  setosa       14.26
4           5.0          3.6           1.4          0.2  setosa       18.00
5           5.4          3.9           1.7          NaN  setosa       21.06
6           4.6          3.4           1.4          0.3  setosa       15.64

 Missing Values per column :
sepal_length    1
sepal_width     1
petal_length    0
petal_width     1
species         0
sepal_area      0
dtype: int64


In [17]:
# Handling Missing Data

# Method 1: Drop all rows with any NaN
df_expcl = df_exp.dropna()
print(f"Original Shape: {df_exp.shape}")
print(f"Shape After Dropping NaN: {df_expcl.shape}")

Original Shape: (150, 6)
Shape After Dropping NaN: (147, 6)


In [20]:
# Method 2: Fill all NaN values with 0

df_expfl = df_exp.fillna(0)
print(f"Filling 0 in place of NaN : \n{df_expfl.head(7)}")

Filling 0 in place of NaN : 
   sepal_length  sepal_width  petal_length  petal_width species  sepal_area
0           0.0          3.5           1.4          0.2  setosa       17.85
1           4.9          3.0           1.4          0.2  setosa       14.70
2           4.7          0.0           1.3          0.2  setosa       15.04
3           4.6          3.1           1.5          0.2  setosa       14.26
4           5.0          3.6           1.4          0.2  setosa       18.00
5           5.4          3.9           1.7          0.0  setosa       21.06
6           4.6          3.4           1.4          0.3  setosa       15.64
