## Data Preprocessing, Data Loading and Dropping Null Values

When we collect data from various sources, there may be a lot of noisy and null values. It is important to preprocess the dataset by removing noisy and null values.

## Step 1: Import the Required Libraries

Let's understand how to drop null values.

- Import the pandas and NumPy libraries


In [1]:
import pandas as pd
import numpy as np

## Step 2: Create a Sample Dataset with Null Values

- Create a dictionary containing data with null values
- Convert the dictionary to a DataFrame


In [2]:
data = {'a':[1,2,np.nan],'b':[1,np.nan,3],'c':[1,np.nan,np.nan]}
df = pd.DataFrame(data)

In [None]:
df

Unnamed: 0,a,b,c
0,1.0,1.0,1.0
1,2.0,,
2,,3.0,


**Observation**

Here, we can see tht there are a lot of null values.

## Step 3: Count the Null Values in the DataFrame

Let's see how many null values are there in each column.

- The function isnull() returns true if the column has a null value. The value will be 1 if it is true. Otherwise, it is 0.

- Now, we can add all the true values to get the total number of null values in the column.

Let's count the number of null values in each column of the DataFrame.


In [3]:
df.isnull().sum()

a    1
b    1
c    2
dtype: int64

**Observation**

Here, we can see column __a__ has 1 null value, column __b__ has 1 null value and column __c__ has 2 null values.

## Step 4: Drop the Rows and Columns with Null Values


To clean the data, we need to drop these null values. 

- Use the dropna() function to drop the null values:

In [4]:
df.dropna()

Unnamed: 0,a,b,c
0,1.0,1.0,1.0


**Observation**

We can see that all rows with null values have been removed.

- Drop the columns that contain at least one null value:

- Give **axis=1** as a parameter to the dropna() function to indicate the column has to be considered:

In [5]:
df.dropna(axis=1)

0
1
2


**Observation**

We can see that there are no columns present since all the columns have null values.

## Step 5: Drop Columns Based on a Non-Null Value Threshold

- Drop columns that do not contain two or more non-null values

- To indicate the number of null values, the DataFrame has to pass **thresh=2**


In [6]:
df.dropna(thresh=2,axis=1)

Unnamed: 0,a,b
0,1.0,1.0
1,2.0,
2,,3.0


**Observation**

We can see that __c__ column has been dropped since it has less than two non-null values, but __a__ and **b** still remain because they have at least two non-null values. 

## Step 6: Drop Rows Based on a Non-Null Value Threshold

- Drop rows based on different non-null value thresholds:


In [7]:
df.dropna(thresh=1)

Unnamed: 0,a,b,c
0,1.0,1.0,1.0
1,2.0,,
2,,3.0,


**Observation**

It did not drop any rows since all rows have more than one non-null value.

In [8]:
df.dropna(thresh=2)

Unnamed: 0,a,b,c
0,1.0,1.0,1.0


**Observation**

All rows with more than two null values are dropped.

## Step 7: Modify DataFrame In Place

- Drop rows with null values, and modify the DataFrame in place:


In [9]:
df.dropna(inplace=True)

In [None]:
df

Unnamed: 0,a,b,c
0,1.0,1.0,1.0
