# Data Cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

Empty cells
Data in wrong format
Wrong data
Duplicates

In [129]:
# Bad data could be:

# Empty cells
# Data in wrong format
# Wrong data
# Duplicates

In [130]:
# Empty cells can potentially give you a wrong result when you analyze data.

# Remove Rows
# One way to deal with empty cells is to remove rows that contain empty cells.

# This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.



# Empty Cell or Missing Data

In [131]:
import pandas as pd

df = pd.read_csv('Z:\Study\ziggy std\Mr. Prakash Senapathi\dataset/new missing data.csv')
df

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0


In [132]:
new_df = df.dropna()

print(new_df)

    YearsExperience    Salary
0               1.1   39343.0
1               1.3   46205.0
2               1.5   37731.0
3               2.0   43525.0
4               2.2   39891.0
5               2.9   56642.0
7               3.2   54445.0
8               3.2   64445.0
9               3.7   57189.0
10              3.9   63218.0
11              4.0   55794.0
12              4.0   56957.0
14              4.5   61111.0
15              4.9   67938.0
16              5.1   66029.0
17              5.3   83088.0
18              5.9   81363.0
19              6.0   93940.0
20              6.8   91738.0
21              7.1   98273.0
22              7.9  101302.0
24              8.7  109431.0
25              9.0  105582.0
26              9.5  116969.0
27              9.6  112635.0
28             10.3  122391.0
29             10.5  121872.0


# Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty cells.

The fillna() method allows us to replace empty cells with a value:

In [133]:
df1 = pd.read_csv('Z:\Study\ziggy std\Mr. Prakash Senapathi\dataset/new missing data.csv')
df1

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0


In [134]:
df1.fillna(1300,inplace=True)
df1

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,1300.0
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0


# Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.

In [135]:
df1 = pd.read_csv('Z:\Study\ziggy std\Mr. Prakash Senapathi\dataset/new missing data.csv')
df1

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0


In [136]:
df1['Salary'].fillna(1303,inplace=True)
df1

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,1303.0
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0


# Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

In [137]:
df = pd.read_csv('Z:\Study\ziggy std\Mr. Prakash Senapathi\dataset/new missing data.csv')

x = df["Salary"].mean()

df["Salary"].fillna(x, inplace = True)
df

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,75890.62963
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0
