## Data Preprocessing / Cleaning

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. A dataset can be referred to as a dirty set if it has any of the following characteristics:
-    Missing Values
-    Duplicate Values
-    Wrong Values
-    Type Mismatch
-    Unnecessary Rows or Columns

Let's start with the missing values. To check all the missing values in the dataset, we could use:

In [None]:
df.isnull()

Since it returns the values in Boolean (T/F) format, we could use .sum() to get the sum of missing values in each column

In [None]:
df.isnull().sum()

### How to tackle Missing Values?

We can simply delete rows with missing values, but usually we would want to take advantage of as many data points as possible. Replacing missing values with zeros would not be a good idea - as age 0 or price 0 have actual meanings and that would change our data.

Therefore a good replacement value would be something that doesn't affect the data too much, such as the median or mean. The "fillna" function replaces every NaN (not a number) entry with the given input:

#### Filling with Mean, Median and Mode

In [None]:
df[col] = df[col].fillna(df[col].mean())

In [None]:
df[col] = df[col].fillna(df[col].median())

In [None]:
df[col] = df[col].fillna(df[col].mode()[0])

#### Filling with Interpolation

In [None]:
df[col] = df[col].interpolate(method ='linear', limit_direction ='forward')

#### Filling with backward or forward fill

In [None]:
df[col] = df[col].fillna(method='ffill')

In [None]:
df[col] = df[col].fillna(method='bfill')

We can also remove the missing values from our dataset using:

#### Removing Row (if all values are missing)

In [None]:
df = df.dropna(how='all')

#### Removing Missing Values (column wise)

In [None]:
df = df.dropna(axis=1)

#### Removing Missing Values (row wise)

In [None]:
df = df.dropna(axis=0)

### Taking care of Duplicate Values

We can simply drop duplicates from our dataset using drop_duplicates function

In [None]:
df.drop_duplicates(keep=first,inplace=True)

### Type Mismatch

We can change the datatypes of any column using:

In [None]:
df[col] = df[col].astype('int64')

In [None]:
df[col] = df[col].astype('object')

In [None]:
df[col] = df[col].astype('float64')

### Handling Wrong Values

We can change any wrongly entered value from column using:

In [None]:
df[col].replace(to_replace=st1, value=st2)

or we can use the following code snippet:

In [None]:
# These are the indexes where karchi is written instead of karachi

khi_wrong_values = df[df['Cities']=='karchi'].index.tolist()
print(khi_wrong_values )
# [119, 147, 156, 388, 453, 471, 487]


# replacing the karchi with Karachi

for index in ans_call_null:
    df.loc[index,'Cities'] = "Karachi"