## Data Cleansing

Data cleaning is a crucial step in the data preprocessing pipeline. It involves identifying and rectifying issues in your dataset to ensure that it’s ready for analysis.

In [None]:
#impor syntax, you need numpy, pandas, matplotlib (for visualization), seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
#load data kpop idol
df = pd.read_csv('Train.csv')

In [None]:
df.head(5)

In [None]:
df.info()

# Common Data Cleaning Tasks in Python:

### 1. Handling Missing Values
Incomplete data sets frequently contain missing values, posing a challenge for analysis. Techniques like imputation, which involves filling missing values with estimates, or dropping rows/columns with significant missing data are commonly employed to handle this issue. These methods help ensure that the dataset remains usable and provides reliable insights.

- We will handle missing values in the column "Var_1" through imputation.

Imputation is replace missing values with a statistical measure such as mean, median, or mode of the column.

In [None]:
# count the occurrences of each unique value in the 'Var_1' column

df['Var_1'].value_counts()

In [None]:
# count the number of missing (NaN) values

df['Var_1'].isna().sum()

In [None]:
# replaces all NaN values in the 'Var_1' column with the calculated mode value.

val = df['Var_1'].mode().values[0]

df['Var_1'] = df['Var_1'].fillna(val)

In [None]:
# the result

df['Var_1'].value_counts()

This method of imputation is useful for categorical or discrete data where replacing missing values with the most common value can preserve the distribution of the data to some extent.

- We will handle missing values in the column "Work_Experience" through imputation.

In [None]:
# we'll fillna Height
# first we can check the shape

df.shape[0]

In [None]:
df['Work_Experience'].isna().sum()

In [None]:
df['Work_Experience'].isna().sum()/df.shape[0]

In [None]:
# the unique is not that far, so we can fill the NaN

df.Work_Experience.nunique()

In [None]:
# show the chart visual
df.Work_Experience.plot(kind='hist');

from the visual this chart is negative skewnes

In [None]:
# fillna
# if negative or positive skewness we can used median, buf if it normal skewnes you can used mode

val = df.Work_Experience.median()
df['Work_Experience'] = df.Work_Experience.fillna(val)

In [None]:
# check the info
df.info()

In [None]:
# check the chart againt
df.Work_Experience.plot(kind='hist');

- We will handle missing values in the column "Graduated" through imputation.

In [None]:
df['Graduated'].value_counts()

In [None]:
val = df['Graduated'].mode().values[0]

df['Graduated'] = df['Graduated'].fillna(val)

In [None]:
# the result

df['Graduated'].value_counts()

- Filling Missing Data in the column 'Profession' with 'Unknown'

In [None]:
df['Profession'].fillna('unknown', inplace=True)

- Filling Missing Data in the column 'Ever_Married' with 'Unknown'

In [None]:
df['Ever_Married'].fillna('unknown', inplace=True)

- Filling Missing Data in the column 'Family_Size' with 'Unknown'

In [None]:
df['Family_Size'].fillna('unknown', inplace=True)

In [None]:
df.info()