### Data Cleansing Overview
**Data Cleaning** refers to identifying and correcting errors in the dataset that may negatively impact a predictive model
* In structred data you can use lots of statistical analysis and data visualization techniques that we can use to explore our data in order to define the data cleansing operation that we want to perform.
* There are some very simple yet crucial steps before more sophisticated techniques that may be overlooked and if they being skipped our model may break or perform overly optimistic.

#### Step1: identify and delete Columns that contain a Single Value
Example:
from numpy import loadtxt
from numpy import unique

data = loadtxt('oil-spill.csv', delimiter=',')
* summarize the number of unique values in each column
for i in range(data.shape[1]):
	print(i, len(unique(data[:, i])))
**OR**
from pandas import read_csv
df = read_csv('oil-spill.csv', header=None)
print(df.nunique())
**Then**
from pandas import read_csv
* load the dataset
df = read_csv('oil-spill.csv', header=None)
print(df.shape)
* get number of unique values for each column
counts = df.nunique()
* record columns to delete
to_del = [i for i,v in enumerate(counts) if v == 1]
print(to_del)
* drop useless columns
df.drop(to_del, axis=1, inplace=True)
print(df.shape)

#### Step2: identify Columns with Few Values
* Example
* from numpy import loadtxt
* from numpy import unique
* data = loadtxt('oil-spill.csv', delimiter=',') 
* ''summarize the number of unique values in each column'' 
* for i in range(data.shape[1]):
	num = len(unique(data[:, i]))
	percentage = float(num) / data.shape[0] * 100
	if percentage < 1:
		print('%d, %d, %.1f%%' % (i, num, percentage))
*  ''delete columns where number of unique values is less than 1% of the rows'' 
* from pandas import read_csv 
*  ''load the dataset'' 
* df = read_csv('oil-spill.csv', header=None)
    print(df.shape) 
* ''get number of unique values for each column'' 
* counts = df.nunique()
* ''record columns to delete''
* to_del = [i for i,v in enumerate(counts) if (float(v)/df.shape[0]*100) < 1]
* print(to_del)
* ''drop useless columns''
* df.drop(to_del, axis=1, inplace=True)
* print(df.shape)

#### Step3: Remove Columns with Low Variance
* Example of applying the variance threshold for feature selection
* from pandas import read_csv
* from sklearn.feature_selection import VarianceThreshold
*'' load the dataset''
* df = read_csv('oil-spill.csv', header=None)
* ''split data into inputs and outputs''
* data = df.values
* X = data[:, :-1]
* y = data[:, -1]
* print(X.shape, y.shape)
* '' define the transform ''
* transform = VarianceThreshold()
* ''transform the input data''
* X_sel = transform.fit_transform(X)
* print(X_sel.shape)

### Step4: Identify and Remove Rows that contain duplicate data
*  ''locate rows of duplicate data''
* from pandas import read_csv
* ''load the dataset''
* df = read_csv('iris.csv', header=None)
*  ''calculate duplicates''
* dups = df.duplicated()
* ''report if there are any duplicates''
* print(dups.any())
* ''list all duplicate rows''
* print(df[dups])
#### **OR even easier version**
* ''delete rows of duplicate data from the dataset''
* from pandas import read_csv
* '' load the dataset''
* df = read_csv('iris.csv', header=None)
* print(df.shape)
* ''delete duplicate rows''
* df.drop_duplicates(inplace=True)
* print(df.shape)

### Step 5: Identifying and Removing Outliers
* It is important to know not all outliers should be removed and it's kind of personal judgement for each dataset
* for the quicl check a scatterplot would help a lot
#### First Approach : IQR
* ''identify outliers with interquartile range''
* from numpy.random import seed
* from numpy.random import randn
* from numpy import percentile
* ''seed the random number generator''
* seed(1)
* ''generate univariate observations''
* data = 5 * randn(10000) + 50
* ''calculate interquartile range''
* q25, q75 = percentile(data, 25), percentile(data, 75)
* iqr = q75 - q25
* print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))
* ''calculate the outlier cutoff**
* cut_off = iqr * 1.5
* lower, upper = q25 - cut_off, q75 + cut_off
* ''identify outliers''
* outliers = [x for x in data if x < lower or x > upper]
* print('Identified outliers: %d' % len(outliers))
* ''remove outliers''
* outliers_removed = [x for x in data if x >= lower and x <= upper]
* print('Non-outlier observations: %d' % len(outliers_removed))
#### Second Approach: Automatic Outlier Detection
* look at related notebook

### Step 6: Mark Missing Values
* some columns may have 0 instaed of missing values first we should make sure whether 0 could be a true observation or not by using.describe() function and looking at the min and max. after we made sure that 0 are missing value then we can replace them ba nan using .replace(0, nan). 
#### First Approach: Removing missing values (not recommended especially with large number of nan or small datasets
* we can use df.dropna(inplace=True).
#### Second Approach: Statistical Imputation (mean: most common, mode, median)
* from sklearn.impute import SimpleImputer
* imputer = SimpleImputer(strategy='mean') *it can be changed to mode and median as well
* imputer.fit(X)
* Xtrain = imputer.transform(X)
#### Third Approach : Simple Imputer with Model Evaluation
* here we want to deal with missing value in train and test data seprately in order to prevent the data leakage for doing so we use *pipeline
* model = could be any model
* imputer = SimpleImputer(strategy='mean')
* pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
#### Fourth Approach: Compare Different Statistical Imputation Strategies
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
* '' load dataset''
* url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
* dataframe = read_csv(url, header=None, na_values='?')
* ''split into input and output elements''
* data = dataframe.values
* ix = [i for i in range(data.shape[1]) if i != 23]
* X, y = data[:, ix], data[:, 23]
* ''evaluate each strategy on the dataset''
* results = list()
* strategies = ['mean', 'median', 'most_frequent', 'constant']
* for s in strategies:
	* ''create the modeling pipeline''
	* pipeline = Pipeline(steps=[('i', SimpleImputer(strategy=s)), ('m', RandomForestClassifier())])
	* ''evaluate the model''
	* cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	* scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	* ''store results''
	* results.append(scores)
	* print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores)))
	* ''create the modeling pipeline''
	* pipeline = Pipeline(steps=[('i', SimpleImputer(strategy=s)), ('m', RandomForestClassifier())])
	* ''evaluate the model''
	* cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	* scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	* ''store results''
	* results.append(scores)
	* print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores)))
* **After finding out which strategy works better then we can choose that startegy or our model**
#### Fifth Approach: K-Nearest Neighbors Imputation (model that can predict missing values)
* from sklearn.impute import KNNImputer
* load dataset and divide it to X and Y
* summarize total missing by *sum(isnan(X)flatten())
* define imputer *imputer=KNNImputer()
* fit on the dataset *imputer.fit(X)
* transform dataset *Xtrans= imputer.transform(X)
* summarite total missing *sum(isnan(Xtrans).flatten())
* using pipeline same with previous approach just changing the imputer to **KNNImputer**
    *model=RandomForestClassifier()
    *imputer=KNNImputer
    *pipeline=Pipeline(steps=[('i', imputer), ('m', model)]
    *cv = REpeatedtratifedKFold(n_splits=10, n_repeats=3, random_state=1)
    *scores= cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    *print('Mean Accuracy: %.3f (%.3f)'(mean(scores), std(scores)))
* look at the related notebook for analyzing which Kfold works better
#### Sixth Approach: Iterative Imputation 
* in this approach the machine try to fill missing values by looking at each feature as a function of another feature and its iterative because it keeps repeat this process until all missing values being filled
* the most common model for doing that is linear regression as if we are trying to predict the missing value by using other columns(features)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impue import IterativeImputer
* defining imputer *imputer=IterativeImputer()
* fit and then transform the date set 
    *imputer.fit(x)
    *Xtrans=imputer.transform(X)
* for implementing the IterativeImputer and Model Evaluation look at related Notebook
* by default the Iterate Imputer repeat the number of iteration 10 times. It is possible the large number of iteration may begin to bias or skewed the estimate hence few iteration may be preffered to check which number of iteartion works better looks at the related notebook

