Variable:
- Qualitative (categorical, no arithmetic operations)
    - Nominal
    - Ordinal
- Quantitative
    - Discrete
    - Continuous

Descriptive dta measures
- Measure of Central Tendency
    - Mean
    - Median
    - Mode
- Measure of Dispersion
    - Variance: average of squared differences fromt he mean
    - Standard Deviation: Quantity expressing by how much each datapoint differs from the mean
        - sqrt of variance
    - Range: Max(X)- Min(X)
        - X is a numeric variable
    - Quartiles
        - First Quartile (Q1): smallest 25% values of the whole dataset
        - Second Quartile (Q2): (Median): divides 50% of values <= median
        - Third Quartile (Q3): divides the smallest 75% of the values
    - IQR (Inter Quartile Range)
        - IQR = Q3 - Q1
    - Coefficient of variation: 
        - irrespective of scale, compare distributions w.r.t their stddevs
        - CV = (stddev/samplemean) * 100
        - CV larger => variablity of distribution is large 
    - Z-score
        - z = (x-samplemean)/stddev
        - z = 0 => datapoint is mean
        - +ve or -ve z score => tells if values are above or below the mean
        - magnitude of z-score => tells how many stddevs is the value, away from the mean
- Five poit summary
    - Minimum, Q1, Q2, Q3, Maximum
    - Boxplot            
    - Outliers:
        - points beyond whiskers/fences may or maynot be outliers
        - The rule that box plot follows to decide what should be an outlier is that "any point greater than Q3 + 1.5IQR or lesser than Q1 - 1.5IQR is an outlier "
- Shape of distribution
    - Skewness
        - left skewed: -ve skewed: mean < median
        - zero skewed: symmetric distribution: mean = median
        - right skewed: +ve skewed: mean > median
- Covariance
    - how two variables change together
    - cov(x,y)
    ![image.png](attachment:image.png)
    - +ve (x increase, y increase) or -ve (x increase, y decrease) is determined
    - strength of relation is not determined
    - dependent of scale
- Coefficient of Correlation
    - determine direction and strength of relationship (linear)
      ![image-2.png](attachment:image-2.png)
    - range: -1 to +1
        - -1, +1 => strong -ve and +ve correlation
            - 0 => no linear relation
    - NOTE: Correlation doesnot imply Causation    

### Handy functions for EDA

Libraries:
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- import seaborn as sns
- import warnings
- warnings.filterwarnings("ignore")
- from PIL import Image 
- from scipy import stats


Importing Data:
- pd.read_csv(file_name): Read from a CSV file
- pd.read_excel(file_name): Read from an Excel file
- pd.read_sql(sql_query, connection_object): Read from a database
- pd.read_json("string, url or file"): Read from a JSON string, URL, or file
- pd.read_html(URL): Read from a URL or a file

Data Exploration:
- df.info(): Provides information about datatype, shape, and memory usage
- df.describe(): Provides statistical summary (count, mean, min, max, std, quantiles)
- df.shape: Returns the shape of the dataset (rows, columns)
- df.head(): Prints the top 5 rows of the dataset
- df.tail(): Prints the last 5 rows of the dataset
- df.col.value_counts(): Returns count of unique classes in a column
- df.count(): Returns total number of observations in each column
- df.col.unique(): Returns unique classes in the column
- df.col.nunique(): number of unique classes
- concatenated_df = pd.concat([df1, df2], axis=1): column wise
- merged_df = df1.merge(df2, on='common_column', how='inner')  # 'inner','outer','left','right'
- result_df = df1.join(df2, on='common_column', how='inner')  # prefer when join on indices 
- cross_tab = pd.crosstab(df['catcol1'], df['catcol2'])
- pivot_table = df.pivot(index='index_column', columns='columns_to_pivot',values='values_column')
- melted_df = pd.melt(df, id_vars=['id_columns'], value_vars=['value_columns'])

Filter Data:
- df.loc[condition]: Returns rows based on one condition
- df[(condition) & (condition)]: Returns rows based on two conditions (AND)
- df[(condition) | (condition)]: Returns rows based on two conditions (OR)
- df.loc[(condition) & (condition)]: Returns rows based on two conditions using loc
- df.loc[(condition) | (condition)]: Returns rows based on two conditions using loc

Renaming Columns and Indices:
- df.columns = ['Column 1', 'Column 2', ...]: Rename columns using a list
- df.rename(columns={'old_name': 'new_name'}): Rename columns using rename function
- df.rename(index={'old_name': 'new_name'}): Rename indices using rename function
- df.set_index("col"): Set a column as indices

Statistical Functions:
- df.mean(): Find the mean of every column
- df.median(): Find the median of every column
- df.column_name.mode(): Find the mode of a column
- df.max(): Find the max value from a column
- df.min(): Find the min value from a column
- df.std(): Find the standard deviation of each column
- df.var(): Find the variance of each column
- df.cov(): Create a covariance matrix (how two variables change together)
- df.corr(): Create a correlation table
- df.quantile(q=0.25): find the value below which 25% of the data lies
- df.quantile(q=0.50): find the value below which 50% of the data lies
- df.quantile(q=0.75): find the value below which 75% of the data lies
- if two modes: first mode is determined by indexing
- IQR: df.col1.quantile(q=0.75) - df.col1.quantile(q=0.25)
- Range: df.max() - df.min()
- Skewness: df.skew()

Sort and Group By:
- df.sort_values(col, ascending): Sort the dataframe based on a column
- df.groupby(column_name): Group a dataframe by a column name
- df.groupby([column_1, column_2, ...]): Group a dataframe by multiple column names
- df.groupby(column_1)[column_2].mean(): Find the mean of a column from the group
- df.groupby(column_1).agg(np.mean()): Find the mean of all columns from the group

Null Value Analysis and Data Cleaning:
- df.isnull(): Returns True where the value is null (only NaN)
- df.isnull().sum(): Returns count of null values in each column
- df.isnull().sum().sum(): Returns total count of null values in all columns
- df.count(): Returns count of null and not null values in each column
- df.isnull().any(): Returns True if col has atleast 1 null
- df.isnull().all(): Returns True if all the values in the column ar enull
- df.notnull(): Returns True where the value is not null
- df.dropna(axis, thresh, inplace=True): Drops columns/rows with null values based on threshold
- df.fillna(value): Fill null values with the passed value
- df.col1.fillna(df.col1.mean(), inplace=True): Fill null values with the passed value
- df.replace('old_value', 'new_value'): Replace a value with a new value
- df.replace([old_1, old_2], [new_1, new_2]): Replace multiple values with new values
- df.column_name.astype('data_type'): Change the data type of a column
- df.drop(col, axis=1,inplace=True)

Selecting Rows and Columns:
- df.column_name: Select a column
- df["column_name"]: Select a column
- df[["column_name_1", "column_name_2", ...]]: Select multiple columns
- df.iloc[ : , : ]: Extract selected rows and columns using indices
- df.iloc[index_position]: Extract rows using index position
- df.loc[index_value]: Extract rows using index value

Write Data:
- df.to_csv(file_name): Write data to a CSV file
- df.to_excel(file_name): Write data to an Excel file
- df.to_html(file_name): Write data to an HTML file
- df.to_sql(table_name, connection_object): Write data to a table in a database
- df.to_json(file_name): Write data to a JSON file

Duplicates:
- dup = df.duplicated(keep='first'): Find first occurring duplicates
- sum(dup) => 0
- df.drop_duplicates(keep, inplace): Drop duplicate rows

Outliers:
    - Z-Score:
        - z = np.abs(stats.zscore(df))
        - outlier_indices = np.where(z > threshold) => outliers indices # threshold = 3
        - outliers = df.iloc[outlier_indices]
        - df.drop(df.index[outliers_indices])
    - IQR
        - Q1 = df.quantile(0.25)
        - Q3 = df.quantile(0.75)
        - IQR = Q3 - Q1
        - outlier_indices = np.where(((df < (Q1-1.5*IQR)) | (df > (Q3+1.5*IQR)).any(axis=1))
        - outliers = df.iloc[outlier_indices]
        - df.drop(df.index[outliers_indices])
        
Handling categorical data:
- One hot encoding: (Nominal)
    - encoded_df = pd.get_dummies(df, prefix='eg_park', columns=['catcol'])
    - Or,
    - from sklearn.preprocessing import OneHotEncoder
    - hotencoder = OneHotEncoder()
    - encoded = hotencoder.fit_transform(df[['catcol']].toarray()
'
- Label encoding: (Ordinal)
    - from sklearn.preprocessing import LabelEncoder  
    - labelencoder = LabelEncoder() 
    - df['catcol'] = labelencoder.fit_transform(df['catcol']) 

Normalization and Scaling:
    - from sklearn.preprocessing import StandardScaler
    - std_scale = StandardScaler()
    - df['col'] = std_scale.fit_transform(df[['col']]) # returns z-scores of the values of the attribute
    - And,
    - from sklearn.preprocessing import MinMaxScaler
    - minmax_scale = MinMaxScaler()
    - df['col'] = minmax_scale.fit_transform(df[['col']])

Transformation:
    - from sklearn.preprocessing import FunctionTransformer   
    - log_transformer = FunctionTransformer(np.log1p) # log transform
    - df['col'] = log_transformer.fit_transform(df[['col']])
    - And,
    - exp_transformer = FunctionTransformer(np.exp) # Exponential transform 
    - df['col'] = exp_transformer.fit_transform(df[['col']]) 

Plots:
- Univariate:
    - sns.distplot(df, hist=False)
    - plt.hist(df, bins=50)
    - sns.violinplot(df)
    - Cummulative distribution:
        - sns.distplot(data, hist_kws=dict(cumulative=True),kde_kws=dict(cumulative=True))
- Multivariate:
    - sns.pairplot(data=df, kind='reg')
    - sns.scatterplot(df.col1, df.col2,  hue=df.col3)
    - sns.heatmap(df.corr(), annot=True)
- Bivariate:
    - Num vs Num:
        - scatter plot: plt.scatter(df['col1'], df['col2'])
        - Line plot: plt.plot(df['col1'], df['col2'])
        - Heat map for correlation: sns.heatmap(df.corr(), annot=True)
        - Joint plot: sns.jointplot(x='col1', y='col2', data=df)
    - Cat vs Num:
        - Bar plot: sns.barplot(x='catcol', y='numcol', data=df)
        - Violin plot: sns.violinplot(x='catcol', y='numcol', data=df)
        - Categorical box plot: sns.boxplot(x='catcol', y='numcol', data=df)
        - Swarm plot: sns.swarmplot(x='catcol', y='numcol', data=df)
    - Cat vs Cat:
        - Bar plot: sns.countplot(x='catcol1', hue='catcol2', data=df)
        - Grouped bar plot: sns.barplot(x='catcol1', y='catcol2', data=df, hue='group_col')
        - Point plot: sns.pointplot(x='catcol1', y='catcol2', data=df, hue='group_col')

### Data Preparation

- **Data Cleaning:** 
    - Identifying and correcting mistakes or errors in the data.
- **Feature Selection:** 
    - Identifying those input variables that are most relevant to the task.
- **Data Transforms:** 
    - Changing the scale or distribution of variables.
- **Feature Engineering:** 
    - Deriving new variables from available data.
- **Dimensionality Reduction:** 
    -Creating compact projections of the data.

Forms of input data:
- image
- time series
- text
- video, ...
- **tabular data**
  - most common format
  - structured data
  - in Excel or DB or in the CSV files (our focus)
  - matrix form of data
  - have **rows** (an example or an instance or a case or an observation)
    - **train set** - used for training model
    - **test set** - involves to evaluate model
  - have **columns** (properties or variables or features or attributes)
    - have **input variables** - used for training model
    - have **output variables** - to be predicted by the model

ML algos have requirements:
- linear regression: Gaussian distribution
- SVM, kNN, Random forests: require large training instances for better performance
- Garbage in garbage out
- Focus on data
- Focus on model
- 80% of Predictive modelling is data preparation

Prevent data leakage
- split dataset -> train, test
- fit data preparation on train set
    - data transforms
    - feature selection
    - dimensionality reduction
    - feature engineering
- apply data preparation on train and test set
- evaluate models

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=7)

In [3]:
# normalizing the data before splitting the data and evaluating the model

# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 84.848


In [4]:
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# define the scaler
scaler = MinMaxScaler()

# fit on the training dataset
scaler.fit(X_train)

# scale the training dataset
X_train = scaler.transform(X_train)

# scale the test dataset
X_test = scaler.transform(X_test)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 85.455


Avoid data leakage:
1. Split into train and test sets
2. Standardize
    - fit scaler on train set
    - transform train and test sets
3. fit the model on trian set
4. evaluate model on test set

In [5]:
# naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

# define dataset
X, y = make_classification(n_samples=1000, 
                           n_features=20, 
                           n_informative=15, 
                           n_redundant=5,
                           random_state=7)

In [6]:
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# define the model
model = LogisticRegression()

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 85.300 (3.607)


In [7]:
from sklearn.pipeline import Pipeline

# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))

pipeline = Pipeline(steps=steps)

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 85.433 (3.471)


- In general, k-fold cross-validation is better than train-test split

Data Cleaning:
- delete column with single value
    - to_del = [i for i,v in enumerate(df.nunique()) if v == 1]
    - df.drop(to_del, axis=1, inplace=True)
- delete duplicate rows
    - to_del = df.duplicated()
    - df.drop_duplicates(inplace=True)
- outlier removal
    - Standard deviation method: z-score
        - lower = mean(data)-(std(data)*3)
        - upper = mean(data)+(std(data)*3)
        - outliers = [x for x in data if x < lower or x > upper]
        - afterremovingoutliers = [x for x in data if x >= lower and x <= upper]
    - IQR method:
        - q25 = np.percentile(data,25)
        - q75 = np.percentile(data,75)
        - iqr = q75 - q25
        - lower = q25 - 1.5*iqr
        - upper = q75 + 1.5*iqr
        - outliers = [x for x in data if x < lower or x > upper]
        - afterremovingoutliers = [x for x in data if x >= lower and x <= upper]
    - multi dim space outliers: LocalOPutlierFactor: based on distance from local neighbours
        - marking 
            - normal : 1
            - outlier : -1
        - lof = LocalOutlierFactor() # sklearn.nighbors
        - marking = lof.fit_predict(X_train)
        - notoutliers = marking != -1
        - train set without outliers
            - X_train, y_train = X_train[notoutliers, :], y_train[notoutliers]
- remove missing data
    - garbage_values = ['garbage', 'unknown', 'NA', 'missing', '?', '-1', '0'] 
    - df.replace(garbage_values,np.nan, inplace=True)
    - df.isnull().sum()
    - df.dropna(inplace=True)
- imputing missing values
    - SimpleImputer
        - from sklearn.impute import SimpleImputer
        - imputer = SimpleImputer(strategy='mean')
        - imputer.fit(X)
        - Xtrans = imputer.transform(X)
    - KNN imputation
        - from sklearn.impute import KNNImputer
        - imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean) # default neighbors = 5
        - imputer.fit(X)
        - Xtrans = imputer.transform(X)
    - IterativeImputer
        - from sklearn.impute import IterativeImputer
        - imputer = IterativeImputer() # repalce nan with iteratively estimated values
        - imputer.fit(X)
        - Xtrans = imputer.transform(X)

Data transforms
- Scale numeric data
    - Normalization with MinMaxScaler()
        - from sklearn.preprocessing import MinMaxScaler
        - scaler = MinMaxScaler()
        - scaler.fit(data)
        - scaler.transform(data)
        - Avoid data leakage: 
            - fit on train data
            - transform on other data when required in the further process
        - range: 0 to 1
    - Standardization using StandardScaler()
        - from sklearn.preprocessing import StandardScaler
        - scaler = StandardScaler()
        - scaled = scaler.fit_transform(data)
        - range: -3 to +3 (approx)
            - preferred for models using GD algos
            - donot prefer if the data has too large outliers
    - Scale data with outliers
        - RobustScaler class
        - (x - median)/(p75 - p25)
        - trans = RobustScaler()
        - data = trans.fit_transform(data)
            - median => 0, stddev => ~1
- Encode Categorical data
    - Ordinal Encoding
        - from sklearn.preprocessing import OrdinalEncoder
        - ordinal = OrdinalEncoder()
        - ordinal.fit(X_train)
        - X_train = ordinal.transform(X_train)
        - X_test = ordinal.transform(X_test)
        - expects data in matrix form
        - ordinal data
            - user defined order-to-value mapping
    - Label Encoder
        - from sklearn.preprocessing import LabelEncoder
        - expects data in 1-d format
        - label_encoder = LabelEncoder()
        - label_encoder.fit(y_train.astype('str'))
        - y_train = label_encoder.transform(y_train)
        - y_test = label_encoder.transform(y_test)
        - ordinal data
            - universally consistent order
    - One Hot Encoding
        - from sklearn.preprocessing import OneHotEncoder
        - encoder = OneHotEncoder(drop='first', sparse=False)
        - encoder.fit(X_train)
        - X_train = encoder.transform(X_train)
        - X_test = encoder.transform(X_test)
        - nominal data
    - Dummy vairable encoding
        - encoded_df = pd.get_dummies(df, columns=['catcol'], prefix=['Cat'])
- Numerical tranformations
    - PowerTransformer 
        - make distribution more Gaussian
        - for numerical variables
        - remove skewness
        - log, sqrt, inverse transforms
        - box-cox transform
        - yeo-johnson tranform
            - from sklearn.preprocessing import PowerTransformer
            - power = PowerTransformer(method='yeo-johnson', standardize=False)
            - data_trans = power.fit_transform(data)
    - QuantileTransformer
        - any distribution -> normal or uniform
        - from sklearn.preprocessing import QuantileTransformer
        - quantile = QuantileTransformer(n_quantiles=100, output_distribution='normal')
        - data_trans = quantile.fit_transform(data)
        - And,
        - quantile = QuantileTransformer() # uniform distribution by default
        - data_trans = quantile.fit_transform(data)
            - n_quantiles -> range(1,100), performance increases drastically until certain n_quantiles (ex: 10), later no much change is observed.
            - choose that n_quantiles value for better performance of model
- Numerical to categorical
    - DiscretizationTranforms
        - Uniform: each bin has same width
        - Quantile: each bin has equal number of samples
        - Clustered: clusters are identified and labelled respective groups
        - KBinsDiscretizer
            - strategy: uniform, quantile, kmeans
            - n_bins: [flexible],[< all samples],[number of reasonable clusters]
            - encode: ordinal or onehot (kmeans uses onehot)
        - from sklearn.preprocessing import KBinsDiscretizer
        - kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform') # bins = categories
        - data_trans = kbins.fit_transform(data)
        - And,
        - trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
        - data = trans.fit_transform(data)
- Polynomial Feature Transform
    - from sklearn.preprocessing import PolynomialFeatures
    - trans = PolynomialFeatures(degree=2) # const,a,b,a^2,ab,b^2 
    - data = trans.fit_transform(data)
    - degree=1 => has no effect on number of features

Advanced Transforms
- ColumnTransformer
    - ColumnTransformer is utilized to perform specific preprocessing operations on distinct groups of features (columns) within a dataset, streamlining the application of diverse transformations and enhancements to different types of data within a unified pipeline.


**Example of applying a single data transform.**

**Example of applying a sequence of data transforms.**

**As a single Task on Full dataset: Use ColumnTransformer**
1. impute numerical column missing values with median
2. scale the numerical columns
3. impute categorical column missing values with mode
4. one hot encode the categories

### How to use the ColumnTransformer

- (Name, Object, Columns)

**Example of applying ColumnTransformer to encode categorical variables.**

**Example of applying ColumnTransformer - impute different columns in different ways**

**Example of using the ColumnTransformer to only encode categorical variables and pass through the rest without dropping**

**Example of configuring and using the ColumnTransformer for data transformation.**

**Example of configuring and using the ColumnTransformer during model evaluation.**

### Feature Selection

![image.png](attachment:image.png)

Chi-squared test for feature selection when you have categorical input features and a categorical output (target) feature:

ANOVA F-test for feature selection when you have numerical input features and a categorical output (target) feature:

ANOVA F-test for feature selection when you have categorical input features and a numerical output (target) feature: