#  Data Science Interview questions
## Part 1 : Data cleaning, pre-processing, EDA
This Jupyter notebook serves as a comprehensive resource for data science interview questions, focusing specifically on topics related to data cleaning, preprocessing, and analysis. Here, you'll find a curated collection of questions that span various aspects of data preparation, transformation, and exploratory data analysis. Whether you're preparing for an interview or seeking to deepen your understanding of essential data science concepts, this notebook aims to provide a structured and informative guide to help you navigate through key challenges in the field. Explore the questions, test your knowledge, and enhance your proficiency in the fundamental stages of data science workflows.

### 1- How to deal with missing values 


### 2- How to detect outliers in data set ? 
 

### 3- How to handel duplicates ? 

### 4- What does normal distribution means ?
The normal distribution is very useful in machine learning becasue it has deterministic statistical characteristics  and it helps detect linear relationship between variables. It consists that mode=mean=median: 

- Mean: called also average of a data set and it is found by summing all numbers in the data set and then dividing by the number of values in the set.
- Mode : it is the value that appears most often in a set of data values.
- Median : the middle number; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).

### 5- What does percentile and quantile means ?

## Part 2: Feature Engineering

### 1- What does feature engineering means? 

Feature engineering refers to the process of raw data manipulation such as addition, deletion, combination, mutation etc. It encompasses the process of creating new features or modifying existing ones to improve the performance of a machine learning model. 

Here is a range of significant activities used in Feature Engineering :

- Feature Selection
- Data Transformation
- Text Data Processing
- Time-Series Feature Engineering

### 2- What does data transformation means?

Data transformation is indeed one subtask within the broader field of feature engineering in machine learning. It is a specific aspect of feature engineering that involves modifying the raw data to make it more suitable for the learning algorithm.
It includes : 
- Feature Scaling
- Feature encoding
- Feature extraction
- Binning or Discretization
- Creating Interaction Terms

### 3- What does feature scaling mean ?
Feature scaling is a preprocessing step in machine learning that involves transforming the numerical features of a dataset to a common scale. Feature scaling is particularly important for algorithms that rely on distance metrics or gradient descent optimization.

Here are common techniques for feature scaling:
- Normalization
- Standard scaling : converts features to standard normal variables (by subtracting the mean and dividing the standard error)
- Log scaling or Log transformation
- Polynomial transformation
- Robust scaling

#### 3. 1- Why do we need perform feature scaling ? 
The goal is to ensure that all features contribute equally to the learning process and to prevent certain features from dominating due to differences in their magnitudes.

#### 3. 2- Normalization - Min-Max Scaling
- Scales the feature values to a specific range, usually between 0 and 1
- Formula : $X_{normalized}= {X-X_{min}\over X_{max}-X_{min}}$

#### 3. 3- Standard scaling - Z-score normalization
- Centers the feature values around zero with a standard deviation of 1.
- Suitable for algorithms that assume a normal distribution of features.
- Formula: $X_{standardized} ={ X - mean(X) \over std(X)}$

#### 3. 4- Robust Scaling
- Scales the features based on the interquartile range (IQR) to handle outliers.
- Formula: $X_{robust} = {X - median(X)\over IQR(X)}$

#### a. IQR : interquartile range
- The IQR is the difference between the third quartile (Q3) and the first quartile (Q1): IQR = Q3 - Q1
- Q1 : It represents the median of the lower 50% of the data.
- Q3 : It represents the median of the upper 50% of the data

Here's how you calculate the IQR: 
- 1. Order the dataset: arrange the values in the dataset in ascending order
- 2. Determine the median (Q2): which is the middle value of the dataset. If the dataset has an odd number of observations, the median is the middle value. If it has an even number, the median is the average of the two middle values.
- 3. Find the First Quartile (Q1)
- 4. Find the Third Quartile (Q3)
- 5. Calculate the IQR

The IQR provides a robust measure of the spread of the middle 50% of the data, making it less sensitive to extreme values or outliers. It is commonly used in box plots to visually represent the dispersion of data.

#### 3. 5- Log Transformation

- The log transformation is the most popular among the different types of transformations used in machine learning.
- It aims to make highly skewed distributions (features with high variance) less skewed.
- The logarithm used is often the natural logarithm (base e) or the common logarithm (base 10).
- Generally, we use the natural logarithm function in Log transformation.
- If the original data follows a log-normal distribution or approximately so, then the log-transformed data follows a normal or near normal distribution.
- However, our real raw data do not always follow a normal distribution. They are often so skewed making the results of our statistical analyses invalid. That’s where Log Transformation comes in.

#### 3. 6- Polynomial transformation
- It is a feature engineering technique used in machine learning and statistics to capture non-linear relationships between variables.
- It involves transforming input features by raising them to the power of an integer, creating polynomial terms. The most common form is the quadratic transformation (squared terms), but higher-order polynomials can also be used.
- Such transformations are often beneficial for machine learning algorithms, particularly in tasks involving numerical input variables, improving predictive accuracy, especially in regression tasks.
- If X is one input feature ==> $X^2$ is its polynomial feature.
- The “degree” of the polynomial is used to control the number of features added, e.g. a degree of 3 will add two new variables for each input variable. Typically a small degree is used such as 2 or 3. Choosing the best polynomial degree is so important as it impacts the number of input features created. 

**More notes :** 

- Higher-degree polynomials (Degree > 2) can lead to overfitting, capturing noise in the data rather than true underlying patterns. Regularization techniques may be needed to mitigate this.
- It's important to scale features before applying polynomial transformations to prevent features with larger scales from dominating the transformed values.

### 4- How to deal with categorical values ?
- Drop categorical variables
- Perform feature encoding

#### 4.1- What does feature encoding means? 

Feature encoding is the process of converting categorical data or text data into a numerical format that can be easily used for machine learning algorithms. In many machine learning models, the input features are expected to be numerical, and encoding is necessary when dealing with non-numeric data.

Here are some common encoding methods: 
- Ordinal encoding: Assign numerical values based on the inherent order of categories
- One-hot encoding : Create binary columns for each category, indicating its presence (1) or absence (0)
- Label Encoding : Assign a unique numerical label to each category in a categorical variable
- Binary Encoding : Convert each category into its binary representation.
- Frequency (Count) Encoding: Replace each category with its frequency or count in the dataset


**!! Notes :**
- Ordianl encoding is a good choice in case we have ranking in our categorical variables (Low, medium, high), most used with DT and Random Forest.
- One-hot encoding is more used when there is no ranking in the categorical variables.
- If our dataset is very large (high cardinality) --> one-hot encoding can greatly expand the size of dataset : number columns.

### 5- What does Feature extraction means?
One of the primary goals of feature extraction is to reduce the dimensionality of the dataset. High-dimensional data can lead to the curse of dimensionality, making it challenging for models to generalize well.

Feature extraction aims to retain the most relevant information from the original data. This involves identifying features that contribute significantly to the variability and patterns within the dataset while discarding redundant or irrelevant information.

Here are all types of Feature Extraction:

- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- Independent Component Analysis (ICA)
- Bag-of-Words (BoW)

#### 5. 1- What does Principal Component Analysis (PCA) means ? 

#### 5. 2- What does Singular Value Decomposition (SVD) means ? 
#### 5. 3- What does Independent Component Analysis means ? 