In [10]:
import pandas as pd 
import numpy as np

# Handling Missing Data Questions:
**1.1 How do you identify and handle missing values in a Pandas DataFrame?**
• Identifying missing values
There are 2 methods for identifying missing values: - isnull() and notnull().
isnull() nethod is used to detect missing values.

In [11]:
data = {'A': [1, 2, 3, np.nan, 6, np.nan],
        'B': [6, 2, 5, np.nan, 3, 0],
        'C': ['A', 'B', 1, np.nan, 'Z', 'O'],
        'D': ['L', 'D', 2, 3, 10, 12]}
df = pd.DataFrame(data)
print(df.isnull())

       A      B      C      D
0  False  False  False  False
1  False  False  False  False
2  False  False  False  False
3   True   True   True  False
4  False  False  False  False
5   True  False  False  False


notnull() method in Pandas is used to detect non-missing values in a DataFrame. It returns a boolean DataFrame where True indicates that the value is not missing, and False indicates that the value is missing.

In [12]:
print(df.notnull())

       A      B      C     D
0   True   True   True  True
1   True   True   True  True
2   True   True   True  True
3  False  False  False  True
4   True   True   True  True
5  False   True   True  True


+ Handling missing values
+ Emoty cells can give you a wrong result when you analyze data
One of the ways to deal with empty values is to remove them, and a command dropna() helps. By default, the dropna() method returns new DataFrame, and won't change the original

In [13]:
new_df = df.dropna()

print(new_df.to_string())

     A    B  C   D
0  1.0  6.0  A   L
1  2.0  2.0  B   D
2  3.0  5.0  1   2
4  6.0  3.0  Z  10


Another way to deal with empty cells is to insert a new value instead, this way you don't have to delete rows because of some empty cells.

In [14]:
df["A"].fillna(6, inplace=True)
print(df)

     A    B    C   D
0  1.0  6.0    A   L
1  2.0  2.0    B   D
2  3.0  5.0    1   2
3  6.0  NaN  NaN   3
4  6.0  3.0    Z  10
5  6.0  0.0    O  12


Here it replaced all the empty values for the A column with 6

**1.2: What is imputation, and why might it be useful in dealing with missing data?**
• To address missing values in a dataset, imputation is a commonly used technique where missing values are filled in by estimating or deriving values based on available data. Imputation is a crucial step in data preprocessing as it ensures that datasets are complete and suitable for analysis and modeling.

Various imputation techniques have been developed to handle missing data, particularly when the missingness is at random and maintaining the structure of the data is important. These techniques include mean or median imputation, mode imputation, regression imputation, k-nearest neighbors imputation, and more. Each technique has its own advantages and considerations, and the choice of imputation method depends on factors such as the nature of the data and the analysis goals.

There are various methods of imputating, just like:

1. **Mean/Median Imputation**:
• A common way to replace empty cells, is to calculate the mean/median of the column.
• Pandas uses the mean() median() methods to calculate the respective values for a specified column. Mean() is the average value. Median() is the value in the middle.
2. **Mode Imputation**:
• Substituting missing categorical values with the mode, which represents the most frequently occurring value for each attribute, is a suitable approach for imputation. 
• This method is particularly relevant for categorical features with missing values.
3. **Regression Imputation**:
• Use regression models trained on the non-missing values of the dataset to predict those missing.
• This technique can be used for both numerical and categorical features, it exploits the relationships among features to impute missing values.
4. **K-Nearest Neighbors (KNN) Imputation**:
• In the process of imputation, values are replaced by estimating them from analogous instances within a dataset, a technique known as nearest neighbors imputation (KNN). 
• This method considers the distances between instances and their attributes to determine which values should be inserted..
5. **Multiple imputation**:
• Generate multiple databases of plausible values for each missing value.
• In several statistical analyses, multiple imputation is carried out in order to account for the uncertainty stemming from imputed data.

Imputation means the technique in data analysis that helps to fix missing values by calculating these gaps through approximating or estimating values. It is useful in dealing with missing data for several reasons:It is useful in dealing with missing data for several reasons:

1. **Preservation of Data Integrity**: 
Imputation plays a significant role in preserving the structure and unbiasing the dataset by imputation of the variables to ensure that none of them is omitted which normally happens before the input of the variables data in the statistical analyses and machine learning algorithms.

2. **Enhancement of Statistical Power**: 
Imputation is an approach that replaces missing data points by keeping all available data, which makes our sample size and data analysis to be more reliable and allow higher sample size as well as powerful statistical analysis.

3. **Mitigation of Bias**: 
Ignoring missing values (complete case analysis) can be biased, especially if it is related to our outcome factor (the variable we want to test). Imputation thus allows retaining observations but also resistance to biased result by the analysis.

4. **Compatibility with Analysis Techniques**: 
There are many statistical methods and ML algorithms that can only leverage on complete datasets as inputs. Through the process of imputation, this enables such methods to be used without change or modification.

5. **Improved Interpretability**: 
Imputation helps researchers to perform analyses on the whole dataset and it is a data-friendly technique that allows for better interpretation of results and does not cause any information loss because of the missing values.

In a nutshell, imputation is an absolute weapon in data analysis that helps to cope with missing data in an appropriate way and thus to conduct your analysis based on the complete and best representation of your dataset possible.

In [15]:
df = pd.read_csv('C:/Users/User/Downloads/data1.csv')
x = df["Pulse"].mean()
y = df["Pulse"].median()
print(x)
print(y)

107.49704142011835
105.0


this is the mean and median value of "Pulse" columns

# DATA TRANFORMATION QUESTIONS
2.1: **How can you encode categorical variables in a Pandas DataFrame?**
    

In Pandas, categorical variables can be encoded using various techniques. Here are some common methods:
1. **Label encoding**:
It does that by assigning each class in the variable an unique integer. Pandas library provide cat.codes accessor, which is used for label encoding. On the other hand, label coding may not be able to apply to categorical variables that don't have any given order.
**df['category_column_encoded'] = df['category_column'].astype('category').cat.codes** it is the sample code
2. **Ordinal encoding**:
Ordinal encoding assigns integers to categories based on a predefined order. This method is useful when there is an inherent order among the categories.
**category_order = ['low', 'medium', 'high']
df['category_column_encoded'] = df['category_column'].map({cat: idx for idx, cat in enumerate(category_order)})** sample code
3. **One-hot encoding**:
I will explain it in the next question.

2.2: **What is one-hot encoding, and when would you use it in data preprocessing?
One-hot encoding:**
One-hot encoding is a technique commonly adopted in which categorical variables are encoded as binary sequences. In this scheme, the encodings for all the categories in this categorical variable are transformed to a binary vector where other elements are all zero except the element corresponding to the category which is set to one.

For example, consider a categorical variable "Color" with three categories: R, G, B. After one-hot encoding, this variable would be transformed into three binary variables: "Red","Green", and "Blue" are the object names, each representing a specific color. A data point in the dataset will be associated with a value of 1 for the category that it has been tagged under and 0 for all other categories.

One-hot encoding is extensively used in pre-processing of data when like in machine learning, there are categorical variables to deal with. Here are some scenarios where one-hot encoding is commonly used:Here are some scenarios where one-hot encoding is commonly used:

1. **Nominal Categorical Variables**: The one-hot encoding turns out to be the perfect method to use in the situations when the categories of the nominal categorical variables have no internal order. Used variables (such as gender, country, or car type) are some examples.

2. **Algorithms That Require Numerical Input**: Nowadays, a simple and varied machine learning algorithm like linear regression, logistic regression, or even neural network may only accept numeric input. Care of cats can be represented in the form of one hot vectors, which in turn makes these variables numeric and ready-to-use for these algorithms.

3. **Preventing Ordinal Assumptions**: The method of one-hot encoding allows the researchers to represent each category as a separate binary variable and it enables them to avoid ordinal assumptions. This is key as there are no natural orders if the categories don't have them.

4. **Interpretability**: One-hot coding with an interpretability aspect makes each category to be explicit represented as a different feature. They can be used to see the pattern well in which one variable is linked to the other variable.

Basically, one-hot encoding is a fundamental pre-processing technique used when dealing with nominal categorical variables in machine learning and when the algorithms require numerical input.

# Removing Duplicates Questions:
3.1 **How do you identify and remove duplicate rows from a DataFrame?**
Duplicate rows in a DataFrame are rows that have identical values across all columns.

To identify duplicate rows in a DataFrame in Pandas, we can use the duplicated() method. This method returns a boolean Series indicating if each row is a duplicate of a previous row. For instance:

In [17]:
data_duplic = {'1st column': [3, 4, 6, 7, 4],
        '2nd column': ['a', 'b', 'c', 'd', 'b']}
df_duplic = pd.DataFrame(data_duplic)

duplicate = df_duplic.duplicated()

print("Original DF:\n")
print(df_duplic)
print("Duplicate :")
print(duplicate)

Original DF:

   1st column 2nd column
0           3          a
1           4          b
2           6          c
3           7          d
4           4          b
Duplicate :
0    False
1    False
2    False
3    False
4     True
dtype: bool


To remove duplicate rows, we can use drop_duplicates() method. That method removes rowa that are duplicates of others, and save only one unique appearance. That's how we do it:

In [18]:
data_remove_duplic = {'1st column': [3, 4, 6, 7, 4],
        '2nd column': ['a', 'b', 'c', 'd', 'b']}
df_remove_duplic = pd.DataFrame(data_remove_duplic)

cleaned_df = df_remove_duplic.drop_duplicates()

print("Original DF:\n")
print(df_remove_duplic)
print("DataFrame without duplicates:") 
print(cleaned_df)

Original DF:

   1st column 2nd column
0           3          a
1           4          b
2           6          c
3           7          d
4           4          b
DataFrame without duplicates:
   1st column 2nd column
0           3          a
1           4          b
2           6          c
3           7          d


3.2 **Can you explain the difference between the duplicated() and drop_duplicates() methods in Pandas?**
Certainly! Both duplicated() and drop_duplicates() are methods in Pandas used for identifying and handling duplicate rows in a DataFrame, but they serve different purposes:


duplicated():

+ The duplicated() method is used to identify duplicate rows in a DataFrame.
+ It returns a boolean Series where each value indicates whether the corresponding row is a duplicate.
+ By default, it considers all columns when checking for duplicates.
+ You can also specify a subset of columns to focus on.

drop_duplicates():
+ The drop_duplicates() method is used to remove duplicate rows from a DataFrame.
+ It returns a new DataFrame with duplicate rows removed.
+ By default, it keeps the first occurrence of each duplicate row.
+ You can specify the keep parameter to control which duplicates to retain:
    'first': Keep the first occurrence (default).
    'last': Keep the last occurrence.
    False: Drop all duplicates.

# Data Scaling and Normalization Questions:
4.1 **Discuss the importance of feature scaling in machine learning.**
Feature scaling is an essential part of machine learning pre-processing which consists in transforming all numerical features to a common scale. It has a significant contribution in the correct and decently good model training and performance. Scaling techniques are intended to normalize ranges, distributions and magnitudes of parameters so that the prediction model will be as consistent as possible.
Feature scaling plays a crucial role in machine learning for a variety of reasons:

+ Many machine learning algorithms use distance-based calculations to make predictions. If the features are not scaled, those with larger values can have a disproportionate impact on the results.

+ Feature scaling can help improve the convergence speed and performance of some optimization algorithms.

+ This helps in handling skewed data and outliers, which can influence the model’s behavior.


1. Importance of feature scaling in machine laerning:
+ **Enhancing Model Performance**:
   - The feature scaling have the opportunity to boost the performance of machine learning models. The process of scaling the features makes algorithms more thorough in their search because the features are on a very similar scale, therefore, the algorithm is able to find the best solution.
   - It is a proven tool that has a lot of potential for faster convergence and better predictions, especially if an algorithm is in place, like k-nearest neighbors, support vector machines or neural networks.

+ **Addressing Skewed Data and Outliers**:
   - Bias, outliers and noise in the data is often the reason for bad performance of the machine learning models. Sharing of the features with other authorships can deal with the problem. The data is now transformed into a standardized range which ensures that extreme partitions of the data do not have a negative impact and make the model more reliable.
   - As it intermittently occurs, this led to a systematic bias. This is most evident in algorithms that assume a normal distribution, and are sensitive to outliers, such as linear regression.
+ **Balanced Feature Influence**:
   - With the features on different scales there is a possibility that big-scaled feature will be of more interest to the model while the small-scaled feature will be ignored. Feature scaling allows all features to contribute to the model without being over shadowed by others just because of their scale of features.

4.2 **Explain the difference between min-max scaling and z-score normalization.**
Certainly! Let's delve into the differences between **min-max scaling** and **z-score normalization**:

1. **Min-Max Scaling**:
   - **Objective**: Min-max scaling ensures that all features have the same scale, typically between 0 and 1.
   - **Method**:
     - Subtract the minimum value of the feature from each data point.
     - Divide the result by the range of the feature (i.e., the difference between the maximum and minimum values).
   - **Advantages**:
     - Simple and intuitive.
     - Guarantees consistent scaling across features.
   - **Disadvantages**:
     - Not robust to outliers: Extreme values can disproportionately affect the scaling.
     - May not work well if the data distribution is not uniform.
   - **Formula**:
   ![image.png](attachment:image.png)

2. **Z-Score Normalization (Standardization)**:
   - **Objective**: Z-score normalization scales data to have a mean of 0 and a standard deviation of 1.
   - **Method**:
     - Subtract the mean of the feature from each data point.
     - Divide the result by the standard deviation of the feature.
   - **Advantages**:
     - Handles outliers better: The z-score is robust to extreme values.
     - Useful for algorithms that assume normally distributed data.
   - **Disadvantages**:
     - Does not guarantee the same scale for all features.
     - May not be suitable for non-Gaussian distributions.
   - **Formula**:
   ![image-2.png](attachment:image-2.png)

In summary, min-max scaling ensures consistent scaling but struggles with outliers, while z-score normalization handles outliers but does not maintain identical scales across features.

# Handling Outliers Questions:
5.1 **What are outliers, and why might they impact machine learning models?**
+ Outliers are the data points that are so far away from the main data and above or below the expected range.
+ Such outliers may be due to multiple reasons, including measurement errors, experimental abnormalities and actually quite exceptional observations.
+ Outliers are known to skew the stats and the findings, lowing both the reliability and the accuracy.

**Impact**: 
+ Distorted Distribution:
  - Outliers can have an effect on the whole distribution of the data. When a machine learning algorithm encounters extreme values, it may struggle to identify meaningful patterns.  
  - Imagine a scatter plot where most of the points are grouped in the same place but some of them are far away from others. Thus, such abnormalities harm the model's accuracy.
+ Faulty Conclusions:
  - Outliers can lead to faulty conclusions about the data. For instance, if we’re predicting housing prices based on features like square footage and location, an outlier (e.g., an unusually expensive mansion) might skew the model’s understanding.
  - The model could mistakenly learn that all houses are expensive, even though most fall within a reasonable price range.
+ Sensitive Models:
  - Some machine learning algorithms are sensitive to outliers. Linear regression, for example, tries to fit a line that minimizes the sum of squared errors. Outliers can disproportionately influence this error term.
  - Robust models, such as decision trees or random forests, are less affected by outliers because they don’t rely on assumptions about data distribution.
+ Anomaly Detection: 
  - Outliers are often associated with anomalies or unusual observations. Detecting anomalies is crucial in various domains (fraud detection, fault diagnosis, etc.).
  - Anomaly detection models specifically focus on identifying outliers, either in an unsupervised or semi-supervised manner.

5.2 **Describe different methods for detecting outliers in a dataset in Python**
1. Z-score:
  - Measures how many standard deviations a data point is from the mean.
  - Steps:
     - Calculate the z-score for each data point.
     - Set a threshold (commonly 3 or -3).
     - Data points with z-scores beyond this threshold are considered outliers.
Sensitive to extreme values.
   ![image.png](attachment:image.png)
   ![image-4.png](attachment:image-4.png)
   
2. Visual Inspection:
  - Box plots, scatter plots, and histograms can visually reveal potential outliers.
  - Look for data points that fall far from the central distribution.
  - Visual inspection is a quick way to identify extreme values, but it doesn’t provide precise quantification.   
  
3. Median Absolute Deviation (MAD):
+ Based on the median rather than the mean.
+ Steps:
   - Calculate the median of the dataset.
   - Compute the absolute deviation of each data point from the median.
   - Set a threshold (e.g., 3 times the MAD), and any value exceeding this threshold is flagged as an outlier.

5.3 **How can you handle outliers in a continuous numerical variable in Python?**
+ There are different ways in handling outliers in a numerical variable:
  - Trimming: To remove extreme values from the dataset based on a certain percentile threshold.

In [19]:
import numpy as np

data = np.random.normal(loc=10, scale=2, size=10)
data[0] = 150 

lower_bound = np.percentile(data, 5)
upper_bound = np.percentile(data, 95)

trim_data = data[(data >= lower_bound) & (data <= upper_bound)]

print("Original data:", data)
print("Trimmed data:", trim_data)

Original data: [150.           9.11916628  12.31122508  12.97411161   8.75293769
   7.14027685   6.44782812   8.50952307  12.43742008  13.87258812]
Trimmed data: [ 9.11916628 12.31122508 12.97411161  8.75293769  7.14027685  8.50952307
 12.43742008 13.87258812]


As you can see here the least value was removed in the trimmed data.
  - Transformation: Apply mathematical transformations such as log, square root, or Box-Cox transformation to make the data more normally distributed and less affected by outliers.
  - Machine learning algorithms: Some algorithms are inherently robust to outliers, such as tree-based methods like Random Forests and Gradient Boosting.
  - Outlier detection algorithms: Utilizing algorithms like Isolation Forest, Local Outlier Factor (LOF), or One-Class SVM to identify and remove outliers.