# <h3 align="center">__Module 3 Activity__</h3>
# <h3 align="center">__Assigned at the start of Module 3__</h3>
# <h3 align="center">__Due at the end of Module 3__</h3><br>



# Weekly Discussion Forum Participation

Each week, you are required to participate in the module’s discussion forum. The discussion forum consists of the week's Module Activity, which is released at the beginning of the module. You must complete/attempt the activity before you can post about the activity and anything that relates to the topic. 

## Grading of the Discussion

### 1. Initial Post:
Create your thread by **Day 5 (Saturday night at midnight, PST).**

### 2. Responses:
Respond to at least two other posts by **Day 7 (Monday night at midnight, PST).**

---

## Grading Criteria:

Your participation will be graded as follows:

### Full Credit (100 points):
- Submit your initial post by **Day 5.**
- Respond to at least two other posts by **Day 7.**

### Half Credit (50 points):
- If your initial post is late but you respond to two other posts.
- If your initial post is on time but you fail to respond to at least two other posts.

### No Credit (0 points):
- If both your initial post and responses are late.
- If you fail to submit an initial post and do not respond to any others.

---

## Additional Notes:

- **Late Initial Posts:** Late posts will automatically receive half credit if two responses are completed on time.
- **Substance Matters:** Responses must be thoughtful and constructive. Comments like “Great post!” or “I agree!” without further explanation will not earn credit.
- **Balance Participation:** Aim to engage with threads that have fewer or no responses to ensure a balanced discussion.

---

## Avoid:
- A number of posts within a very short time-frame, especially immediately prior to the posting deadline.
- Posts that complement another post, and then consist of a summary of that.


# Module Activity: Building a Preprocessing Pipeline

## Objective
Learn how to build a preprocessing pipeline in scikit-learn and apply it to the famous Iris dataset. Gain hands-on experience in handling missing values, scaling features, and understanding the importance of preprocessing pipelines.

---

## Sample Code for Pipeline Syntax
Here’s an example to help you understand how to create a pipeline. This pipeline imputes missing values using the mean:

```python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Example dataset with missing values
data = pd.DataFrame({
    'Feature1': [1.0, np.nan, 3.0],
    'Feature2': [np.nan, 2.0, 3.0]
})

# Define a pipeline with an imputer
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# Fit and transform the data
processed_data = pipeline.fit_transform(data)

print("Original Data:")
print(data)
print("\nProcessed Data:")
print(processed_data)




# Activity Instructions

## Dataset Preparation
We will use the Iris dataset, randomly remove values to simulate missing data, and keep it in a Pandas DataFrame for you to preprocess.

---

## Your Task
Build a preprocessing pipeline that:
- Imputes missing values using the median.
- Scales features to a `[0, 1]` range using `MinMaxScaler`.
- Add at least one more preprocessing step.

### Reflection
At the end of the activity, answer the following questions:
1. What challenges did you face while handling missing data?
2. Why is it important to use a pipeline for preprocessing?
---

## Dataset Setup
Run the following code to import the Iris dataset and simulate missing data. You will use this dataset for the activity.




In [132]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)

# Randomly introduce missing values in random cells
np.random.seed(42)
total_cells = data.size
num_missing = int(0.1 * total_cells)  # 10% of total cells
missing_indices = [(row, col) for row in range(data.shape[0]) for col in range(data.shape[1])]
random_missing_indices = np.random.choice(len(missing_indices), size=num_missing, replace=False)

for index in random_missing_indices:
    row, col = missing_indices[index]
    data.iat[row, col] = np.nan

print("Dataset with Missing Values:")
print(data.head(10))


Dataset with Missing Values:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                NaN               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                NaN               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
5                5.4               3.9                1.7               0.4
6                NaN               3.4                1.4               0.3
7                5.0               NaN                NaN               0.2
8                4.4               2.9                1.4               0.2
9                4.9               3.1                1.5               0.1


## Next Steps

1. **Build your pipeline** to preprocess the dataset.
2. **Test your pipeline** by fitting it to the Iris dataset and transforming it.
3. **Review the processed data** and reflect on how the pipeline simplifies your workflow.


## 1. Build Pipeline

In [133]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif

pipeline = Pipeline([
    # handle missing values by replacing them with the median
    ('imputer', SimpleImputer(strategy='median')),
    # select the best 2 features based on statistical tests
    ('feature_selector', SelectKBest(mutual_info_classif, k=2)),
    # scale features to a range
    ('scaler', MinMaxScaler()),
])

##  2. Test Pipeline

In [134]:
# Fit the pipeline with the data and target variable
data_preprocessed = pipeline.fit_transform(data, iris.target)

# get the names of the selected features
selected_features = pipeline.named_steps['feature_selector'].get_feature_names_out(data.columns)

# convert the preprocessed data to a DataFrame
data_preprocessed = pd.DataFrame(data_preprocessed, columns=selected_features)

# display the first 10 rows of the preprocessed data
print("DataFrame before processing:")
print(data_preprocessed[:10])

DataFrame before processing:
   petal length (cm)  petal width (cm)
0           0.576271          0.041667
1           0.067797          0.041667
2           0.576271          0.041667
3           0.084746          0.041667
4           0.067797          0.041667
5           0.118644          0.125000
6           0.067797          0.083333
7           0.576271          0.041667
8           0.067797          0.041667
9           0.084746          0.000000


## 3. Review Processed Data

In [135]:
print("DataFrame after processing:")
print(data_preprocessed.head(10))

DataFrame after processing:
   petal length (cm)  petal width (cm)
0           0.576271          0.041667
1           0.067797          0.041667
2           0.576271          0.041667
3           0.084746          0.041667
4           0.067797          0.041667
5           0.118644          0.125000
6           0.067797          0.083333
7           0.576271          0.041667
8           0.067797          0.041667
9           0.084746          0.000000


In [136]:
# check for missing values by summing the missing values in each column
print("Missing values (will display 0 if none are found in that column):")
print(data_preprocessed.isnull().sum())

# showcase the min and max values of each column
print("\nMin values:")
print(data_preprocessed.min())
print("\nMax values:")
print(data_preprocessed.max())

Missing values (will display 0 if none are found in that column):
petal length (cm)    0
petal width (cm)     0
dtype: int64

Min values:
petal length (cm)    0.0
petal width (cm)     0.0
dtype: float64

Max values:
petal length (cm)    1.0
petal width (cm)     1.0
dtype: float64


# Reflection
## Challenges Faced
### 
I had an issue handling the missing values. I at first tried to feature select beforehand but then realized this was not the optimal way to do it. I realized that it would eventually cause issues later on if the missing values weren't handled first.
I also struggled with figuring out if missing values were fully removed. Adding the data.isnull().sum() portion solved this issue.

## Importance of Using Pipeline for Preprocessing
### 
It keeps preprocessing organized by automating the steps in a pre-set structure. Data is fit and transformed in one step. Allows the handling of values, scaling, and transforming data to be done in a single process.