<a href="https://colab.research.google.com/github/Uzma-Jawed/python-class_work-and-practice/blob/main/27_DataPreprocessingandEncoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📘 Uzma Jawed

📅 Class Work - August 10


---



### Data Preprocessing and Encoding in Python


---



In this notebook, we explore:
1. Basic Pandas operations
2. Calculating areas using NumPy
3. Encoding categorical variables:
   - Label Encoding
   - Ordinal Encoding
   - One-Hot Encoding
4. Combining encoded data
5. Using `ColumnTransformer` for preprocessing
6. Creating a `Pipeline` for machine learning

This workflow is useful in data preprocessing before building machine learning models.


---




In [11]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline



---


### Part 1: Calculating Circle Areas

We start with a list of circle radii and compute their areas.

Here, NumPy is used for the constant `π` and squaring operation, and Pandas is used for tabular representation.


---



In [12]:
# Create DataFrame with circle radii
data = {"radius": [2, 5, 7, 8, 10]}
data_circle = pd.DataFrame(data)

# Calculate area of each circle
data_circle['area'] = np.pi * data_circle['radius']**2
data_circle

Unnamed: 0,radius,area
0,2,12.566371
1,5,78.539816
2,7,153.93804
3,8,201.06193
4,10,314.159265




---


### Part 2: Label Encoding

Label Encoding converts categories into integer labels.  
For example:
- apple → 0
- banana → 1
- mango → 2

It is simple but **not suitable for nominal categories** in many ML algorithms because it implies an order.


---



In [13]:
# Sample dataset
data = pd.DataFrame({
    "fruits": ["apple", "mango", "banana", "apple", "mango"]
})

# Apply Label Encoding
le = LabelEncoder()
data['transform_fruits'] = le.fit_transform(data["fruits"])
data

Unnamed: 0,fruits,transform_fruits
0,apple,0
1,mango,2
2,banana,1
3,apple,0
4,mango,2




---


### Part 3: Ordinal Encoding




Ordinal Encoding is used when categories have a **natural order**.  
Example: small < medium < large.

Here, we manually define the order of categories.


---



In [14]:
# Sample dataset
data = pd.DataFrame({
    "size": ["small", "medium", "medium", "large", "small", "large"]
})

# Apply Ordinal Encoding with defined order
oe = OrdinalEncoder(categories=[["small", "medium", "large"]])
data['transform_order'] = oe.fit_transform(data[["size"]])

# Check feature names
oe.feature_names_in_

# Inverse transform (convert back to original categories)
data[['inverse_data']] = oe.inverse_transform(data[['transform_order']])
data

Unnamed: 0,size,transform_order,inverse_data
0,small,0.0,small
1,medium,1.0,medium
2,medium,1.0,medium
3,large,2.0,large
4,small,0.0,small
5,large,2.0,large




---


### Part 4: One-Hot Encoding

One-Hot Encoding creates separate binary columns for each category.  
This is useful for nominal categories where no ordering exists.


---

In [15]:
# Apply One-Hot Encoding
encoder = OneHotEncoder()
encoded = encoder.fit_transform(data[['size']]).toarray()

# Get new column names
cols = encoder.get_feature_names_out()

# Create new DataFrame with encoded data
data_encoded = pd.DataFrame(encoded, columns=cols)
data_encoded

Unnamed: 0,size_large,size_medium,size_small
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0




---

## Part 5: Combining Encoded Data

We combine the original dataset with the one-hot encoded columns using `pd.concat()`.


---



In [16]:
# Combine original and encoded DataFrames
correct_data = pd.concat([data, data_encoded], axis=1)
correct_data

Unnamed: 0,size,transform_order,inverse_data,size_large,size_medium,size_small
0,small,0.0,small,0.0,0.0,1.0
1,medium,1.0,medium,0.0,1.0,0.0
2,medium,1.0,medium,0.0,1.0,0.0
3,large,2.0,large,1.0,0.0,0.0
4,small,0.0,small,0.0,0.0,1.0
5,large,2.0,large,1.0,0.0,0.0




---


### Part 6: Using ColumnTransformer

`ColumnTransformer` allows us to apply different preprocessing steps to different columns in a single step.

Here:
- Numeric columns are scaled with `StandardScaler`
- Categorical columns are one-hot encoded


---



In [17]:
# Sample dataset
df = pd.DataFrame({
    'Weight': [70, 80, 60, 90],
    'Height': [1.75, 1.80, 1.65, 1.82],
    'City': ['Lahore', 'Karachi', 'Islamabad', 'Lahore']
})

# Define numeric and categorical columns
num_cols = ['Weight', 'Height']
cat_cols = ['City']

# Create ColumnTransformer
ct = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(sparse_output=False), cat_cols)
    ],
    remainder='passthrough'
)

# Fit and transform the data
result = ct.fit_transform(df)

# Get new column names
num_names = num_cols
cat_names = ct.named_transformers_['cat'].get_feature_names_out(cat_cols).tolist()
final_cols = num_names + cat_names

# Create transformed DataFrame
df_transformed = pd.DataFrame(result, columns=final_cols)
df_transformed

Unnamed: 0,Weight,Height,City_Islamabad,City_Karachi,City_Lahore
0,-0.447214,-0.076029,0.0,0.0,1.0
1,0.447214,0.684257,0.0,1.0,0.0
2,-1.341641,-1.5966,1.0,0.0,0.0
3,1.341641,0.988372,0.0,0.0,1.0




---


### Part 7: Machine Learning Pipeline

We combine preprocessing and model training in a single pipeline.  
This ensures that any new data will go through the same preprocessing steps before prediction.


---



In [18]:
# Create full pipeline: preprocessing + model
model_pipeline = Pipeline([
    ('preprocessing', ct),
    ('classifier', LogisticRegression())
])

# Example (commented out because we don't have target variable y):
# model_pipeline.fit(X, y)
# preds = model_pipeline.predict(X)
# print("Predictions:", preds)



---


### Conclusion

In this notebook, we learned:
- How to calculate derived values (circle areas) with Pandas + NumPy
- The differences between Label, Ordinal, and One-Hot Encoding
- How to combine encoded data into a single DataFrame
- How to use `ColumnTransformer` for multi-step preprocessing
- How to integrate preprocessing and model training into a single pipeline


---






---

### Extra Example: Running the Full Pipeline

Create a small dataset, fit our pipeline, and make predictions.

This will show the complete workflow from raw data → preprocessing → model training → predictions.


---



In [19]:
# Small dataset with target variable
df_example = pd.DataFrame({
    'Weight': [70, 80, 60, 90, 65, 85],
    'Height': [1.75, 1.80, 1.65, 1.82, 1.70, 1.78],
    'City': ['Lahore', 'Karachi', 'Islamabad', 'Lahore', 'Karachi', 'Islamabad'],
    'Sport': ['Football', 'Cricket', 'Cricket', 'Football', 'Cricket', 'Football']
})

# Features and target
X = df_example[['Weight', 'Height', 'City']]
y = df_example['Sport']

# Fit the pipeline
model_pipeline.fit(X, y)

# Make predictions
preds = model_pipeline.predict(X)

# Show results
results_df = X.copy()
results_df['Actual'] = y
results_df['Predicted'] = preds
results_df

Unnamed: 0,Weight,Height,City,Actual,Predicted
0,70,1.75,Lahore,Football,Football
1,80,1.8,Karachi,Cricket,Cricket
2,60,1.65,Islamabad,Cricket,Cricket
3,90,1.82,Lahore,Football,Football
4,65,1.7,Karachi,Cricket,Cricket
5,85,1.78,Islamabad,Football,Football




---


### Understanding the Output

The final table shows:

- **Weight**, **Height**, **City** → Original feature values  
- **Actual** → The real sport played by the person (from our dataset)  
- **Predicted** → The sport predicted by our Logistic Regression model after preprocessing

Since the dataset is small and artificial, the model might predict perfectly (if the pattern is simple) or make small mistakes.  
This confirms that our pipeline correctly:
1. Scales numeric features
2. One-hot encodes categorical features
3. Trains a machine learning model
4. Produces predictions on the same preprocessing steps automatically


---


