<a href="https://colab.research.google.com/github/abosalah0/Codveda-Technology/blob/main/Basic_task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Step 2: Load dataset correctly
# (your file was space-delimited, so we fix that)
col_names = [
    "CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX",
    "PTRATIO","B","LSTAT","MEDV"  # MEDV is the target (house price)
]

df = pd.read_csv("/content/4) house Prediction Data Set.csv", sep=r"\s+", header=None, names=col_names, engine="python")

print("Dataset shape:", df.shape)
print(df.head())


Dataset shape: (506, 14)
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296.0   
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242.0   
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242.0   
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222.0   
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222.0   

   PTRATIO       B  LSTAT  MEDV  
0     15.3  396.90   4.98  24.0  
1     17.8  396.90   9.14  21.6  
2     17.8  392.83   4.03  34.7  
3     18.7  394.63   2.94  33.4  
4     18.7  396.90   5.33  36.2  


In [None]:
# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Basic stats for numeric columns
print("\nDescriptive statistics:")
print(df.describe())



Missing values per column:
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

Descriptive statistics:
             CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   
std      8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   
max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000   

              AGE     

In [None]:
# Target = MEDV (median house value)
X = df.drop(columns=["MEDV"])
y = df["MEDV"]

# Split into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)


Train shape: (404, 13) Test shape: (102, 13)


In [None]:
# CHAS (river adjacency) and RAD (highway access index) are categorical
categorical_cols = ["CHAS", "RAD"]
numeric_cols = [col for col in X.columns if col not in categorical_cols]

print("Categorical:", categorical_cols)
print("Numeric:", numeric_cols)


Categorical: ['CHAS', 'RAD']
Numeric: ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']


In [None]:
# Step 5: Preprocessing pipeline (fixed for scikit-learn version issues)

# First, check your sklearn version
import sklearn
print("scikit-learn version:", sklearn.__version__)

# Use correct parameter depending on version
try:
    preprocessor = ColumnTransformer(
        transformers=[
            ("cat", OneHotEncoder(handle_unknown="ignore", sparse=False), categorical_cols),
            ("num", StandardScaler(), numeric_cols),
        ]
    )
except:
    preprocessor = ColumnTransformer(
        transformers=[
            ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
            ("num", StandardScaler(), numeric_cols),
        ]
    )

# Fit on training data
X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)

print("Processed train shape:", X_train_proc.shape)
print("Processed test shape:", X_test_proc.shape)


scikit-learn version: 1.6.1
Processed train shape: (404, 22)
Processed test shape: (102, 22)


In [None]:
# Get feature names after preprocessing
ohe = preprocessor.named_transformers_["cat"]
ohe_features = ohe.get_feature_names_out(categorical_cols)
all_features = np.concatenate([ohe_features, numeric_cols])

print("\nFeature names after preprocessing:")
print(all_features)



Feature names after preprocessing:
['CHAS_0' 'CHAS_1' 'RAD_1' 'RAD_2' 'RAD_3' 'RAD_4' 'RAD_5' 'RAD_6' 'RAD_7'
 'RAD_8' 'RAD_24' 'CRIM' 'ZN' 'INDUS' 'NOX' 'RM' 'AGE' 'DIS' 'TAX'
 'PTRATIO' 'B' 'LSTAT']


📝 Task 1: Data Preprocessing — Summary

In this task, we prepared the House Prediction dataset for machine learning.

Steps we performed:



1. Loaded & fixed parsing
  - The dataset was space-delimited, so we reloaded it properly and assigned column names.

 - Final dataset shape: 506 rows × 14 columns.

2.   Data Audit
   - Checked for missing values → none found.
  - Inspected summary statistics of numerical columns.



3.  Feature/Target split
  - Features: 13 columns (crime rate, rooms, tax, etc.).
  - target: MEDV (Median value of owner-occupied homes, in $1000s)

4.   Categorical vs. Numeric variables
  - Categorical: CHAS (river adjacency), RAD (highway access index).
  - Numeric: all other features (e.g., CRIM, RM, TAX, LSTAT).

5.   Preprocessing
- Applied One-Hot Encoding to categorical variables (CHAS, RAD).

- Applied Standard Scaling to numeric variables.

- Split data into training (80%) and testing (20%) sets.


6.   Final Processed Dataset
- Original 13 features expanded into 22 features after preprocessing.

- Training shape: (404, 22)

- Testing shape: (102, 22)



Key Findings:

No missing values were present, so no imputation was needed.

The dataset required careful preprocessing: categorical expansion + numeric scaling.

The final feature space is well-structured and ready for machine learning models.bold text