# Instructions

Dataset Description
Following is a description of different columns in the dataset.

* CRIM: per capita crime rate in the vicinity
* ZN: amount of residential land reserved in the vicinity.
* INDUS: proportion of industrial land reserved nearby (in square kilometers)
* RIVERSIDE: If the boundary faces river side (= 1 if tract bounds river; 0 otherwise)
* POLINDEX: polution index
* RM: number of rooms in the house.
* AGE: Age of the property in years.
* DIS: weighted distances to the major economic centres (in kilometers)
* HIGHWAYCOUNT: Number of highways within 5 KM of distance.
* TAX: full-value property-tax rate per 1 lac.
* PTRATIO: student-teacher ratio in the vicinity.
* IMM: Immigration index in the vicinity.
* BPL: % of below poverty line population in the vicinity.
* PRICE: Price of the home in lacs, this is the target column.

Note: For numerical type questions, always enter the answer correct upto 3 decimal places without rounding off, unless otherwise stated.


Dataset Link : [Dataset](https://drive.google.com/file/d/1DRtaP8QnU7SFrhMsdR67TBNuq5aOusUe/view)

# 1 - Import Libs & Data

In [19]:
import pandas as pd
import numpy as np

# 2 - Load Data

In [20]:
df = pd.read_csv("NPPE1_Preprocessing1.csv")
df.head()

Unnamed: 0,CRIM,ZN,INDUS,POLINDEX,RM,AGE,DIS,HIGHWAYCOUNT,TAX,PTRATIO,IMM,BPL,PRICE,RIVERSIDE
0,1.026769,1.429034,7.8513,1.134216,6.0,42.0,5.251911,5,279.201277,20.689586,398.81196,10.461456,22.991633,NO
1,0.848089,0.255543,6.263434,1.245993,7.0,63.0,4.305546,8,307.444529,17.465398,377.153649,11.61969,24.551055,NO
2,10.925905,0.441022,18.32296,2.824833,8.0,-2.0,2.409495,25,666.492973,20.351601,387.061355,19.36607,15.875346,NO
3,0.559027,1.041175,11.11492,0.794952,6.0,9.0,6.898669,4,305.514181,19.787314,391.778647,6.20682,23.007756,NO
4,0.905063,81.167963,3.673369,1.02903,8.0,20.0,10.246463,1,315.91396,17.360439,395.833166,10.827105,21.503177,NO


Metadata:
* CRIM: per capita crime rate in the vicinity
* ZN: amount of residential land reserved in the vicinity.
* INDUS: proportion of industrial land reserved nearby (in square kilometers)
* RIVERSIDE: If the boundary faces river side (= 1 if tract bounds river; 0 otherwise)
* POLINDEX: polution index
* RM: number of rooms in the house.
* AGE: Age of the property in years.
* DIS: weighted distances to the major economic centres (in kilometers)
* HIGHWAYCOUNT: Number of highways within 5 KM of distance.
* TAX: full-value property-tax rate per 1 lac.
* PTRATIO: student-teacher ratio in the vicinity.
* IMM: Immigration index in the vicinity.
* BPL: % of below poverty line population in the vicinity.
* PRICE: Price of the home in lacs, this is the target column.

---

# Questions

## Q1

Which dataset are you using for this exam?

NPPE1_Preprocessing1.csv

NPPE1_Preprocessing2.csv

NPPE1_Preprocessing3.csv

NPPE1_Preprocessing4.csv

In [21]:
"NPPE1_Preprocessing1.csv"

'NPPE1_Preprocessing1.csv'

## Q2

How many samples are there in the dataset?

In [22]:
df.shape[0]

4000

## Q3

What is the average house price (in lacs)?

In [23]:
df['PRICE'].mean()

24.355923220694248

## Q4

How many houses have 5 or more rooms?

While filtering use syntax : `df[df[condition]]`

In [24]:
df[df['RM'] >=5].shape[0]

3953

## Q5

What is the average price of the top 10 most expensive houses (in lacs)?

In [30]:
top_10 = df.sort_values(by = "PRICE", ascending = False).head(10)
top_10['PRICE'].mean()

52.36590175716407

## Q6

What is the total number of missing or unknown values in the number of rooms feature?

(Hint: carefully look at the values the feature takes and find out implausible value.)

40

71

99

61

68

None of these

In [38]:
df['RM'].isna().sum() # not the case

# check unique values
df['RM'].unique() # have -1 as values - can't be no of rooms

# filter only no of rooms and take shape[0] -> no of observation
df[df['RM']==-1].shape[0]

40

## Q7

What is the total number of missing or unknown values in the age feature?

(Hint: carefully look at the values the feature takes and find out implausible value.)

50

83

74

64

59

None of these

In [43]:
df['AGE'].isna().sum() # wrong

# check unioque values
df['AGE'].unique() # -2 can't be

# filter for -2 samples and take shape [0]
df[df['AGE'] == -2].shape[0]

50

## Q8

What is the total number of missing or unknown values in the RIVERSIDE feature?

(Hint: carefully look at the values the feature takes and find out implausible value.)

88

101

56

62

80

None of these

In [49]:
# check nan
df['RIVERSIDE'].isna().sum() # wrong

# check unique values
df['RIVERSIDE'].unique() # unknown not feasible

# filter based on uknown
df[df['RIVERSIDE']=="UNKNOWN"].shape[0]

88

In [48]:
df['RIVERSIDE'].unique()

array(['NO', 'UNKNOWN', 'YES'], dtype=object)

## Q9

How many houses are on riverside and were built within the last 50 years (i.e. a house 50 years old or younger)?

For this question, ignore the rows that have missing values in either riverside feature or age feature.

In [55]:
# How many houses are on riverside and were built within the last 50 years (i.e. a house 50 years old or younger)?
filter_condition = (df['RIVERSIDE']=="YES") & (df['AGE'] <= 50) & (df['AGE']>=0)
df[filter_condition].shape[0]

44

## Q10

How many houses are near to exactly 6, 7 or 8 highways (all three inclusive)?

1211

1174

1234

938

1209

None of these

In [60]:
filter_condition = (df['HIGHWAYCOUNT']>=6) & (df['HIGHWAYCOUNT']<=8)
df[filter_condition].shape[0]


1211

## 11

Create a column 'CATEGORY' and divide the houses in categories as following:

* Category 1: house price <10 lacs
* Category 2: 10 lacs <= house price <20 lacs
* Category 3: 20 lacs <= house price <30 lacs
* Category 4: 30 lacs <= house price <40 lacs
* Category 5: house price >=40 lacs

Which category has the highest number of records?

1

2

3

4

5

There is a tie between multiple categories

In [61]:
# create fn
def category(x):
  if x<10:
    return 1
  elif 10 <=x <20:
    return 2
  elif 20 <= x <30:
    return 3
  elif 30 <= x <40:
    return 4
  else:
    return 5

In [63]:
# create new colum and apply the fn
df['CATEGORY'] = df['PRICE'].apply(category)

In [69]:
# get which category have max value
df['CATEGORY'].value_counts()

Unnamed: 0_level_0,count
CATEGORY,Unnamed: 1_level_1
3,2028
2,1158
4,503
5,268
1,43


## Q12

**PREPROCESSING**

Divide the data into training and test sets

1. Replace the respective missing or unknown values in features room count, riverside and age with np.nan.
2. Keep 30% of the data as test set.
3. Use random_state as 0
4. PRICE is the target, rest of the columns are the features.
5. Apply train test split.
Hint: look for the documentation of the usual function that divides the data into training and test datasets.

What is the number of samples in the training set?

In [70]:
# Replace the respective missing or unknown values in features room count, riverside and age with np.nan.
df['RM'] = df['RM'].replace(-1, np.nan)
df['RIVERSIDE'] = df['RIVERSIDE'].replace("UNKNOWN", np.nan)
df['AGE'] = df['AGE'].replace(-2, np.nan)

In [71]:
# Define X and y
X = df.drop(columns = "PRICE")
y = df["PRICE"]

In [72]:
# test set and random state
test_size = 0.3
random_state = 0


In [73]:
# import and split
from sklearn.model_selection import train_test_split

X_train, X_test , y_train, y_testy = train_test_split(X, y, test_size = test_size, random_state = random_state)

X_train.shape[0]

2800

## Q13 (most time consuming - took 10 min to solve)

Apply following preprocessing steps:

1. Drop CATEGORY column
2. CRIM: min max scaling
3. ZN: min max scaling
4. INDUS: standard scaling
5. POLINDEX: min max scaling
6. DIS: min max scaling
7. HIGHWAYCOUNT: min max scaling
8. TAX: min max scaling
9. PTRATIO: min max scaling
10. IMM: min max scaling
11. BPL: min max scaling
12. RM: impute with median then min max scaling
13. AGE: impute with mean then min max scaling
14. RIVERSIDE: Impute with most frequent value then one hot encode.

NOTE:
1. Make sure to preprocess the features in exactly above order. Answer of Q.16 depends upon correct order of featuring processing.
2. You may have to use multiple instances of a trasnformer for this question.


How many features are there after performing above transformation?

In [74]:
# LOAD LIBS INCLUDING PIPELINE AND COLUMN TRANSFORMER
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer


In [75]:
# drop the category column
df.drop(columns = "CATEGORY", inplace = True)

In [79]:
# intatiate min max scaler and and simple imputer and others
min_max_scaler = MinMaxScaler()
simple_imputer = SimpleImputer()

In [80]:
3 # set up seperate pipeline for combined processing
mean_impute_min_max = Pipeline([
    ("mean_imputer", SimpleImputer(strategy = "mean")),
    ("min_max_scaler", min_max_scaler)
])

median_impute_min_max = Pipeline([
    ("imputer", SimpleImputer(strategy = "median")),
    ("min_max_scaler", min_max_scaler)
])

most_frequent_impute_one_hot = Pipeline([
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("one_hot_encoder", OneHotEncoder())
])

1. Drop CATEGORY column
2. CRIM: min max scaling
3. ZN: min max scaling
4. INDUS: standard scaling
5. POLINDEX: min max scaling
6. DIS: min max scaling
7. HIGHWAYCOUNT: min max scaling
8. TAX: min max scaling
9. PTRATIO: min max scaling
10. IMM: min max scaling
11. BPL: min max scaling
12. RM: impute with median then min max scaling
13. AGE: impute with mean then min max scaling
14. RIVERSIDE: Impute with most frequent value then one hot encode.

In [81]:
# define column transformer
processing_pipeline = ColumnTransformer([
    ("min_max_scaling", min_max_scaler, ['CRIM', 'ZN', 'POLINDEX', 'DIS', 'HIGHWAYCOUNT', 'TAX', 'PTRATIO', 'IMM', 'BPL']),
    ("standard_scaling", StandardScaler(), ['INDUS']),
    ("median_impute_min_max", median_impute_min_max, ['RM']),
    ("mean_impute_min_max", mean_impute_min_max, ['AGE']),
    ("most_frequent_impute_one_hot", most_frequent_impute_one_hot, ['RIVERSIDE'])
], remainder = "passthrough")

In [82]:
# see pipeline
processing_pipeline

In [83]:
# fit the transformer
df_processed = processing_pipeline.fit_transform(df)

In [87]:
# get no of features
df_processed.shape[1]

15

## Q14

What is the mean of the transformed test data (features only)?
Note : Compute the mean of the whole feature matrix i.e. mean of all values in the transformed test feature matrix


In [92]:
# apply the pipeline to test data
X_test_processed = processing_pipeline.transform(X_test)

In [94]:
X_test_processed.mean()

0.5546784035227021