# Problem Statement: -

1. A construction firm wants to develop a suburban locality with new infrastructure but they might incur losses if they cannot sell the properties. To overcome this, they consult an analytics firm to get insights on how densely the area is populated and the income levels of residents. Use the Support Vector Machines algorithm on the given dataset and draw out insights and also comment on the viability of investing in that area.

# 🔷 Business Objective:-

To assist the construction firm in making informed investment decisions by predicting the viability of infrastructure development in a suburban locality. This will be done by analyzing population density and income levels using the Support Vector Machines (SVM) algorithm. The ultimate goal is to identify whether the locality is suitable for profitable property development and sales.



# 🔶 Business Constraints:-

1. Limited Time – Analysis must be completed quickly to align with development plans.
2. Data Quality – Predictions depend on accurate and complete data.
3. Budget Limits – Must stay within the firm's analytics budget.
4. Model Interpretability – Results should be easy to understand for business stakeholders.
5. Compliance – Must follow data privacy and ethical standards.



In [1]:
import pandas as pd
import numpy as np

In [2]:
train_df=pd.read_csv("SalaryData_Train.csv")
train_df.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
train_df.shape

(30161, 14)

# Data Exploration:-

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30161 entries, 0 to 30160
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   age            30161 non-null  int64 
 1   workclass      30161 non-null  object
 2   education      30161 non-null  object
 3   educationno    30161 non-null  int64 
 4   maritalstatus  30161 non-null  object
 5   occupation     30161 non-null  object
 6   relationship   30161 non-null  object
 7   race           30161 non-null  object
 8   sex            30161 non-null  object
 9   capitalgain    30161 non-null  int64 
 10  capitalloss    30161 non-null  int64 
 11  hoursperweek   30161 non-null  int64 
 12  native         30161 non-null  object
 13  Salary         30161 non-null  object
dtypes: int64(5), object(9)
memory usage: 3.2+ MB


In [5]:
train_df.describe()

Unnamed: 0,age,educationno,capitalgain,capitalloss,hoursperweek
count,30161.0,30161.0,30161.0,30161.0,30161.0
mean,38.438115,10.121316,1092.044064,88.302311,40.931269
std,13.13483,2.550037,7406.466611,404.121321,11.980182
min,17.0,1.0,0.0,0.0,1.0
25%,28.0,9.0,0.0,0.0,40.0
50%,37.0,10.0,0.0,0.0,40.0
75%,47.0,13.0,0.0,0.0,45.0
max,90.0,16.0,99999.0,4356.0,99.0


In [6]:
train_df.isnull().sum()

age              0
workclass        0
education        0
educationno      0
maritalstatus    0
occupation       0
relationship     0
race             0
sex              0
capitalgain      0
capitalloss      0
hoursperweek     0
native           0
Salary           0
dtype: int64

In [7]:
train_df.columns

Index(['age', 'workclass', 'education', 'educationno', 'maritalstatus',
       'occupation', 'relationship', 'race', 'sex', 'capitalgain',
       'capitalloss', 'hoursperweek', 'native', 'Salary'],
      dtype='object')

# Data Preprocessing:-

In [8]:
test_df=pd.read_csv("SalaryData_Test.csv")
test_df.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,34,Private,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K


In [9]:
# Encode categorical features (Label or One-Hot Encoding)

In [10]:
from sklearn.preprocessing import LabelEncoder

In [11]:
# Create copies of the datasets to preprocess
train_preprocessed = train_df.copy()
test_preprocessed = test_df.copy()

In [12]:
# Identify categorical columns
categorical_columns = train_preprocessed.select_dtypes(include=['object']).columns
categorical_columns

Index(['workclass', 'education', 'maritalstatus', 'occupation', 'relationship',
       'race', 'sex', 'native', 'Salary'],
      dtype='object')

In [13]:
# Initialize label encoder
le = LabelEncoder()
le


In [14]:
# Apply Label Encoding to all categorical columns in both train and test sets
for col in categorical_columns:
    train_preprocessed[col] = le.fit_transform(train_preprocessed[col])
    test_preprocessed[col] = le.transform(test_preprocessed[col])


In [15]:

# Check the first few rows after encoding
train_preprocessed.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,5,9,13,4,0,1,4,1,2174,0,40,37,0
1,50,4,9,13,2,3,0,4,1,0,0,13,37,0
2,38,2,11,9,0,5,1,4,1,0,0,40,37,0
3,53,2,1,7,2,5,0,2,1,0,0,40,37,0
4,28,2,9,13,2,9,5,2,0,0,0,40,4,0


# Train SVM Model:-

In [16]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [17]:
# Separate features and target
X_train = train_preprocessed.drop("Salary", axis=1)
y_train = train_preprocessed["Salary"]


In [18]:
X_test = test_preprocessed.drop("Salary", axis=1)
y_test = test_preprocessed["Salary"]


In [19]:
# Initialize SVM with RBF kernel (commonly used)
svm_model = SVC(kernel='rbf')
svm_model

In [20]:
#Train The Model
svm_model.fit(X_train,y_train)

In [21]:
#Predict on test data
y_pred=svm_model.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [22]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)


In [23]:
accuracy, conf_matrix, class_report

(0.7964143426294821,
 array([[10997,   363],
        [ 2703,   997]], dtype=int64),
 '              precision    recall  f1-score   support\n\n           0       0.80      0.97      0.88     11360\n           1       0.73      0.27      0.39      3700\n\n    accuracy                           0.80     15060\n   macro avg       0.77      0.62      0.64     15060\nweighted avg       0.79      0.80      0.76     15060\n')

In [24]:
# Try SVM with a linear kernel
svm_linear = SVC(kernel='linear')

In [None]:
svm_linear.fit(X_train, y_train)

In [None]:
# Predict and evaluate
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)
conf_matrix_linear = confusion_matrix(y_test, y_pred_linear)
class_report_linear = classification_report(y_test, y_pred_linear)

accuracy_linear, conf_matrix_linear, class_report_linear


In [None]:
# Try SVM with a polynomial kernel
svm_poly = SVC(kernel='poly', degree=3)  # default degree is 3
svm_poly.fit(X_train, y_train)



In [None]:
# Predict and evaluate
y_pred_poly = svm_poly.predict(X_test)
accuracy_poly = accuracy_score(y_test, y_pred_poly)
conf_matrix_poly = confusion_matrix(y_test, y_pred_poly)
class_report_poly = classification_report(y_test, y_pred_poly)

accuracy_poly, conf_matrix_poly, class_report_poly
