<a href="https://colab.research.google.com/github/faznafathima/ICTAK/blob/main/FAZNA_FATHIMA_SHAJU__CASE_STUDY_ON_REGRESSION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CASE STUDY ON REGRESSION**

**DATASET: CAR-AGE-PRICE.CSV**

---


---





### **IMPORTING NECESSARY LIBRARIES**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### **LOAD AND DISPLAY THE DATA**

In [3]:
data=pd.read_csv(r"/content/drive/MyDrive/car_age_price.csv")
data

Unnamed: 0,Year,Price
0,2018,465000
1,2019,755000
2,2019,700000
3,2018,465000
4,2018,465000
...,...,...
107,2016,375000
108,2014,300000
109,2015,425000
110,2016,420000


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112 entries, 0 to 111
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Year    112 non-null    int64
 1   Price   112 non-null    int64
dtypes: int64(2)
memory usage: 1.9 KB


In [5]:
data.describe()

Unnamed: 0,Year,Price
count,112.0,112.0
mean,2016.669643,483866.044643
std,1.629616,91217.450533
min,2013.0,300000.0
25%,2015.0,423750.0
50%,2017.0,500000.0
75%,2017.0,550000.0
max,2020.0,755000.0


In [6]:
data.isnull().sum()

Year     0
Price    0
dtype: int64

NO NULL OR MISSING VALUES

### **CHECKING OUTLIERS USING Z-SCORE**

In [7]:
# Calculate z-scores for the Price column
z_scores = np.abs((data['Price'] - data['Price'].mean()) / data['Price'].std())

# Set a threshold for identifying outliers (e.g., z-score greater than 3)
threshold = 3

# Identify the outliers
outliers = data[z_scores > threshold]
# Print the outliers
print("Outliers:")
print(outliers)

# Calculate z-scores for the Year column
z_scores = np.abs((data['Year'] - data['Year'].mean()) / data['Year'].std())

# Set a threshold for identifying outliers (e.g., z-score greater than 3)
threshold = 3

# Identify the outliers
outliers = data[z_scores > threshold]

# Print the outliers
print("Outliers:")
print(outliers)

Outliers:
Empty DataFrame
Columns: [Year, Price]
Index: []
Outliers:
Empty DataFrame
Columns: [Year, Price]
Index: []


NO outliers in the given data

### **PREPROCESS THE DATA**

In [8]:
# Preprocess the data
X = data[['Year']]  # Features (Year)
y = data['Price']   # Target variable (Price)

### **SPLIT THE DATA**

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### **MODEL TRAINING**

In [10]:
# Train the linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [11]:
# Train the Lasso regression model
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)

### **PREDICTION**

In [12]:
# Predict the prices for the test set using both models
linear_predictions = linear_model.predict(X_test)
lasso_predictions = lasso_model.predict(X_test)

print("linear predictions: ",linear_predictions)
print("\n lasso predictions: ",lasso_predictions)

linear predictions:  [600775.91252081 505558.77690466 553167.34471273 553167.34471273
 553167.34471273 410341.6412885  505558.77690466 553167.34471273
 600775.91252081 600775.91252081 315124.50567235 505558.77690466
 410341.6412885  648384.48032889 553167.34471273 600775.91252081
 315124.50567235 410341.6412885  505558.77690466 505558.77690466
 505558.77690466 505558.77690466 505558.77690466]

 lasso predictions:  [600775.81201603 505558.75884159 553167.28542881 553167.28542881
 553167.28542881 410341.70566715 505558.75884159 553167.28542881
 600775.81201603 600775.81201603 315124.65249272 505558.75884159
 410341.70566715 648384.33860324 553167.28542881 600775.81201603
 315124.65249272 410341.70566715 505558.75884159 505558.75884159
 505558.75884159 505558.75884159 505558.75884159]


In [13]:
# Predict the price of a 2022 model second-hand grand i10
predicted_price_linear = linear_model.predict([[2022]])
predicted_price_lasso = lasso_model.predict([[2022]])

print("predicted_price_linear: ",predicted_price_linear)
print("\n predicted_price_lasso: ",predicted_price_lasso)


predicted_price_linear:  [743601.61594504]

 predicted_price_lasso:  [743601.39177768]




### **MODEL EVALUATION**

In [14]:
# Evaluate the models
linear_rmse = mean_squared_error(y_test, linear_predictions, squared=False)
lasso_rmse = mean_squared_error(y_test, lasso_predictions, squared=False)
linear_r2 = r2_score(y_test, linear_predictions)
lasso_r2 = r2_score(y_test, lasso_predictions)

### **RESULTS**

In [15]:
# Print the results
print("Linear Regression RMSE:", linear_rmse)
print("Lasso Regression RMSE:", lasso_rmse)
print("Linear Regression R^2 Score:", linear_r2)
print("Lasso Regression R^2 Score:", lasso_r2)
print("Predicted price of a 2022 model using Linear Regression:", predicted_price_linear)
print("Predicted price of a 2022 model using Lasso Regression:", predicted_price_lasso)

Linear Regression RMSE: 65779.22359552195
Lasso Regression RMSE: 65779.18826038415
Linear Regression R^2 Score: 0.36759313425902185
Lasso Regression R^2 Score: 0.36759381368868127
Predicted price of a 2022 model using Linear Regression: [743601.61594504]
Predicted price of a 2022 model using Lasso Regression: [743601.39177768]


# **INFERENCE**

1. Both the Linear Regression and Lasso Regression models have similar RMSE values, indicating that they have similar performance in terms of the average prediction error.

2. Both the Linear Regression and Lasso Regression models have similar R-squared scores, suggesting that they explain approximately 36.8% of the variability in the Price based on the Year.

3. Both models predict a similar price for a 2022 model of the second-hand Hyundai Grand i10, with values around 743,600.

4. Based on these observations, both Linear Regression and Lasso Regression models have **similar performance** in terms of RMSE, R-squared score, and the predicted price for a 2022 model. Therefore, it is difficult to definitively conclude which model is performing better based on these metrics alone.



---



---

