(a) Write a program to create a synthetic dataset for car reselling price with three numerical
variables as below:
1. price: Target variable (in dollars/ Rs.).
2. mileage: Independent variable.
3. Age of the car: Independent variable (can be calculated as ((current year)- (year
of manufacture)), where current year is 2025.

(b) Write a program to calculate the Pearson’s correlation coefficients for all the variable
pairs and display it. Determine the pairs with highest (positive / negative) correlations
and interpret the results.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
np.random.seed(42)

n = 100
current_year = 2025

year_of_manufacture = np.random.randint(2005, 2025, size=n)
age = current_year - year_of_manufacture

avg_km_per_year = np.random.uniform(8_000, 20_000, size=n)
mileage = avg_km_per_year * age + np.random.normal(0, 5_000, size=n)
mileage = np.clip(mileage, 0, None)

base_price_new = np.random.uniform(800_000, 1_800_000, size=n)
price = base_price_new * np.exp(-0.08 * age) - 0.8 * mileage + np.random.normal(0, 50_000, size=n)
price = np.clip(price, 50_000, None)

df = pd.DataFrame({
    'price': price,
    'mileage': mileage,
    'age': age,
    'year_of_manufacture': year_of_manufacture
})

print(df.head())

           price        mileage  age  year_of_manufacture
0  231890.748209  235711.137634   14                 2011
1  786661.169744   22186.403428    1                 2024
2  432489.609656  103841.901293    6                 2019
3  711695.669292  137588.960312   10                 2015
4  438168.905955  123956.116338   13                 2012


In [4]:
df.to_csv('1_6_car_resell_data.csv', index=False)

### Correlation coefficient/matrix

In [None]:
import seaborn as sns

# Pearson correlation matrix
corr = df[['price','mileage','age']].corr(method='pearson')
print("Pearson correlation matrix:\n", corr, "\n")

upper = corr.where(np.triu(np.ones(corr.shape), 1).astype(bool))
pairs = upper.unstack().dropna()
max_pos_pair, max_pos_val = pairs.sort_values(ascending=False).iloc[0:1].index[0], pairs.sort_values(ascending=False).iloc[0]
max_neg_pair, max_neg_val = pairs.sort_values().iloc[0:1].index[0], pairs.sort_values().iloc[0]

print(f"Strongest positive correlation: {max_pos_pair} = {max_pos_val:.3f}")
print(f"Strongest negative correlation: {max_neg_pair} = {max_neg_val:.3f}\n")

if ('age','mileage') in [max_pos_pair, max_neg_pair] or ('mileage','age') in [max_pos_pair, max_neg_pair]:
    print("Interpretation: Age and mileage are strongly positively correlated (older cars accumulated more mileage).")
if ('price','age') in [max_pos_pair, max_neg_pair] or ('age','price') in [max_pos_pair, max_neg_pair]:
    print("Interpretation: Price is negatively correlated with age (older cars depreciate).")
if ('price','mileage') in [max_pos_pair, max_neg_pair] or ('mileage','price') in [max_pos_pair, max_neg_pair]:
    print("Interpretation: Price is negatively correlated with mileage (higher usage lowers resale value).")

Pearson correlation matrix:
             price   mileage       age
price    1.000000 -0.785850 -0.855167
mileage -0.785850  1.000000  0.875149
age     -0.855167  0.875149  1.000000 

Strongest positive correlation: ('age', 'mileage') = 0.875
Strongest negative correlation: ('age', 'price') = -0.855

Interpretation: Age and mileage are strongly positively correlated (older cars accumulated more mileage).
Interpretation: Price is negatively correlated with age (older cars depreciate).
