## Import libraries

In [13]:
import pandas as pd
import numpy as np

## Load data

In [4]:
df = pd.read_csv('car_fuel_efficiency.csv')

In [5]:
print(f"Number of records in the dataset: {len(df)}")

Number of records in the dataset: 9704


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9704 entries, 0 to 9703
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   engine_displacement  9704 non-null   int64  
 1   num_cylinders        9222 non-null   float64
 2   horsepower           8996 non-null   float64
 3   vehicle_weight       9704 non-null   float64
 4   acceleration         8774 non-null   float64
 5   model_year           9704 non-null   int64  
 6   origin               9704 non-null   object 
 7   fuel_type            9704 non-null   object 
 8   drivetrain           9704 non-null   object 
 9   num_doors            9202 non-null   float64
 10  fuel_efficiency_mpg  9704 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 834.1+ KB


## Q3: How many fuel types are presented in the dataset?

In [7]:
df['fuel_type'].nunique()

2

In [9]:
print(f"{df.isnull().any().sum()} columns have missing values")

4 columns have missing values


## Q5. Max fuel efficiency
What's the maximum fuel efficiency of cars from Asia?

In [10]:
max_efficiency_asia = df[df['origin'] == 'Asia']['fuel_efficiency_mpg'].max()
print(f"Maximum fuel efficiency of cars from Asia: {max_efficiency_asia}")

Maximum fuel efficiency of cars from Asia: 23.759122836520497


## Q6. Median value of horsepower
1. Find the median value of horsepower column in the dataset.
2. Next, calculate the most frequent value of the same horsepower column.
3. Use fillna method to fill the missing values in horsepower column with the most frequent value from the previous step.
4. Now, calculate the median value of horsepower once again.

Has it changed?
- Yes, it increased
- Yes, it decreased
- No

In [11]:
# 1. Median value of horsepower before filling missing values
median_before = df['horsepower'].median()

# 2. Most frequent value (mode) of horsepower
mode_horsepower = df['horsepower'].mode()[0]

# 3. Fill missing values in horsepower with the mode
df['horsepower'] = df['horsepower'].fillna(mode_horsepower)

# 4. Median value of horsepower after filling missing values
median_after = df['horsepower'].median()

print(f"Median before: {median_before}")
print(f"Most frequent value: {mode_horsepower}")
print(f"Median after: {median_after}")

if median_after > median_before:
    print("Yes, it increased")
elif median_after < median_before:
    print("Yes, it decreased")
else:
    print("No")

Median before: 149.0
Most frequent value: 152.0
Median after: 152.0
Yes, it increased


## Q7. Sum of weights
1. Select all the cars from Asia
2. Select only columns vehicle_weight and model_year
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it X.
5. Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
Invert XTX.
6. Create an array y with values [1100, 1300, 800, 900, 1000, 1100, 1200].
7. Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
8. What's the sum of all the elements of the result?

In [None]:
# 1-3. Filter, select columns, take first 7 rows
asia_subset = df.loc[df['origin'] == 'Asia', ['vehicle_weight', 'model_year']].head(7)

# 4. Underlying NumPy array
X = asia_subset.to_numpy()

# 5. XTX and its inverse
XTX = X.T @ X
try:
    XTX_inv = np.linalg.inv(XTX)
except np.linalg.LinAlgError:
    # Fallback in case of singular matrix (unlikely here)
    XTX_inv = np.linalg.pinv(XTX)

# 6. y array
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

# 7. Compute w
w = XTX_inv @ X.T @ y

# 8. Sum of elements of w
sum_w = w.sum()
print("w:", w)
print("Sum of elements of w:", sum_w)

w: [0.01386421 0.5049067 ]
Sum of elements of w: 0.5187709081074025
