In [2]:
%pip install --upgrade pandas numpy matplotlib seaborn

Collecting pandas
  Downloading pandas-2.3.2-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting numpy
  Downloading numpy-2.3.3-cp311-cp311-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     ------------ ------------------------- 20.5/60.9 kB 640.0 kB/s eta 0:00:01
     ------------------------------- ------ 51.2/60.9 kB 525.1 kB/s eta 0:00:01
     -------------------------------------- 60.9/60.9 kB 460.9 kB/s eta 0:00:00
Collecting matplotlib
  Downloading matplotlib-3.10.6-cp311-cp311-win_amd64.whl.metadata (11 kB)
Downloading pandas-2.3.2-cp311-cp311-win_amd64.whl (11.3 MB)
   ---------------------------------------- 0.0/11.3 MB ? eta -:--:--
   ---------------------------------------- 0.1/11.3 MB 1.7 MB/s eta 0:00:07
   ---------------------------------------- 0.1/11.3 MB 1.3 MB/s eta 0:00:09
   ---------------------------------------- 0.1/11.3 MB 1.2 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Q1. Pandas version
What's the version of Pandas that you installed?

You can get the version information using the __version__ field:

In [3]:
pd.__version__

'2.3.2'

In [5]:
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv"
df = pd.read_csv(url)

df.head()


Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


## Q2. Records count
How many records are in the dataset?

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9704 entries, 0 to 9703
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   engine_displacement  9704 non-null   int64  
 1   num_cylinders        9222 non-null   float64
 2   horsepower           8996 non-null   float64
 3   vehicle_weight       9704 non-null   float64
 4   acceleration         8774 non-null   float64
 5   model_year           9704 non-null   int64  
 6   origin               9704 non-null   object 
 7   fuel_type            9704 non-null   object 
 8   drivetrain           9704 non-null   object 
 9   num_doors            9202 non-null   float64
 10  fuel_efficiency_mpg  9704 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 834.1+ KB


## Q3. Fuel types
How many fuel types are presented in the dataset?

In [8]:
df['fuel_type'].nunique()

2

## Q4. Missing values
How many columns in the dataset have missing values?

In [10]:
df.isna().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

## Q5. Max fuel efficiency
What's the maximum fuel efficiency of cars from Asia?

In [18]:
asia_df = df[df['origin']=='Asia']
asia_df['fuel_efficiency_mpg'].max()

23.759122836520497

## Q6. Median value of horsepower
Find the median value of horsepower column in the dataset.
Next, calculate the most frequent value of the same horsepower column.
Use fillna method to fill the missing values in horsepower column with the most frequent value from the previous step.
Now, calculate the median value of horsepower once again.
Has it changed?

In [23]:
df['horsepower'].mode()

0    152.0
Name: horsepower, dtype: float64

In [24]:
df['horsepower'].median()

149.0

In [25]:
df['horsepower'] = df['horsepower'].fillna(152)

In [26]:
df['horsepower'].median()

152.0

## Q7. Sum of weights
Select all the cars from Asia
Select only columns vehicle_weight and model_year
Select the first 7 values
Get the underlying NumPy array. Let's call it X.
Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
Invert XTX.
Create an array y with values [1100, 1300, 800, 900, 1000, 1100, 1200].
Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
What's the sum of all the elements of the result?

In [39]:
df.loc[df['origin'] == 'Asia',['vehicle_weight','model_year']].iloc[:7]

Unnamed: 0,vehicle_weight,model_year
8,2714.21931,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
34,2844.227534,2014
38,3761.994038,2019


In [42]:
X = np.array(df.loc[df['origin'] == 'Asia',['vehicle_weight','model_year']].iloc[:7])
print(X)

[[2714.21930965 2016.        ]
 [2783.86897424 2010.        ]
 [3582.68736772 2007.        ]
 [2231.8081416  2011.        ]
 [2659.43145076 2016.        ]
 [2844.22753389 2014.        ]
 [3761.99403819 2019.        ]]


In [43]:
XTX = X.T.dot(X)
print(XTX)

[[62248334.33150762 41431216.5073268 ]
 [41431216.5073268  28373339.        ]]


In [44]:
XTX_inv = np.linalg.inv(XTX)

In [45]:
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

In [47]:
w = XTX_inv.dot(X.T).dot(y)
print(w)


[0.01386421 0.5049067 ]


In [48]:
w_sum = w.sum()
print("Sum of w:", w_sum)

Sum of w: 0.5187709081074007
