# Machine Learning Zoomcamp 2024 | DataTalks.club

## **Homework #1:** Introduction to Machine Learning

### Set up the environment

Import the `Pandas` and `NumPy` libraries.

In [1]:
# Import libraries
import pandas as pd
import numpy  as np

### Question 1 

What's the version of Pandas that you installed?

You can get the version information using the `__version__ ` field.

In [2]:
# You can get the version information using the `__version__` field
pd.__version__

'2.2.2'

### Getting the data 

For this homework, we'll use the Laptops Price dataset, available [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv). Download and read the `laptops.csv` file with `Pandas`, or read it directly from the web link.

In [3]:
# Read the laptops.csv file and view the DataFrame
laptops = pd.read_csv('laptops.csv')
laptops.head()

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


### Question 2 

How many records are in the dataset?

In [4]:
laptops.shape[0]   # N(Records) = number of rows in the DataFrame

2160

### Question 3

How many laptop brands are presented in the dataset?

In [5]:
brands = laptops['Brand'].unique()  # Extract unique entries in the 'Brand' column
len(brands)                         # Returns the number of brands

27

### Question 4

How many columns in the dataset have missing values?

In [6]:
laptops.isna().sum()    # `isna()` detects missing (NaN) values
                        # `sum()` returns the number of NaN values in each column

Laptop             0
Status             0
Brand              0
Model              0
CPU                0
RAM                0
Storage            0
Storage type      42
GPU             1371
Screen             4
Touch              0
Final Price        0
dtype: int64

In [7]:
# Return number of columns with missing data
laptops.isnull().any().sum()    # `isnull()` checks for NaN values in each column
                                # `any()` applies this to all columns, outputs 'True' or 'False'
                                # `sum()` returns the number of 'True' columns

3

### Question 5

What's the maximum final price of Dell notebooks in the dataset?

In [8]:
# Filter by 'Brand' by Dell and extract the 'Final Price' column
# Find maximum value in the filtered column
laptops.loc[laptops.Brand == 'Dell', 'Final Price'].max()

3936.0

### Question 6

Median value of `Screen`.

1. Find the median value of the `Screen` column in the dataset.
2. Next, calculate the most frequent value of the same column.
3. Use fillna method to fill the missing values in `Screen` column with the most frequent value from the previous step.
4. Now, calculate the median value of `Screen` once again.

> Hint: refer to existing `mode` and `median` functions to complete the task.

In [9]:
# Step 1: Find the median value of the laptops DataFrame's 'Screen' column
screen_median = (laptops['Screen']).median()
print('screen_median:', screen_median)

screen_median: 15.6


In [10]:
# Step 2: Calculate the 'Screen' column's mode
screen_mode = laptops['Screen'].mode()[0]
print('screen_mode:', screen_mode)

screen_mode: 15.6


In [11]:
# Step 3: Use the fillna method to replace the NaN values in the 'Screen' column with its mode
laptops['Screen'] = laptops['Screen'].fillna(screen_mode)

# Ensure the NaN values have all been replaced
print(f"Have all the NaN values in laptops['Screen'] been replaced?: {not laptops['Screen'].isnull().any()}")

Have all the NaN values in laptops['Screen'] been replaced?: True


In [12]:
# Step 4: Recalculate the median value of the 'Screen' column
new_screen_median = np.median(laptops['Screen'])
print('new_screen_median:', new_screen_median)

# Has it changed?
print(f"Has the 'Screen' column's median changed?: {screen_median != new_screen_median}")

new_screen_median: 15.6
Has the 'Screen' column's median changed?: False


### Question 7

Sum of weights.

1. Select all the `Innjoo` laptops from the dataset.
2. Select only columns `RAM`, `Storage`, `Screen`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the sum of all the elements of the result?

In [14]:
# Step 1
innjoo_laptops = laptops.loc[laptops.Brand == 'Innjoo']

# Step 2
innjoo_laptops = innjoo_laptops.filter(['RAM', 'Storage', 'Screen'])

innjoo_laptops  # View filtered 'Innjoo' DataFrame

Unnamed: 0,RAM,Storage,Screen
1478,8,256,15.6
1479,8,512,15.6
1480,4,64,14.1
1481,6,64,14.1
1482,6,128,14.1
1483,6,128,14.1


In [15]:
# Step 3
X = innjoo_laptops.values

# Step 4
XTX = X.T.dot(X)

# Step 5
XTX_inv = np.linalg.inv(XTX)

# Step 6
y = np.array([1100, 1300, 800, 900, 1000, 1100])

# Step 7
w = (XTX_inv @ X.T) @ y

# Step 8
round(sum(w), 2)

91.3