## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](../../../01-intro/06-environment.md).

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

```python
pd.__version__
```

In [1]:
import pandas as pd

In [2]:
pd.__version__

'2.2.2'

### Getting the data 

For this homework, we'll use the Laptops Price dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv

--2024-10-02 11:44:40--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 298573 (292K) [text/plain]
Saving to: ‘laptops.csv.1’


2024-10-02 11:44:40 (43.1 MB/s) - ‘laptops.csv.1’ saved [298573/298573]



### Q2. Records count

How many records are in the dataset?

- 12
- 1000
- 2160
- 12160

In [4]:
df=pd.read_csv("laptops.csv")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2160 entries, 0 to 2159
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Laptop        2160 non-null   object 
 1   Status        2160 non-null   object 
 2   Brand         2160 non-null   object 
 3   Model         2160 non-null   object 
 4   CPU           2160 non-null   object 
 5   RAM           2160 non-null   int64  
 6   Storage       2160 non-null   int64  
 7   Storage type  2118 non-null   object 
 8   GPU           789 non-null    object 
 9   Screen        2156 non-null   float64
 10  Touch         2160 non-null   object 
 11  Final Price   2160 non-null   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 202.6+ KB


### Q3. Laptop brands

How many laptop brands are presented in the dataset?

In [6]:
df["Brand"].value_counts().count()

27

### Q4. Missing values

How many columns in the dataset have missing values?

In [7]:
df.isnull().sum()

Laptop             0
Status             0
Brand              0
Model              0
CPU                0
RAM                0
Storage            0
Storage type      42
GPU             1371
Screen             4
Touch              0
Final Price        0
dtype: int64

### Q5. Maximum final price

What's the maximum final price of Dell notebooks in the dataset?

In [8]:
df[
    df['Brand'] == 'Dell'
]['Final Price'].describe()

count      84.000000
mean     1153.839881
std       671.795071
min       379.000000
25%       699.000000
50%      1003.000000
75%      1313.810000
max      3936.000000
Name: Final Price, dtype: float64

### Q6. Median value of Screen

1. Find the median value of `Screen` column in the dataset.
2. Next, calculate the most frequent value of the same `Screen` column.
3. Use `fillna` method to fill the missing values in `Screen` column with the most frequent value from the previous step.
4. Now, calculate the median value of `Screen` once again.

Has it changed?

> Hint: refer to existing `mode` and `median` functions to complete the task.

- Yes
- No

In [9]:
df["Screen"].describe()

count    2156.000000
mean       15.168112
std         1.203329
min        10.100000
25%        14.000000
50%        15.600000
75%        15.600000
max        18.000000
Name: Screen, dtype: float64

In [10]:
df["Screen"].value_counts().head()

Screen
15.6    1009
14.0     392
16.0     174
17.3     161
13.3     131
Name: count, dtype: int64

In [11]:
df_upd = df.fillna({"Screen": 15.6})

In [12]:
df_upd["Screen"].describe()

count    2160.000000
mean       15.168912
std         1.202357
min        10.100000
25%        14.000000
50%        15.600000
75%        15.600000
max        18.000000
Name: Screen, dtype: float64

Answer: No, coz 50% value is the same for both df and df_new

### Q7. Sum of weights

1. Select all the "Innjoo" laptops from the dataset.
2. Select only columns `RAM`, `Storage`, `Screen`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the sum of all the elements of the result?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- 0.43
- 45.29
- 45.58
- 91.30

In [13]:
import numpy as np

In [14]:
df[df['Brand'] == 'Innjoo'][["RAM", "Storage", "Screen"]]

Unnamed: 0,RAM,Storage,Screen
1478,8,256,15.6
1479,8,512,15.6
1480,4,64,14.1
1481,6,64,14.1
1482,6,128,14.1
1483,6,128,14.1


In [15]:
X = df[df['Brand'] == 'Innjoo'][["RAM", "Storage", "Screen"]].values

In [16]:
X

array([[  8. , 256. ,  15.6],
       [  8. , 512. ,  15.6],
       [  4. ,  64. ,  14.1],
       [  6. ,  64. ,  14.1],
       [  6. , 128. ,  14.1],
       [  6. , 128. ,  14.1]])

In [17]:
XTX = np.dot(X.T,X)

In [18]:
XTX

array([[2.52000e+02, 8.32000e+03, 5.59800e+02],
       [8.32000e+03, 3.68640e+05, 1.73952e+04],
       [5.59800e+02, 1.73952e+04, 1.28196e+03]])

In [19]:
XTX_inv = np.linalg.inv(XTX)

In [22]:
y = np.array([1100, 1300, 800, 900, 1000, 1100])

In [25]:
w = XTX_inv.dot(X.T).dot(y)

In [27]:
sum(w)

91.2998806299555