### **Anaconda and Conda**

The easiest way to set up the environment is to use Anaconda or Miniconda.

Anaconda comes with everything we need (and much more). Miniconda is a smaller version of Anaconda that contains only Python.

Follow the instructions on page for installing the correct package for your system. The site will automatically detect your operating system and suggest the correct package.

- Anaconda
- Miniconda

If you are using Windows, you can use WSL, but the plain Windows version should work too.

Anaconda is recommended.

**(Optional) Create environment for course**
It is a good idea to set up a dedicated environment for the course

In your terminal, run this command to create the environment:
- ``` conda create -n ml-zoomcamp python=3.11 ``` 

Activate it:
- ``` conda activate ml-zoomcamp ``` 

Installing libraries:
- ``` conda install numpy pandas scikit-learn seaborn jupyter ```

Later in the course you will also need to install XGBoost and Tensorflow, but we can skip this part for now.

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the __version__ field:

``` pd.__version__ ```

**Getting the data**
For this homework, we'll use the Laptops Price dataset. Download it from here.

You can do it with wget:

``` wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv ``` 
Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
import pandas as pd
import numpy as np

In [4]:
pd.__version__

'2.2.2'

In [17]:
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv"

laptop_data = pd.read_csv(url)
df = laptop_data.copy()
df.head()

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


### Q2. Records count

How many records are in the dataset?

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2160 entries, 0 to 2159
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Laptop        2160 non-null   object 
 1   Status        2160 non-null   object 
 2   Brand         2160 non-null   object 
 3   Model         2160 non-null   object 
 4   CPU           2160 non-null   object 
 5   RAM           2160 non-null   int64  
 6   Storage       2160 non-null   int64  
 7   Storage type  2118 non-null   object 
 8   GPU           789 non-null    object 
 9   Screen        2156 non-null   float64
 10  Touch         2160 non-null   object 
 11  Final Price   2160 non-null   float64
dtypes: float64(2), int64(2), object(8)
memory usage: 202.6+ KB


### Q3. Laptop brands

How many laptop brands are presented in the dataset?

In [19]:
df["Brand"].nunique()

27

### Q4. Missing values

How many columns in the dataset have missing values?

In [20]:
laptop_columns = list(df.columns)
missing_counter = 0

for column in laptop_columns:
    columns_missing = df[column].isnull().sum()
    if columns_missing > 0:
        missing_counter += 1
    print(column, df[column].isnull().sum())
print("Number of Columns: ", missing_counter)

Laptop 0
Status 0
Brand 0
Model 0
CPU 0
RAM 0
Storage 0
Storage type 42
GPU 1371
Screen 4
Touch 0
Final Price 0
Number of Columns:  3


### Q5. Maximum final price

What's the maximum final price of Dell notebooks in the dataset?

In [21]:
max(df[df["Brand"] == "Dell"]["Final Price"])

3936.0

### Q6. Median value of Screen

1. Find the median value of Screen column in the dataset.

In [22]:
df.Screen.median()

15.6

2. Next, calculate the most frequent value of the same Screen column.

In [23]:
df.Screen.mode()

0    15.6
Name: Screen, dtype: float64

3. Use fillna method to fill the missing values in Screen column with the most frequent value from the previous step.

In [24]:
df.Screen.fillna(df.Screen.mode())
# df.Screen.fillna(df.Screen.mode(), inplace=True) # if want to save change, uncomment it!

0       15.6
1       15.6
2       15.6
3       15.6
4       15.6
        ... 
2155    17.3
2156    17.3
2157    17.3
2158    13.4
2159    13.4
Name: Screen, Length: 2160, dtype: float64

In [25]:
df.Screen.isnull().sum()

4

4. Now, calculate the median value of Screen once again.

In [26]:
df.Screen.median() # No

15.6

### Q7. Sum of weights

1. Select all the "Innjoo" laptops from the dataset.

In [None]:
df[df["Brand"] == "Innjoo"]

2. Select only columns RAM, Storage, Screen.

In [34]:
df[df["Brand"] == "Innjoo"][["RAM", "Storage", "Screen"]]

Unnamed: 0,RAM,Storage,Screen
1478,8,256,15.6
1479,8,512,15.6
1480,4,64,14.1
1481,6,64,14.1
1482,6,128,14.1
1483,6,128,14.1


3. Get the underlying NumPy array. Let's call it X.

In [44]:
X = df[df["Brand"] == "Innjoo"][["RAM", "Storage", "Screen"]].to_numpy()
# X = np.array(df[df["Brand"] == "Innjoo"][["RAM", "Storage", "Screen"]])

4. Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.

In [48]:
XTX = np.dot(X.T, X)

In [49]:
XTX

array([[2.52000e+02, 8.32000e+03, 5.59800e+02],
       [8.32000e+03, 3.68640e+05, 1.73952e+04],
       [5.59800e+02, 1.73952e+04, 1.28196e+03]])

5. Compute the inverse of XTX.

In [51]:
inverse_XTX = np.linalg.inv(XTX)

6. Create an array y with values [1100, 1300, 800, 900, 1000, 1100].

In [52]:
y = [1100, 1300, 800, 900, 1000, 1100]

7. Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.

In [53]:
w = np.dot(np.dot(inverse_XTX, X.T),y)

8. What's the sum of all the elements of the result?

In [54]:
sum(w)

91.29988062995496