# Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn.

Done and ready to go.

## Imports

In [1]:
import numpy as np
import pandas as pd 

---

## Question 1

What's the version of NumPy that you installed? 

You can get the version information using the `__version__` field:

In [2]:
np.__version__

'1.23.2'

---

## Getting the data 

For this homework, we'll use the Car price dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

You can do it with wget:

In [3]:
!wget "https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv"

--2022-09-08 19:09:42--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1475504 (1.4M) [text/plain]
Saving to: ‘data.csv.2’


2022-09-08 19:09:43 (8.30 MB/s) - ‘data.csv.2’ saved [1475504/1475504]



In [4]:
# Read the data into a Pandas Dataframe
df = pd.read_csv('data.csv')

---

### Question 2

How many records are in the dataset?

Here you need to specify the number of rows.

- 16
- 6572
- **11914**
- 18990

In [5]:
# Use df.shape to get the shape of the Dataframe
# The first element of the shape tuple is the number of rows
df.shape

(11914, 16)

---

### Question 3

Who are the most frequent car manufacturers (top-3) according to the dataset?

- Chevrolet, Volkswagen, Toyota
- Chevrolet, Ford, Toyota
- Ford, Volkswagen, Toyota
- **Chevrolet, Ford, Volkswagen**

> **Note**: You should rely on "Make" column in this question.

In [6]:
# Groupby the Make
#   get the size of each group
#   reset the index and name the count column "count"
#   sort on the count column decending
#   we are only interested in the top 3 values
df.groupby(["Make"]).size().reset_index(name = 'count').sort_values(['count'], ascending = [False]).head(3)

Unnamed: 0,Make,count
9,Chevrolet,1123
14,Ford,881
46,Volkswagen,809


---

### Question 4

What's the number of unique Audi car models in the dataset?

- 3
- 16
- 26
- **34**

In [7]:
# Filter on the Make "Audi"
#   Get the unique number of Models
df[df['Make'] == 'Audi']['Model'].nunique()

34

---

### Question 5

How many columns in the dataset have missing values?

- **5**
- 6
- 7
- 8

In [8]:
# Get the number of Nulls in the Dataframe
#   Sum by column
#   reset the index and name the count column "count"
#   sort on the count column decending
#   assign to a temporary dataframe
#   from the temp dataframe we are only interested in values > 0
df_tmp = df.isnull().sum().reset_index(name = 'count').sort_values(['count'], ascending = [False])
df_tmp[df_tmp['count'] > 0]

Unnamed: 0,index,count
9,Market Category,3742
4,Engine HP,69
5,Engine Cylinders,30
8,Number of Doors,6
3,Engine Fuel Type,3


---

### Question 6

* Find the median value of "Engine Cylinders" column in the dataset.
* Next, calculate the most frequent value of the same "Engine Cylinders".
* Use the `fillna` method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step.
* Now, calculate the median value of "Engine Cylinders" once again.

Has it changed?

> Hint: refer to existing `mode` and `median` functions to complete the task.

- Yes
- **No**

In [9]:
# Find the median value of "Engine Cylinders"
df['Engine Cylinders'].median()

6.0

In [10]:
# Calculate the most frequent value of the same "Engine Cylinders"
df['Engine Cylinders'].mode()

0    4.0
Name: Engine Cylinders, dtype: float64

In [11]:
# Use the fillna method to fill the missing values in "Engine Cylinders" with the most frequent value
df['Engine Cylinders'] = df['Engine Cylinders'].fillna(df['Engine Cylinders'].mode()[0])

In [12]:
# Find the median value of "Engine Cylinders"
df['Engine Cylinders'].median()

6.0

---

### Question 7

* Select all the "Lotus" cars from the dataset.
* Select only columns "Engine HP", "Engine Cylinders".
* Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 9 rows).
* Get the underlying NumPy array. Let's call it `X`.
* Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
* Invert `XTX`.
* Create an array `y` with values `[1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800]`.
* Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
* What's the value of the first element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -0.0723
- 4.5949
- 31.6537
- 63.5643

In [13]:
# Select all the "Lotus" cars from the dataset.
# Select only columns "Engine HP", "Engine Cylinders".
# Now drop all duplicated rows using drop_duplicates
df_lotus = df[df["Make"] == "Lotus"][["Engine HP", "Engine Cylinders"]].drop_duplicates()

In [14]:
# Get the underlying NumPy array. Let's call it X
X = np.array(df_lotus.values)
X

array([[189.,   4.],
       [218.,   4.],
       [217.,   4.],
       [350.,   8.],
       [400.,   6.],
       [276.,   6.],
       [345.,   6.],
       [257.,   4.],
       [240.,   4.]])

In [15]:
# Compute matrix-matrix multiplication between the transpose of X and X. 
# To get the transpose, use X.T. Let's call the result XTX
XTX = X.T.dot(X)

In [16]:
# Invert XTX
XTX_inverse = np.linalg.inv(XTX)

In [17]:
# Create an array y with values ...
y = np.array([1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800])

In [18]:
# Multiply the inverse of XTX with the transpose of X
# Then multiply the result by y. 
# Call the result w
w = XTX_inverse.dot(X.T).dot(y)

In [19]:
# What's the value of the first element of `w`?
w[0]

4.5949448100945744

---

# Learning in Public

* https://twitter.com/David__Colton/status/1566819448941465605?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1566852824243359744?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1566853269783199746?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1566869459742105600?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1566874640772960256?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1566876044065390595?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1566880715270144000?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1567178518647152641?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1567593259991539714?s=20&t=GJGuC9OEzKMWn1lEfqwrZg
* https://twitter.com/David__Colton/status/1567637214703587328?s=20&t=GJGuC9OEzKMWn1lEfqwrZg

---