## Assignment 2 Numpy and pandas

This assignment will contain 3 questions with details as below. The due date is October 9 (Friday), 2020 23:59PM. Each late day will result in 20% loss of total points.

### Question 1 (30 points) Just another ordinary yet least square

Every college student with a business degree may know linear regression pretty well. Essentially, linear regression models the linear regression between a scalar variable (dependent variable) and a list of independent variables, expressed in matrix notation as:

\begin{equation}
\mathbf{y} = X\boldsymbol\theta + \boldsymbol\varepsilon
\end{equation}


Ordinary least squares (OLS) allows to find the value of $\theta$ as a *closed-form solution* —in other words, a mathematical equation that gives the result directly. This is called
the *Normal Equation*:

\begin{equation}
\hat{\theta} = (X^T\cdot X)^{-1}\cdot X^T\cdot y
\end{equation}

**Question 1.1** (5 points): Load the data from ```independent_variable.npy``` and ```dependent_variable.npy```, and list the dimensions of the independent variables and the dependent variable, respectively.

**Question 1.2** (15 points) Now implement the normal equation of Ordinary Least Squres using numpy to estimate the $\theta$. Show the estimated value of $\theta$, denoted as $\hat{\boldsymbol\theta}$. Also perform the Least Squares using ```np.linalg.lstsq ``` function (with polynomial degree of 1), check whether estimated $\theta$ from your own implementation is equal to the estimated value of polynomial regression.

**Question 1.3** (10 points) Now use the estimated $\hat{\boldsymbol\theta}$ to *predict* the value of $y$ using the equation:

\begin{equation}
\mathbf{\hat{y}} = X\hat{\boldsymbol\theta}
\end{equation}

Calculate the prediction errors the linear regression model has made between the $\hat{y}$ and $y$ as:

\begin{equation}
\mathbf{E} = \sum_{j=0}^n |\hat{y}_j - y_j|^2
\end{equation}


### Question 2 (30 points)  Is it Instagrammable?

Consider any photo you take at Carcavelos beach:

![carcavelos](https://www.cm-oeiras.pt/pt/descobrir/patrimonio/PublishingImages/Paginas/fortesaojuliaobarra/CM145328.JPG)

<br>

An image is composed of three matrixes, each one for an RGB (red, green and blue) channel. Each matrix has values between 0 and 255.


![image](https://static.packt-cdn.com/products/9781789613964/graphics/e91171a3-f7ea-411e-a3e1-6d3892b8e1e5.png)

In this exercise, you will apply a filter to the image in a process called convolution. This process uses the filter as a matrix that is multiplied as a sliding window to an image. On the following animation, the filter is represented as yellow, the image channel as green, and the convolution result as red.

![convolution](https://icecreamlabs.com/wp-content/uploads/2018/08/33-con.gif)


Note that the resulting matrix has a smaller shape than the original. To ensure the same shape consider a padding with value 0 that enables the filter to slide through every pixel value as shown in the following image:

<br>

![padding](https://media5.datahacker.rs/2018/11/sl_1.png)

<br><br>

This question requires the `numpy` and `Pillow` libraries that may already be installed on your enviroment. However, if you have a problem importing these libraries, install it via pip. Example for Pillow: `pip install Pillow` 

<br>

Consider the following code to convert an image into a numpy array:

```python
from PIL import Image
from numpy import asarray

# load the image
image = Image.open('carcavelos.jpg')

# convert image to numpy array
data = asarray(image)
```



You can save a numpy array as an image with the following code:

```python
"""
color image
"""

# convert numpy array to image
img = Image.fromarray(data, 'RGB')

# save image and .png
img.save('image.png')

"""
grayscale image
"""

# convert numpy array to image
data = data.astype(np.uint8)
img = Image.fromarray(data)

# save grayscale image as .png
img.save('image.png')
```

Answer the following questions:

**Question 2.1** (5 points) What is the shape of the image? Answer with the shape and indicate what each dimension represents.

**Question 2.2** (10 points) Save the "carcavelos.jpg" image as a grayscale image into "carcavelos-grayscale.png". The image should result from a numpy array with only one channel instead of 3, consider the following weights for each channel: 0.30*R + 0.59*G + 0.11*B

**Question 2.3** (15 points) Apply convolution operations as described above to "carcavelos.jpg". 
1. Save an image into "carcavelos-sharpen.png" with the applied filter: `np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])`
2. Save an image into "carcavelos-blur.png" with the applied filter: `np.array([[1,1,1],[1,1,1],[1,1,1]])/9`

### Question 3 My Heart Will Go On (40 points) 

![](https://camo.githubusercontent.com/78ca11f9a2e6c36bbee928124a7d3f9abc3abb2b/68747470733a2f2f696d672d73332e6f6e6564696f2e636f6d2f69642d3537616336353563393365613835613733323935343639652f7265762d302f7261772f732d613730613530323939633033303464336535383266356230373338613366653730396533613564662e6a7067)

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. The RMS Titanic was the largest ship afloat at the time it entered service and was the second of three Olympic-class ocean liners operated by the White Star Line. The Titanic was built by the Harland and Wolff shipyard in Belfast. Thomas Andrews, her architect, died in the disaster.  Incorporating both historical and fictionalized aspects, the film Titanic is a 1997 American epic romance and disaster film based on accounts of the sinking of the RMS Titanic directed, written, co-produced, and co-edited by James Cameron, and stars Leonardo DiCaprio and Kate Winslet as members of different social classes who fall in love aboard the ship during its ill-fated maiden voyage.

**Titanic dataset (titanic.csv)**
The titanic.xlsx contains list of detailed passenger information aboard with the description in the data dictionary as below.

Data Dictionary 

| Variable        | Definition           | Key  |
| ------------- |:-------------:| -----:|
| survived      | Survival | 0 = No, 1 = Yes |
| pclass      | Ticket class      |   1 = 1st, 2 = 2nd, 3 = 3rd |
| sex         | Gender   |      |
| age | Age in years      |     |
| sibsp | # of siblings / spouses aboard the Titanic      |   Sibling = brother, sister Spouse = husband, wife |
| parch | # of parents / children aboard the Titanic      |     |
| fare | Passenger fare      |     |
| cabin | Cabin number      |     |
| embarked | Port of Embarkation     |   C = Cherbourg, Q = Queenstown, S = Southampton  |
| class | Class of tickets      |  First, Second, Third class   |
| who   | Identity              |  man, woman, child            |
| adult_male |  Is male adult or not | Ture, False              |
| embark_town | The town of embarkation  | Cherbourg, Queenstown, Southampton |
| alive       | same as the survived  | no, yes |
| alone       | Is alone or not       | True, False |


Answer the following questions using the provided dataset. You can write down intermediate results obtained while working towards the final answers.

**Question 3.1** (10 points)

Read the `titanic.csv` and show how many passengers records are in the data.

Due to the errors in the history archives, there are several problems you need to address first in order to obtain the correct data:

1. In the column of *`sibsp`*, the value of 1 is mistakenly recorded as -1
2. In the column of *`survived`*, the value of 0 is mistakenly recorded as NaN

**Question 3.2** (5 points) Show how many male and female passengers there are, as a percentage of the total number of passangers:

**Question 3.3** (5 points) 
Show the average number of siblings/spouses for passengers embarked from Southampton

**Question 3.4** (5 points) Show the median age of passengers that are adult male:

**Question 3.5** (5 points) Show the mean difference of fares between First Class passengers and Third Class passengers: 

**Question 3.6** (5 points) Show the survival status of passengers with the top 10 highest fares:

**Question 3.7** (5 points)
Show the survival rate of men, women and children, respectively: