# Exercise

During this exercise, you might want to refer to tutorial 3 and 4 notebooks concerning Pandas library in order to apply necessary methods. Your main exercise is to apply everything at most you covered in Exploratory Data Analysis.

Import necessary libraries.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read the csv file of titanic dataset.

Start initial trials to explore the dataset by using attributes and functions, such as shape, dtypes, columns, head(), and more. 

Use describe method to have general description of the dataset.

Use count method on DataFrame. What do you see?

Use isna() method with sum() to explore the number of missing values in more details.

Choose one **numerical** column and apply mean(), median(), std(), var(), count(). You can use num_df variable below to do so as it only contains numerical columns from the original DataFrame.

In [None]:
num_df = df.select_dtypes(np.num)

Investigate categorical columns with value_counts(), unique(), and nunique() methods.

Apply boolean search operations with **loc** on any column. Challenge yourself with multiple conditions. Use categorical columns as well.

Generate a column that represents **Embarked** column numerically.

Plot scatter chart to see relationships between any two numerical columns.

# Your Turn

Try to apply some other functionalities of Pandas we covered. 

# Modelling

You will now use Linear Regression model in pure NumPy (with Pandas). This is the simplest example of this model since we ignore training part and only consider prediction, which is too easy! The following steps should be done:
* Read random_sample.csv file. Check data frame shape. Choose columns - column_1, column_2, column_3 and assign new data frame to X. Get NumPy array of X by using *values* attribute. Check X.shape.
* Choose the target column from the DataFrame, obtain NumPy array of it, and name it y variable. Check y.shape. If it is 1D, reshape it to (1000, 1).
* Define the weights of **[25.97, 98.36, 81.87]** as a NumPy array. Reshape it to (3, 1).
* Each weight is related to the column in their corresponding index. For instance, 25.97 of weight defines column_1.

In [20]:
df = pd.read_csv('data/random_sample.csv')

In [21]:
df.shape

(1000, 4)

In [22]:
df.head()

Unnamed: 0,column_1,column_2,column_3,target
0,-0.539123,-0.607875,-0.837262,-142.338105
1,-1.406661,-0.611518,-0.755383,-158.520012
2,0.681891,1.044161,-0.489439,80.334597
3,0.015618,-1.405567,-0.370508,-168.176389
4,1.305479,0.81351,0.825416,181.493775


In [23]:
df.columns

Index(['column_1', 'column_2', 'column_3', 'target'], dtype='object')

In [30]:
X = df[['column_1', 'column_2', 'column_3']]

In [31]:
X

Unnamed: 0,column_1,column_2,column_3
0,-0.539123,-0.607875,-0.837262
1,-1.406661,-0.611518,-0.755383
2,0.681891,1.044161,-0.489439
3,0.015618,-1.405567,-0.370508
4,1.305479,0.813510,0.825416
...,...,...,...
995,0.224685,-0.399636,1.726964
996,-0.147780,-0.508140,-1.220712
997,0.640480,-0.466495,0.996571
998,-1.138833,0.622207,0.300474


In [32]:
X = X.values

In [33]:
X

array([[-0.5391227 , -0.60787526, -0.83726243],
       [-1.4066611 , -0.6115178 , -0.75538293],
       [ 0.68189149,  1.04416088, -0.48943944],
       ...,
       [ 0.6404798 , -0.46649538,  0.99657051],
       [-1.13883312,  0.62220714,  0.30047436],
       [-0.95643638, -1.29327296,  0.71975794]])

In [34]:
X.shape

(1000, 3)

In [46]:
# y = df['target']
# y = y.values
# y = y.reshape(1000, 1)
# y.shape

(1000, 1)

In [47]:
y = df[['target']]

In [49]:
y

Unnamed: 0,target
0,-142.338105
1,-158.520012
2,80.334597
3,-168.176389
4,181.493775
...,...
995,107.920980
996,-153.761065
997,52.341674
998,56.227124


In [50]:
y = y.values

In [51]:
y.shape

(1000, 1)

In [52]:
w = [25.97, 98.36, 81.87]

In [55]:
df.columns

Index(['column_1', 'column_2', 'column_3', 'target'], dtype='object')

In [56]:
w = np.array(w)

In [57]:
w.shape

(3,)

In [58]:
w = w.reshape(3, 1)

In [59]:
w.shape

(3, 1)

In [60]:
# weights
w

array([[25.97],
       [98.36],
       [81.87]])

To make a prediction, you just need to write the folliwing in your code. 

![prediction.png](attachment:prediction.png)

Here, **y** in the formula is not the same with **y** we defined for our target column. It is a set of predictions based on the given X features and w parameters. You can rename it as "pred" or "prediction" in the code.

In [63]:
w.shape

(3, 1)

In [62]:
X.shape

(1000, 3)

In [64]:
y_pred = X.dot(w)

In [65]:
y_pred.shape

(1000, 1)

In [66]:
y.shape

(1000, 1)

Now, the next task is to understand the deviation of prediction from y. Apply the below formulas in your code and get their results. 

![error%20formula-2.png](attachment:error%20formula-2.png)

In [68]:
y_diff = y_pred - y

In [69]:
y_diff.shape

(1000, 1)

In [72]:
y_diff = np.abs(y_diff)

In [75]:
y_diff.shape

(1000, 1)

In [77]:
y_diff_sum = y_diff.sum()

In [78]:
y_diff_sum

4.840896983294722

In [80]:
n_samples = y_diff.size
n_samples

1000

In [82]:
mae = y_diff_sum / n_samples

In [83]:
mae

0.004840896983294722

In the previous exercise, we already had pre-defined weights (parameters) to make a prediction on the dataset. But how did I get these parameters so that they give me precise results in prediction? 

Get ready for diving into Calculus in the further tutorials.