<a href="https://colab.research.google.com/github/fortune-uwha/DSN---Titanic-project/blob/master/131.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 1: Data Science Fundamentals
## Sprint 3: Intro to Modeling
## Subproject 1: Linear Algebra Refresher

Welcome to the final sprint of the Data Science Fundamentals module! We've done EDA, we've learned to verify our hypotheses of change by testing for significance in the change. There's one more skill set every data scientist should have - modeling. Modeling comes in many flavors - statistical modeling, machine learning. 

We will to explore the fundamental idea that **data can be represented in a compressed way - with a model**. You can imagine the model as an approximation of a dataset. Using models not only allows us, humans, to understand data better, but also make predictions about unseen data (we'll explore the concept of automatically learning relationships in the data).

## Learning outcomes

- Basics of linear algebra: vector operations, matrix operations.
- Intermediate NumPy proficiency: broadcasting, vectorization, higher-level APIs.

## Linear Algebra Refresher

Go through all videos and exercises at [KhanAcademy's linear algebra module](https://www.khanacademy.org/math/linear-algebra). After the course, you should have refreshed your knowledge of linear algebra basics - vectors, matrices, their operations, inner product.

## Advanced NumPy

We've had a chance to try out some NumPy in Sprint 1. We've barely scratched the surface - there's so much more you should know about NumPy APIs as a data scientist, which we'll focus on throughout this subproject. This subproject will mostly consist of exercises.

### Aggregations

By this point, you should be familiar with Pandas aggregation techniques. Often, you need to crunch big data with pure NumPy (after all, Pandas Series are just NumPy arrays under the hood), thus it's worth being familiar with NumPy data aggregation techniques - sums, statistical parameter calculations (mean, median, quantiles, percentiles).

Go through [this](https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html) tutorial on NumPy aggregation APIs. By the end of this tutorial, you should be able to apply basic NumPy aggregation techniques, when needed.

### Broadcasting and vectorization

Get familiar with the concept of [broadcasting](https://cs231n.github.io/python-numpy-tutorial/#broadcasting). It allows NumPy to work different shape arrays.

Afterwards, get used to the concept of [vectorization](https://realpython.com/numpy-array-programming/#what-is-vectorization). This technique allows to run loop-based algorithms several times faster.

### Exercises

Implement exercises [26](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md#26-what-is-the-output-of-the-following-script-), [13](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md#13-create-a-10x10-array-with-random-values-and-find-the-minimum-and-maximum-values-) and [45](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md#45-create-random-vector-of-size-10-and-replace-the-maximum-value-by-0-). Check the solutions [here](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints_with_solutions.md). You can implement them in this notebook.

In [None]:
import numpy as np

#### 13. Create a 10x10 array with random values and find the minimum and maximum values

In [None]:
array = np.random.random((10,10))
print(array)

[[0.11349589 0.36859183 0.38280257 0.96000403 0.01918696 0.4770775
  0.18481897 0.2939184  0.82863523 0.31940507]
 [0.47502892 0.9662604  0.47836859 0.92811399 0.06791091 0.57020302
  0.42738582 0.28196161 0.38250416 0.28311315]
 [0.15899944 0.09360673 0.45639867 0.85185694 0.35963508 0.24309713
  0.02140269 0.06970917 0.36578393 0.57004118]
 [0.92729871 0.43606201 0.27762668 0.35657565 0.47270959 0.46650929
  0.4423446  0.88780343 0.26656417 0.7235139 ]
 [0.00128418 0.00755258 0.86771782 0.51737338 0.41042256 0.33151126
  0.55138    0.97679889 0.11247594 0.89366934]
 [0.46285533 0.82886194 0.73242847 0.00131373 0.18919213 0.89490859
  0.9420891  0.57686663 0.65071431 0.38639396]
 [0.45953573 0.75616008 0.01182896 0.13750912 0.39969753 0.67298642
  0.49124397 0.69700449 0.70220217 0.03610937]
 [0.88999946 0.37925733 0.18218067 0.05225879 0.1126304  0.50345763
  0.43461621 0.73362919 0.63560381 0.01020375]
 [0.16010764 0.59823547 0.91319821 0.97366681 0.39708826 0.51799925
  0.50499881 

In [None]:
array.min(),array.max()

(0.0012841844245141676, 0.9767988941203034)

#### 26. What is the output of the following script?

In [None]:
 print(sum(range(5),-1))
from numpy import *
print(sum(range(5),-1))

9
10


#### 45. Create random vector of size 10 and replace the maximum value by 0

In [None]:
array_x = np.random.random(10)
array_x[array_x.argmax()] = 0
print(array_x)

[0.         0.39031574 0.73055401 0.50928489 0.1179906  0.6531963
 0.74946352 0.44411752 0.23223249 0.01877596]


For the end of this subproject, we invite you to solve some more NumPy exercises, this time more advanced. From the document [here](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints.md), choose 2 exercises of each difficulty level 3, and solve them. Afterwards, verify your solution using the answer sheet [here](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hints_with_solutions.md).

### Subproject

-----

## Summary

Linear algebra basics will be fundamental for the upcoming subprojects of this sprint. You should be comfortable with vector and matrix operations by now, as well as be able to employ De facto the best linear algebra framework for Python NumPy. Be aware that although NumPy is usually much faster than Python, vectorization knowledge is crucial to make it blazing fast.