## Exercise 1:

#### The Breast Cancer Data was loaded into the data and target variables.

#### All data was splitted into train and test sets:

- train set: X_train, y_train

- test set: X_test, y_test

#### Find the distribution of target, y_train and y_test arrays (in percent).

#### In response, print the result to the console as shown below.

Tip: You can use np.unique(): https://numpy.org/doc/stable/reference/generated/numpy.unique.html

In [8]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split


pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 200)
np.set_printoptions(precision=2, suppress=True, linewidth=100)
raw_data = load_breast_cancer()

data = raw_data['data']
target = raw_data['target']

X_train, X_test, y_train, y_test = train_test_split(
    data, target, random_state=40, test_size=0.25
)


print(f'Target: {np.unique(target, return_counts = True)[1]/len(target)}')
print(f'y_train: {np.unique(y_train, return_counts = True)[1]/len(y_train)}')
print(f'y_test: {np.unique(y_test, return_counts = True)[1]/len(y_test)}')


Target: [0.37 0.63]
y_train: [0.39 0.61]
y_test: [0.31 0.69]


#### Notes:

- np.unique(target, return_counts=True): This function finds all the unique elements in the target array and counts how many times each unique element occurs. It returns two arrays: one with the unique elements and another with their respective counts.

- [1]: This index accesses the second array returned by np.unique, which contains the counts of each unique element in the target array.

## Exercise 2:
#### Keep the same distribution of values in the y_train and y_test arrays as in target array.
#### In response, print the percentage distribution of target, y_train and y_test arrays.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    data, target, random_state=40, test_size=0.25, stratify=target
)


print(f'Target: {np.unique(target, return_counts = True)[1]/len(target)}')
print(f'y_train: {np.unique(y_train, return_counts = True)[1]/len(y_train)}')
print(f'y_test: {np.unique(y_test, return_counts = True)[1]/len(y_test)}')

Target: [0.37 0.63]
y_train: [0.37 0.63]
y_test: [0.37 0.63]


#### Notes:

- The stratify parameter in the train_test_split() function is used to ensure that the training and test sets maintain the same proportion of classes as in the original dataset.
- Without stratify, the split might result in an uneven distribution of classes, which could lead to a biased model or poor performance evaluation, especially in cases of imbalanced datasets.

## Exercise 3:

#### Using the normal equation and the numpy package, find the linear regression equation for the df.

#### In response, print the result to the console as shown below.

In [10]:
df = pd.DataFrame(
    {
        'years': [1, 2, 3, 4, 5, 6],
        'salary': [4000, 4250, 4500, 4750, 5000, 5250],
    }
)

m = len(df)
 
X1 = df['years'].values
Y = df['salary'].values
 
X1 = X1.reshape(m, 1)
bias = np.ones((m, 1))
X = np.append(bias, X1, axis=1)
 
coefs = np.dot(np.linalg.inv(np.dot(X.T, X)), np.dot(X.T, Y))
print(f'Linear regression: {coefs[0]:.2f} + {coefs[1]:.2f}x')

Linear regression: 3750.00 + 250.00x


#### Notes: 

- np.dot(X.T, X): Computes the dot product of the transpose of X with X. This is part of the normal equation formula.
- np.linalg.inv(...): Calculates the inverse of the matrix resulting from np.dot(X.T, X).
- np.dot(X.T, Y): Computes the dot product of the transpose of X with the target variable Y.
- The result, coefs, is an array containing the coefficients of the linear regression model: [intercept, slope].