# Fundamentos de Análise de Dados 2022.1

# Dataset

_Naval Propulsion Plants_: regressão múltipla (2 variáveis de saída), estimar cada variável de saída separadamente:
- 11934 amostras;
- 16 características reais;
- 2 características reais para estimar, mas estimar somente _GT Compressor decay state coecient_ (remover _GT Turbine decay state coecient_).

# 01. Fazer o _download_ do respectivo banco de dados.

Link: http://archive.ics.uci.edu/ml/datasets/condition+based+maintenance+of+naval+propulsion+plants

Após feito o download, os dados foram salvos em _"../data/naval_data.txt"_.

# 02. Fazer a leitura dos dados.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
column_names = [
    "Lever position",
    "Ship speed",
    "Gas Turbine shaft torque",
    "GT rate of revolutions",
    "Gas Generator rate of revolutions",
    "Starboard Propeller Torque",
    "Port Propeller Torque",
    "Hight Pressure Turbine exit temperature",
    "GT Compressor inlet air temperature",
    "GT Compressor outlet air temperature",
    "HP Turbine exit pressure",
    "GT Compressor inlet air pressure",
    "GT Compressor outlet air pressure",
    "GT exhaust gas pressure",
    "Turbine Injecton Control",
    "Fuel flow",
    "GT Compressor decay state coefficient",
    "GT Turbine decay state coefficient"
]

In [4]:
# read data using read_csv
# raw_data = pd.read_csv("data/naval_data.txt", sep="   ", header=None, engine='python')

# read data using read_fwf
raw_data = pd.read_fwf("../data/naval_data.txt", header=None)
raw_data.columns = column_names

In [5]:
raw_data.head()

Unnamed: 0,Lever position,Ship speed,Gas Turbine shaft torque,GT rate of revolutions,Gas Generator rate of revolutions,Starboard Propeller Torque,Port Propeller Torque,Hight Pressure Turbine exit temperature,GT Compressor inlet air temperature,GT Compressor outlet air temperature,HP Turbine exit pressure,GT Compressor inlet air pressure,GT Compressor outlet air pressure,GT exhaust gas pressure,Turbine Injecton Control,Fuel flow,GT Compressor decay state coefficient,GT Turbine decay state coefficient
0,1.138,3.0,289.964,1349.489,6677.38,7.584,7.584,464.006,288.0,550.563,1.096,0.998,5.947,1.019,7.137,0.082,0.95,0.975
1,2.088,6.0,6960.18,1376.166,6828.469,28.204,28.204,635.401,288.0,581.658,1.331,0.998,7.282,1.019,10.655,0.287,0.95,0.975
2,3.144,9.0,8379.229,1386.757,7111.811,60.358,60.358,606.002,288.0,587.587,1.389,0.998,7.574,1.02,13.086,0.259,0.95,0.975
3,4.161,12.0,14724.395,1547.465,7792.63,113.774,113.774,661.471,288.0,613.851,1.658,0.998,9.007,1.022,18.109,0.358,0.95,0.975
4,5.14,15.0,21636.432,1924.313,8494.777,175.306,175.306,731.494,288.0,645.642,2.078,0.998,11.197,1.026,26.373,0.522,0.95,0.975


In [8]:
# convert the DataFrame into a Numpy array
naval_array = raw_data.to_numpy()
naval_array

array([[1.1380000e+00, 3.0000000e+00, 2.8996400e+02, ..., 8.2000000e-02,
        9.5000000e-01, 9.7500000e-01],
       [2.0880000e+00, 6.0000000e+00, 6.9601800e+03, ..., 2.8700000e-01,
        9.5000000e-01, 9.7500000e-01],
       [3.1440000e+00, 9.0000000e+00, 8.3792290e+03, ..., 2.5900000e-01,
        9.5000000e-01, 9.7500000e-01],
       ...,
       [7.1480000e+00, 2.1000000e+01, 3.9003867e+04, ..., 8.3400000e-01,
        1.0000000e+00, 1.0000000e+00],
       [8.2060000e+00, 2.4000000e+01, 5.0992579e+04, ..., 1.1490000e+00,
        1.0000000e+00, 1.0000000e+00],
       [9.3000000e+00, 2.7000000e+01, 7.2775130e+04, ..., 1.7040000e+00,
        1.0000000e+00, 1.0000000e+00]])

# 03. Se necessário, dividir os dados em conjunto de treinamento (70%) e teste (30%), utilizando a função apropriada do scikit-learn. Quatro NumPy arrays devem ser criados: X_train, y_train, X_test e y_test.

In [9]:
data = raw_data.copy()
data.drop(["GT Turbine decay state coefficient"],
          axis=1,
          inplace=True)

print(data.shape)

X = data.drop(["GT Compressor decay state coefficient"],
              axis=1)

y = data[["GT Compressor decay state coefficient"]]

X.shape

(11934, 17)


(11934, 16)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

In [11]:
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

# 04. Acrescentar uma coluna de 1s ([1 1 . . . 1]^T) como última coluna da matriz de treinamento Xtrain (vamos chamá-la de X_train_2). Repita o procedimento para a matriz de teste, chamando-a de X_test_2.

[StackOverflow: How to add an extra column to a NumPy array](https://stackoverflow.com/questions/8486294/how-to-add-an-extra-column-to-a-numpy-array)

In [11]:
X_train.shape

(8353, 16)

In [13]:
def add_ones_column(data_array: np.array) -> np.array:
    length = data_array.shape[0]
    return np.c_[data_array, np.ones(length)]

# 05. Calcular o posto das matrizes X_train_2 e X_test_2. Se necessário, ajustar as matrizes X_train_2 e X_test_2.

In [14]:
np.linalg.matrix_rank(add_ones_column(X_train))

14

In [17]:
np.linalg.matrix_rank(add_ones_column(X_test))

14

In [16]:
add_ones_column(X_train).shape

(8353, 17)

# 06. Calcular a decomposição QR da matriz de treinamento: X_train_2 = QR, usando a função do NumPy apropriada.

In [18]:
Q, R = np.linalg.qr(add_ones_column(X_train))

In [19]:
coefs_lineares = np.linalg.solve(R, np.dot(Q.T, y_train))

In [43]:
independent_vectors = np.abs(np.diag(R))>=1e-12
independent_vectors

array([ True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True, False,  True,  True,  True,  True, False])

In [46]:
np.allclose(add_ones_column(X_train), np.dot(Q, R))

True

In [59]:
for index, value in enumerate(independent_vectors):
    if value == False:
        print(list(X_train.columns)[index-1])

Starboard Propeller Torque
HP Turbine exit pressure
Fuel flow
