## Crew Number PyTorch

### This notebook shows about training an model to predict the Crew number step by step


### 1. Import dependencies

In [1]:
import pandas as pd
import numpy as np
import torchvision
import torch.nn as nn
import torch
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.preprocessing import OneHotEncoder
import visuals as vs
from torch import optim
import os

### 2. Load the data and show preview

In [None]:
data = pd.read_csv("cruise_ship_info.csv")
data

### 3.  Separe the crew (that we want predict) from the features and get the numpy arrays

In [None]:
crew = data['crew']
features = data.drop('crew', axis = 1)
features.shape, crew.shape

### 4. Show basic statistics about the dataset

In [None]:
print("Statistics for Cruise Ship dataset:\n")
print("Minimum crew: {}".format(crew.min())) 
print("Maximum crew: {}".format(crew.max()))
print("Mean crew: {}".format(crew.mean()))
print("Median crew {}".format(np.median(crew)))
print("Standard deviation of crew value: {}".format(crew.std()))
print("Total of crew values: {}".format(crew.count()))

We can see that the difference between the minimum and maximum crew size are large so we can infer that we are working with really small ships and with really large ones 

### 5. Choose the best features for a crew predictor
At first for sure i would remove the **Ship_name** since each ship has one so it isn't useful, and intuitivelly remove **Age** and **Cruise_line** but i rather check which is the feature importance. The features **Tonnage**, **passengers**, **length**, **passenger_density** and **cabins** in my point of view are all importants since a crew value should be choosen depending on the demand of those features (Higher the features value also would be the crew).

In [None]:
# Delete the Ship_name column from the input data 
data_clean = features.drop('Ship_name', axis = 1)
data_clean[:3]

#### 5.1. One Hot Encode the Cruise_line feature

In [None]:
# Get one hot encoding of columns Cruise_line
one_hot = pd.get_dummies(data_clean['Cruise_line'])
# Drop column Cruise_line as it is now encoded
data_clean = data_clean.drop('Cruise_line',axis = 1)
# Join the encoded df
data_clean = data_clean.join(one_hot)
print(data_clean.shape)
data_clean[:3]


#### 5.3 Check the feature importance without the Cruise_line and with all the features

In [None]:
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot

X = data_clean.values
y = crew.values

lr = LinearRegression()
lr.fit(X,y)

In [None]:
importance = lr.coef_
# summarize feature importance
vs.feature_plot(importance,data_clean,crew)

#### 5.4 Drop columns with less important features
After checking the importance of the features will drop a the columns that aren't useful for our models

In [None]:
indices = np.argsort(importance)[::-1]
useful_features = len([v for v in importance if v > 0])
print("Less important features: ", data_clean.columns.values[indices[useful_features:]])
data_clean = data_clean.drop(data_clean.columns.values[indices[useful_features:]], axis=1)
X = data_clean.values

### 6. Convert the data to PyTorch format

In [None]:
train_len = 95
valid_len = 63

# Convert dataset to PyTorch
dataset = TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32))
train_ds, val_ds = random_split(dataset, [train_len, valid_len])
train_loader = DataLoader(train_ds, 1)
val_loader = DataLoader(val_ds, 1)

### 7. Create the architecture of the model

In [None]:
class ShipModel(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.linear = nn.Linear(input_size, 1)
        
    def forward(self, input_): 
        out = self.linear(input_)
        return out
model = ShipModel(data_clean.shape[1])

### 8. Set the main hyper-parameters
the hyper-parameters need to be used to make the model to converge, if you choose a higher learning rater for example the model would overfit and if you choose a really small learning rate it will underfit, the same for the quantity of epochs you choose. 

In [None]:
# Mean Square Error was choosen since its a great loss for a Regression problem 
criterion = nn.MSELoss()
# The Learning rate was tuned to find the best one to fit the data without causing overfitting or underfitting
learning_rate=0.000004
# SGD Was used since ts a great Optimizer used for general purposes
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
# Epochs
epochs = 25
# Assign to the device
model.to("cpu")

### 9. Define training function

In [None]:
def train(model, trainloader, testloader, epochs, criterion, optmizer, model_name):
    steps = 0
    running_loss = 0
    print_every = 1
    train_losses, test_losses = [], []
    for epoch in range(epochs):
        for inputs, values in trainloader:
            steps += 1
            inputs, values = inputs.to("cpu"), values.to("cpu")
            logps = model.forward(inputs)
            loss = criterion(logps, values)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            running_loss += loss.item()
            if steps % print_every == 0:
                test_loss = 0
                accuracy = 0
                model.eval()
                with torch.no_grad():
                    for inputs, values in testloader:
                        inputs, values = inputs.to("cpu"), values.to("cpu")
                        logps = model.forward(inputs)
                        batch_loss = criterion(logps, values)
                        test_loss += batch_loss.item()

                train_losses.append(running_loss/len(trainloader))
                test_losses.append(test_loss/len(testloader))

                val_loss_fix = round(test_loss/len(testloader),3)

                print(f"Epoch {epoch+1}/{epochs}.. "
                       f"Train loss: {running_loss/print_every:.3f}.. "
                       f"Validation loss: {val_loss_fix}.. ")
                running_loss = 0
                model.train()
    #       Save every epoch
            if not os.path.exists('models/'+model_name+'/'):
                os.mkdir('models/'+model_name+'/')
            model_full_name = 'models/'+model_name+'/epoch_'+str(epoch)+'_name_'+model_name+'_.pt'
            torch.save(model, model_full_name) # official recommended

### 10. Train the model!

In [None]:
train(model ,train_loader, val_loader, epochs, criterion, optimizer, "model_all")

### 11. Show some predictions and compare with ground truth

In [None]:
for i in range(10):
    print("ground-truth: ",float(val_ds[i][1]), " Predicted: ",float(model(val_ds[i][0])))

### 12. Measure Pearson Corrrelation Coefficient

In [None]:
y_val = val_ds.dataset.tensors[1].cpu().detach().numpy()

pred_train = np.ones(95, dtype=np.float32)
y_train = np.ones(95, dtype=np.float32)

pred_val = np.ones(63, dtype=np.float32)
y_val = np.ones(63, dtype=np.float32)

count = 0
for point in train_ds:
    pred_train[count] = float(model(point[0]))
    y_train[count] = float(point[1])
    count+=1
    
count = 0
for point in val_ds:
    pred_val[count] = float(model(point[0]))
    y_val[count] = float(point[1])
    count+=1

corr_train = np.corrcoef(y_train,pred_train, np.float64)
print(corr_train)

corr_val = np.corrcoef(y_val,pred_val, np.float64)
print(corr_val)

The model seems to have a great value of Pearson Correlation Coefficient getting close to 1

### 13. Regularization
Regularization is used in order to avoid overfitting, for that the SGD Optimizer aproach already deals with L2 Regularization which is the same as this model trained.