## Regression with a Crab Age Dataset

### Training a regression model

Training a regression model with the Regression with a Crab Age Dataset provided on Kaggle.
First, we import all the necessary libraries required for data manipulation and modeling, such as pandas, numpy, and scikit-learn.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

### Loading and pre-processing the data

Now we load the train csv dataset into a pandas DataFrame using the provided data dwonloaded from the Kaggle website and the read csv function.

In [8]:
data = pd.read_csv('train.csv', header=0, sep= ',')
print(data.head())

   id Sex  Length  Diameter  Height     Weight  Shucked Weight  \
0   0   I  1.5250    1.1750  0.3750  28.973189       12.728926   
1   1   I  1.1000    0.8250  0.2750  10.418441        4.521745   
2   2   M  1.3875    1.1125  0.3750  24.777463       11.339800   
3   3   F  1.7000    1.4125  0.5000  50.660556       20.354941   
4   4   I  1.2500    1.0125  0.3375  23.289114       11.977664   

   Viscera Weight  Shell Weight  Age  
0        6.647958      8.348928    9  
1        2.324659      3.401940    8  
2        5.556502      6.662133    9  
3       10.991839     14.996885   11  
4        4.507570      5.953395    8  


### Splitting the data into features and target

Then we are going to separate the features (independent variables) from the target variable (Age) in the dataset.

In [9]:
X = data.drop(['id', 'Age'], axis=1)
y = data['Age']

### Splitting into training and testing sets

We are then going to split the data into training and testing sets to evaluate the performance of the trained model. We are going to allocate 20% of the data to testing, while the remaining 70% will be allocsted to training. We have to perform one-hot encoding on the 'Sex' column, which creates binary variables for each unique value ('I', 'M', 'F'), because LinearRegression model only expects numerical input.

In [13]:
X_encoded = pd.get_dummies(X, columns=['Sex'])
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

### Training the regression model

At this point we create an instance of the linear regression model and fit it to the training data.

In [14]:
model = LinearRegression()
model.fit(X_train, y_train)

### Making predictions on the test set

We use the trained model to make predictions on the test set.

In [15]:
y_pred = model.predict(X_test)

### Evaluating the model's performance

Finally we calculate the mean absolute error (MAE) between the predicted and actual ages of the crabs.

In [16]:
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Mean Absolute Error: 1.488616915780394


### Making predictions with test data

We now have to use the trained model to make predictions on the test data provided.

### Pre-processing the data

We need to import the test dataset which has been downloaded on Kaggle with pandas read csv function.

In [18]:
test_data = pd.read_csv('test.csv', header=0, sep= ',')
print(test_data.head())

      id Sex  Length  Diameter  Height     Weight  Shucked Weight  \
0  74051   I  1.0500    0.7625  0.2750   8.618248        3.657085   
1  74052   I  1.1625    0.8875  0.2750  15.507176        7.030676   
2  74053   F  1.2875    0.9875  0.3250  14.571643        5.556502   
3  74054   F  1.5500    0.9875  0.3875  28.377849       13.380964   
4  74055   I  1.1125    0.8500  0.2625  11.765042        5.528153   

   Viscera Weight  Shell Weight  
0        1.729319      2.721552  
1        3.246018      3.968930  
2        3.883882      4.819415  
3        6.548735      7.030676  
4        2.466407      3.331066  


Then we apply all the necessary preprocessing steps to the test data, such as dropping unnecessary columns (id) or encoding categorical variables (Sex). Using the same preprocessing techniques applied to the training data.



In [19]:
test_data = test_data.drop('id', axis=1)
test_data_encoded = pd.get_dummies(test_data, columns=['Sex'])

### Predictions

Now we use the trained model to make predictions on the preprocessed test data and then we examine the predicted age.

In [20]:
predictions = model.predict(test_data_encoded)
print(predictions)

[ 7.72888916  7.69986598 10.40461034 ... 12.34460125 10.00759867
 12.65155756]


### Exporting predictions

We create a submission csv file for the predictions in the right format which is with the id column only. We have to import the dataset again because we dropped the id column earlier, then we create the submission data with pandas' DataFrame function.

In [26]:
test_data_fin = pd.read_csv('test.csv', header=0, sep= ',')
submission_data = pd.DataFrame({
    'id': test_data_fin['id'],
    'yield': predictions.astype(int)
})
submission_data.to_csv('submission.csv', index=False)
print(submission_data.head())

      id  yield
0  74051      7
1  74052      7
2  74053     10
3  74054      9
4  74055      7
