<a href="https://www.kaggle.com/code/vidhikishorwaghela/regression-with-an-abalone-dataset?scriptVersionId=171327918" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Abalone Age Prediction Model

### Data Overview:
- The dataset contains information about abalones, a type of marine mollusk, and their physical measurements.
- Features include length, diameter, height, and various weight measurements.
- The target variable is the number of rings, which is commonly used as an indicator of the abalone's age.

### Data Preprocessing:
- **Handling Missing Values**: No missing values were found in the dataset.
- **Feature Engineering**: One-hot encoding was performed for the 'Sex' column to convert categorical data into numerical format.
- **Feature Scaling**: Features were standardized using StandardScaler to ensure all features have the same scale.

### Model Training:
- **Splitting Data**: The dataset was split into training and validation sets using a 80-20 split ratio.
- **Model Selection**: Linear Regression was chosen as the predictive model due to its simplicity and interpretability.
- **Model Training**: The model was trained on the training data after preprocessing.

### Model Evaluation:
- **RMSLE Metric**: Root Mean Squared Log Error (RMSLE) was used as the evaluation metric to measure the accuracy of the model's predictions on the validation set.
- **Clipping Predictions**: Negative predictions were clipped to ensure they are non-negative before calculating RMSLE.

### Making Predictions on Test Data:
- After training the model and evaluating its performance, predictions were made on the test dataset.
- The test dataset underwent the same preprocessing steps as the training dataset to ensure consistency.
- Predictions were made using the trained model on the preprocessed test data.
- The predicted values were saved in a submission file for further analysis or submission to competitions.



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error

In [2]:
# Load the data
train_data = pd.read_csv("/kaggle/input/playground-series-s4e4/train.csv")
test_data = pd.read_csv("/kaggle/input/playground-series-s4e4/train.csv")

In [3]:
train_data.head(3)

Unnamed: 0,id,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
0,0,F,0.55,0.43,0.15,0.7715,0.3285,0.1465,0.24,11
1,1,F,0.63,0.49,0.145,1.13,0.458,0.2765,0.32,11
2,2,I,0.16,0.11,0.025,0.021,0.0055,0.003,0.005,6


In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90615 entries, 0 to 90614
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              90615 non-null  int64  
 1   Sex             90615 non-null  object 
 2   Length          90615 non-null  float64
 3   Diameter        90615 non-null  float64
 4   Height          90615 non-null  float64
 5   Whole weight    90615 non-null  float64
 6   Whole weight.1  90615 non-null  float64
 7   Whole weight.2  90615 non-null  float64
 8   Shell weight    90615 non-null  float64
 9   Rings           90615 non-null  int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 6.9+ MB


In [5]:
train_data.describe

<bound method NDFrame.describe of           id Sex  Length  Diameter  Height  Whole weight  Whole weight.1  \
0          0   F   0.550     0.430   0.150        0.7715          0.3285   
1          1   F   0.630     0.490   0.145        1.1300          0.4580   
2          2   I   0.160     0.110   0.025        0.0210          0.0055   
3          3   M   0.595     0.475   0.150        0.9145          0.3755   
4          4   I   0.555     0.425   0.130        0.7820          0.3695   
...      ...  ..     ...       ...     ...           ...             ...   
90610  90610   M   0.335     0.235   0.075        0.1585          0.0685   
90611  90611   M   0.555     0.425   0.150        0.8790          0.3865   
90612  90612   I   0.435     0.330   0.095        0.3215          0.1510   
90613  90613   I   0.345     0.270   0.075        0.2000          0.0980   
90614  90614   I   0.425     0.325   0.100        0.3455          0.1525   

       Whole weight.2  Shell weight  Rings  
0       

In [6]:
train_data.tail(3)

Unnamed: 0,id,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
90612,90612,I,0.435,0.33,0.095,0.3215,0.151,0.0785,0.0815,6
90613,90613,I,0.345,0.27,0.075,0.2,0.098,0.049,0.07,6
90614,90614,I,0.425,0.325,0.1,0.3455,0.1525,0.0785,0.105,8


In [7]:
# Handling missing values (if any)
train_data.isnull().sum()

id                0
Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Whole weight.1    0
Whole weight.2    0
Shell weight      0
Rings             0
dtype: int64

In [8]:
# Splitting features and target variable
X = train_data.drop(columns=['Rings'])
y = train_data['Rings']

In [9]:
# Splitting the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [10]:
# One-hot encoding for 'Sex' column
X_train_encoded = pd.get_dummies(X_train, columns=['Sex'], drop_first=True)
X_val_encoded = pd.get_dummies(X_val, columns=['Sex'], drop_first=True)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_val_scaled = scaler.transform(X_val_encoded)

In [11]:
# Train the model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the validation set
y_pred = model.predict(X_val_scaled)

# Clip negative predictions to ensure they are non-negative
y_pred_clipped = np.clip(y_pred, a_min=0, a_max=None)

# Calculate RMSLE on clipped predictions
rmsle = np.sqrt(mean_squared_log_error(y_val, y_pred_clipped))
print("RMSLE with clipped predictions:", rmsle)


RMSLE with clipped predictions: 0.1667849272400637


In [12]:
# Perform one-hot encoding for 'Sex' column with specified categories
test_data_processed_encoded = pd.get_dummies(test_data, columns=['Sex'], drop_first=True, prefix=['Sex'])

# Drop irrelevant columns
test_data_processed_encoded = test_data_processed_encoded.drop(columns=['Rings'])

# Scale the test data using the same scaler used for training data
test_scaled = scaler.transform(test_data_processed_encoded)

# Make predictions on the scaled test data
test_predictions = model.predict(test_scaled)

# Ensure the shape of test_predictions matches the number of samples in the test dataset
if test_predictions.shape[0] != test_data.shape[0]:
    print("Error: Number of predictions does not match the number of samples in the test dataset.")
else:
    # Create a submission DataFrame
    submission_df = pd.DataFrame({'id': test_data['id'], 'Rings': test_predictions})

    # Save the submission file
    submission_df.to_csv('submission.csv', index=False)


In [13]:
pd.read_csv("/kaggle/input/playground-series-s4e4/sample_submission.csv")

Unnamed: 0,id,Rings
0,90615,10
1,90616,10
2,90617,10
3,90618,10
4,90619,10
...,...,...
60406,151021,10
60407,151022,10
60408,151023,10
60409,151024,10
