# Linear Regression with Scikit Learn – Predicting Cat Body Length

### Project Description:
This project applies linear regression techniques to predict the body length of cats using features such as weight and age. Both ordinary least squares (Linear Regression) and L2-regularized linear regression (Ridge) were implemented using scikit-learn to compare model performance.

### Objectives:
* Predict cat body length from weight and age using linear regression.
* Compare performance of unregularized and L2-regularized models (Ridge).
* Evaluate model accuracy using mean squared error (MSE) and R² score.

### Public dataset source:
[Kaggle Cat Dataset](https://www.kaggle.com/datasets/joannanplkrk/its-raining-cats?select=cat_breeds_dirty.csv)
This dataset contains ~1000 items with data on 3 different cat breeds (Maine coon, Ragdoll and Angora). It includes information about animal's breed, age, gender, body length, weight, fur colour and pattern, eye colour, sleeping and playing time, country (including latitude and longitude) etc. The data was artificially generated.

Data cleaning and exploratory data analysis conducted [here](https://github.com/emmaricci/machine-learning/blob/main/Data%20Wrangling/cats_wrangling.ipynb)

In [17]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score


In [3]:
# Establish file path and import data
path = 'raining_cats_cleaned.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Breed,Age_in_years,Age_in_months,Gender,Neutered_or_spayed,Body_length,Weight,Fur_colour_dominant,Fur_pattern,Eye_colour,...,Sleep_time_hours,Country,Latitude,Longitude,Age_bracket,Fur_colour_dominant_encoded,Fur_pattern_encoded,Eye_colour_encoded,Gender_encoded,Breed_encoded
0,Angora,0.25,3.0,female,False,19.0,2.0,white,solid,blue,...,16.0,France,43.296482,5.36978,0-1,0,0,0,0,0
1,Angora,0.33,4.0,male,False,19.0,2.5,white,solid,blue,...,16.0,France,43.61166,3.87771,0-1,0,0,0,1,0
2,Angora,3.0,36.0,male,True,38.0,5.0,white,solid,yellow,...,14.0,France,43.296482,5.36978,3-4,0,0,2,1,0
3,Angora,1.17,14.04,female,True,25.0,3.0,white,solid,yellow,...,17.0,France,45.76342,4.834277,1-2,0,0,2,0,0
4,Angora,5.83,69.96,male,True,37.0,4.6,black,solid,green,...,16.0,France,48.864716,2.349014,5-6,2,0,1,1,0


In [4]:
cols = ['Weight','Age_in_months']
X = df[cols].to_numpy()  
y = df['Body_length'].to_numpy()

In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [20]:
# Create and fit the model
lr = LinearRegression()
lr.fit(X_train,y_train)

# Make predictions
y_pred = lr.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("MSE is:", mse)
r2 = r2_score(y_test, y_pred)
print("R2 is:", r2)

MSE is: 100.28612981575007
R2 is: 0.6599603734898145


In [23]:
# Create and fit the model
model = Ridge(alpha=0.1)
model.fit(X_train,y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("MSE is:", mse)
r2 = r2_score(y_test, y_pred)
print("R2 is:", r2)

MSE is: 100.28730078757665
R2 is: 0.6599564030821091


With a relatively small penalty, the Ridge will behave similarly as the Linear without penalty.