# Lab 04: Predicting a Continuous Target (Fare Prediction)
**Name:** David Rodriguez-Mayorquin
**Date:** April 5, 2025  

## Introduction  
In this lab, we are focusing on regression, a type of machine learning used to predict continuous numeric targets. We will work with a dataset that includes information about journeys and their associated fares. Our objective is to build models that can accurately predict the fare based on various features.  

We will begin by exploring and preparing the data, selecting appropriate features, and training a Linear Regression model. We will then compare it with alternative models like Ridge, Elastic Net, and Polynomial Regression to evaluate performance and interpretability.


## Section 1: Import and Inspect the Data

In [8]:
# Import libraries
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [9]:
# Load Titanic dataset from seaborn
titanic = sns.load_dataset("titanic")

# Display first 5 rows
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Section 2: Data Exploration and Preparation

In [10]:
# Impute missing values
titanic['age'].fillna(titanic['age'].median(), inplace=True)

# Drop rows where 'fare' is missing
titanic = titanic.dropna(subset=['fare'])

# Create 'family_size' feature
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# Convert categorical features
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})

## Preview the updated dataset
titanic.head()

# (Optional) Check how many rows remain after cleaning
titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    int64   
 3   age          891 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
 15  family_size  891 non-null    int64   
dtypes: bool(2), category(2), float64(2), int64(6), object(4)
memory usage: 87.6+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)


## Section 3: Feature Selection and Justification

In [12]:
# Case 1: age
X1 = titanic[['age']]
y1 = titanic['fare']

# Case 2: family_size
X2 = titanic[['family_size']]
y2 = titanic['fare']

# Case 3: age and family_size
X3 = titanic[['age', 'family_size']]
y3 = titanic['fare']

# Case 4: age, family_size, and sex
X4 = titanic[['age', 'family_size', 'sex']]
y4 = titanic['fare']


### Reflection

**Why might these features affect a passenger’s fare?**  
- age: Older passengers may afford or choose higher-class tickets.
- family_size: Larger families may pay more in total or get group pricing.
- sex: There may be subtle patterns in fare by gender, depending on travel behavior or class selection.

**List all available features:**  
`survived`, `pclass`, `sex`, `age`, `sibsp`, `parch`, `fare`, `embarked`, `class`, `who`, `deck`, `embark_town`, `alive`, `alone`, `family_size`

**Which other features could improve predictions and why?**  
- pclass: Strongly tied to fare (1st class is more expensive).
- embarked: May indicate point of origin and fare differences.
- deck: Higher decks may correlate with higher fare.

**How many variables are in your Case 4?**  
**3 variables**: age, family_size, and sex

**Which variables did you choose for Case 4 and why?**  
I chose age, family_size, and sex because they may interact with social class, affordability, or ticket price. T

