In [8]:
import pandas as pd
import numpy as np

In [18]:
df=pd.read_csv('/content/titanic.csv')

In [19]:
# Handle missing values in 'Age' column by filling with the mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Handle missing values in 'Embarked' column by filling with the mode
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Drop irrelevant features
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

display(df.head())

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [12]:
# Separate features and target variable
X = df.drop('Survived', axis=1)
y = df['Survived']

# Perform one-hot encoding for categorical features
X = pd.get_dummies(X, columns=['Sex', 'Embarked'], drop_first=True)

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# display(X_train.head())
# display(X_test.head())
# display(y_train.head())
# display(y_test.head())

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
331,1,45.5,0,0,28.5,True,False,True
733,2,23.0,0,0,13.0,True,False,True
382,3,32.0,0,0,7.925,True,False,True
704,3,26.0,1,0,7.8542,True,False,True
813,3,6.0,4,2,31.275,False,False,True


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
709,3,29.699118,1,1,15.2458,True,False,False
439,2,31.0,0,0,10.5,True,False,True
840,3,20.0,0,0,7.925,True,False,True
720,2,6.0,0,1,33.0,False,False,True
39,3,14.0,1,0,11.2417,False,False,False


Unnamed: 0,Survived
331,0
733,0
382,0
704,0
813,0


Unnamed: 0,Survived
709,1
439,0
840,0
720,1
39,1


In [20]:
from sklearn.linear_model import LinearRegression

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)



In [21]:
# Extract model coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_

print("Model Coefficients:", coefficients)
print("Model Intercept:", intercept)



Model Coefficients: [-1.54594862e-01 -4.78914212e-03 -3.87912960e-02 -2.00730979e-02
  3.53564488e-04 -5.13381987e-01 -2.15383271e-02 -7.21893188e-02]
Model Intercept: 1.2876125346315552


In [22]:
# Predict on test data
y_pred = model.predict(X_test)

display(y_pred[:5])

array([0.11473865, 0.24810053, 0.1452758 , 0.86909317, 0.72196333])

In [14]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

Mean Squared Error (MSE): 0.1350060004114436
Root Mean Squared Error (RMSE): 0.3674316268524575


In [15]:
from sklearn.metrics import r2_score

# Calculate R-squared score
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")

# You can also calculate R² on the training set to check for overfitting
y_train_pred = model.predict(X_train)
r2_train = r2_score(y_train, y_train_pred)
print(f"R-squared (R²) on training set: {r2_train}")

R-squared (R²): 0.44327834502148467
R-squared (R²) on training set: 0.3845118957517023


In [25]:
# Train the model with a subset of features (e.g., 'Pclass', 'Fare')
X_train_subset = X_train[['Pclass', 'Fare']]
X_test_subset = X_test[['Pclass', 'Fare']]

model_subset = LinearRegression()
model_subset.fit(X_train_subset, y_train)



In [26]:

y_pred_subset = model_subset.predict(X_test_subset)


In [24]:
# Calculate R-squared score for the subset model
r2_subset = r2_score(y_test, y_pred_subset)
print(f"R-squared (R²) with Pclass and Fare features: {r2_subset}")

R-squared (R²) with Pclass and Fare features: 0.15918314079270957


### Limitations of R² and When to Use Other Metrics

While R-squared (R²) is a commonly used metric for evaluating linear regression models, it has several limitations:

-   **Does not indicate causality:** A high R² does not mean that the independent variables cause the changes in the dependent variable.
-   **Increases with more predictors:** R² will always increase or stay the same when you add more independent variables to the model, even if those variables are not statistically significant. This can lead to overfitting.
-   **Does not indicate the correctness of the model:** A high R² does not necessarily mean that the model is a good fit for the data, nor does a low R² necessarily mean that the model is a bad fit.
-   **Sensitive to outliers:** Outliers can significantly impact the R² value.
-   **Not appropriate for non-linear relationships:** R² is designed for linear regression and may not be appropriate for evaluating models with non-linear relationships between variables.

Given these limitations, it's important to use R² in conjunction with other evaluation metrics and techniques, such as:

-   **Adjusted R²:** This metric adjusts the R² value based on the number of predictors in the model, penalizing the inclusion of unnecessary variables.
-   **Mean Squared Error (MSE) and Root Mean Squared Error (RMSE):** These metrics measure the average magnitude of the errors.
-   **Mean Absolute Error (MAE):** This metric measures the average absolute magnitude of the errors, being less sensitive to outliers than MSE or RMSE.
-   **Residual analysis:** Plotting residuals can help identify patterns or violations of the assumptions of linear regression.
-   **Cross-validation:** This technique helps assess how well the model generalizes to new data.

### Comparison of Performance Between Training and Testing Sets

We trained the linear regression model on the training data and evaluated its performance on both the training and testing sets using the R² score.

-   **R² on the training set:**  {{r2_train}}
-   **R² on the testing set:** {{r2}}

Comparing these values helps us detect potential overfitting or underfitting:

-   **Overfitting:** If the R² on the training set is significantly higher than on the testing set, it suggests that the model has learned the training data too well, including its noise and specific patterns, and does not generalize well to unseen data.
-   **Underfitting:** If the R² is low on both the training and testing sets, it may indicate that the model is too simple to capture the underlying patterns in the data, or that a linear model is not appropriate for this dataset.

In our case, the R² on the training set ({{r2_train}}) is slightly lower than on the testing set ({{r2}}). This suggests that the model is not significantly overfitting the training data. However, the overall R² values are relatively low, indicating that the linear regression model may not be the best fit for this dataset, or that there might be other factors influencing the target variable that are not included in our current features.

### Impact of Feature Selection on Bias and Variance

Feature selection plays a crucial role in balancing bias and variance in a model:

-   **Bias:** Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model. A model with high bias makes strong assumptions about the underlying relationship between features and the target variable.
-   **Variance:** Variance refers to the model's sensitivity to small fluctuations in the training data. A model with high variance is highly influenced by the training data and may not generalize well to new data.

Here's how feature selection affects bias and variance:

-   **Removing irrelevant features:** Removing features that are not related to the target variable can help reduce variance by simplifying the model and making it less sensitive to noise in the data. This can potentially increase bias if important information is lost.
-   **Removing redundant features:** Removing features that are highly correlated with each other can also help reduce variance and improve model interpretability.
-   **Adding relevant features:** Including features that have a strong relationship with the target variable can help reduce bias by allowing the model to capture more complex patterns. This can potentially increase variance if the added features introduce noise or make the model too complex.

In our experimentation, we trained a model with a subset of features ('Pclass' and 'Fare'). The R² score for this subset model ({{r2_subset}}) was lower than the R² score for the model trained with all features ({{r2}}). This suggests that removing features in this case increased the bias of the model, as it was less able to explain the variance in the target variable. The simpler model with fewer features likely has lower variance, but at the cost of increased bias. The goal of feature selection is to find the optimal balance between bias and variance to achieve the best possible model performance on unseen data.