**Low Variance Features Considered**

In [66]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

In [68]:
titanic_df = pd.read_csv('titanic_dataset.csv')

# Select numeric columns for analysis
numeric_columns = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

# 1. Calculate the variance of the numeric features before scaling (on raw data)
variance_before_scaling = titanic_df[numeric_columns].var()

# Print the variance before scaling
print("Variance before scaling (on raw data):")
print(variance_before_scaling)

# 2. Identify low variance features before scaling (variance < 1.00)
low_variance_features_before_scaling = variance_before_scaling[variance_before_scaling < 1.00]

# Print low variance features before scaling
print("\nLow variance features before scaling (variance < 1.00):")
print(low_variance_features_before_scaling)

# 3. Standardize the data using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(titanic_df[numeric_columns])

# 4. Calculate the variance of the scaled features
variance_after_scaling = pd.DataFrame(X_scaled, columns=numeric_columns).var()

# Print the variance after scaling
print("\nVariance after scaling:")
print(variance_after_scaling)


Variance before scaling (on raw data):
Pclass       0.699015
Age        211.019125
SibSp        1.216043
Parch        0.649728
Fare      2469.436846
dtype: float64

Low variance features before scaling (variance < 1.00):
Pclass    0.699015
Parch     0.649728
dtype: float64

Variance after scaling:
Pclass    1.001124
Age       1.001403
SibSp     1.001124
Parch     1.001124
Fare      1.001124
dtype: float64


In [54]:
# The following features most likely have no mechanical influence on the survival rate: Name, Embarked, PassengerID, and Ticket.
# Additionally, based on their low variances before scaling, I will add Pclass and Parch to the list of features with no mechanical 
# influence on survival rate.  Find the removal of these features in notebookI.  

In [70]:
# Remove any low variance features that you've selected.  

# List of columns to remove
columns_to_remove = ['Name', 'Embarked', 'Ticket', 'Parch', 'Pclass']

# Remove the specified columns
titanic_df = titanic_df.drop(columns=columns_to_remove)

# Print the first few rows of the cleaned dataset (Visualize Final Data Frame)
print("\nFirst few rows of the cleaned dataset:")
print(titanic_df.head())



First few rows of the cleaned dataset:
   PassengerId  Survived     Sex   Age  SibSp     Fare Cabin
0            1         0    male  22.0      1   7.2500   NaN
1            2         1  female  38.0      1  71.2833   C85
2            3         1  female  26.0      0   7.9250   NaN
3            4         1  female  35.0      1  53.1000  C123
4            5         0    male  35.0      0   8.0500   NaN
