# Elite Sports Cars Analysis

This notebook performs an in-depth analysis of the Elite Sports Cars dataset, which contains information about 5,000 sports cars including features like horsepower, price, fuel efficiency, and more.

## Analysis Overview
1. Data Loading and Exploration
2. Feature Engineering and Preprocessing
3. Outlier Detection
4. Price Prediction Modeling
5. Feature Importance Analysis
6. Market Segment Analysis

Let's begin by importing the necessary libraries and loading our data.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, RobustScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor, IsolationForest
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette("husl")
%matplotlib inline

## 1. Data Loading and Initial Exploration

Let's load our dataset and examine its basic properties.

In [None]:
# Load the dataset
df = pd.read_csv('Elite Sports Cars in Data.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
display(df.head())

print("\nDataset Info:")
display(df.info())

print("\nBasic Statistics:")
display(df.describe())

## 2. Feature Engineering and Preprocessing

We'll create several helper functions to process our data and engineer new features.

In [None]:
def detect_outliers(df, columns):
    """Detect outliers using Isolation Forest"""
    iso_forest = IsolationForest(contamination=0.1, random_state=42)
    outliers = iso_forest.fit_predict(df[columns])
    return outliers == -1

def preprocess_data(df):
    """Preprocess data and engineer features"""
    # Create label encoders for categorical variables
    le_dict = {}
    categorical_cols = ['Brand', 'Country', 'Condition', 'Fuel_Type', 'Drivetrain', 
                       'Transmission', 'Popularity', 'Market_Demand']
    
    for col in categorical_cols:
        le_dict[col] = LabelEncoder()
        df[f'{col}_encoded'] = le_dict[col].fit_transform(df[col])
    
    # Create advanced features
    df['Age'] = 2025 - df['Year']
    df['Power_to_Weight'] = df['Horsepower'] / df['Weight']
    df['Performance_Index'] = (df['Horsepower'] * df['Torque']) / df['Weight']
    df['Maintenance_Score'] = df['Insurance_Cost'] * df['Number_of_Owners']
    df['Rarity_Score'] = 1 / df['Production_Units']
    
    # Log transform appropriate numerical features
    numeric_cols = ['Price', 'Mileage', 'Insurance_Cost', 'Production_Units']
    for col in numeric_cols:
        df[f'{col}_log'] = np.log1p(df[col])
    
    # Detect outliers
    outlier_cols = ['Price', 'Horsepower', 'Mileage', 'Insurance_Cost']
    df['is_outlier'] = detect_outliers(df, outlier_cols)
    
    # Create price segments
    df['PriceSegment'] = pd.qcut(df['Price'], q=4, labels=['Budget', 'Mid-Range', 'Premium', 'Luxury'])
    
    return df, le_dict

# Process the data
df, le_dict = preprocess_data(df)

# Display new features
print("Newly created features:")
new_features = ['Age', 'Power_to_Weight', 'Performance_Index', 'Maintenance_Score', 'Rarity_Score']
display(df[new_features].describe())

## 3. Exploratory Data Analysis

Let's visualize some key relationships in our data.

In [None]:
plt.figure(figsize=(20, 15))

# 1. Price Distribution
plt.subplot(2, 2, 1)
sns.histplot(data=df, x='Price_log', bins=30)
plt.title('Price Distribution (Log Scale)')

# 2. Price vs Horsepower
plt.subplot(2, 2, 2)
sns.scatterplot(data=df, x='Horsepower', y='Price', hue='Condition', alpha=0.6)
plt.title('Price vs Horsepower by Condition')

# 3. Price by Market Demand
plt.subplot(2, 2, 3)
sns.boxplot(data=df, x='Market_Demand', y='Price')
plt.title('Price Distribution by Market Demand')
plt.xticks(rotation=45)

# 4. Price vs Age
plt.subplot(2, 2, 4)
sns.scatterplot(data=df, x='Age', y='Price', hue='Popularity', alpha=0.6)
plt.title('Price vs Age by Popularity')

plt.tight_layout()
plt.show()

## 4. Prepare Data for Modeling

Now let's prepare our features for the price prediction model.

In [None]:
def prepare_features(df):
    features = [
        'Age', 'Engine_Size', 'Horsepower', 'Torque', 'Weight', 
        'Top_Speed', 'Acceleration_0_100', 'Fuel_Efficiency',
        'CO2_Emissions', 'Mileage_log', 'Safety_Rating',
        'Number_of_Owners', 'Insurance_Cost_log', 'Production_Units_log',
        'Power_to_Weight', 'Performance_Index', 'Maintenance_Score',
        'Rarity_Score', 'Brand_encoded', 'Country_encoded',
        'Condition_encoded', 'Fuel_Type_encoded', 'Drivetrain_encoded',
        'Transmission_encoded', 'Market_Demand_encoded'
    ]
    
    X = df[features]
    y = df['Price_log']
    
    # Scale features
    scaler = RobustScaler()
    X_scaled = scaler.fit_transform(X)
    X_scaled = pd.DataFrame(X_scaled, columns=features)
    
    return X_scaled, y, features

# Remove outliers and prepare features
df_clean = df[~df['is_outlier']].copy()
X, y, features = prepare_features(df_clean)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

## 5. Model Training and Evaluation

We'll use XGBoost for our price prediction model.

In [None]:
# Train model
model = xgb.XGBRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=7,
    random_state=42
)
model.fit(X_train, y_train)

# Model evaluation
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

print("Model Performance:")
print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${np.exp(rmse):,.2f}")
print(f"MAE: ${np.exp(mae):,.2f}")

## 6. Feature Importance Analysis

In [None]:
# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Most Important Features')
plt.xlabel('Feature Importance')
plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
display(feature_importance.head(10))

## 7. Market Segment Analysis

In [None]:
# Analyze price segments
segment_stats = df.groupby('PriceSegment').agg({
    'Price': ['count', 'mean', 'std'],
    'Horsepower': 'mean',
    'Mileage': 'mean'
}).round(2)

print("Price Segment Statistics:")
display(segment_stats)

# Visualize price segments
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='PriceSegment', y='Price')
plt.title('Price Distribution by Segment')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 8. Brand Analysis

In [None]:
# Analyze brands
brand_stats = df.groupby('Brand').agg({
    'Price': ['mean', 'count'],
    'Horsepower': 'mean',
    'Market_Demand': lambda x: x.mode().iloc[0] if not x.empty else None
}).round(2)

# Sort by average price
brand_stats = brand_stats.sort_values(('Price', 'mean'), ascending=False)

print("Top 10 Brands by Average Price:")
display(brand_stats.head(10))

# Visualize top brands
plt.figure(figsize=(12, 6))
top_brands = brand_stats.head(10).index
sns.boxplot(data=df[df['Brand'].isin(top_brands)], x='Brand', y='Price')
plt.title('Price Distribution for Top 10 Brands')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 9. Conclusions

Key findings from our analysis:

1. **Model Performance**:
   - The model's performance suggests that car pricing is complex and influenced by many factors
   - The log transformation of the price helped handle the price variability

2. **Important Features**:
   - Performance-related features (Horsepower, Torque) are significant price determinants
   - Market factors (Rarity, Brand) also play important roles

3. **Market Segments**:
   - Clear price segmentation exists in the market
   - Each segment shows distinct characteristics in terms of performance and features

4. **Brand Impact**:
   - Significant price variations exist between brands
   - Premium brands command higher prices regardless of specifications

This analysis provides valuable insights for understanding the sports car market and the factors that influence car prices.