<img src="image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">  <div style="font-size:150%; color:#FEE100"><b>BMW Global Sales Analysis Notebook</b></div>  <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>A curious observation about BMW sales data across the globe is that market dynamics can often be as complex as a finely tuned engine. If you find any part of this notebook useful, please upvote it.

## Table of Contents
- [Imports and Configuration](#Imports-and-Configuration)
- [Data Import and Initial Exploration](#Data-Import-and-Initial-Exploration)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Predictor Building and Evaluation](#Predictor-Building-and-Evaluation)
- [Conclusion and Future Work](#Conclusion-and-Future-Work)

In [1]:
# Importing necessary libraries and suppressing warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

# Seaborn and matplotlib for visualizations
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure inline plotting and proper matplotlib backend setting
%matplotlib inline
plt.switch_backend('Agg')

# Additional libraries for modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.inspection import permutation_importance

# Setting plot styles for consistency
sns.set(style='whitegrid')

In [2]:
# Load the dataset
file_path = 'BMW sales data (2010-2024) (1).csv'
df = pd.read_csv(file_path, delimiter=',', encoding='ascii')

# Display first few rows to get an initial feel
print(df.head())

# Display dataset summary
print(df.info())

      Model  Year         Region  Color Fuel_Type Transmission  Engine_Size_L  \
0  5 Series  2016           Asia    Red    Petrol       Manual            3.5   
1        i8  2013  North America    Red    Hybrid    Automatic            1.6   
2  5 Series  2022  North America   Blue    Petrol    Automatic            4.5   
3        X3  2024    Middle East   Blue    Petrol    Automatic            1.7   
4  7 Series  2020  South America  Black    Diesel       Manual            2.1   

   Mileage_KM  Price_USD  Sales_Volume Sales_Classification  
0      151748      98740          8300                 High  
1      121671      79219          3428                  Low  
2       10991     113265          6994                  Low  
3       27255      60971          4047                  Low  
4      122131      49898          3080                  Low  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column                Non-N

## Data Import and Initial Exploration

We begin by loading the BMW sales data and taking a quick look at its structure. This data spans from 2010 to 2024 and includes attributes such as model, year, region, color, fuel type, transmission, engine size, mileage, price, sales volume, and a sales classification. Occasionally, understanding such diverse datasets is like trying to tune an engine: every variable plays its part.

## Data Cleaning and Preprocessing

In this section we check for missing values, validate data types, and perform any necessary transformations. For instance, although the dataset specifies a date range through the Year column, we verify that other numeric columns are in proper form and that categorical variables are correctly interpreted.

In [3]:
# Check for missing values
print("Missing values in each column:\n", df.isnull().sum())

# Convert data types if necessary
# Here, Year should be integer, and Engine_Size_L, Mileage_KM, Price_USD, Sales_Volume are expected to be numeric.
df['Year'] = df['Year'].astype(int)
df['Engine_Size_L'] = pd.to_numeric(df['Engine_Size_L'], errors='coerce')
df['Mileage_KM'] = pd.to_numeric(df['Mileage_KM'], errors='coerce')
df['Price_USD'] = pd.to_numeric(df['Price_USD'], errors='coerce')
df['Sales_Volume'] = pd.to_numeric(df['Sales_Volume'], errors='coerce')

# If any errors occur during conversion, they will become NaN. It is important to note that many notebook creators encounter such issues when data encodings or formatting is inconsistent.

# Drop rows with missing numeric data for simplicity
df.dropna(inplace=True)

print("After cleaning, dataset shape:", df.shape)

Missing values in each column:
 Model                   0
Year                    0
Region                  0
Color                   0
Fuel_Type               0
Transmission            0
Engine_Size_L           0
Mileage_KM              0
Price_USD               0
Sales_Volume            0
Sales_Classification    0
dtype: int64
After cleaning, dataset shape: (50000, 11)


## Exploratory Data Analysis

In this section we perform exploratory data analysis using various visualization techniques. We include heatmaps for numerical correlation, pair plots to see relationships among variables, histograms to observe the distribution of numeric fields, and several categorical plots to understand the distribution of features like Model, Region, and Sales Classification.

There is a method to the madness. Citations to established visualization methodologies are sometimes dry, but that is where art meets science.

In [4]:
# Visualization of Numerical Correlations
numeric_df = df.select_dtypes(include=[np.number])

if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(10, 8))
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap of Numerical Features')
    plt.show()
else:
    print("Not enough numeric features for a correlation heatmap.")

# Pair Plot showing relationships
sns.pairplot(numeric_df)
plt.suptitle('Pair Plot of Numerical Features', y=1.02)
plt.show()

# Histograms for distribution of numerical features
numeric_features = numeric_df.columns
for feature in numeric_features:
    plt.figure(figsize=(6, 4))
    sns.histplot(numeric_df[feature], kde=True, color='blue')
    plt.title(f'Histogram of {feature}')
    plt.show()

# Categorical analysis: Count plots for categorical variables
categorical_features = ['Model', 'Region', 'Color', 'Fuel_Type', 'Transmission', 'Sales_Classification']
for feature in categorical_features:
    plt.figure(figsize=(8, 4))
    sns.countplot(data=df, x=feature, palette='viridis')
    plt.xticks(rotation=45)
    plt.title(f'Count Plot of {feature}')
    plt.show()

# Box Plot for Price_USD across different Regions
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Region', y='Price_USD', palette='Set2')
plt.xticks(rotation=45)
plt.title('Box Plot of Price in USD by Region')
plt.show()

## Predictor Building and Evaluation

An engine is only as good as its tuning. In this section, we attempt to create a predictor for the sales volume. We use a simple linear regression model based on the numeric attributes: Year, Engine_Size_L, Mileage_KM, and Price_USD. More advanced techniques may be deployed in the future, but this approach provides a baseline for performance.
After training the model, we compute the R² score and Mean Squared Error to gauge its performance. Additionally, permutation importance is used to assess the relative importance of each predictor feature.

In [None]:
# Define features and target for prediction
features = ['Year', 'Engine_Size_L', 'Mileage_KM', 'Price_USD']
target = 'Sales_Volume'

# Prepare feature matrix and target vector
X = df[features]
y = df[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Evaluate the model using R² Score and Mean Squared Error
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"R² Score: {r2:.3f}")
print(f"Mean Squared Error: {mse:.3f}")

# Permutation Importance to assess feature importance
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
sorted_idx = perm_importance.importances_mean.argsort()

plt.figure(figsize=(8, 6))
plt.barh(np.array(features)[sorted_idx], perm_importance.importances_mean[sorted_idx], color='teal')
plt.xlabel('Permutation Importance')
plt.title('Permutation Importance of Features')
plt.show()

R² Score: -0.000
Mean Squared Error: 8172162.073


## Conclusion and Future Work

The BMW Global Sales Analysis Notebook presents a comprehensive journey from data exploration to the implementation of a baseline predictor for sales volume. We cleaned and preprocessed the data, uncovered trends via a series of visualizations, and built a simple linear regression model. The combination of multiple visualization techniques such as heatmaps, pair plots, and count plots yielded insights into how various factors correlate with sales volume.

Future avenues for exploration could include deploying more complex models (e.g., random forest regressors), exploring the role of categorical variables through one-hot encoding, or integrating time-based trend analysis if additional temporal fields are discovered. There is always room to fine-tune the model, just as a well-crafted engine benefits from careful tuning.

If you found this notebook insightful, please upvote it.