<div class="markdown-google-sans">

# <strong>Bosch Assessment</strong>

## <strong>Overview</strong>
</div>

The objective is to perform a thorough analysis of this data and build a machine learning model to predict a target variable (price).

<div class="markdown-google-sans">
  <h3>Packages Installation</h3>
</div>

In [5]:
%pip install ucimlrepo
%pip install pandas
%pip install matplotlib

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


<div class="markdown-google-sans">
  <h3>Imports</h3>
</div>

In [6]:
import pandas as pd

<div class="markdown-google-sans">

## <strong>Exploratory Data Analysis (EDA)</strong>
</div>

Conduct a thorough exploratory data analysis. 

This should include understanding the distribution of data, detecting outliers, and exploring relationships between features. 

Visualize important features and correlations.

In [7]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

automobile_df_og = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data", header = None, names = headers, na_values = "?")
automobile_df_og.head()


Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


<div class="markdown-google-sans">
  <h2>Missing Values</h2>
</div>

Observation: There are NaN values.

Question: How many entries with NaN values?

Follow up: If few rows -> remove them.

In [43]:
print("Number of rows: ", len(automobile_df_og.index))
print("Number of rows with NaN values: ", automobile_df_og.isna().any(axis=1).sum())
print("Number of NaNs per column: \n", automobile_df_og.isna().sum())

Number of rows:  205
Number of rows with NaN values:  46
Number of NaNs per column: 
 symboling             0
normalized_losses    41
make                  0
fuel_type             0
aspiration            0
num_doors             2
body_style            0
drive_wheels          0
engine_location       0
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_type           0
num_cylinders         0
engine_size           0
fuel_system           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64


Obsevation: 41 rows do not have normalized losses (20% of the rows).

Follow up: If I had no restriction on time I would try to predict the missing values taking in consideration the values from other columns. A simple approach could be to replace it with the columns mean but that could introduce error on Price prediction, that being said, I will drop this column, being aware that it could decrease the accuracy of the final model.

In [44]:
automobile_df = automobile_df_og.drop(['normalized_losses'], axis=1)
print("Number of rows: ", len(automobile_df.index))
print("Number of rows with NaN values: ", automobile_df.isna().any(axis=1).sum())
print("Number of NaNs per column: \n", automobile_df.isna().sum())

Number of rows:  205
Number of rows with NaN values:  12
Number of NaNs per column: 
 symboling            0
make                 0
fuel_type            0
aspiration           0
num_doors            2
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_cylinders        0
engine_size          0
fuel_system          0
bore                 4
stroke               4
compression_ratio    0
horsepower           2
peak_rpm             2
city_mpg             0
highway_mpg          0
price                4
dtype: int64


Obsevation: Now only 12 rows have missing values (5% of the rows), and therefore, can be removed.

Follow up: Drop NaNs.

In [45]:
automobile_df = automobile_df.dropna()
print("Number of rows: ", len(automobile_df.index))
print("Number of rows with NaN values: ", automobile_df.isna().any(axis=1).sum())
print("Number of NaNs per column: \n", automobile_df.isna().sum())

Number of rows:  193
Number of rows with NaN values:  0
Number of NaNs per column: 
 symboling            0
make                 0
fuel_type            0
aspiration           0
num_doors            0
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_cylinders        0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_ratio    0
horsepower           0
peak_rpm             0
city_mpg             0
highway_mpg          0
price                0
dtype: int64


<div class="markdown-google-sans">

## <strong>Feature Engineering and Selection</strong>
</div>

Based on your EDA, engineer new features and select the most relevant ones for your model. Justify your choices.

<div class="markdown-google-sans">

## <strong>Machine Learning Model</strong>
</div>

Build a machine learning model to predict the "price" variable.
Explain your choice of model and any hyperparameters you tune. Use appropriate validation techniques.

<div class="markdown-google-sans">

## <strong>Evaluation and Interpretation</strong>
</div>

Evaluate the performance of your model using appropriate metrics. 
Interpret your model's predictions, and discuss its strengths and weaknesses.