**Scenario**: You work as a data scientist for a US used car dealer. The dealer buys used cars at low prices in online auctions and from other car dealers in order to resell them profitably on their own platform. It's not always easy to tell whether it is worth buying a used car: One of the biggest challenges in used car auctions is the risk of a car having problems that are so serious, that they prevent it from being resold to customers. These are referred to as "lemons" - cars that have significant defects from the outset due to production faults that significantly affect the safety, use or value of that car and at the same time cannot be repaired with a reasonable number of repairs or within a certain period of time. In cases like this, the customer has the right to be refunded the purchase price. In addition to the purchase costs, the bad purchase of these so-called lemons leads to considerable costs as a result, such as the storage and repair of the car, which can result in market losses when the vehicle is resold.

That is why it is important for your boss to rule out as many bad purchases as possible. To help the buyers in the company with the large number of cars on offer, you are to develop a model that predicts whether a car would be a bad buy, a so-called lemon. However, this must not lead to too many good purchases being excluded. You won't receive more detailed information on the costs and profits of the respective purchases for developing the prototype just yet.


Each row of the dataset corresponds to a car that was first auctioned and then resold. The data dictionary looks like this:

Each row of the dataset corresponds to a car that was first auctioned and then resold. The data dictionary looks like this:

Column number | Column name | Type | Description
:---|:---|:---|:----  
1 | `'IsBadBuy'` | categorical (nominal) | Identifies whether the auctioned car is a "lemon", and therefore whether it was a bad buy (`0`= not a lemon, `1`= lemon)
2  |  `'PurchDate'` | continuous (`datetime`) | The date the vehicle was purchased at the auction
3 | `'Auction'` | categorical (nominal) | auction provider the vehicle was purchased from
4  |  `'VehYear'` | continuous (`int`) | Vehicle's year model 
5  |  `'VehicleAge'` | continuous (`int`) | The age of the car at the time of the auction
6  |  `'Make'` | categorical (nominal) | Car manufacturer
7  |  `'Model'` | categorical (nominal) | Car model
8  |  `'Trim'` | categorical (nominal) | Vehicle trim
9  |  `'Submodel'` | categorical (nominal) | Car submodel
10  |  `'Color'` | categorical (nominal) | Vehicle color
11  |  `'Tranmission'` | categorical (nominal) | Vehicle transmission type (automatic, manual)
12  |  `'WheelTypeID'` | categorical (nominal) | The type ID of the wheel rims 
13  |  `'WheelType'` | categorical (nominal) | The type of wheel rims
14  |  `'VehOdo'` | continuous (`int`) | Vehicle mileage
15  |  `'Nationality'` | categorical (nominal) | Manufacturer's country
16  |  `'Size'` | categorical (nominal) | The size class of the vehicle (compact, SUV, etc.)
17 | `'TopThreeAmericanName'` | categorical (nominal) | indicates whether the manufacturer is one of the three leading American car manufacturers.
18  |  `'MMRAcquisitionAuctionAveragePrice'` | continuous (`int`) | Purchase price in US dollars for this vehicle in average condition at the time of purchase
19  |  `'MMRAcquisitionAuctionCleanPrice'` | continuous (`int`) | Purchase price in US dollars for this vehicle in above average condition at the time of purchase
20  |  `'MMRAcquisitionRetailAveragePrice'` | continuous (`int`) | Purchase price in US dollars for this vehicle at retail in average condition at the time of purchase
21  |  `'MMRAcquisitonRetailCleanPrice'` | continuous (`int`) | Purchase price in US dollars for this vehicle at retail in above-average condition at the time of purchase
22  |  `'MMRCurrentAuctionAveragePrice'` | continuous (`int`) | Current day purchase price in US dollars for this vehicle in average condition
23  |  `'MMRCurrentAuctionCleanPrice'` | continuous (`int`) | Current day purchase price in US dollars for this vehicle in above-average condition
24  |  `'MMRCurrentRetailAveragePrice'` | continuous (`int`) | Current day purchase price in US dollars for this vehicle at retail in average condition
25  |  `'MMRCurrentRetailCleanPrice'` | continuous (`int`) | Current day purchase price in US dollars for this vehicle at retail in above-average condition
26  |  `'PRIMEUNIT'` | categorical (nominal) | Indicates whether the vehicle would have a higher demand than a standard purchase 
27  |  `'AUCGUART'` | categorical (nominal) | The guarantee level given for the vehicle by the auction platform (`'GREEN'` - guarantee  present, `'YELLOW'` - guarantee unclear, `'RED'` - no guarantee)
28 | `'BYRNO'` | categorical (nominal) | Unique number assigned to the buyer who bought the vehicle
29  |  `'VNZIP1'` | categorical (nominal) | Zip code where the vehicle was purchased
30  |  `'VNST'` | categorical (nominal) |  State of the vehicle when it was purchased 
31 | `'VehBCost'` | continuous (`int`) | Purchase costs in US dollars paid for the vehicle at the time of purchase
32  |  `'IsOnlineSale'` | categorical (nominal) |  Indicates whether the vehicle was originally purchased online.
33 | `'WarrantyCost'` | continuous (`int`) | Cost of the warranty for a term of 36 months

**Tip**: This data dictionary is also located in the file *Data_dictionary.ipynb*. You can open this in a separate window with your file browser.


# Preparation

***What problem should the model solve?***
The model should aim to predict whether a given used car is likely to be a "lemon" (a car with significant defects) 
that would result in a financial loss for the company. By identifying these potentially problematic cars before purchase, 
the model helps to minimize financial risks and ensures that only cars in good condition are acquired.

_____________________________________________

    
***What is the nature of the problem?***
This is a classification problem. The task is to classify each car into one of two categories:

Lemon (problematic car).

Not a Lemon (non-problematic car).

_____________________________________________

***What would an application using your model look like?***
**Application Appearance and Functionality:**

**User Interface:** The application could have a web-based or mobile interface accessible by company buyers.

**Input Fields:** Buyers can input various features of a car (e.g., make, model, year, mileage, service history, etc.).

**Prediction Output:** The application processes the input data and returns a prediction indicating whether the car is likely to be a "lemon".

**Recommendation System:** Provide insights or recommendations based on the prediction, such as advising against the purchase or suggesting further inspection.

_____________________________________________

***Requirements Specified by Employer:***

**Accuracy and Recall:** The model must effectively distinguish between lemons and non-problematic cars while minimizing the exclusion of good purchases.

**Ease of Use:** The application should be user-friendly and provide clear, actionable insights.

**Scalability:** The model should handle a large volume of data and predictions efficiently.

**Explainability:** The model should provide some level of interpretability or explainability to help users understand the reasons behind each prediction.

_____________________________________________


***What data do you need so that you can build your model?***
To build an effective model, the following data is required:

**Car Features:**

Make and model.

Year of manufacture.

Mileage.

Service and repair history.

Previous ownership details.

Condition reports.

Auction or seller details.

**Historical Data:**

Records of past car purchases, including those identified as lemons and non-problematic cars.

Detailed description of the issues faced by lemons.

Financial data on purchase and repair costs, as well as resale values.

**Label Data:**

Binary labels indicating whether each car in the historical dataset was a lemon.

**External Data:**

Industry reports and statistics on common defects and their occurrence in specific car models or makes.

Customer feedback or reviews that might indicate recurring issues

# Importing libraries

In [None]:
# import modules
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# useful imports
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier  # Corrected the typo here
from sklearn.svm import SVC, LinearSVC  # Corrected the capitalization here
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense  # Corrected the typo here
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score, confusion_matrix, classification_report  # Corrected the typo here
import os
import pickle as p
from tensorflow.keras.models import load_model  # Corrected the import here
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA

from sklearn.cluster import DBSCAN
from sklearn.metrics import euclidean_distances

from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

import warnings  # Corrected the typo here
from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action='ignore', category=DataConversionWarning)

plt.style.use('ggplot')


# Importing data + EDA 

In [None]:
df = pd.read_csv("data_train.csv")
df.loc[:, 'PurchDate'] = pd.to_datetime(df.loc[:, 'PurchDate'], unit='s')


def space(distance_between_outputs):
  
    result = '\n' * distance_between_outputs
    print(result)

#Dimension
print("DIMENSION:")
print(df.shape)
space(3)


#describe()
print("DESCRIBE-OVERVIEW:")
display(df.describe())
space(3)

#info()

print("DESCRIBE-OVERVIEW:")
display(df.info())
space(3)


print("NAN VALUES_")
print(df.isna().sum())
space(3)

print ("DUPLICATES:")
print('duplicate count', df.duplicated().sum())
space(3)

![image.png](attachment:42c6bb54-2675-4966-908d-f9886b6cea81.png)


![image.png](attachment:5b8820c6-ada6-4835-aa3d-df2b856f7420.png)

![image.png](attachment:d91d4d0b-0215-4d41-8799-33bb87268083.png)

![image.png](attachment:0ae90550-0be8-4ff4-92de-56e6f6d49a9c.png)

# Train - Test - Split

In [None]:
# perform train-test-split

df_features = df.drop(columns=['IsBadBuy'])
df_target = df['IsBadBuy']

features_train, features_test, target_train, target_test = train_test_split(df_features,
                                                                           df_target,
                                                                           test_size = 0.1,
                                                                           random_state = 42)


In [None]:
# save features_test as 'features_test.csv'
features_test.to_csv('features_test.csv', index=True)

# Data Preparation + Working with features_train

### Getting cat and num colunms from features_train

In [None]:
cat_cols = features_train.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = features_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [None]:
for col in df_cat_cols:

    #print(f'{col} UNIQUES:\n ', df[col].unique(),'\n')
    print(f'{col} UNIQUES:\n', features_train[col].value_counts(),'\n')
    print(f'{col} COUNT OF VALUES:\n', features_train[col].nunique(),'\n')
    print(f'DATATYPE:',df[col].dtype)
    print('-----------------------------------------------')

![image.png](attachment:e5d9deaf-a4ab-4f8c-b206-fa975d6e2079.png)

![image.png](attachment:edb7531d-7e60-4d4a-9d96-c0b20a5a345d.png)

![image.png](attachment:5268e8a6-7e55-41fd-b945-2c254b3916dd.png)

![image.png](attachment:549bff58-ee8e-4b41-bc4c-21a6a6d39da7.png)

In [None]:
cols = 3
rows = int(np.ceil(len(num_cols) / cols))

fig, axes = plt.subplots(rows, cols, figsize = (cols*10, rows*8))


for i, col in enumerate(num_cols):
    row = i // 3
    col_pos = i % 3
    sns.histplot(data = features_train,
                x = col,
                ax=axes[row][col_pos])
    axes[row][col_pos].set_title(f'Histogramm - {col}')

![image.png](attachment:77a2b5ee-b69a-4ef0-91f3-08cd7fb099f5.png)

### Cleaning the Nan Values

In [None]:
def data_cleaner(df, df_referenz = features_train):
    
    df = df.drop('PurchDate', axis = 1)
        
    dict_referenz = {
        'Auction':df_referenz['Auction'].mode()[0],
        'VehYear':df_referenz['VehYear'].mode()[0],
        'VehicleAge' : df_referenz['VehYear'].mode()[0], 
        'Make':df_referenz['Make'].mode()[0], 
        'Model':df_referenz['Model'].mode()[0],
        'Trim': df_referenz['Trim'].mode()[0], # kann auch durch 'Unknow sein'
        'SubModel':df_referenz['SubModel'].mode()[0], 
        'Color':df_referenz['Color'].mode()[0],
        'Transmission':df_referenz['Transmission'].mode()[0],
        'WheelTypeID': df_referenz['WheelTypeID'].mode()[0],
        'WheelType': df_referenz['WheelType'].mode()[0],
        'VehOdo': df_referenz['VehOdo'].median(),
        'Nationality':df_referenz['Nationality'].mode()[0],
        'Size':df_referenz['Size'].mode()[0],
        'TopThreeAmericanName' :df_referenz['TopThreeAmericanName'].mode()[0],
        'BYRNO':df_referenz['BYRNO'].mode()[0],
        'VNZIP1':df_referenz['VNZIP1'].mode()[0],
        'VNST':df_referenz['IsOnlineSale'].mode()[0],
        'IsOnlineSale': df_referenz['VNST'].mode()[0],
        'WarrantyCost':  df_referenz['WarrantyCost'].mode()[0],
        'MMRAcquisitionAuctionAveragePrice': df_referenz['MMRAcquisitionAuctionAveragePrice'].median(),
        'MMRAcquisitionAuctionCleanPrice': df_referenz['MMRAcquisitionAuctionCleanPrice'].median(),
        'MMRAcquisitionRetailAveragePrice': df_referenz['MMRAcquisitionRetailAveragePrice'].median(),
        'MMRAcquisitonRetailCleanPrice': df_referenz['MMRAcquisitonRetailCleanPrice'].median(),
        'MMRCurrentAuctionAveragePrice': df_referenz['MMRCurrentAuctionAveragePrice'].median(),
        'MMRCurrentAuctionCleanPrice': df_referenz['MMRCurrentAuctionCleanPrice'].median(),
        'MMRCurrentRetailAveragePrice': df_referenz['MMRCurrentRetailAveragePrice'].median(),
        'MMRCurrentRetailCleanPrice': df_referenz['MMRCurrentRetailCleanPrice'].median(),
        'VehBCost': df_referenz['VehBCost'].mode()[0]  
        }
    
    df = df.fillna(dict_referenz)  
    df = df.drop('PRIMEUNIT', axis = 1)
    df = df.drop('AUCGUART', axis = 1)

    
    
    #Dtype_chage
    

    num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
    cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    for col in num_cols:
        df[col] = df[col].astype('int')

    for col in cat_cols:
        df[col] = df[col].astype('category')

    
    return df

In [None]:
print("shape:", features_train.shape)
print("Gesamtzahl Zellen mit fehlenden Werten voher:",features_train.isna().sum().sum())
print("Anzahl Spalten:                             :",features_train.shape[1])

space(1)

features_train_cleaned = data_cleaner(features_train)
features_test_cleaned = data_cleaner(features_test)

print("shape:" ,features_train_cleaned.shape)
print("Gesamtzahl Zellen mit fehlenden Werten voher:",features_train_cleaned .isna().sum().sum())
print("Anzahl Spalten:                             :",features_train_cleaned .shape[1])

![image.png](attachment:43e8934d-0ee8-46a5-b269-77708fcdd275.png)

# Deal with outliers

In [None]:
col = "VehBCost"
 
q1 = features_train_cleaned.loc[:, col].quantile(0.25)
q3 = features_train_cleaned .loc[:, col].quantile(0.75)


limit = 3

IQR = q3 - q1

lower_bound = q1 - limit * IQR
upper_bound = q3 + limit * IQR

mask = (features_train_cleaned .loc[:,col] > upper_bound) | (features_train_cleaned .loc[:,col] < lower_bound) 

ax = sns.scatterplot(x = 'VehOdo',
                    y =  'VehBCost',
                    data = features_train_cleaned ,
                    hue = mask)

ax.set_title = ("Scatterplot: VehBcost - VehOdo")


![image.png](attachment:cfeb11da-ee2c-49e4-b581-6a815ec18b2e.png)

### Function - IQR as a strategy to deal with outliers

In [None]:
num_cols = features_train_cleaned.select_dtypes(include=['int64', 'float64']).columns.tolist()


def iqr_cleaner(col_list=None, limit=3, data=None):
    if col_list is None:
        col_list = num_cols
    if data is None:
        data = features_train_cleaned 

    cols = 2
    rows = int(np.ceil(len(col_list) / cols))
    fig, axes = plt.subplots(rows, cols, figsize=(cols * 10, rows * 6))
    
    for i, col in enumerate(col_list):
        row = i // cols
        col_pos = i % cols

        # Ensure the column is numeric
        if pd.api.types.is_numeric_dtype(data[col]):
            q1 = data.loc[:, col].quantile(0.25)
            q3 = data.loc[:, col].quantile(0.75)
            IQR = q3 - q1
            lower_bound = q1 - limit * IQR
            upper_bound = q3 + limit * IQR

            mask = (data.loc[:, col] > upper_bound) | (data.loc[:, col] < lower_bound)
            sns.scatterplot(x='VehBCost', y=col, data=data, hue=mask, ax=axes[row, col_pos])
            axes[row, col_pos].set_title(f"Scatterplot: VehBCost - {col}")
    
    fig.tight_layout()
    plt.show()


iqr_cleaner()


![image.png](attachment:29abcb4a-0037-43a1-8282-8e3cc3253c5f.png)

![image.png](attachment:2b8a8fbd-1dfa-446c-a2a8-5498fba76e4a.png)

In [None]:
def sampling_data(df = features_train_cleaned):
    
    # Define the filtering criteria
    filters = [
        ('MMRAcquisitionAuctionAveragePrice', 100000, 10),
        ('MMRAcquisitionAuctionCleanPrice', 100000, 10),
        ('MMRAcquisitionRetailAveragePrice', 100000, 10),
        ('MMRAcquisitonRetailCleanPrice', 100000, 10),
        ('MMRCurrentAuctionAveragePrice', 100000, 10),
        ('MMRCurrentAuctionCleanPrice', 100000, 10),
        ('MMRCurrentRetailAveragePrice', 100000, 10),
        ('MMRCurrentRetailCleanPrice', 100000, 10)
    ]

    # Apply the filters
    for col, upper_bound, lower_bound in filters:
        df = df[(df[col] <= upper_bound) & (df[col] > lower_bound)]

    
    return df

features_train_cleaned = sampling_data()

### Check new dataset

In [None]:
iqr_cleaner(data = features_train_cleaned)

![image.png](attachment:455ccc18-a74a-4849-9e0e-e10d91af001f.png)

![image.png](attachment:9fb43563-57ed-4289-9b38-c3b586661c89.png)

# Modeling

### Data new import

In [None]:
# Read data
df = pd.read_csv('data_train.csv')

# Clean the data
df_train, df_test = train_test_split(df, test_size=0.1, random_state=41)

##################################
#TRAIN DATA
##################################

# Cleaning nan and deleting unuseful columns
df_train = data_cleaner(df_train,df_train)

# Cleaning nan and deleting unuseful columns
df_train = sampling_data(df = df_train)
    

features_train = df_train.drop(columns=['IsBadBuy'])
target_train = df_train['IsBadBuy']

##################################
#TEST DATA
##################################

# Cleaning nan and deleting unuseful columns
df_test= data_cleaner(df_test,df_train)


features_test = df_test.drop(columns=['IsBadBuy'])
target_test = df_test['IsBadBuy']

In [None]:
print(features_test.shape)
print(features_train.shape)

![image.png](attachment:7d1b35eb-7132-4c65-96ae-d63be25f15bb.png)

In [None]:
num_cols = features_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = features_train.select_dtypes(include=['object', 'category']).columns.tolist()

print(cat_cols)
space(2)
print(num_cols)

![image.png](attachment:08aceb58-0317-4d27-9ca7-57115c27278e.png)

### Funktion: results_add

In [None]:
cols = 5
arrays = [["F1"]*cols + ["PRECISION"]*cols + ["RECALL"]*cols + ["ACCURACY"]*cols,
            [' ', ' ', 'Train', 'Val', 'Test']*4]

columns = pd.MultiIndex.from_arrays(arrays, names=('Metrik', 'Datensatz'))
results = pd.DataFrame(columns=columns)

def results_add(model, results, model_name="model", selection="original"):
    
    if selection == "original":
        features_train = pd.read_pickle("features_train.p")
        features_val = pd.read_pickle("features_val.p")
        features_test = pd.read_pickle("features_test.p")
        
        
    if selection == "set_1":
        features_train = pd.read_pickle("features_train_sel_1.p")
        features_val = pd.read_pickle("features_val_sel_1.p")
        features_test = pd.read_pickle("features_test_sel_1.p")
        
    target_train = pd.read_pickle("target_train.p")
    target_val = pd.read_pickle("target_val.p")
    target_test = pd.read_pickle("target_test.p")


    for col, features, target in [("Test", features_test, target_test),
                                ("Val", features_val, target_val),
                                ("Train", features_train, target_train)]:
        
        if "model_ANN" in model_name:
            target_pred = model.predict_classes(features)

        else:
            target_pred = model.predict(features)

        for metrik_name, metrik in [("F1", f1_score),
                                    ("PRECISION", precision_score),
                                    ("RECALL", recall_score),
                                    ("ACCURACY", accuracy_score)]:
            results.loc[model_name, (metrik_name, col)] = str(round(metrik(target, target_pred) * 100, 1)) + "%"
    
    results = results.fillna("")
    
    return results

# Feature Engineering

### Sortierung der Spalten

In [None]:
cols_ready_cat = ["Auction",
                "Make",
                "Color",
                "Transmission",
                "WheelType",
                "Nationality",
                "TopThreeAmericanName"]



cols_ready_num = ["VehOdo",
                    "IsOnlineSale",
                    "WarrantyCost"]


cols_pca_MMR = ["MMRAcquisitionAuctionAveragePrice",
                "MMRAcquisitionAuctionCleanPrice",
                "MMRAcquisitionRetailAveragePrice",
                "MMRAcquisitonRetailCleanPrice",
                "MMRCurrentAuctionAveragePrice",
                "MMRCurrentAuctionCleanPrice",
                "MMRCurrentRetailAveragePrice",
                "MMRCurrentRetailCleanPrice",
                "VehBCost"]


cols_pca_veh_age_year = ["VehYear",
                        "VehicleAge"]


cols_drop = ["Trim",
            "WheelTypeID",
            "BYRNO",
            "VNZIP1",
            "VNST",
            "Model",
            "SubModel",
             "Size"]


features_train_engineered = features_train.drop(columns= cols_drop)



### OneHotEncoding

In [None]:
# Define the ColumnTransformer correctly
transformer_OHE = ColumnTransformer(
    transformers=[
        ("OHE", OneHotEncoder(drop="first", sparse=False), cols_ready_cat)
    ],
    remainder="drop",
    n_jobs=1
)

# Apply the transformation
transformed_data = transformer_OHE.fit_transform(features_train_engineered)

#Namen der neuen Spalten
col_OHE = transformer_OHE.named_transformers_["OHE"].get_feature_names(cols_ready_cat)
features_train_OHE = pd.DataFrame(transformed_data, dtype="int", columns=col_OHE, index=features_train_engineered.index)

## Zusammenfügen zu neuem DataFrame
features_train_engineered_OHE = pd.concat([features_train_engineered.loc[:, cols_ready_num 
                                                                         + cols_pca_MMR 
                                                                         + cols_pca_veh_age_year], 
                                                                           features_train_OHE], axis=1)


In [None]:
features_train_engineered_OHE.shape

![image.png](attachment:bcd50ca5-0003-41a4-b095-b07533567728.png)

### Funktion: engineer_features

In [None]:
def engineer_features(df,
                    col_drop = cols_drop,
                    cols_new_OHE = col_OHE,
                    cols_passthrough = cols_ready_num + cols_pca_MMR + cols_pca_veh_age_year,
                    transformer_OHE = transformer_OHE):

    df_copy  = df.drop(columns= cols_drop)
    

    # OneHotEncoding
    transformed_data = transformer_OHE.transform(df_copy)
    df_copy_OHE = pd.DataFrame(transformed_data, dtype="int", columns=cols_new_OHE, index=df_copy.index)
    df_copy = pd.concat([df_copy.loc[:, cols_passthrough], df_copy_OHE], axis=1)
    
    
    return df_copy

In [None]:
features_test_engineered_OHE = engineer_features(features_test)

In [None]:
features_test_engineered_OHE.shape

In [None]:
print("TRAIN:")
print("Vorher: ", features_train.shape[1])
print("Nachher:", features_train_engineered_OHE.shape[1])  # Corrected variable name

# Create space between the sections
print("\n")

print("TEST:")
print("Vorher: ", features_test.shape[1])
print("Nachher:", features_test_engineered_OHE.shape[1])


![image.png](attachment:f0fad584-2246-4230-9d41-f65fd9320803.png)

# Scaler

In [None]:
scaler = StandardScaler()

features_train_std_arr = scaler.fit_transform(features_train_engineered_OHE)

features_train_engineered_OHE_std = pd.DataFrame(features_train_std_arr,
                                                columns = features_train_engineered_OHE.columns,
                                                index = features_train_engineered_OHE.index)

# Dimensionality Reduction

In [None]:
# PCA Test
# Test MMR
features_train_engineered_MMR = features_train_engineered_OHE_std.loc[:, cols_pca_MMR]
ratio = []
for i in range(1, len(cols_pca_MMR) - 1):
    
    pca_MMR = PCA(n_components=i, random_state=42)
    pca_MMR.fit(features_train_engineered_MMR)
    
    ratio.append(np.cumsum(pca_MMR.explained_variance_ratio_)[-1])
    
    print(i, "Components: ", round(ratio[-1]*100, 1))
    
sns.lineplot(x = range(1, len(cols_pca_MMR) - 1),y = ratio);

![image.png](attachment:dbe13b90-af1c-4f11-9332-cbca939ee096.png)

# PCA

In [None]:

cols_pca = ["MMR_Component_1", "MMR_Component_2", "MMR_Component_3", "veh_age_year_Component"]

transformer_pca = ColumnTransformer(
    transformers=[
        ("pca_MMR", PCA(n_components=3, random_state=42), cols_pca_MMR),
        ("pca_veh_age_year", PCA(n_components=1, random_state=42), cols_pca_veh_age_year)
    ],
    remainder="drop",
    n_jobs=1
)

#Data get fitted

transformed_data = transformer_pca.fit_transform(features_train_engineered_OHE_std)

features_train_engineered_MMR = pd.DataFrame(
    transformed_data,
    columns=cols_pca,
    index=features_train_engineered_OHE_std.index
)

features_train_engineered_ready = pd.concat([features_train_engineered_MMR, features_train_engineered_OHE_std], axis=1)

features_train_engineered_ready = features_train_engineered_ready.drop(cols_pca_MMR + cols_pca_veh_age_year, axis=1)


In [None]:
print("TRAIN:")
print("Vorher: ",features_train_engineered_OHE.shape[1])
print("Nachher:", features_train_engineered_ready.shape[1])  # Corrected variable name
# Create space between the sections


![image.png](attachment:f993e500-01ba-4b64-bbfe-2fff7f26722d.png)

### component composition

In [None]:
data_arr = transformer_pca.named_transformers_["pca_MMR"].components_
components_MMR = pd.DataFrame(data_arr, columns=cols_pca_MMR, index=cols_pca[:3]).T

fig_components, axes = plt.subplots(components_MMR.shape[1]+1,1, figsize=(15,20))

for i, component in enumerate(components_MMR.columns):
    
    sns.barplot(components_MMR.loc[:, component],components_MMR.index,ax=axes[i])
    
    axes[i].set_title(f"Komponentenzusammensetzung {cols_pca[i]}")
    
data_arr = transformer_pca.named_transformers_["pca_veh_age_year"].components_
components_veh_age_year = pd.DataFrame(data_arr, columns=cols_pca_veh_age_year, index=[cols_pca[3]]).T
    
sns.barplot(components_veh_age_year.loc[:, cols_pca[3]], components_veh_age_year.index, ax=axes[3])
    
axes[i].set_title(f"Komponentenzusammensetzung {cols_pca[3]}")
fig_components.tight_layout()

![image.png](attachment:7e54a340-24c0-4231-ac4b-455cd9c37dc7.png)

In [None]:
components_MMR

![image.png](attachment:59e81cfc-5c80-4539-88bb-32d655c66a6b.png)

# Funktion: scale_features & pca_features

In [None]:

cols_pca = ["MMR_Component_1", "MMR_Component_2", "MMR_Component_3", "veh_age_year_Component"]

transformer_pca = ColumnTransformer(
    transformers=[
        ("pca_MMR", PCA(n_components=3, random_state=42), cols_pca_MMR),
        ("pca_veh_age_year", PCA(n_components=1, random_state=42), cols_pca_veh_age_year)
    ],
    remainder="drop",
    n_jobs=1
)

#Data get fitted

transformed_data = transformer_pca.fit_transform(features_train_engineered_OHE_std)

features_train_engineered_MMR = pd.DataFrame(
    transformed_data,
    columns=cols_pca,
    index=features_train_engineered_OHE_std.index
)

features_train_engineered_ready = pd.concat([features_train_engineered_MMR, features_train_engineered_OHE_std], axis=1)

features_train_engineered_ready = features_train_engineered_ready.drop(cols_pca_MMR + cols_pca_veh_age_year, axis=1)


In [None]:
def scale_features(df, scaler = scaler):
    df_copy = df.copy()
    transformed_data = scaler.transform(df_copy)
    
    df_copy = pd.DataFrame(transformed_data,
                columns = df_copy.columns,
                index = df_copy.index)
    
    return df_copy
    
def pca_features (df, 
                  cols_pca_new = cols_pca,
                  transformer_pca = transformer_pca,
                  cols_pca_old = cols_pca_MMR + cols_pca_veh_age_year):
    
    df_copy = df.copy()
    
    transformed_data = transformer_pca.transform(df_copy)
    
    df_copy_MMR = pd.DataFrame(transformed_data, columns=cols_pca_new, index=df_copy.index)
    
    df_copy = pd.concat([df_copy_MMR, df_copy], axis=1)
    
    df_copy = df_copy.drop(cols_pca_old, axis=1)
    
    
    return df_copy

In [None]:
features_test_engineered_OHE_scaled = scale_features(features_test_engineered_OHE)

features_test_engineered_ready = pca_features(features_test_engineered_OHE_scaled)

In [None]:
features_test_engineered_ready.shape

# Create Validation Set

In [None]:
# Erstellen des Validierungssets und Speichern als Pickle
# Abspalten des Evaluierungsdatensatzes der Originalen Features
features_train_engineered_ready, features_val_engineered_ready, target_train, target_val = train_test_split(
    features_train_engineered_ready, target_train, test_size=0.3, random_state=42
)

# Abspalten des Evaluierungsdatensatzes der ersten selektierten Features
features_train_engineered_ready_poly, features_val_engineered_ready_poly, target_train_poly, target_val_poly = train_test_split(
    features_train_engineered_ready_poly, target_train_poly, test_size=0.3, random_state=42
)

# Speichern der originalen Daten - um wiedervorbereitung zu vermeiden, und time zu sparen
features_train_engineered_ready.to_pickle("features_train.p")
features_val_engineered_ready.to_pickle("features_val.p")
features_test_engineered_ready.to_pickle("features_test.p")

target_train.to_pickle("target_train.p")
target_val.to_pickle("target_val.p")
target_test.to_pickle("target_test.p")

# Speichern der Daten der ersten Selektion
features_train_engineered_ready_poly.to_pickle("features_train_sel_1.p")
features_val_engineered_ready_poly.to_pickle("features_val_sel_1.p")
features_test_engineered_ready_poly.to_pickle("features_test_sel_1.p")


# Load Dataset

In [None]:
# Laden der originalen Features
features_train_engineered_ready = pd.read_pickle("features_train.p")
features_val_engineered_ready = pd.read_pickle("features_val.p")
features_test_engineered_ready = pd.read_pickle("features_test.p")

target_train = pd.read_pickle("target_train.p")
target_val = pd.read_pickle("target_val.p")
target_test = pd.read_pickle("target_test.p")


# Laden der Daten der ersten Selektion
features_train_engineered_ready_poly = pd.read_pickle("features_train_sel_1.p")
features_val_engineered_ready_poly = pd.read_pickle("features_val_sel_1.p")
features_test_engineered_ready_poly = pd.read_pickle("features_test_sel_1.p")


target_train_poly = target_train
target_val_poly = target_val
target_test_poly = target_test

# Resampling

In [None]:
smotesampler = SMOTE(random_state=42)
oversampler = RandomOverSampler(random_state=42)
undersampler = RandomUnderSampler(random_state=42)

# Originale Features Samplen
features_train_engineered_ready_SMOTE, target_train_SMOTE = smotesampler.fit_resample(features_train_engineered_ready,target_train)
features_train_engineered_ready_Over, target_train_Over = oversampler.fit_resample(features_train_engineered_ready,target_train)
features_train_engineered_ready_Under, target_train_Under = undersampler.fit_resample(features_train_engineered_ready,target_train)

# Samplen der ersten Selection
features_train_engineered_ready_poly_SMOTE, target_train_poly_SMOTE = smotesampler.fit_resample(features_train_engineered_ready,target_train)
features_train_engineered_ready_poly_Over, target_train_poly_Over = oversampler.fit_resample(features_train_engineered_ready,target_train)
features_train_engineered_ready_poly_Under, target_train_poly_Under = undersampler.fit_resample(features_train_engineered_ready,target_train)

In [None]:
import warnings
warnings.filterwarnings("ignore", message=".*indexing past lexsort depth.*")

# Trying models

### Model_RF_1

In [None]:

# Your code here


dateipfad = '/home/jovyan/work/dsk3de/module-03/model_RF_1.p'

if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_RF_1 = p.load(file)

else:

    model_RF_1 = RandomForestClassifier(n_estimators = 10,
                                        max_depth = 10,
                                        class_weight = "balanced",
                                        random_state = 42)
    model_RF_1.fit(features_train_engineered_ready, target_train)

    p.dump(model_RF_1, open(dateipfad, "wb"))
    
results = results_add(model_RF_1,
                     results = results,
                      model_name = "model_RF_1")

results.loc[["model_RF_1"],:]

![image.png](attachment:186930da-4481-4e37-8d6d-ffdad2ca1615.png)

### model_RF_Grid_1

In [None]:
#Random Forest GridSearch
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_RF_Grid_1.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_RF_Grid_1 = p.load(file)
        
else:

    search_dict = {"max_depth": range(1,50),
                    "class_weight": [None, "balanced", "balanced_subsample"]}
    model_RF_Grid_1 = GridSearchCV(estimator = RandomForestClassifier(random_state = 42, n_estimators = 10),
    param_grid = search_dict,
    scoring = "f1",
    cv = 5,
    n_jobs = -1)
    
    model_RF_Grid_1.fit(features_train_engineered_ready, target_train)
    p.dump(model_RF_Grid_1, open(dateipfad, "wb"))
    
results = results_add(model_RF_Grid_1,
                    results = results,
                    model_name = "model_RF_Grid_1")
results.loc[["model_RF_Grid_1"],:]

![image.png](attachment:e935d510-d5ab-4796-a920-3786e32c87ae.png)

### model_RF_Grid_2

In [None]:
# Random Forest GridSearch
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_RF_Grid_2.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_RF_Grid_2 = p.load(file)
else:
    search_dict = {"max_depth": range(1,50),
                "class_weight": [None, "balanced", "balanced_subsample"]}
    
    model_RF_Grid_2 = GridSearchCV(estimator = RandomForestClassifier(random_state = 42, n_estimators = 10),
                                    param_grid = search_dict,
                                    scoring = "precision",
                                    cv = 5,
                                    n_jobs = -1)
    
    model_RF_Grid_2.fit(features_train_engineered_ready, target_train)
    p.dump(model_RF_Grid_2, open(dateipfad, "wb"))
    
results = results_add(model_RF_Grid_2,
                    results = results,
                    model_name = "model_RF_Grid_2")

results.loc[["model_RF_Grid_2"],:]

![image.png](attachment:873c6c57-79f8-4dc3-b756-aac1cfbffb69.png)

### Model Log_1

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_Log_1.p'


if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_Log_1 = p.load(file)


else:
    model_Log_1 = LogisticRegression(penalty = "l2",
                                            C = 1,
                                            solver = "lbfgs",
                                            random_state = 42)
    
    model_Log_1.fit(features_train_engineered_ready, target_train)
    p.dump(model_Log_1, open(dateipfad, "wb"))

results = results_add(model_Log_1,
                    results = results,
                    model_name = "model_Log_1")

results.loc[["model_Log_1"],:]

![image.png](attachment:e4391e96-3747-4ae2-b63f-e76909c723dd.png)

### Model_Log_Grid_1

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_Log_Grid_1.p'


if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_Log_Grid_1 = p.load(file)
    
else:
    search_dict = {"C": np.geomspace(0.001,1000,14),
                    "penalty": ["l1", "l2"],
                    "solver": ["liblinear", "saga"],
                    "class_weight": [None, "balanced"]}
    model_Log_Grid_1 = GridSearchCV(estimator = LogisticRegression(random_state = 42, max_iter=100),
                                                param_grid = search_dict,
                                                scoring = "f1",
                                                cv = 5,
                                                n_jobs = -1)
    
    model_Log_Grid_1.fit(features_train_engineered_ready, target_train)
    p.dump(model_Log_Grid_1, open(dateipfad, "wb"))
    
results = results_add(model_Log_Grid_1,
                results = results,
                model_name = "model_Log_Grid_1")

results.loc[["model_Log_Grid_1"],:]

![image.png](attachment:dba22ea3-a5eb-4744-b11c-16eab64c285a.png)

### model_Log_Grid_2

In [None]:
import os
import pickle as p
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

# Define paths
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_Log_Grid_2.p'

# Check if the model already exists
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_Log_Grid_2 = p.load(file)
else:
    # Define parameter grid for GridSearchCV
    search_dict = {
        "C": np.geomspace(0.001, 1000, 14),
        "penalty": ["l1", "l2"],
        "solver": ["liblinear", "saga"],
        "class_weight": [None, "balanced"]
    }

    # Create and fit the GridSearchCV model
    model_Log_Grid_2 = GridSearchCV(
        estimator=LogisticRegression(random_state=42, max_iter=1000),  # Increased max_iter
        param_grid=search_dict,
        scoring="precision",
        cv=5,
        n_jobs=-1
    )

    # Ensure features are scaled
    scaler = StandardScaler()
    features_train_scaled = scaler.fit_transform(features_train_engineered_ready)
    
    # Fit the model
    model_Log_Grid_2.fit(features_train_scaled, target_train)
    
    # Save the model
    with open(dateipfad, "wb") as file:
        p.dump(model_Log_Grid_2, file)

# Add results to the results DataFrame
results = results_add(model_Log_Grid_2, results=results, model_name="model_Log_Grid_2")
results.loc[["model_Log_Grid_2"], :]


![image.png](attachment:53f14804-3997-4e49-828e-ffab9433f0af.png)

### model_KNN_1

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_KNN_1.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_KNN_1 = p.load(file)
    
else:
    model_KNN_1 = KNeighborsClassifier(n_neighbors = 2,
                                        weights = "uniform")
    
    model_KNN_1.fit(features_train_engineered_ready_SMOTE, target_train_SMOTE)
    p.dump(model_KNN_1, open(dateipfad, "wb"))
    
results = results_add(model_KNN_1,
            results = results,
            model_name = "model_KNN_1")

results.loc[["model_KNN_1"],:]

![image.png](attachment:6975890d-43ec-453c-9717-5f3fb29d4efc.png)

### model_SVM_1 (rbf)

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_SVM_1.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_SVM_1 = p.load(file)
else:
    model_SVM_1 = SVC(C = 3,
                        kernel = "rbf",
                        gamma = "scale",
                        class_weight = "balanced",
                        random_state = 42)
    
    model_SVM_1.fit(features_train_engineered_ready, target_train)
    p.dump(model_SVM_1, open(dateipfad, "wb"))
    
results = results_add(model_SVM_1,
                        results = results,
                        model_name = "model_SVM_1")

results.loc[["model_SVM_1"],:]

![image.png](attachment:3dcbc598-1906-401d-9bdb-263c64113e9a.png)

### model_SVM_2 (rbf)

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_SVM_2.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_SVM_2 = p.load(file)
        
else:
    model_SVM_2 = SVC(C = 3,
                    kernel = "rbf",
                    gamma = "scale",
                    random_state = 42)
    
    model_SVM_2.fit(features_train_engineered_ready, target_train)
    
    p.dump(model_SVM_2, open(dateipfad, "wb"))
    
results = results_add(model_SVM_2,
        results = results,model_name = "model_SVM_2")

results.loc[["model_SVM_2"],:]

![image.png](attachment:018026ce-a441-4467-a068-a339c505de16.png)

### model_SVM_3 (Poly, 2)

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_SVM_3.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_SVM_3 = p.load(file)
else:
    model_SVM_3 = SVC(C = 3,
                    kernel = "poly",
                    degree = 2,
                    gamma = "scale",
                    random_state = 42)
    
    model_SVM_3.fit(features_train_engineered_ready, target_train)

    p.dump(model_SVM_3, open(dateipfad, "wb"))
    
results = results_add(model_SVM_3,
        results = results, model_name = "model_SVM_3")

results.loc[["model_SVM_3"],:]

![image.png](attachment:8a93d302-bc1c-4318-bf70-ae199e4f12f7.png)

### model_SVM_5 (Poly, 4)

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_SVM_5.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_SVM_5 = p.load(file)
        
else:
    model_SVM_5 = SVC(C = 3,
                    kernel = "poly",
                    degree = 4,
                    gamma = "scale",
                      random_state = 42)

    model_SVM_5.fit(features_train_engineered_ready, target_train)
    p.dump(model_SVM_5, open(dateipfad, "wb"))
    
results = results_add(model_SVM_5,
                    results = results,
                    model_name = "model_SVM_5")

results.loc[["model_SVM_5"],:]

![image.png](attachment:f4a37161-e212-4d2e-8708-30271787b534.png)

### model_SVM_6

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_SVM_6.p'

# Check if the model already exists
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_SVM_6 = p.load(file)
else:
    # Scale features
    scaler = StandardScaler()
    features_train_scaled = scaler.fit_transform(features_train_engineered_ready)
    
    # Train the model
    model_SVM_6 = SVC(C=1, kernel="rbf", gamma="scale", random_state=42, cache_size=500)
    model_SVM_6.fit(features_train_scaled, target_train)
    
    # Save the model
    with open(dateipfad, "wb") as file:
        p.dump(model_SVM_6, file)

results = results_add(model_SVM_6, results=results, model_name="model_SVM_6")
results.loc[["model_SVM_6"], :]


![image.png](attachment:b45bef9f-b849-414d-b2d4-4519b218b721.png)

### model_SVM_Grid_1

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_SVM_Grid_1.p'
if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_SVM_Grid_1 = p.load(file)
else:
    search_dict = {"C": [1, 5, 10]}
    model_SVM_Grid_1 = GridSearchCV(estimator = SVC(random_state=42, gamma="scale", kernel = "rbf"),
                                                    param_grid = search_dict,
                                                    scoring = "precision",
                                                    cv = 5,
                                                    n_jobs = -1)
    model_SVM_Grid_1.fit(features_train_engineered_ready, target_train)
    p.dump(model_SVM_Grid_1, open(dateipfad, "wb"))
    
results = results_add(model_SVM_Grid_1,
                    results = results,
                    model_name = "model_SVM_Grid_1")
results.loc[["model_SVM_Grid_1"],:]

![image.png](attachment:34a1f240-849c-43a8-a08b-f2628d523241.png)

### model_SVM_Lin_1

In [None]:
dateipfad = '/home/jovyan/work/dsk3de/module-03/model_SVM_Lin_1.p'

if os.path.exists(dateipfad):
    with open(dateipfad, "rb") as file:
        model_SVM_Lin_1 = p.load(file)
        
else:
    model_SVM_Lin_1 = LinearSVC(penalty = "l2",
                        loss = 'squared_hinge',
                        C = 1,
                        class_weight = "balanced",
                        random_state = 42)

    model_SVM_Lin_1.fit(features_train_engineered_ready, target_train)

    p.dump(model_SVM_Lin_1, open(dateipfad, "wb"))
results = results_add(model_SVM_Lin_1,
                        results = results,
                        model_name = "model_SVM_Lin_1")
results.loc[["model_SVM_Lin_1"],:]

![image.png](attachment:99dae75a-b1eb-45ab-872a-bd9bf4b80837.png)

### model_ANN Test SMOTE, OVER, UNDER

In [None]:
import os
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
import matplotlib.pyplot as plt

# Neuronale Netze
dateipfad_SMOTE = '/home/jovyan/work/dsk3de/module-03/model_ANN_SMOTE_0.h5'
dateipfad_OVER = '/home/jovyan/work/dsk3de/module-03/model_ANN_OVER_0.h5'
dateipfad_UNDER = '/home/jovyan/work/dsk3de/module-03/model_ANN_UNDER_0.h5'

dateipfade = [dateipfad_SMOTE, dateipfad_OVER, dateipfad_UNDER]

model_ANN_SMOTE_0 = Sequential()
model_ANN_OVER_0 = Sequential()
model_ANN_UNDER_0 = Sequential()

models = ["SMOTE", "OVER", "UNDER"]

for i, (model, features_ANN, target_ANN) in enumerate([
    (model_ANN_SMOTE_0, features_train_engineered_ready_SMOTE, target_train_SMOTE),
    (model_ANN_OVER_0, features_train_engineered_ready_Over, target_train_Over),
    (model_ANN_UNDER_0, features_train_engineered_ready_Under, target_train_Under)
]):
    if os.path.exists(dateipfad_SMOTE) and os.path.exists(dateipfad_OVER) and os.path.exists(dateipfad_UNDER):
        if i == 0:
            model_ANN_SMOTE_0 = load_model(dateipfad_SMOTE)
            model_ANN_OVER_0 = load_model(dateipfad_OVER)
            model_ANN_UNDER_0 = load_model(dateipfad_UNDER)
    else:
        units = 50
        
        hidden_first = Dense(units=units, activation='relu', input_dim=features_ANN.shape[1])
        hidden_second = Dense(units=units, activation='relu')
        hidden_third = Dense(units=units, activation='relu')
        hidden_fourth = Dense(units=units, activation='relu')     
        hidden_fifth = Dense(units=units, activation='relu')
        
        model.add(hidden_first)
        model.add(hidden_second)
        model.add(hidden_third)
        model.add(hidden_fourth)
        model.add(hidden_fifth)
                        
        output_layer = Dense(units=1, activation='sigmoid')
        model.add(output_layer)
    
        model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
        
        epochs = 20
                                           
        hist = model.fit(features_ANN,
                         target_ANN,
                         epochs=epochs,
                         batch_size=64,
                         validation_data=(features_val_engineered_ready, target_val))
                                           
        model.save(dateipfade[i])
                                           
        if i == 0:
            fig, axs = plt.subplots(3, 2, figsize=(15, 15))
                                           
        axs[i][0].plot(hist.history['loss'])
        axs[i][0].plot(hist.history['val_loss'])
        axs[i][0].set(title=f'Model {models[i]} Loss', ylabel='Loss', xlabel='Epoch')
        axs[i][0].legend(['train', 'val'])
            
        axs[i][1].plot(hist.history['accuracy'])
        axs[i][1].plot(hist.history['val_accuracy'])
        axs[i][1].set(title=f'Model {models[i]} Accuracy', ylabel='Accuracy', xlabel='Epoch')                                
        axs[i][1].legend(['train', 'val'], loc='upper left')

# Adjust layout to prevent overlap
fig.tight_layout()
plt.show()


In [None]:
model_ANN_OVER_1 = Sequential()
model_ANN_OVER_2 = Sequential()
model_ANN_OVER_3 = Sequential()
model_ANN_OVER_4 = Sequential()
model_ANN_OVER_5 = Sequential()

models = [model_ANN_OVER_1, model_ANN_OVER_2, model_ANN_OVER_3, model_ANN_OVER_4, model_ANN_OVER_5]

for epochs in range(1,6):
    
    model = models[epochs-1]
    
    units = 50
    hidden_first = Dense(units=units, activation='relu', input_dim=features_train_engineered_ready_Over.shape[1])
    hidden_second = Dense(units=units, activation='relu')
    hidden_third = Dense(units=units, activation='relu')
    hidden_fourth = Dense(units=units, activation='relu')
    hidden_fifth = Dense(units=units, activation='relu')
    
    
    model.add(hidden_first)
    model.add(hidden_second)
    model.add(hidden_third)
    model.add(hidden_fourth)
    model.add(hidden_fifth)
        
    output_layer = Dense(units=1, activation='sigmoid')
    
    model.add(output_layer)
    model.compile(optimizer="adam",loss='binary_crossentropy',metrics=["accuracy"])
    
    epochs=epochs
    hist=model.fit(features_train_engineered_ready_Over,
                   target_train_Over,
                   epochs=epochs,
                   batch_size=64,
                   validation_data=(features_val_engineered_ready,target_val))
    results=results_add(model,
                        results=results,
                        model_name=f"model_ANN_OVER_{epochs}")
results.loc[["model_ANN_OVER_1","model_ANN_OVER_2","model_ANN_OVER_3","model_ANN_OVER_4","model_ANN_OVER_5"],:]

![image.png](attachment:73c510cd-0f3d-431e-ad9a-f9f11ad26e1f.png)

# Model Selection

In [None]:
results

![image.png](attachment:6b39079b-1ac6-4cef-840e-d0823f8a8277.png)

# Finales Modell

In [None]:

model_final = model_RF_1
print("Parameter:")
print(f"n_estimators: {model_final.get_params()['n_estimators']}")
print(f"max_depth: {model_final.get_params()['max_depth']}")
print(f"class_weight: {model_final.get_params()['class_weight']}")


![image.png](attachment:68f7555a-788b-4027-944f-614c3eb55e4e.png)

# Die finale Datenpipeline

In [None]:
def data_pred(df,model = model_final):
    df.loc[:, "PurchDate"] = pd.to_datetime(df.loc[:, "PurchDate"], unit="s")
    df = clean_data(df)
    df = engineer_features(df)
    df = scale_features(df)
    df = pca_features(df)
    df = poly_features(df)
    
    return model.predict(df)

# Abschluss des Projekts

In [None]:
features_aim = pd.read_csv("features_aim.csv")
# Vorhersage treffen
target_aim_pred = pd.DataFrame(data_pred(features_aim), columns=["IsBadBuy_pred"])

# CSV-Datei speichern
target_aim_pred.to_csv("predictions_aim.csv")