<a href="https://www.kaggle.com/code/gizemnalbantarslan/car-price-prediction-linear-regression?scriptVersionId=199070340" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **CAR PRICE PREEDICTION**

Geely Auto, an automobile company, wants to set up its production unit in the US to compete with its counterparts here. First, it hired a consulting company to prepare a database of its competitors' vehicles on the market, including their features and prices. 

The company wants to know:

* Which variables are significant in predicting the price of a car
* How well those variables describe the price of a car

We will model car prices with the available independent variables, and this model will be used by management to understand exactly how prices vary with the independent variables.

In [None]:
# 1.Import and Requirements

In [None]:
# import and requirements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score,GridSearchCV
from sklearn.preprocessing import StandardScaler

warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# 2.Reading and analyzing the dataset

In [None]:
df_ = pd.read_csv("../input/car-price-prediction/CarPrice_Assignment.csv")
#We copy the dataset so that we don't waste time reading the dataset when trying operations.
df=df_.copy()
df = df.drop("car_ID", axis=1)
df.head()

In [None]:
def check_df(dataframe):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(3))
    print("##################### Tail #####################")
    print(dataframe.tail(3))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### describe #####################")
    print(dataframe.describe())


check_df(df)

We see that there is no NA value, so we are not going to do anything about it.

Finally, let's look at the unique values of the “CarName” variable in order to prevent typos that may arise from spelling errors while creating the dataset.

But first, to facilitate analysis and to avoid information pollution, we will only separate the vehicles by company name.

In [None]:
df['CarName'] = df['CarName'].str.split(' ',expand=True)[0]
df['CarName'].unique()

**OBSERVATION**

As we can see, the names of some companies are misspelled. This leads to both a loss of relevant company data and an incorrect categorization. Let's edit this incorrect data.

Misnomenclatures and their correction
* maxda = mazda
* Nissan = nissan
* porsche = porcshce
* toyota = toyouta
* vokswagen = volkswagen = vw

In [None]:
def replace(f,t):
    df["CarName"].replace(f,t,inplace=True)

replace('maxda','mazda')
replace('porcshce','porsche')
replace('toyouta','toyota')
replace('vokswagen','volkswagen')
replace('vw','volkswagen')
replace('Nissan','nissan')

In [None]:
# check it
df['CarName'].unique()

# 3.Variable analysis

Some variable types may not be as given,we'll do this analysis with the function “grap_col_names”.

In [None]:
def grab_col_names(dataframe, cat_th=10, car_th=20):

    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]

    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]

    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]

    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f'cat_cols: {len(cat_cols)}')
    print(f'num_cols: {len(num_cols)}')
    print(f'cat_but_car: {len(cat_but_car)}')
    print(f'num_but_cat: {len(num_but_cat)}')

    return cat_cols, cat_but_car, num_cols

In [None]:
cat_cols, cat_but_car, num_cols = grab_col_names(df,10,30)

As a result of this analysis, we can see that the numerical variable “symboling” is actually categorical. This variable is shown in “cat_cols” as the output of the function.

> # 3.1 Analysis of Categorical Variables

In [None]:
def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))

    if plot:
        plt.figure(figsize=(7,6))
        plt.xticks(rotation=90)
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show()


for col in cat_cols:
    cat_summary(df, col,True)

In [None]:
# Among these variables, seeing the CarName variable visually can give us a meaningful insight.
sns.countplot(df, x=df["CarName"],order=df['CarName'].value_counts().index)
plt.xticks(rotation=90, horizontalalignment='right',fontweight='light',fontsize='x-large')
plt.show()

**OBSERVATION**

* We can say fueltype is prefered as gas %90 and enginelocal is prefered as front %98.

* It is also clear that the most preferred brands are Japanese and Korean brands.

> # 3.2.Analysis of Numerical Variables

In [None]:
def num_summary(dataframe, numerical_col, plot=False):
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=50)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show()

    print("#####################################")


for col in num_cols:
    num_summary(df, col, True)

> # 3.3.Analysis of Target Variable

Let's analysis relotionships these variables with target variable.

In [None]:
def target_summary_with_cat(dataframe, target, categorical_col):
    print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean()}), end="\n\n\n")

In [None]:
for col in cat_cols:
    target_summary_with_cat(df,"price",col)

In [None]:
df["price"].hist(bins=100)
plt.show()

We have already seen which vehicle brand is more preferred, but examining which vehicle brand has a higher price can also be useful for senior management in making decisions. 

Let's examine the relationship between average price and brand:

In [None]:
plt.subplot(1,1,1)
x = pd.DataFrame(df.groupby("CarName")["price"].mean().sort_values(ascending=False))
sns.barplot(x=x.index,y="price",data=x) 
plt.xticks(rotation=90)
plt.title("Car Company vs Average Price", pad=10, fontweight="black", fontsize=20)
plt.tight_layout()
plt.show()

# 4.Analysis of Correlation

Let us examine correlated variables with the correlation domain.

In [None]:
corr = df[num_cols].corr()

In [None]:
sns.set(rc={'figure.figsize': (12, 12)})
sns.heatmap(corr, cmap="RdBu")
plt.show()

# 5. Feature Engineering

**Outlier analysis**

Let's examine the outliers and see if they need to be suppressed. 

In [None]:
def outlier_thresholds(dataframe, variable, low_quantile=0.10, up_quantile=0.90):
    quantile_one = dataframe[variable].quantile(low_quantile)
    quantile_three = dataframe[variable].quantile(up_quantile)
    interquantile_range = quantile_three - quantile_one
    up_limit = quantile_three + 1.5 * interquantile_range
    low_limit = quantile_one - 1.5 * interquantile_range
    return low_limit, up_limit

In [None]:
def check_outlier(dataframe, col_name):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False

In [None]:
for col in num_cols:
    if col != "price":
      print(col, check_outlier(df, col))

When we look at the results, we see that enginesize and compressionratio variables are outliers, but industry knowledge shows us that these outlier values can actually be found in cars. For this reason, we continue without suppressing the outliers.

> # 5.1 Rare Encoding

With the rare encoding process, we will try to avoid redundancy by combining unnecessary categories together.

In [None]:
def rare_analyser(dataframe, target, cat_cols):
    for col in cat_cols:
        print(col, ":", len(dataframe[col].value_counts()))
        print(pd.DataFrame({"COUNT": dataframe[col].value_counts(),
                            "RATIO": dataframe[col].value_counts() / len(dataframe),
                            "TARGET_MEAN": dataframe.groupby(col)[target].mean()}), end="\n\n\n")

rare_analyser(df, "price", cat_cols)

We can see that some variables do not even affect the target variable by 0.01%. However, in a few variables, one or two subcategories will become rare, so this change will not lead to a meaningful result in our dataset. Therefore, instead of making a permanent change with “rare_encoder”, it would make more sense to decide on meaningful variables based on our analysis.

> OBSERVATION

* When we compare according to brands, some are sold at high prices while others are at low prices. It may make sense to categorize all these prices rather than examining them separately.
* We see that fueltype aspiration, drivewheel and enginetype have a great effect on the target variable proportionally, while doornumber, carbody, cyclindernumber, fuelsystem affect the target variable in quantity.

The categorical variables that make a significant difference on the price target variable are as follows:

* CarName
* fueltype
* aspiration
* doornumber
* carbody
* drivewheel
* enginetype
* cyclindernumber
* fuelsystem

The numerical variables that make a significant difference on the price target variable are as follows:

* wheelbase
* carlength
* carwidth
* curbeweight
* enginesize
* boreratio
* horsepower
* citympg
* highwaympg


> # 5.2.Creation of new variables

We have already established that we need to categorize the Price variable, now let's analyze the variable and decide on our points of separation.

In [None]:
num_summary(df, "price")

We can choose the 50%, 90% and 95% intervals as meaningful points for discrimination.

In [None]:
bins = [0,10000,20000,40000]
cars_bin=['Budget','Medium','Highend']
df['CarsRange'] = pd.cut(df["price"],bins,right=False,labels=cars_bin)
df.head()

In [None]:
df.head()

Now let's add “NEW” at the beginning of the new variables we created.

In [None]:
df["NEW_cmpgrpm"] =df["peakrpm"] / df["citympg"]
df["NEW_hmpgrpm"] = df["peakrpm"] / df["highwaympg"]
df["NEW_horserpm"] = df["peakrpm"] / df["horsepower"]
df["NEW_horseng"] = df["horsepower"] / df["enginesize"]
df["NEW_horseng"] = df["peakrpm"] / df["horsepower"]
df["NEW_engcmpg"] = df["enginesize"] / df["citympg"]
df["NEW_enghmpg"] = df["enginesize"] / df["highwaympg"]
df["NEW_compcmpg"] = df["citympg"] / df["compressionratio"]
df["NEW_comphmpg"] = df["highwaympg"] / df["compressionratio"]
df["NEW_comphorse"] = df["horsepower"] / df["compressionratio"]

In [None]:
df.head()

> # 5.3 Label Encoding & One-Hot Encoding

In [None]:
cat_cols, cat_but_car, num_cols = grab_col_names(df,10,30)

In [None]:
def label_encoder(dataframe, binary_col):
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

binary_cols = [col for col in df.columns if df[col].dtypes == "O" and len(df[col].unique()) == 2]

for col in binary_cols:
    label_encoder(df, col)

In [None]:
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

df = one_hot_encoder(df, cat_cols, drop_first=True)

In [None]:
scaler = StandardScaler()

In [None]:
df[num_cols] = scaler.fit_transform(df[num_cols])
df.head()

# 6. MODELING

In [None]:
y = df['price']
x = df.drop(["price"], axis=1)

> # 6.1.Splitting

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

> # 6.2.Model

Since our dependent variable is numeric, we use regression models.

In [None]:
models = [('LR', LinearRegression()),
          #("Ridge", Ridge()),
          #("Lasso", Lasso()),
          #("ElasticNet", ElasticNet()),
          ('KNN', KNeighborsRegressor()),
          ('CART', DecisionTreeRegressor()),
          ('RF', RandomForestRegressor()),
          #('SVR', SVR()),
          ('GBM', GradientBoostingRegressor()),
          ("XGBoost", XGBRegressor(objective='reg:squarederror'))]

In [None]:
for name, regressor in models:
    rmse = np.mean(np.sqrt(-cross_val_score(regressor, x, y, cv=5, scoring="neg_mean_squared_error")))
    print(f"RMSE: {round(rmse, 4)} ({name}) ")

# OBSERVATION

* We get the best model performance from the linear regression model. In a scenario where a numerical dependent variable is affected by multiple independent variables according to their weights, this is an acceptable result. 
* However, 77% may not be enough for us. Therefore, in the next process, if desired, hyperparameter optimization or rebuilding the model as a result of the use of variables determined by feature importance can be used.