## Importing necessary python libraries 
- Pandas        : used for data manipulation, analysis & working with structured data (tables,spreedsheets)
- Numpy         : used for numerical computing, as it provides large/multi dim arrays, matrices / math functions to operate on these arrays
- Matplotlib    : used for creating static, animated & interactive visualizations. pyplot is module in that library, which is for creating figures, plotting data, and formatting plots

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.style.use('ggplot')

ModuleNotFoundError: No module named 'matplotlib'

- pd.read_csv() function of pandas to read the csv dataset file 

In [None]:
car=pd.read_csv('quikr_car.csv')

- .head() function of pandas to get first 5 rows, (starting from 0) by default

In [None]:
car.head()

- shape is an attribute of Pandas DataFrame that returns a tuple(rows*col) representing the dimensionality of the DataFrame

In [None]:
car.shape

- .info() gives a concise summary of your DataFrame

In [None]:
car.info()

## Creating backup copy

In [None]:
backup=car.copy()

## Quality

- names are pretty inconsistent
- names have company names attached to it
- some names are spam like 'Maruti Ertiga showroom condition with' and 'Well mentained Tata Sumo'
- company: many of the names are not of any company like 'Used', 'URJENT', and so on.
- year has many non-year values
- year is in object. Change to integer
- Price has Ask for Price
- Price has commas in its prices and is in object
- kms_driven has object values with kms at last.
- It has nan values and two rows have 'Petrol' in them
- fuel_type has nan values

- filtering out any rows from the car DataFrame where the 'year' column contains non-numeric values.

In [None]:
car=car[car['year'].str.isnumeric()]

- converting the data type of the 'year' column from object to integer

In [None]:
car['year']=car['year'].astype(int)

- removing rows where the 'Price' column contains the string 'Ask For Price'

In [None]:
car=car[car['Price']!='Ask For Price']

- clean and convert the 'Price' column in the car DataFrame from a string format (with commas) to an integer format

In [None]:
car['Price']=car['Price'].str.replace(',','').astype(int)

- clean the 'kms_driven' column by extracting the numerical part and removing commas

In [None]:
car['kms_driven']=car['kms_driven'].str.split().str.get(0).str.replace(',','')

- filter the DataFrame to keep only rows where the 'kms_driven' column contains numeric values

In [None]:
car=car[car['kms_driven'].str.isnumeric()]

- convert the 'kms_driven' column to integer type after ensuring all values are numeric

In [None]:
car['kms_driven']=car['kms_driven'].astype(int)

- filter the DataFrame to remove rows where the 'fuel_type' column contains missing (NaN) values

In [None]:
car=car[~car['fuel_type'].isna()]

In [None]:
car.shape

- truncates the 'name' column to keep only the first three words of each entry

In [None]:
car['name']=car['name'].str.split().str.slice(start=0,stop=3).str.join(' ')

- resets the DataFrame index to a default integer index, dropping the old index and avoiding the creation of a new column for it

In [None]:
car=car.reset_index(drop=True)

In [None]:
car

- saving the cleaned DataFrame car to a CSV file named 'Cleaned_Car_data.csv'

In [None]:
car.to_csv('Cleaned_Car_data.csv')

In [None]:
car.info()

- get summary of statistics for all columns in the DataFrame, including count, unique values, top (most frequent) values, and frequency for non-numeric data, as well as standard statistical measures for numeric data

In [None]:
car.describe(include='all')

- filters the DataFrame to keep only rows where the 'Price' column values are less than 6,000,000

In [None]:
car=car[car['Price']<6000000]

In [None]:
car['company'].unique()

In [None]:
import seaborn as sns

- creates a boxplot using Seaborn to visualize the distribution of car prices (Price) across different companies (company) from the car dataset. The x-axis labels are rotated 40 degrees to the right for better readability, and the figure size is set to 15x7 inches.

In [None]:
plt.subplots(figsize=(15,7))
ax=sns.boxplot(x='company',y='Price',data=car)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()

- This code generates a swarm plot using Seaborn to display the distribution of car prices (Price) by year (year) from the car dataset. The plot's x-axis labels are rotated 40 degrees for clarity, with the figure size set to 20x10 inches for better visibility

In [None]:
plt.subplots(figsize=(20,10))
ax=sns.swarmplot(x='year',y='Price',data=car)
ax.set_xticklabels(ax.get_xticklabels(),rotation=40,ha='right')
plt.show()

- This code creates a relational plot using Seaborn to visualize the relationship between kilometers driven (kms_driven) and car prices (Price) in the car dataset. The plot is sized with a height of 7 inches and an aspect ratio of 1.5 to ensure a wider view.

In [None]:
sns.relplot(x='kms_driven',y='Price',data=car,height=7,aspect=1.5)

- This code generates a boxplot using Seaborn to compare the distribution of car prices (Price) across different fuel types (fuel_type) in the car dataset. The plot is displayed with a figure size of 14x7 inches to enhance visibility.

In [None]:
plt.subplots(figsize=(14,7))
sns.boxplot(x='fuel_type',y='Price',data=car)

- his code creates a relational plot using Seaborn to visualize car prices (Price) across different companies (company), with points colored by fuel type (hue='fuel_type') and sized based on the year of manufacture (size='year'). The x-axis labels are rotated 40 degrees for readability, and the plot is sized with a height of 7 inches and an aspect ratio of 2 for a wider display.

In [None]:
ax=sns.relplot(x='company',y='Price',data=car,hue='fuel_type',size='year',height=7,aspect=2)
ax.set_xticklabels(rotation=40,ha='right')

- separate the features (X) and the target variable (y) from the car dataset. The features include name, company, year, kms_driven, and fuel_type, while the target variable is Price, which you intend to predict.

In [None]:
X=car[['name','company','year','kms_driven','fuel_type']]
y=car['Price']

In [None]:
X

In [None]:
y.shape

- splits the dataset into training and testing sets using Scikit-learn's train_test_split function. It allocates 80% of the data for training (X_train, y_train) and 20% for testing (X_test, y_test).

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

- from sklearn library , module linear_model, import LinearRegression class

In [None]:
from sklearn.linear_model import LinearRegression

- OneHotEncoder for converting categorical variables into a binary (one-hot) encoded format, make_column_transformer for applying different transformations to specific columns, make_pipeline for creating a sequential pipeline of transformations and models, and r2_score for evaluating the performance of a regression model by calculating the R² score.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

- initialize a OneHotEncoder object and fits it to the categorical columns name, company, and fuel_type in the X dataset. This process prepares the encoder to convert these categorical variables into binary (one-hot) encoded vectors for use in machine learning models.

In [None]:
ohe=OneHotEncoder()
ohe.fit(X[['name','company','fuel_type']])

- creates a ColumnTransformer that applies one-hot encoding to the name, company, and fuel_type columns using the predefined categories from ohe. The remainder='passthrough' parameter ensures that all other columns not specified (e.g., year and kms_driven) are left unchanged during the transformation.

In [None]:
column_trans=make_column_transformer((OneHotEncoder(categories=ohe.categories_),['name','company','fuel_type']),
                                    remainder='passthrough')

- initializes a LinearRegression object from Scikit-learn, which will be used to build and train a linear regression model. This model aims to predict the target variable by fitting a linear relationship between the features and the target

In [None]:
lr=LinearRegression()

- creates a pipeline that first applies the column_trans transformations (e.g., one-hot encoding) to the data and then fits a LinearRegression model. The pipeline streamlines the process by chaining data preprocessing and model training into a single step.

In [None]:
pipe=make_pipeline(column_trans,lr)

- trains the pipeline on the training data (X_train and y_train). It applies the preprocessing steps defined in the pipeline and then fits the LinearRegression model to the transformed training data.

In [None]:
pipe.fit(X_train,y_train)

- uses the trained pipeline to make predictions on the test data (X_test). It applies the same preprocessing steps and then predicts the target variable values (y_pred) based on the test features.

In [None]:
y_pred=pipe.predict(X_test)

- calculates the R² score to evaluate the performance of the model by comparing the predicted values (y_pred) against the actual values (y_test). The R² score measures how well the model's predictions match the true values, with a higher score indicating better performance.

In [None]:
r2_score(y_test,y_pred)

- Performs a repeated evaluation of the model's performance by splitting the data into training and test sets with varying random states.
- It trains a LR model within a pipeline, makes predictions, and calculates the R² score for each split, storing the scores in a list to assess variability and model stability.
- Random State: Think of it like a "seed" for random number generation. If you plant the same seed in the ground, you'll grow the same plant every time. Similarly, setting a specific random_state ensures that every time you run your code, the random processes (like shuffling or splitting data) will produce the same results. This makes your experiments and results repeatable and consistent.

In [None]:
scores=[]
for i in range(1000):
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1,random_state=i)
    lr=LinearRegression()
    pipe=make_pipeline(column_trans,lr)
    pipe.fit(X_train,y_train)
    y_pred=pipe.predict(X_test)
    scores.append(r2_score(y_test,y_pred))

- code finds the index of the maximum R² score from the scores list. 
- It identifies which iteration of the model evaluation yielded the highest performance score

In [None]:
np.argmax(scores)

- It display the max score value at index, which we got previously(here 302)

In [None]:
scores[np.argmax(scores)]

- code uses the trained pipeline to make a prediction based on a new input with the features 'Maruti Suzuki Swift', 'Maruti', 2019, 100, and 'Petrol'. The input data is formatted as a DataFrame with the same column names as X_test to ensure compatibility with the pipeline.

In [None]:
pipe.predict(pd.DataFrame(columns=X_test.columns,data=np.array(['Maruti Suzuki Swift','Maruti',2019,100,'Petrol']).reshape(1,5)))

- code splits the dataset into training and test sets using the best random state from previous evaluations to ensure consistency. It then trains a LinearRegression model within a pipeline on this split, makes predictions on the test set, and calculates the R² score to evaluate model performance.

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1,random_state=np.argmax(scores))
lr=LinearRegression()
pipe=make_pipeline(column_trans,lr)
pipe.fit(X_train,y_train)
y_pred=pipe.predict(X_test)
r2_score(y_test,y_pred)

- Importing pickle module, which allows you to serialize (save) and deserialize (load) Python objects.
- This is useful for saving the state of a model or other data structures to a file, and then loading them later without needing to recreate or retrain them.

In [None]:
import pickle

- Code serializes and saves the pipe object, which includes the trained pipeline, to a file named 'LinearRegressionModel.pkl' in binary mode ('wb').
- w: Indicates that the file is being opened in write mode, which allows you to write data to the file.
- b: Indicates that the file is being opened in binary mode, which is necessary for handling non-text data, such as serialized objects.
- This allows you to persist the trained model and reload it later for future use without retraining.

In [None]:
pickle.dump(pipe,open('LinearRegressionModel.pkl','wb'))

- code uses the trained pipeline to make predictions based on a new input with the specified feature values ('Maruti Suzuki Swift', 'Maruti', 2019, 100, and 'Petrol').
- The input data is formatted as a DataFrame with the correct column names ('name', 'company', 'year', 'kms_driven', 'fuel_type') to match the model's expected input format.

In [None]:
pipe.predict(pd.DataFrame(columns=['name','company','year','kms_driven','fuel_type'],data=np.array(['Maruti Suzuki Swift','Maruti',2019,100,'Petrol']).reshape(1,5)))

In [None]:
pipe.steps[0][1].transformers[0][1].categories[0]

In [None]:
# Assuming you have `y_test` and `y_pred` from your model
plt.figure(figsize=(10, 6))

# Create a DataFrame for plotting
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Plot the results
sns.scatterplot(x='Actual', y='Predicted', data=results)
plt.plot([results['Actual'].min(), results['Actual'].max()],
         [results['Actual'].min(), results['Actual'].max()],
         'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()