## Supervised learning using Regression

## Predicting Price

## Objectives

On completing this assignment, you will learn how to write a simple AI application involving supervised learning using regression.

## Description

Write an AI application which, when provided with a diamond's attributes, will predict its price. For training and testing the application, please use the labeled data set provided in the file, sb_diamonds.csv. The data set contains data regarding 53940 diamonds with 10 attributes each including the price. Use 80% of the data items for training, and the remaining 20% for testing. Use the sklearn's Linear Regression (LinearRegression) model for training. After the model is trained, test it using the test data and produce the Mean Absolute Percentage Error (MeanAbsolutePercentageError) (MAPE) reflecting its performance. Also produce trained model's coefficient (coeff_) and intercept (intercpt_) values. 

#### Regressor models to be used

Altogether, try out the following regression models of sklearn's library and compare their performance using Mean Absolute Percentage Error (MAPE) values.

- Linear Regressor (LinearRegression) from sklearn,linear_model

- KNeighbor Regressor (KNeighborRegressor) from sklearn.neighbors (using n_neighbors=5)
  
- Support Vector Regressor (SVR) from sklearn.svm
  
- Random Forest Regressor (RandomForestRegressor) from sklearn.ensemble 

#### Individual Values

Also, try out made-up attribute values of a few diamonds with the best performing model from the above list and report the attribute values used and predicted prices received from the model.

## Implementation 

#### Preprocessing

- Remove rows containing missing or null values
- Remove duplicate rows

#### Columns Used

Use all columns provided.

#### Column Cleaning

- carat, depth, table, price, x, y, z, and price column values are already numerical. So, leave them as they are.
  
- cut and clarity column values seem to be ordinal type. So, we need to convert them into numerical values using sklearn.preprocessing's label encoder (LabelEncoder).
  
- color column values seem to be nominal type. So, we need to convert them into numerical values using panda's getdummies function (using one hot decoding).

## Discussion

#### Column Data Types

Data values are of either quantitative or qualitative type.

#### Quantitative (Numerical) Values

We can recognize quantitative (numrical) type values from the fact that they can be shown along a number line and we can perform mathematical operations (+, -, *, /) on them. The quantitative (numerical) type values can be either of discrete or continuous type.

##### Continuous Values

When quantitative (numerical) type values are along a number line within a range and all possible values within the range are permitted, then they are considered to be of continuous type. For example, height and weight size values are considered continuous because all weight and height values with a range are permitted.

##### Discrete Values
 
When quantitative (numerical) type values are along a number line within a range but some values  within the range are not included, then they are considered to be of discrete type. For example clothes and shoe size values are considered discrete because only certain clothes and shoe sizes exist within a range. 

For differentiating between discrete and continuous type values, consider shoe size and foot size values. Shoe size values are considered discrete because only certain shoe size values are permitted (the shoe size values of 8.11, 8.12 etc. do not exist). On the other hand, foot size values are considered continuous because we can specify a foot size of any value within a range

Regressors versus Classifiers 

In our supervised learning problems, if the target (label) values are continuous such as prices (a price can have any value within the range)then we use regressors to solve them. However, when the target can have only certain values or can belong to certain categories, we use classifiers to solve them.


#### Qualitative (Categorical) (Non-numerical) Values

We can recognize Qualitative (Categorical) (non-numerical) type values from the fact that they can be shown along a number line and we cannot perform mathematical operations (+, -, *, /) on them. The quantitative (numerical) type values can be either of nominal or ordinal type.

##### Nominal Values
 
When data values are just names without any ranking or order to them, they are considered nominal values. For example, if a hair-color column contains values such as black, brown, red etc., then these value are considered nominal values because there is no ranking attached to these values. 

##### Ordinal Values

When data values are names but there is an implied ranking or order attached to them, they are considered ordinal values. For example, if a job satisfaction column contains values such as unsatisfied, satisfied, very satisfied etc. then these values are considered ordinal values because there is an implied ranking or order attached to them.

Implementing nominal and ordinal values

In our problem, both nominal and ordinal column values are converted to numerical values. For converting nominal values, we use Pandas' getdummies method. It creates a separate column for each different name value. So, in our hair color example above, it will create a column for "black', a column for "brown", and a column for "red" etc. and assign 0 or 1 in each column indicating the presence or absent of that color in the individual. 

On the other hand, for an ordinal column value, we use sklearn.preprocessing module's label encoder (LabelEncoder). The encoder does not create any new columns. Instead, it substitutes value 0, 1, 2, 3, etc for different ordered name values. 


## Implementation Notes


#### Dataset source

The data set was downloaded fity-data-determining-factors


## Submittal

The uploaded submittal should contain the following:

- jpynb file after running the application from start to finish containing the marked source code, output, and your interaction.
  
- the corresponding html file.

## Keith Yrisarri Stateson
June 23, 2024. Python 3.11.0

## Title: Diamond Price Prediction Using Supervised Learning

## Summary
This program is an AI application to predict the prices of Diamonds based on their attributes. Supervised learning and regression techniques are used to train and evaluate multiple models on a provided dataset. The goal is to determine which model performs best in predicting diamond prices and to understand the influence of various features on the price.

For each model:
- Model Training and Evaluation
    - Train various regression models and evaluate their performance using MAPE
- Model Prediction
    - Predict diamon price for new, made-up attributes
    
Part 1: DataFrame Cleaning

Part 2: Evaluate the Features and Target variable

Part 3: Data Cleaning

Part 4: Feature Engineering

Part 5: Train-Test Split and Feature Scaling

Part 6: Modeling
- Linear Regression Model
- Random Forest Regressor Model
- KNN Model
- SVR Model

Part 7: Identify the best performing model to predict diamond prices

## Part 1: DataFrame Cleaning

Evaluate the dataframe for missing values, empty rows and columns, and duplicate entries

## Part 2: Evaluate the Features and Target variable

## Part 3: Data Cleaning - Features and Target variable

*Conversion of Panda Series into a NumPy Array.*  
Many machine learning libraries, such as scikit-learn, expect input data to be in the form of NumPy arrays rather than pandas Series.  
Converting the target to a NumPy array ensures compatibility with these libraries.

## Part 4: Feature Engineering

Transform nominal categorical data to numerical using pandas.get_dummies, and drop the catgorical column and add the newly created numerical version to the features dataframe.

Transforms ordinal categorical data to numerical using the LabelEncoder.

## Part 5: Train-Test Split and Feature Scaling

Assign features and target.  
Split the dataset into 80% training, 20% testing.  
Standardize the training and test feature data.  
Apply the transformation to both the training and test datasets.

## Part 6: Modeling

Linear Regression  
Random Forest Regressor  
KNN Model  
SVR

## Part 7: Identify the best performing model to predict mobile phone prices