## Supervised learning using regression 

## Predicting Price

## Objectives

On completing the assignment, you will be able to write a simple AI supervised application that uses regression.

## Description

In the last assignment, you were provided a data set that had 17 columns including the Price column. That assignment used a few of those columns. In this assignment, you will use another set of columns and they are listed below.

#### Columns to be used

Use  the following columns:

Rating, Spec_score,	Ram, Display, Screen_resolution, Company, Price

#### Regressor Models

Use the following regression models of sklearn's library and compare their performance using Mean Absolute Percentage Error (MAPE) values. 

- Linear Regressor (LinearRegression) from sklearn,linear_model (already mentioned above)
- KNeighbor Regressor (KNeighborRegressor) from sklearn.neighbors using n_neighbors=5
- Support Vector Regressor (SVR) from sklearn.svm
- Random Forest Regressor (RandomForestRegressor) from sklearn.ensemble 

#### Individual Values

Also, try out made-up attribute values of a few cell phones with the best performing model from the above list and report the attribute values used and predicted prices received from the model.

#### Final File  

At the end, do the following and submit the file

- Turn off all warnings
- Run your application from start to finish with the top performing model
- Add a short paragraph at the end describing whether you found much performance differences among various models you tried.
- Also, list the names of the models tried along with their corresponding values for Mean Absolute Percentage Error (MAPE)
- Also, list two made up cell phone attribute values you tried and the predicted price you received from the top performing model.  

## Implementation

#### Preprocessing

- Remove missing and null containing rows
- Remove duplicate rows

#### Columns Used

Although data set provided have 17 columns including the price, use only the following columns:

Rating, Spec_score,	Ram, Display, Screen_resolution, Company, Price

#### Making columns values numerical

We try to make all column values numerical. Columns Rating and Spec_score are already have numerical values. Column Display values can be be made numerical through some text cleaning.

However, column Screen_resolution and Company are non-numerical. The column company values are nominal and column Screen_resolution values are ordinal. (When there is no ranking or hierarchy to non-numerical values, they are nominal. Otherwise, they are ordinal. See discussion on column data type later below.) We make nominal values numerical using Pandas' getdummies method and make ordinal values numerical using sklearn's ordinal encoder (OrdinalEncoder). 

#### Column Cleaning

- Rating and Spec_score column values are already numerical.
  
- Ram values need to be cleaned because they are strings and are given in a form such as: "4 GB RAM' etc. We need to remove GB and RAM and convert their values into float.
  
- Display values need to be cleaned because they are strings and are given in a form such as: "4 inches' etc. We need to remove inches and convert their values into float.
  
- Screen_resolution values are strings such as '2408 x 1080 px Display with Water Drop Notch', '720 x 1560 px Display with Punch Hole' and seems to be ordinal type. So, we need to convert them into numeric using sklearn.preprocessing's ordinal encoder (OrdinalEncoder).
  
- Company values contain a large number of different company names. We need to keep the top 5 names and change the remaining to "Others". Company names are strings and they are nominal type. So, we need to convert them into numeric using panda's getdummies function (one hot decoding).
  
- Price values needs cleaning because they are given as strings with commas in them such as 9,999. So, we need to remove commas and convert them into float.

#### Column Cleaning Support

In pandas, every column is represented by a Series object and the Series (column) object provide a method called apply which is often used for column clean up. We call the method apply only once and supply it the name of our clean up function. (See code fragment below). Then, the apply method calls our clean up function repeatedly, once for each column cell, and passes it the contents of the cell. Our function cleans up (modifies) the content of the cell and returns with the new value for the cell. The apply method updates the content of the cell with the new value received from our function.  When all done, the apply method returns the updated Series (column) object. 

In the code fragment below, we call the method apply on Series (column) object price and supply it our function cleanup_price. Then, the apply method calls our function cleanup_price repeatedly, once for each column cell and passes it the cell contents. Our function cleans up (modifies) the content of the cell and returns the new value for the cell. On return from the call from our clean up function, the apply method, updates the contents of the cell with the new value received. Thus the values of all column cells get updated. When all done, the apply method returns the updated Series (column) object. 

import re

def cleanup_price (item):
    item = re.sub ('[,]', '',item)
    item = re.sub (r'\s+', '',item)
    return float (item)

price = price.apply(cleanup_price)

#### Cleanup with regular expressions

In the above code fragment, our cleanup_price function receives an item containing a column cell content as a text. We need to work on this text and modify it. At times, we need to search for certain patterns in the text (such as multiple spaces) and eliminate or modify them. For this purpose, we use a mechanism called 'regular expressions'.

A regular expression is just some text that specifies a pattern according to well defined rules. The module re (short for regular expression) provides methods that use regular expressions (patterns) and search for those in a given text and may replace the matching pattern with another text. 

In the code fragment above, we use re module's function sub (short for substitute). The function sub searches for the provided regular expression pattern (provided as the first parameter) in a provided text (provided as the third parameter) and when found substitutes it with another text (provided as second parameter).

For example, in the first call to the sub function above, we provide the following values as parameters: 

regular expression pattern to be searched for a match: '[,]'
substituting text: '' (empty string)
the text to be searched: item (contents of column cell.

The function sub will search the item content for a match with comma character and when a match is found, it will replace it with empty string. The result will be that commas characters will eliminated from the item contents.

In the second call to the sub function above, we specify the following values as parameters: 

regular expression pattern to be searched: r'\s+'
substituting text: '' (empty string)
the text to be searched: item (contents of column cell with comma characters eliminated).

The function sub will search the item content for a match with one or more consecutive space characters and, when found, it will replace them with empty string. The result will be that all the space characters will be eliminated from item content.

The overall result of the two calls to function sub in the above code, will be that all the comma characters and space characters will be eliminated and the item it will be left with a sequence of numbers as a text. So, we will be safely convert its contents to a float and return it as the new content for the column cell.

#### Interpreting regular expression patterns

[,]
The above pattern matches with a comma character in the searched string

r'\s+' 
the letter r before the pattern indicates to python runtime that it is raw string.
the pattern '\s+' matches with one or more consecutive space characters in the searched string.
(\s indicates a space character)
(+ indicates one or more)

#### Further study on regular expressions

When we mistype an email, phone number, social security number, or birth date in an online application form, the information is checked against a regular expression pattern and when the entered information does not match the specified pattern, the application issues an error. For further study on regular expression, consult internet. 


## Column Data Types

Data values are of either quantitative or qualitative type.

#### Quantitative (Numerical) Values

We can recognize quantitative (numrical) type values from the fact that they can be shown along a number line and we can perform mathematical operations (+, -, *, /) on them. The quantitative (numerical) type values can be either of discrete or continuous type.

##### Continuous Values

When quantitative (numerical) type values are along a number line within a range and all possible values within the range are permitted, then they are considered to be of continuous type. For example, height and weight size values are considered continuous because all weight and height values with a range are permitted.

##### Discrete Values
 
When quantitative (numerical) type values are along a number line within a range but some values  within the range are not included, then they are considered to be of discrete type. For example clothes and shoe size values are considered discrete because only certain clothes and shoe sizes exist within a range. 

For differentiating between discrete and continuous type values, consider shoe size and foot size values. Shoe size values are considered discrete because only certain shoe size values are permitted (the shoe size values of 8.11, 8.12 etc. do not exist). On the other hand, foot size values are considered continuous because we can specify a foot size of any value within a range

Regressors versus Classifiers 

In our supervised learning problems, if the target (label) values are continuous such as prices (a price can have any value within the range)then we use regressors to solve them. However, when the target can have only certain values or can belong to certain categories, we use classifiers to solve them.


#### Qualitative (Categorical) (Non-numerical) Values

We can recognize Qualitative (Categorical) (non-numerical) type values from the fact that they can be shown along a number line and we cannot perform mathematical operations (+, -, *, /) on them. The quantitative (numerical) type values can be either of nominal or ordinal type.

##### Nominal Values
 
When data values are just names without any ranking or order to them, they are considered nominal values. For example, if a hair-color column contains values such as black, brown, red etc., then these value are considered nominal values because there is no ranking attached to these values. 

##### Ordinal Values

When data values are names but there is an implied ranking or order attached to them, they are considered ordinal values. For example, if a job satisfaction column contains values such as unsatisfied, satisfied, very satisfied etc. then these values are considered ordinal values because there is an implied ranking or order attached to them.

## Making nominal and ordinal values numerical

Non-numeric column valIn our problem, both nominal and ordinal column values are converted to numerical values. For converting nominal values, we use Pandas' getdummies method. It creates a separate column for each different name value. So, in our hair color example above, it will create a column for "black', a column for "brown", and a column for "red" etc. and assign 0 or 1 in each column indicating the presence or absent of that color in the individual. 

On the other hand, for an ordinal column value, we use sklearn.preprocessing module's ordinal encoder (OrdinalEncoder). The encoder does not create any new columns. Instead, it substitutes value 0, 1, 2, 3, etc for different ordered name values in the order desired. 

# Implementation Notes


#### Dataset source



## Submittal

The uploaded submittal should contain the following:

- jpynb file after running the application from start to finish containing the marked source code, output, and your interaction.
  
- the corresponding html file.

## Coding
