# Course Project - MATH2319 Machine Learning | Sem 1, 2020
## Honour Code

We solemnly swear that we have not discussed our assignment solutions with anyone in any way and the solutions we are submitting are our own personal work.

Full Name: Ankit Munot (s3764950), Akshay Sunil Salunke (s3730440) | Group 12

## Table of Contents
* [Objective](#1)
* [Source and description of dataset](#2)
  - [Dataset](#2.1)
  - [Target feature](#2.2)
  - [Descriptive features](#2.3)
* [Data preprocessing](#3)
  - [Preliminaries](#3.1)
  - [Data cleaning and transformation](#3.2)
  - [Summary of features](#3.3)
* [Data Exploration and Visualisation](#4)
  - [Univariate visualizations](#4.1)
  - [Bivariate visualizations](#4.2)
  - [Multivariate visualizations](#4.3)
* [Methodology](#5)
  - [Feature selection](#5.1)
  * [Model fitting](#5.2)
    - [K-Nearest Neighbor (KNN)](#5.2.1)
    - [Decision Tree Classifier](#5.2.2)
    - [Random Forest Classifier](#5.2.3)
    - [Support Vector Machine](#5.2.4)
  * [Performance Comparison](#5.3)
* [Model Evaluation](#6)
* [Summary and Conclusion](#7)
* [References](#8)

# Objective
The objective of this project is to predict mobile phone price range, based on various features a phone has. We predict price for a phone as a price-bucket rather than as a continuos feature, hence the problem at hand is multinomial classification problem. We have performed following steps in this project, data visualization, data preprocessing,  feature selection (using f-score and random forest), model evaluation (4 models).

# Source and description of dataset
## Dataset
The dataset for this project was sourced from [kaggle.com](https://www.kaggle.com/iabhishekofficial/mobile-price-classification). A copy of dataset is included in `/data` folder, as accessed on `26 May 2020`. The dataset originally has 2,000 rows, but we have randomly sampled 1200 rows so that our laptops could cope up.
The dataset has 20 features(excluding target), some of which are binary (eg. `wifi`, `bluetooth`, etc.) and denote if a phone has that particular feature, whereas some are continuos features (eg. `batter_power`, `ram`, etc.)

## Target feature
The target feature for this dataset is `price_range`, which has `0,1,2,3` as possible values, which represents phone price category corresponding to `low, mid, high, v.high`

## Descriptive features
The dataset has followinf descriptive features:
- `battery_power`: Total energy a battery can store in one time measured in *mAh.*
- `bluetooth`: Has bluetooth or not.
- `clock_speed`: Speed at which microprocessor executes instructions in *GHz*.
- `dual_sim`: Has dual sim support or not.
- `front_cam_mp`: Front Camera mega pixels.
- `four_g`: Has 4G or not.
- `int_memory`: Internal Memory in *Gigabytes*.
- `n_cores`: Number of cores of processor.
- `back_cam_mp`: Primary Camera in *Megapixels*.
- `px_height`: Pixel Resolution Height in *pixels*.
- `px_width`: Pixel Resolution Width in *pixels*.
- `ram`: Random Access Memory in *Megabytes*.
- `sc_h`: Screen Height of mobile in *cm*.
- `sc_w`: Screen Width of mobile in *cm*.
- `talk_time`: longest time that a single battery charge will last when you are on a call, in *Hours*.
- `three_g`: Has 3G or not.
- `touch_screen`: Has touch screen or not.
- `wifi`: Has wifi or not.
- `screen_size`: Screen size diagonally in *inches*.

Some of the features have been transformed into new features, for example, `sc_h` and `sc_w` have been transformed into `screen_size`, which is screen size measured diagonally in inches, which is the industry standard for measuring phone screen sizes.

# Data preprocessing
## Preliminaries
We import relevant datascience libraries, which we could think off top of our heads. Also, ignore the warning(because they're annoying).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing
import sklearn
import warnings 
import seaborn as sns
warnings.filterwarnings('ignore')
%matplotlib inline

We now read the data in an pandas dataframe. We randomly sample 1200 rows, with a fixed `random_state` so that results are reproducible.

In [None]:
df = pd.read_csv("data/train.csv")
df = df.sample(1200, random_state=786)

Here's how the original dataset looks like:

In [None]:
df.head()

And here is it's shape:  (rows, columns)

In [None]:
df.shape

## Data cleaning and transformation
In this section we remove the irrelevant features, transform existing features into new, etc.

We now check for any null values right off the bat, in all `df`.

In [None]:
df.isnull().any()

But isnull() only detects those values which are `NaN` or `None`. This is not enough and we still need to check the `df` for missing values in other formats, for example `?` or `0`.

We first rename the columns to more readable and understandable names.

In [None]:
df = df.rename(columns={'blue':'bluetooth', 'fc':'front_cam_mp', 'sc_h':'screen_ht', 'sc_w':'screen_wt', 'pc':'back_cam_mp'})

Then, we drop the columns like `mobile_wt` and `m_dep` which we think are not actual deciding factors for a phone price, but more of a byproduct after the phone has been already manufactured.
We also dropped `px_height` and `px_width` since it played very little to no role as a deciding factor for price of a phone, for the general consumer, since most of the consumers focus on the physical screen size rather than how many pixels a display has.

In [None]:
df = df.drop(columns=['m_dep', 'mobile_wt', 'px_height', 'px_width'])

Now we perform binning of continuos feature `clock_speed`, into 3-bins namely, `low, mid, high` which correspond to following buckets `0-1, 1-2, 2-3`, in *Ghz*.

In [None]:
df['clock_speed'] = pd.cut(df['clock_speed'], bins=[0, 1, 2, 3], labels = ['low', 'mid', 'high'])
level_map = {'low':0, 'mid':1, 'high':2}
df['clock_speed'] = df['clock_speed'].replace(level_map)
df.head()

We also transform the features `screen_wt` and `screen_ht` which are in *cm*, to new feature `screen_size` which is the diagonal screen size in *inches*, which is the industry standard for measuring screen sizes of phones.

In [None]:
df[['screen_ht', 'screen_wt']].describe()

# Convert column datatypes to float.
df['screen_ht'] = df['screen_ht'].astype(float)
df['screen_wt'] = df['screen_wt'].astype(float)

We first handle the missing values in `screen_wt`. For this, we find the height for missing width, calculate mean width for that height in whole df, then append this mean to missijg value.

In [None]:
# Create a temp list of all screen heights whose screen width is 0.
x = df[df['screen_wt']==0]['screen_ht'].value_counts().index.tolist()

# Create a list of all screen heights whose screen width is 0.
arr = []
for d in df.loc[df['screen_wt']==0]['screen_ht']:
    arr.append(d)

# Calculate mean width for a specific screen height. 
# We create set out of screen heights list so that width is calculated only for unique screen heights.
# mean_width stores all unique screen heights, and their mean width.
mean_width = {}
for d in set(arr):
    total = 0
    n = 0
    for width in df.loc[df['screen_ht'] == d]['screen_wt']:
        if width == 0:
            pass
        total += width
        n += 1
        mean = round(total/n, 2)
    print("Mean width for height", d, "=", mean)
    mean_width[d] = mean

# Append all missing screen widths with the mean screen width for that specific height.
for z in x:
    df['screen_wt'] = np.where(((df['screen_wt']==0.0) & (df['screen_ht']==z)), mean_width.get(z), df['screen_wt'])

In [None]:
df.head()

We now transform the `screen_height` and `screen_width` features to `screen_size` feature.
Using the formula for diagonal of a rectangle = $\sqrt{width^2 + heigth^2}$. Then we convert to *inches* by dividing with 2.54.

In [None]:

df['screen_size'] = df['screen_ht']**2 + df['screen_wt']**2
df['screen_size'] = np.sqrt(df['screen_size'])
df['screen_size'] = df['screen_size']/2.54
df['screen_size'] = df['screen_size'].round(2)
df.drop(columns=['screen_ht', 'screen_wt'], inplace=True)

p = pd.DataFrame(df['price_range'])
df.drop(columns=['price_range'], inplace=True)
df = df.join(p)

We also, seperate out the categorical features, continuos features and the target feature in seperate variables.

In [None]:
categorical_features = ['bluetooth', 'dual_sim', 'four_g', 'three_g', 'touch_screen', 'wifi']
continuous_features = ['battery_power', 'front_cam_mp', 'int_memory', 'n_cores', 'back_cam_mp', 'ram', 'clock_speed', 'talk_time']
TARGET = ['price_range']

## Summary of features
We describe the continuos features, to understand the dataset and draw some insights.

In [None]:
df[continuous_features].describe(include='all')

Here is the shape of our `df` after dropping unrequired columns.

In [None]:
df.shape

We now print unique values for all features to find any missing/outlier values.

In [None]:
df['bluetooth'].unique()

In [None]:
df['dual_sim'].value_counts()

In [None]:
df['front_cam_mp'].unique()

In [None]:
df['back_cam_mp'].unique()

We assume the `0` values in `from_cam_mp` and `back_cam_mp` are not outliers and instead mean that those phone lack that specific feature.

In [None]:
df['four_g'].value_counts()

In [None]:
df['int_memory'].describe()

The extreme values for `int_memory` are 2Gb and 64Gb, which are both available as internam memory on phones. Hence there are no outliers.

In [None]:
df['n_cores'].value_counts()

In [None]:
df['ram'].describe()

In [None]:
df['talk_time'].describe()

In [None]:
df['three_g'].value_counts()

In [None]:
df['touch_screen'].value_counts()

In [None]:
df['four_g'].value_counts()

In [None]:
df['wifi'].value_counts()

In [None]:
df['price_range'].value_counts()

All categorical values have possible values, no missing values or outliers.

In [None]:
df.head()

In [None]:
df.head()

# Data Exploration and Visualisation
In this section we visualize the data and try to get some insights from it.
## Univariate visualizations
In this section we try to visualize and analyze one feature at a time.

First we plot a bar plot of counts of `n_cores` across the whole dataset. There is no clear pattern visible, but you learn that some number of cores appear more frequently (like `4`, `7`, `8`), denoting that these core sizes occur more frequently, which aligns with the fact that quad and octa core are standard # of cores for many major processor brands (like Qualcomm and Mediatek).

In [None]:
# Bar Plot
fig = plt.figure(figsize = (6, 4))
title = fig.suptitle("No. of Cores", fontsize = 14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel("No.of Cores")
ax.set_ylabel("Count #") 
w_q = df['n_cores'].value_counts()
w_q = (list(w_q.index), list(w_q.values))
ax.tick_params(axis='both', which='major', labelsize=8.5)
bar = ax.bar(w_q[0], w_q[1], color='steelblue', 
        edgecolor='black', linewidth=1)

Next, we plot a pie chart on `clock_speed` feature, which shows that `low` clock speed appears the most, which suggests that majority of the phones operated on a processor which was between `0-1 Ghz`, followed by `1-2 Ghz` and lastly the high end phones with `2-3 Ghz` clock speed.

In [None]:
labels = ['low', 'mid', 'high']
df['clock_speed'].value_counts().plot(kind='pie', autopct='%.2f')
plt.tight_layout()
plt.legend(labels)
plt.show()

Since `screen_size` is a continuos feature, we plot a density graph, to try to find which screen sizes were more prominent. We find that screen size across phones can be split into 2 distinct different categories, which aligns with today's trend of smaller 4.7 *inches* screens and larger 6+ *inches* screens. (Apple iPhone SE 2020 - 4.7 *inches*, iPhone 11 Pro Max - 6.5 *inches*)

In [None]:
df['screen_size'].plot(kind='density', title="Screen size density plot")

## Bivariate visualizations
In this section we try to visualize and analyze two features at a time.

We start with plotting a 2d histogram for `ram` and `price_range`. We can clearly see the density of ram for each price range. Following are the average ram sizes for each price range:
- `low` : 0 - 750 MB
- `mid` : 1250 - 1800 MB
- `high` : ~2500 MB
- `v.high` : 3500 - 4000 MB

We can clearly see the pattern, as price range increases, average ram size also increases.

In [None]:
labels = ['low', 'mid', 'high', 'v.high']
h, x, y, i = plt.hist2d(df['price_range'], df['ram'], bins=(4, 16), cmap='Blues')
bin_w = (max(x) - min(x)) / (len(x) - 1)
plt.xticks(np.arange(min(range(0,4))+bin_w/2, max(range(0, 4)), bin_w), labels)
plt.xlabel("Price range vs. Ram")
plt.ylabel("RAM size in MB")
cb = plt.colorbar(i)
cb.set_label('ram counts in bin')
plt.show()

Now we plot a bar graph between `dual_sim` and `talk_time` features. Even though there is no clear pattern visible, we can see that talk time for non dual sim phones falls around the higher end and vice versa. From this, we can assume that dual sim phones have less talk time on average as compared to non dual sim phones.

In [None]:
# Using subplots or facets along with Bar Plots
fig = plt.figure(figsize = (10, 4))
title = fig.suptitle("Dual_sim vs. talk_time", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

# Non Dual Sim
ax1 = fig.add_subplot(1,2, 1)
ax1.set_title("Non Dual Sim")
ax1.set_xlabel("Talk-Time in Hrs.")
ax1.set_ylabel("Frequency") 
rw_q = df[df['dual_sim'] == 0]['talk_time'].value_counts()
rw_q = (list(rw_q.index), list(rw_q.values))
ax1.set_ylim([0,70])
ax1.tick_params(axis='both', which='major', labelsize=8.5)
bar1 = ax1.bar(rw_q[0], rw_q[1], color='red', 
               edgecolor='black', linewidth=1)

# Dual Sim
ax2 = fig.add_subplot(1,2, 2)
ax2.set_title("Dual Sim")
ax2.set_xlabel("Talk-time in Hrs.")
ax2.set_ylabel("Frequency") 
ww_q = df[df['dual_sim'] == 1]['talk_time'].value_counts()
ww_q = (list(ww_q.index), list(ww_q.values))
ax2.set_ylim([0, 70])
ax2.tick_params(axis='both', which='major', labelsize=8.5)
bar2 = ax2.bar(ww_q[0], ww_q[1], color='white', 
               edgecolor='black', linewidth=1)

Lastly, we plot a boxplot for `battery_power` vs `price_range`. Here we can see that as the price range increases, battery capacity also tends to increase. But, this is not true for `mid` and `high` range phones. High range phones have very little battery capacity increase as compared to `mid` range phones.

In [None]:
df.boxplot(column='battery_power', by='price_range', figsize=(6,5))

## Multivariate visualizations
In this section we try to visualize and analyze three or more features at a time.

To start with, we plot 5 different features with each other in a pair plot, to find any dependence/pattern between features. Following observations can be concluded from the pair plot:
- `ram` feature shows distinct seperation along target feature, meaning the average ram differs most between price range.
- `low` and `v.high` prices phones tend to have lower or higher internal memory respectively. But `mid` and `high` priced phones can have a lot of options avaiable for internal memory sizes.
- all phones tend to have bigger `screen_size`, except `high` priced phones. This may mean that there are people who tend to purchase high priced phones but are looking for smaller screen sizes.
- Only `v.high` priced phones can offer higher battery capacity in general.

In [None]:
# Scaling attribute values to avoid few outiers
cols = ['ram', 'int_memory', 'screen_size', 'battery_power','price_range']
pp = sns.pairplot(df[cols], hue='price_range', size=1.8, aspect=1.8, 
                  palette={0: "#FF9999", 1: "#FFE888", 2:"#2A9D8F", 3:"#E63946"},
                  plot_kws=dict(edgecolor="black", linewidth=0.5))
fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

Secondly, we plot a relationship plot for `int_memory` vs. `four_g` vs. `front_cam_mp`. Here we can see that as front camera megapixels increases, internal memory of phones also tends to increase, presumably to cope with the multimedia possibilites whoch are opened with a good camera module.

In [None]:
sns.relplot(y="int_memory", x="front_cam_mp", hue='four_g', kind="line", data=df)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt



fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df['battery_power'], df['back_cam_mp'], df['screen_size'], c='orange')

ax.set_xlabel('Battery Power')
ax.set_ylabel('Back_Camera_Px')
ax.set_zlabel('Screen_Size')

plt.show()

# Methodology

We are using classification task for this machine learning project. The 4 algorithms used to predit the models are:<br><br>
1. K-Nearest Neigbors (commonly known as KNN)<br>
2. Decision Tree<br>
3. Random Forest<br>
4. Support Vector Machine (also known as SVM)<br><br>

Firstly,we have applied cross fold validation on the entire feature present in the dataset after pre-processing & found CV score as **'0.38'**. As the score seemed to be lower,we then decided to go for feature selection to see if any improvement in the accuracy. We applied `f-score` & `random forest importance` methods for feature selection and compared the accuracy of both. Finally we ended up with f-score as best features estimator and used it for further analysis.


We then applied the above mentioned algorithms on the best features given by f-score method and estimated the accuracy of all models. Also for each algorithm, we tuned the parameters and visualised to get the best accuracy score for corresponding model.

Lastly we evaluated the algorithms using the performance metrics and performance comparison using paired t-test






# Feature selection
Feature selection is the process where you automatically select features which contribute most to your prediction variable. Sometimes having many features can decrease the accuracy of model.<br><br>

1. Performance with full sets of features:
We first accessed the performance using all the features of our data. We used `Stratified-K-fold` methods with `splits = 5` and `repetitions = 3` with scoring metric set to accuracy & lastly computed the result using `cross_val_score()`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score,RepeatedStratifiedKFold

Data= df.drop(columns=['price_range'])
target = df[TARGET]
Data = preprocessing.MinMaxScaler().fit_transform(Data)

clf = KNeighborsClassifier(n_neighbors=1)
cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=786)
scoring_metric = 'accuracy'
cv_results_full = cross_val_score(estimator=clf, X=Data, y=target, cv=cv_method,scoring=scoring_metric)

cv_results_full.mean().round(2)

With full set of features & `1` neigbor classifier, we achieved the accuracy score of **38%**.

2. Feature selection using f-score:<br><br>
F-score method selects the features based on relationship between descriptive feature and target feature using F-distribution. We now set number of features to `8`. The `fs_indices_fscore` returns us top 8 features sorted highest to lowest. 
 

In [None]:
Data = df.drop(columns=['price_range'])
target = df[TARGET]
Data = preprocessing.MinMaxScaler().fit_transform(Data)

In [None]:
from sklearn import feature_selection as fs
num_features = 8
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=num_features)
fs_fit_fscore.fit_transform(Data, target)
fs_indices_fscore = np.argsort(np.nan_to_num(fs_fit_fscore.scores_))[::-1][0:num_features]
fs_indices_fscore

In [None]:
best_features_fscore = df.columns[fs_indices_fscore].values
best_features_fscore

* We got `ram, battery_power, int_memory, clock_speed, screen_size, n_cores, talk_time` and `front_cam_mp` as best features based on F-score.

In [None]:
feature_importances_fscore = fs_fit_fscore.scores_[fs_indices_fscore]
feature_importances_fscore

In [None]:
import altair as alt

def plot_imp(best_features, scores, method_name, color):
    
    df = pd.DataFrame({'features': best_features, 
                       'importances': scores})
    
    chart = alt.Chart(df, 
                      width=500, 
                      title=method_name + ' Feature Importances'
                     ).mark_bar(opacity=0.75, 
                                color=color).encode(
        alt.X('features', title='Feature', sort=None, axis=alt.AxisConfig(labelAngle=45)),
        alt.Y('importances', title='Importance')
    )
    
    return chart

* Plotting the best_features_fscores to visualised the feature importance.

In [None]:
plot_imp(best_features_fscore, feature_importances_fscore, 'F-Score', 'red')


*Accessing the performance of the selected features using cross validation. *

In [None]:
cv_results_fscore = cross_val_score(estimator=clf,
                             X=Data[:, fs_indices_fscore],
                             y=target, 
                             cv=cv_method, 
                             scoring=scoring_metric)
cv_results_fscore.mean().round(3)

3. Feature selection using Random Forest Importance<br><br>
Random Forest importance (RFI) is widely used feature selector because of the accuracy, robustness and <br>ease of use it gives. It tells us about how much accuracy is decreased when a variable is excluded <br>and decrease in gini impurity when a variable is chosen to split node.


In [None]:
Data= df.drop(columns=['price_range'])
target=df[TARGET]
Data=preprocessing.MinMaxScaler().fit_transform(Data)

In [None]:
Data

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_rfi = RandomForestClassifier(n_estimators=100)
model_rfi.fit(Data, target)
fs_indices_rfi = np.argsort(model_rfi.feature_importances_)[::-1][0:num_features]

In [None]:
best_features_rfi = df.columns[fs_indices_rfi].values
best_features_rfi

* We got `ram`, `battery_power`, `screen_size`,`int_memory`, `talk_time`, `front_cam_mp`, `back_cam_mp` and `n_cores`,as best features based on random forest importance.

In [None]:
feature_importances_rfi = model_rfi.feature_importances_[fs_indices_rfi]
feature_importances_rfi

* Plotting the `best_features_rfi` to visualise the feature importance

In [None]:
plot_imp(best_features_rfi, feature_importances_rfi, 'Random Forest', 'green')

*Accessing the performance of the selected features using cross validation.*

In [None]:
cv_results_rfi = cross_val_score(estimator=clf,
                             X=Data[:, fs_indices_rfi],
                             y=target, 
                             cv=cv_method, 
                             scoring=scoring_metric)
cv_results_rfi.mean().round(3)

Finding the overall performance:
We found that F-score feature selector gives us good accuracy score as compared to random forest importance. 

Hence we choose `best_feature_f-score` for further fitting the model.


In [None]:
print('Full Set of Features:', cv_results_full.mean().round(3))
print('F-Score:', cv_results_fscore.mean().round(3))
print('RFI:', cv_results_rfi.mean().round(3))

### Splitting the data into training and test set

We have selected the sample(1200) of our entire data i.e(2000 rows) for model fitting and evaluation. We have split the data into `70 :30` ratio i.e 70% of our data to build a model and 30% data to test it to ensure that we measure the accuracy based on unseen data.

In [None]:
Data = df[best_features_fscore].copy()
target = df[TARGET]
Data = preprocessing.MinMaxScaler().fit_transform(Data)

In [None]:
from sklearn.model_selection import train_test_split

D_train, D_test, t_train, t_test = train_test_split(Data, target, test_size=0.3, random_state=786)

# Model fitting

## 1.K-Nearest Neighbor (KNN)<br>
We fit a `KNeighborClassifier` with default parameter values as `n_neigbors = 5` and `P=2`. n_neigbors value is the number of neigbors to be used and `P=2` is the Euclidean distance metric.
The score function returns the accuracy of classifier on the test data. Accuracy is ratio of total correctly predicted observations upon total number of observations. Computed accuracy found was 59.44%<br>


In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=5, p=2)
knn_classifier.fit(D_train, t_train) 
knn_classifier.score(D_test, t_test)

## Hyperparameter tuning using Grid Search
Grid-search is used to find the optimal hyperparameters of a model which results in the most *accurate* predictions.<br>

Below we have defined a function for `grid search` to which we pass the classifier (KNN, DT, RF, and SVM) and training data.
* We have defined different parameters for each algorithm in the `grid_params` method. 
* The function returns us best model parameters and model score based on the parameters given.
* In addition we include repeated stratified cv method.
* Also we tell sklean library which metric to optimize i.e. accuracy in our case.<br>

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

def grid_search(D_train, t_train, clf):
   
    if isinstance(clf, KNeighborsClassifier): 
        grid_params = {
        'n_neighbors':[3, 5, 7, 9, 11, 13, 15],
        'p':[1, 2, 3]
        }
    elif isinstance(clf, DecisionTreeClassifier): 
        grid_params = {
        'criterion':['gini','entropy'],
        'min_samples_split':[2, 3, 4],
        'max_depth':[1, 2, 3, 4, 5, 6, 7, 8]
        }
    elif isinstance(clf, RandomForestClassifier):
        grid_params = {
        'n_estimators':[110, 130, 150, 200],
        'criterion':['gini','entropy'],
        'min_samples_split':[2, 3, 4],
        'max_depth':[3, 4, 5]
        
        }
    elif isinstance(clf, SVC):
       grid_params = {
            'C':[1, 10, 50, 100],
            'gamma':[1, 0.1, 0.05, 0.001],
            'kernel':['rbf', 'poly', 'sigmoid']
        }
    else : 
        raise ValueError("unkown classifier")

    gs = GridSearchCV(
        estimator = clf,
        param_grid = grid_params,
        verbose = 3,
        cv = cv_method,
        n_jobs = -1,
        refit = True   
    )

    gs_results = gs.fit(D_train, t_train)
    p = gs_results.best_params_
    model = gs_results.best_estimator_
    return model, p, gs_results

* With `n_neigbors:[3, 5, 7, 9, 11, 13, 15]` and `P: [1, 2, 3]` the grid search function finds out the best parameter values and calculates the model score.

In [None]:
knn_model, knn_best_estimate, knn_result = grid_search(D_train, t_train, knn_classifier)


In [None]:
knn_best_estimate

In [None]:
knn_model.score(D_test, t_test)

* KNN classifier with `n_neighbor = 15` and `p = 1` predicted the model mean score of 68.8% 

In [None]:
results_KNN = pd.DataFrame(knn_result.cv_results_['params'])
results_KNN['test_score'] = knn_result.cv_results_['mean_test_score']
results_KNN.head()


In [None]:
results_KNN['metric'] = results_KNN['p'].replace([1, 2, 3], ["Manhattan", "Euclidean", "Minkowski"])
results_KNN.head()

### Plotting the KNN Performance comparison. 
We know visualise the hyper parameter tuning results from cross fold validation. We plot using altair module.<br> The plot shows that at all values of `K` with Manhattan distance `p=1` outperforms others.

In [None]:
import altair as alt

alt.Chart(results_KNN, 
          title='KNN Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('n_neighbors', title='Number of Neighbors'),
    alt.Y('test_score', title='Mean CV Score', scale=alt.Scale(zero=False)),
    color='metric'
)

#### Advantages of KNN Classifier:
*   The algorithm is simple and easy to implement.
*   The algorithm is versatile and can be used for classification, regression and search.<br><br>

#### Disadvantages:
*   The algorithm gets slower as number of independent variables increases where predictions needs to be made<br> rapidly.

#### Limitations:
*   Need to have high computing resources to speedly handle the data.

## 2.Decision Tree Clasification

Decision trees are non-parametric supervised learning methods used for classification. The main aim of this is to define a model that gives value of target feature by learning decision rule inferred from data features.<br><br>
Fitting the decision tree classifier with default values and `random state = 786` which was selected at the very beginning.<br><br>
The score function returns the accuracy of classifier on the test data. Accuracy is ratio of total correctly <br>predicted observations upon total number of observations.The accuracy measured was 76.38%.<br>


In [None]:
dt_classifier = DecisionTreeClassifier(random_state=786)
dt_classifier.fit(D_train, t_train)
dt_classifier.score(D_test, t_test)

* Out of Criterion = `gini, entropy`, `min_sample_split = [2, 3, 4]` & `max_depth = [1, 2, 3, 4, 5, 6, 7, 8]` the grid search function finds out the best parameter values and calculates the model score.

In [None]:
dt_model, dt_best_estimate, dt_result = grid_search(D_train, t_train, dt_classifier)

In [None]:
dt_best_estimate

In [None]:
dt_model.score(D_test, t_test)

*With Criterion `gini` , `max_depth = 4` and `min_sample_splits` of 2 the model predicts the accuracy 76.6%*

In [None]:
results_DT = pd.DataFrame(dt_result.cv_results_['params'])
results_DT['test_score'] = dt_result.cv_results_['mean_test_score']
results_DT.head()

### Plotting the DT Performance Comparison 
Also from the plot we visualise the best hyperparamters as `gini` and `max_depth:4`


In [None]:
alt.Chart(results_DT, 
          title='DT Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('max_depth', title='Maximum Depth'),
    alt.Y('test_score', title='Mean CV Score', aggregate='average', scale=alt.Scale(zero=False)),
    color='criterion'
)


#### Advantages of Decision tree classifier
*	Inexpensive to construct. 
*	Easy to interpret for small size trees. 
*	Fast at classifying unknown records.

#### Disadvantages
*	Decision tree models are often biased towards splits on features.
*	Large trees can be difficult to interpret.
*	Small change in training data can account for large change to decision logic.


## 3.Random Forest classifier
A random forest is a Meta estimator that fits number of decision tree classifier on various sub-samples and uses mean to advance the accuracy and avoid over-fitting.<br><br>
Fitting the random forest classifier with default estimator `n = 100` i.e. number of trees in the forest,<br> criterion `gini` and `max_depth` 2.<br>

The score function returns the accuracy of classifier on the test data. Accuracy is ratio of total correctly predicted observations upon total number of observations. The accuracy measured was *73.6%*.


In [None]:
rf_classifier = RandomForestClassifier(random_state=786,n_estimators=100,max_depth=2,criterion='gini')
rf_classifier.fit(D_train, t_train)
rf_classifier.score(D_test, t_test)

* Out of the given parameters given to grid search function `criterion = [‘gini’, ‘entropy’]`, `n_estimators = [110, 130, 150, 200]`, `max_depth = [3, 4, 5]` and `min_sample_split = [2, 3, 4]` it calculates & returns best parameters with model score.

In [None]:
rf_model, rf_best_estimate, rf_result = grid_search(D_train, t_train, rf_classifier)

In [None]:
rf_best_estimate

In [None]:
rf_model.score(D_test, t_test)

* The model predicts the accuracy score of 76.9%.*

In [None]:
results_RF = pd.DataFrame(rf_result.cv_results_['params'])
results_RF['test_score'] = rf_result.cv_results_['mean_test_score']
results_RF.head()

### Plotting the RF Performance Comparison 
From the plot we visualise that at `max_depth = 4` , `gini` overpowers `entropy`. 


In [None]:
alt.Chart(results_RF, 
          title='RF Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('max_depth', title='Maximum Depth'),
    alt.Y('test_score', title='Mean CV Score', aggregate='average', scale=alt.Scale(zero=False)),
    color='criterion')


#### Advantages of Random forest classifier
*	No need of any feature selection 
*	Easier to make parallel models 
*	If larger parts of features are lost , accuracy can still be maintained.

#### Disadvantages
*	Fits for some noisy data 
*	Time complexity- much harder and time consuming to construct.

#### Limitations
*	Heavy computation resources.


## 4.Support Vector Machine classifier 
SVM is linear model for classification problem. The idea of SVM is simple. The algorithm creates <br>a line or hyperplane which separates the data into classes.<br><br>
We fit the model with default kernel as rbf and regularisation value=1.0 parameters .<br><br>
The score function returns the accuracy of classifier on the test data. Accuracy is ratio of total<br> correctly predicted observations upon total number of observations. The accuracy measured was 78.6%.


In [None]:
svm_classifier = SVC()
svm_classifier.fit(D_train, t_train)
svm_classifier.score(D_test, t_test)

* The parameters passed to grid search function were gamma `values=[0.1, 0.05, 0.001, 1]` as the value <br> must be between 0.1 to 1. `Kernels=[‘rbf’, ‘poly’, ‘sigmoid’]` with `C=[1, 10, 50, 100]`.

In [None]:
svm_model, svm_best_estimate, svm_result = grid_search(D_train, t_train, svm_classifier)

In [None]:
svm_best_estimate

In [None]:
svm_model.score(D_train, t_train)

*The model predicts the accuracy score of 83.4% with best parameters .*

In [None]:
results_SVM = pd.DataFrame(svm_result.cv_results_['params'])
results_SVM['test_score'] = svm_result.cv_results_['mean_test_score']
results_SVM.head()

### Plotting the SVM Performance Comparison 
From the plot we visualise that at max_depth =4 , gini overpowers entropy . 


In [None]:
alt.Chart(results_SVM, 
          title='SVM Performance Comparison'
         ).mark_line(point=True).encode(
    alt.X('C', title='Regularisation Parameter'),
    alt.Y('test_score', title='Mean CV Score', aggregate='average', scale=alt.Scale(zero=False)),
    color='kernel'
)

#### Advantages of Support Vector Machine
*	Work well when there is clean margin of separation.
*	Memory efficient
#### Disadvantages
*	Not suitable for larger data sets
*	SVM does not perform well when data set has more noise or target class is overlapping.


## Performance comparison
After testing the classifier  by considering the train data  and using it in cross validation way, <br>we know perform the paired t-test in order to understand if the difference between performance is statistically<br> significant for any 2 classifiers.<br><br>
Firstly we calculate the cross_val_score  and then compare it with all models as:
*	KNN-DT
*	KNN-RF
*	KNN-SVM
*	DT-RF
*	DT-SVM
*	RF_SVM<br>

From scipy library we import the stats module to run the t-test .


In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv_method_ttest = StratifiedKFold(n_splits=10, random_state=786)
cv_results_KNN = cross_val_score(estimator=knn_model, X=Data, y=target, cv=cv_method_ttest, n_jobs=-1, scoring='accuracy')              

In [None]:
cv_results_KNN.mean()

In [None]:
cv_results_RF = cross_val_score(estimator=rf_model, X=Data, y=target, cv=cv_method_ttest, n_jobs=-1, scoring='accuracy')
cv_results_RF.mean()

In [None]:
cv_results_DT = cross_val_score(estimator=dt_model, X=Data, y=target, cv=cv_method_ttest, n_jobs=-1, scoring='accuracy')
cv_results_DT.mean()

In [None]:
cv_results_SVM = cross_val_score(estimator=svm_model, X=Data, y=target, cv=cv_method_ttest, n_jobs=-1, scoring='accuracy')
cv_results_SVM.mean()

In [None]:
from scipy import stats

print(stats.ttest_rel(cv_results_KNN, cv_results_DT))
print(stats.ttest_rel(cv_results_KNN, cv_results_RF))
print(stats.ttest_rel(cv_results_KNN, cv_results_SVM))

print(stats.ttest_rel(cv_results_DT, cv_results_RF))
print(stats.ttest_rel(cv_results_DT, cv_results_SVM))

print(stats.ttest_rel(cv_results_RF, cv_results_SVM))

*The Pair KNN-SVM gives statistically significant value of 0.0002 which is less than 0.05.*

# Model evaluation 
Model evaluation is one of the important step required to determine the best model, how well the model will perform.<br><br>
The target variable for our dataset was multinomial. That is target feature is categorical with 4 different <br>level {0, 1, 2, 3}. It refers to different price range ={‘low’, ‘mid’, ‘high’, ‘v high’}. <br>Hence we cannot use binary metric such as roc_auc curve to evaluate multinomial classifier.<br><br>
Below are the evaluation metrics used to find the accuracy, classification report and average model accuracy<br> for each model.


In [None]:
from sklearn import metrics
def print_model_stats(model, D_test, t_test):
    pred = model.predict(D_test)
    print("=========={model_name} Model Statistics=============".format(model_name=model.__class__.__name__))
    print("Accuracy score:", metrics.accuracy_score(t_test, pred))
    print("Confusion Matrix:\n", metrics.confusion_matrix(t_test, pred))
    print("Classification report:\n", metrics.classification_report(t_test, pred))
    print("Average model accuracy:", metrics.balanced_accuracy_score(t_test, pred))

In [None]:
print_model_stats(knn_model, D_test, t_test)

In [None]:
print_model_stats(dt_model, D_test, t_test)

In [None]:
print_model_stats(rf_model, D_test, t_test)

In [None]:
print_model_stats(svm_model, D_test, t_test)

*Hence we conclude that SVM gives us the best model accuracy and should be used for this predicting target feature.*

# Summary and Conclusion
After cleaning and visualizing the dataset, we were able to find clear pattern for features like `ram` which was proportional to the `price_range`, whereas some features had a very little information gain like `bluetooth` and `wifi`. We also noticed a pattern in `screen-size` for phones, had 2 distinct sizes which were manufactured the most, which aligned with the popular screen sizes provided by big brands(like Apple's iPhone).

The case study was to predict the cell phone price based on the descriptive features. We have successfully built a model based on the parameters given by grid search. That is we have fine-tuned the parameters and the best ones were applied to the model to train the data. The model was then tested and accuracy was computed for each algorithms. Out of 4, SVM gave us best accuracy with 84%.<br><br> 

Also we performed statistically significant ttest to determine if any difference between performance of any two classifier and we got KNN-SVM results as significant. Last but not the least, we used method evaluation techniques to verify the accuracy for multinomial classifier and it gave the same results.

There were also some limitations. The f-score method does not reveal information among the features but still we have used due to greater score than random forest importance.

Also we used few cases for feature selection and parameter tuning .we could have explored more taken more parameters and more feature selection methods. This might had helped us giving a better model.

# References
- Sharma, A. (2018). Mobile Price Classification [online]. Retrieved from https://www.kaggle.com/iabhishekofficial/mobile-price-classification

- Vural, A. (2019). Feature Ranking [online]. Retrieved from http://www.featureranking.com

- Pupale, R. (2018). Support Vector Machines(SVM) — An Overview [online]. Retrieved from https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989
