# DSC 478 Spring Quarter 2021

## Project Type: Analysis
## Team Members: Cassandra Steffey, Daniel O’Brien, Dina Allen

### Link to Data Set: 
https://www.kaggle.com/iabhishekofficial/mobile-price-classification?select=train.csv


# Introduction

#### The cost of mobile phones is difficult to determine given the varying number of components that go into a mobile phone and the exponential increase in level of technology over the last two decades. Many of the components that deem a phone expensive are hidden features such as RAM. Retail sellers need to understand the complexity of a phone in order to obtain the optimal profit. 



# Abstract

#### The dataset we will be working with contains sales information about mobile phones. To determine the optimal model to predict the price of a phone, we used clustering methods such as PCA and K-Means as well as classification methods such as KNN, Support Vector Machines, Naive Bayes and Decision Trees. 

# Methods

## Data Pre-Processing

#### Prior to analysis we pre process our data. Steps in pre-processing included checking for missing values, analysis of distribution of the data and normalization. We did not find any missing data values from the dataset. The normalization technique used in our pre-processing stage was z-score normalization. Finally, we split our data into training and testing sets. We set aside 20% of the data to use for testing, while the remaining 80% was used for training.

See [Data Pre-Processing](./DataPre-Processing.ipynb)

## PCA

#### After performing data cleaning and preprocessing, Principal Component Analysis was performed to reduce the dimensionality of the data and increase the accuracy of the models. It was found that 99% of the variance in the model could be explained by two principal components. For all model building, the models were performed on three sets of data: the normalized data, one principal component, and two principal components. 

See [PCA](./PrincipalComponentAnalysis.ipynb)

## Clustering

#### Before performing kMeans clustering, an analysis of the distortion and inertia of the dataset and number of clusters was performed to determine the number of ideal clusters. Both the distortion, which measures the averages of the square distances from the cluster centers to the points of the cluster, and the inertia, which measures the sum of the squared distances from each point to their cluster center, indicated that the ideal number of clusters would be somewhere between 3 and 5. 

#### The best performance by kMeans clustering was with 5 clusters, although overall kMeans returned poor results with each number of clusters. When 5 clusters were applied, the testing data returned a completeness score of 0.22, a homogeneity score of 0.24 and a Mean Silhouette Value of 0.52. The low completeness and homogeneity scores indicate that there were clusters that contained more than one class and that classes were not placed in the same clusters. The Mean Silhouette Value measures the proximity of points within the same cluster and proximity to points of other clusters, the Mean Silhouette Value of 0.52 indicates that some points were far from other points of the same cluster or close to points of other clusters. 

See [Clustering Notebook](./ClusterProject.ipynb)


## KNN

#### K - Nearest Neighbors was performed on the normalized data and on the principal components found during the principal component analysis. Using grid search, the best parameters for KNN were determined before creating the classifier and calculating the respective metrics. The grid search returned the parameters: n_neighbors = 7 and weights = ‘uniform’). This returned a training accuracy of 0.70 and a testing accuracy of 0.64 on the normalized data. These values do not suggest overfitting but are not ideal for a prediction model. The model using the PCA data performed worse than the normalized data alone. 

#### Looking at the outputs, there is a lot of misclassification with group one and group three being labeled as group 2. Also, groups zero and two have many instances where they are predicted as group one. Group zero was predicted the most accurately and group one was predicted the least accurately. Overall, this model did not perform as well as we would like but it did classify certain groups well. 

See [KNN Notebook](./KNearestNeighbors.ipynb)

## Decision Trees

#### Initially the decision tree classifier returned a training score of 0.99 and a testing score of 0.57. This indicates that the decision tree classifier learned the pattern of the training model too closely and thus performed poorly on the testing data. This is a case of overfitting which needs to be addressed by either pre or post pruning the decision tree. Cost complexity pruning was used to address this issue of overfitting. Cost complexity pruning is adjusted by changing the ccp_alpha value for the decision tree classifier. The higher the ccp_alpha value, the more nodes are removed from the decision tree. Analysis was performed examining the impact different ccp_alpha values have on the number of nodes, tree depth and levels of impurity. The final analysis returned a graph examining the accuracy of the training and testing sets and how it corresponds to the different alpha values. 

#### A ccp_alpha value of 0.01 was selected because it coincided with a peak performance on the testing data and a similar performance on the training data. This indicates that the decision tree model accurately fit the training data and provided a realistic assessment of the testing data. The final performance of the decision tree classifier returned a training accuracy of 0.67 and a testing accuracy of 0.63. This shows slight improvement in the testing accuracy, however this may indicate that a decision tree classifier is not an ideal classification model for this dataset. 

See [Decision Tree Notebook](./decisiontree.ipynb)

## Naive Bayes

#### Naives Bayes was performed using the normalized data at the first and second principal components. Multinomial Naive Bayes performed very poorly as did the multinomial and gaussian models when using the principal components. The model when using the full normalized data set and the gaussian distribution resulted in a 53.75% training score and a 55.5% test score. While this is not ideal for a final model, this was the most promising model of the Naive Bayes models that were run. 

#### Due to Naive Bayes assumptions that predictors are independent and normally distributed, it struggled to classify test instances accurately. Data preprocessing indicated that many of the variables are uniformly distributed which could contribute to why the Naive Bayes model did not perform well. The final model selected specifically struggled to classify the third price range category. Of the 53 group 3 records in the test set, only 5 were categorized correctly. This could be a result of being the most expensive price category. Our data pre-processing showed that among the different components in each of the phones, there was not a significant amount of variance between each price category.

See [Naive Bayes and SVM Notebook](./NaiveBayesandSVM.ipynb)

## Support Vector Machines

#### The normalized data and the 2 principal components were used to fit a support vector machine model. Linear and Radial Basis Function(RBF) kernels were used as well as grid search to determine the best selection of C for linear kernels and the best combination of C and Gamma for the RBF kernels. The final model selected used a transformed data set using the first two principal components and anRBF kernel with a C equal to 10 and Gamma equal to .01. It resulted in a 49.75% accuracy. C controls the width of the margin where Gamma controls the size of the radial basis.

#### The idea behind testing out an SVM model came because of how poorly Naive Bayes performed. Instead of modeling the distribution of the different classes, it was predicted that maybe finding a curve to separate the different classes would perform better. Unfortunately, this hypothesis was inaccurate and we were not able to create a promising SVM model. 

See [Naive Bayes and SVM Notebook](./NaiveBayesandSVM.ipynb)

# Conclusion 
	
#### Finding an effective and efficient method to accurately predict the price of mobile phones was our objective. Throughout the analysis, we discovered several methods that proved ineffective at meeting our goal. 

#### The top performing form of analysis was K Nearest Neighbors. KNN returned a training score of 0 .70 and a testing of 0.64. KNN returned better results when working with the normalized data compared to the KNN using the results from PCA. Although there appears to be some misclassification due to the score returned for the testing data, KNN returned better results than the other methods used.

# Discussion 

#### While our data set contained 2000 observations, to further our analysis we would like to obtain more data from multiple mobile phone resale stores. With newer phone data and more accurate pricing, that possibly takes consumer demand into account as well, we are hoping to see more variance in the different price range categories. This could potentially help in clustering to determine associated time periods which would be confirmed with a supplemental data set containing mobile phone feature release dates.
