# Project Description

## Context & Objective

The second-hand phone business/industry is poised for significant growth in the near to medium future with the IDC (Internation Data Corporation) predicting a whooping \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023.

There are many advantages to using a used cell phone:  
1. Significant cost savings with warranties
2. Increases the life of the cell phone - Environmentally friendly 

ReCell is a start-up looking to take advantage of the potential boom in the industry and are looking to use ML to understand the dynamics.

Build a linear regression model that predicts the price of a used phone and identify factors that significantly influence the price.  


### Data Information:  
brand_name: Name of manufacturing brand  
os: OS on which the phone runs  
screen_size: Size of the screen in cm  
4g: Whether 4G is available or not  
5g: Whether 5G is available or not  
main_camera_mp: Resolution of the rear camera in megapixels  
selfie_camera_mp: Resolution of the front camera in megapixels  
int_memory: Amount of internal memory (ROM) in GB  
ram: Amount of RAM in GB  
battery: Energy capacity of the phone battery in mAh  
weight: Weight of the phone in grams  
release_year: Year when the phone model was released  
days_used: Number of days the used/refurbished phone has been used  
new_price: Price of a new phone of the same model in euros  
used_price: Price of the used/refurbished phone in euros  

### 1. Import Libraries

In [1]:
import pandas as pd 
import numpy as np 

#Visualization Libraries:
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns 

#statistical and regression Libraries 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split # Sklearn package's randomized data splitting function
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.stats.diagnostic as sms

pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 200)


import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read the data into the dataframe
df=pd.read_csv('used_phone_data.csv')
df.shape

(3571, 15)

There are 3571 data rows in the data set with 15 columns.

In [3]:
#Taking a look at the first couple of columns 
df.head()

Unnamed: 0,brand_name,os,screen_size,4g,5g,main_camera_mp,selfie_camera_mp,int_memory,ram,battery,weight,release_year,days_used,new_price,used_price
0,Honor,Android,23.97,yes,no,13.0,5.0,64.0,3.0,3020.0,146.0,2020,127,111.62,86.96
1,Honor,Android,28.1,yes,yes,13.0,16.0,128.0,8.0,4300.0,213.0,2020,325,249.39,161.49
2,Honor,Android,24.29,yes,yes,13.0,8.0,128.0,8.0,4200.0,213.0,2020,162,359.47,268.55
3,Honor,Android,26.04,yes,yes,13.0,8.0,64.0,6.0,7250.0,480.0,2020,345,278.93,180.23
4,Honor,Android,15.72,yes,no,13.0,8.0,64.0,3.0,5000.0,185.0,2020,293,140.87,103.8


In [5]:
# Sampling random data points to see if the observation holds true
np.random.seed(1)
df.sample(n=10) 

Unnamed: 0,brand_name,os,screen_size,4g,5g,main_camera_mp,selfie_camera_mp,int_memory,ram,battery,weight,release_year,days_used,new_price,used_price
2501,Samsung,Android,13.49,yes,no,13.0,13.0,32.0,4.0,3600.0,181.0,2017,683,198.68,79.47
2782,Sony,Android,13.81,yes,no,,8.0,32.0,4.0,3300.0,156.0,2019,195,198.15,149.1
605,Others,Android,12.7,yes,no,8.0,5.0,16.0,4.0,2400.0,137.0,2015,1048,161.47,48.39
2923,Vivo,Android,19.37,yes,no,13.0,16.0,64.0,4.0,3260.0,149.3,2019,375,211.88,138.31
941,Others,Others,5.72,no,no,0.3,0.3,32.0,0.25,820.0,90.0,2013,883,29.81,8.92
1833,LG,Android,13.49,no,no,8.0,1.3,32.0,4.0,3140.0,161.0,2013,670,240.54,96.18
671,Apple,iOS,14.92,yes,no,12.0,7.0,64.0,4.0,5493.0,48.0,2018,403,700.15,350.08
1796,LG,Android,17.78,yes,no,5.0,0.3,16.0,4.0,4000.0,294.8,2014,708,189.3,75.94
757,Asus,Android,13.49,yes,no,13.0,8.0,32.0,4.0,5000.0,181.0,2017,612,270.5,108.13
3528,Realme,Android,15.72,yes,no,,16.0,64.0,4.0,4035.0,184.0,2019,433,159.885,80.0


The columns show the properties of the phones. The Brand name, whether it is 4G or 5G enabled and the operating system are object data type as is expected. There remaining are numberical data types. It looks like we have some NAN values in the main camera mega pixel information. Looking forward to the linear regression, there are some 'object' variables that I believe would be important variables in determining the price of use phones that should be converted to categorical variavles: OS and connectivity capability   4G and 5G.
