### Used data set: https://www.kaggle.com/iabhishekofficial/mobile-price-classification/data

Short description of the dataset is presented in the README.

At first we'll import necessary libraries:

In [8]:
%pylab inline
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn import model_selection
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

Populating the interactive namespace from numpy and matplotlib


Then we'll load the dataset and preview first 5 rows of it:

In [5]:
train = pd.read_csv('train.csv')

print(train[0:5])

   battery_power  blue  clock_speed  dual_sim  fc  four_g  int_memory  m_dep  \
0            842     0          2.2         0   1       0           7    0.6   
1           1021     1          0.5         1   0       1          53    0.7   
2            563     1          0.5         1   2       1          41    0.9   
3            615     1          2.5         0   0       0          10    0.8   
4           1821     1          1.2         0  13       1          44    0.6   

   mobile_wt  n_cores     ...       px_height  px_width   ram  sc_h  sc_w  \
0        188        2     ...              20       756  2549     9     7   
1        136        3     ...             905      1988  2631    17     3   
2        145        5     ...            1263      1716  2603    11     2   
3        131        6     ...            1216      1786  2769    16     8   
4        141        2     ...            1208      1212  1411     8     2   

   talk_time  three_g  touch_screen  wifi  price_range  

To simplify classification we'll firstly reduce the task up to a binary classification and change [0, 1, 2, 3] classes to [0, 1] and remove the target variable from our dataset

In [6]:
y = train.price_range

y = y.replace({1: 0, 
               2: 1,
               3: 1,
               4: 1})

print(y[0: 10])

X = train.drop('price_range', axis=1)

print(X[0:5])

0    0
1    1
2    1
3    1
4    0
5    0
6    1
7    0
8    0
9    0
Name: price_range, dtype: int64
   battery_power  blue  clock_speed  dual_sim  fc  four_g  int_memory  m_dep  \
0            842     0          2.2         0   1       0           7    0.6   
1           1021     1          0.5         1   0       1          53    0.7   
2            563     1          0.5         1   2       1          41    0.9   
3            615     1          2.5         0   0       0          10    0.8   
4           1821     1          1.2         0  13       1          44    0.6   

   mobile_wt  n_cores  pc  px_height  px_width   ram  sc_h  sc_w  talk_time  \
0        188        2   2         20       756  2549     9     7         19   
1        136        3   6        905      1988  2631    17     3          7   
2        145        5   6       1263      1716  2603    11     2          9   
3        131        6   9       1216      1786  2769    16     8         11   
4        141        2 

Now, let's split the sample via 7:3:

In [7]:
# Saving seed 
divide_seed = np.random.randint(1, 100)

# And then split the data
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.3, random_state=divide_seed)

# And check if they splitted correctly
print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))

1400
600
1400
600


Also, we need to normallize and scale the data, so we'll use Scaler:

In [9]:
scaler = StandardScaler()

# We normalize train sample
scaler.fit(X_train)
X_train = scaler.transform(X_train)

# And test sample
scaler.fit(X_test)
X_test = scaler.transform(X_test)

Now, our data is ready to be used as a sample for predicting models.