# Phone Dataset Analysis

## Import Used Library

Pandas library ("import pandas as pd"): For reading, cleaning and analyzing data.

In [1]:
import pandas as pd

## Data Wrangling

### Gathering Data

Calling data from folder Data containing the dataset "phone_dataset_raw.csv" and placing it inside "dataset_df".

In [2]:
dataset_df = pd.read_csv("../Data/phone_dataset_raw.csv")

Reading the first five data stored in "dataset_df".

In [3]:
dataset_df.head()

Unnamed: 0,id,brand,name,image,release_date,resolution,weight,os,chipset,memory,...,main_camera_1,main_camera_2,main_camera_3,selfie_camera,earphone_jack,battery,charging,colors,nfc,price
0,1,Oppo,Oppo Reno 11 Pro,https://image.oppo.com/content/dam/oppo/common...,18/01/2023,1080x2412,181.0,Android 14,Mediatek Dimensity 8200,512,...,50,32.0,8.0,32,True,4600,80.0,Pearl White; Rock Grey,True,8999000
1,2,Oppo,Oppo Reno 11,https://image.oppo.com/content/dam/oppo/common...,25/01/2024,1080x2412,182.0,Android 14,Mediatek Dimensity 7050,256,...,50,32.0,8.0,32,False,5000,67.0,Wave Green; Rock Grey,True,5999000
2,3,Oppo,Oppo Reno 11F,https://fdn2.gsmarena.com/vv/bigpic/oppo-reno1...,08/02/2024,1080x2412,177.0,Android 14,Mediatek Dimensity 7050,256,...,64,8.0,2.0,32,False,5000,67.0,Palm Green; Ocean Blue; Coral Purple,True,4499000
3,4,Oppo,Oppo Reno 10,https://image.oppo.com/content/dam/oppo/common...,15/07/2023,1080x2412,185.0,Android 13,Mediatek Dimensity 7050,256,...,64,32.0,8.0,32,False,5000,67.0,Silvery Grey; Ice Blue,True,5499000
4,5,Oppo,Oppo Reno 10 Pro,https://fdn2.gsmarena.com/vv/bigpic/oppo-reno1...,08/07/2023,1080x2412,185.0,Android 13,Qualcomm SM7325 Snapdragon 778G 5G,256,...,50,32.0,8.0,32,False,4600,80.0,Silvery Grey; Glossy Purple,True,8499000


### Assessing Data

Reading all data types from each column inside "dataset_df".

In [4]:
dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             96 non-null     int64  
 1   brand          96 non-null     object 
 2   name           96 non-null     object 
 3   image          96 non-null     object 
 4   release_date   96 non-null     object 
 5   resolution     96 non-null     object 
 6   weight         96 non-null     float64
 7   os             96 non-null     object 
 8   chipset        96 non-null     object 
 9   memory         96 non-null     int64  
 10  ram            96 non-null     int64  
 11  main_camera_1  96 non-null     int64  
 12  main_camera_2  94 non-null     float64
 13  main_camera_3  57 non-null     float64
 14  selfie_camera  96 non-null     int64  
 15  earphone_jack  96 non-null     bool   
 16  battery        96 non-null     int64  
 17  charging       96 non-null     float64
 18  colors      

Checking whether there are duplicate data in "dataset_df".

In [5]:
print("The amount of duplications in the dataset: ", dataset_df.duplicated().sum())

The amount of duplications in the dataset:  0


Checking whether there is empty data in "dataset_df".

In [6]:
print("The amount of missing data in the dataset:\n", dataset_df.isna().sum())

The amount of missing data in the dataset:
 id                0
brand             0
name              0
image             0
release_date      0
resolution        0
weight            0
os                0
chipset           0
memory            0
ram               0
main_camera_1     0
main_camera_2     2
main_camera_3    39
selfie_camera     0
earphone_jack     0
battery           0
charging          0
colors            0
nfc               0
price             0
dtype: int64


Statistical check of "dataset_df" to ensure data accuracy.

In [7]:
dataset_df.describe(include="all")

Unnamed: 0,id,brand,name,image,release_date,resolution,weight,os,chipset,memory,...,main_camera_1,main_camera_2,main_camera_3,selfie_camera,earphone_jack,battery,charging,colors,nfc,price
count,96.0,96,96,96,96,96,96.0,96,96,96.0,...,96.0,94.0,57.0,96.0,96,96.0,96.0,96,96,96.0
unique,,10,96,93,67,34,,14,50,,...,,,,,2,,,86,2,
top,,Oppo,Oppo Reno 11 Pro,https://southeast-asia.pro.infinixmobility.com...,27/10/2022,1080x2400,,Android 13,Mediatek Dimensity 7050,,...,,,,,False,,,Black; Green; White; Gold,True,
freq,,15,1,2,5,23,,35,5,,...,,,,,55,,,3,83,
mean,48.5,,,,,,197.385417,,,398.666667,...,61.604167,14.275957,9.732281,20.416667,,4839.84375,47.744792,,,8126707.0
std,27.856777,,,,,,39.334562,,,269.693419,...,40.812145,17.396596,12.779297,14.545512,,705.729504,27.237937,,,7054422.0
min,1.0,,,,,,144.0,,,128.0,...,8.0,0.08,0.08,4.0,,2018.0,10.0,,,1199000.0
25%,24.75,,,,,,185.0,,,256.0,...,50.0,2.0,2.0,8.0,,4775.0,25.0,,,2999000.0
50%,48.5,,,,,,190.0,,,256.0,...,50.0,8.0,5.0,13.0,,5000.0,44.0,,,4999000.0
75%,72.25,,,,,,199.25,,,512.0,...,64.0,12.0,12.0,32.0,,5000.0,67.0,,,11249000.0


### Cleaning Data

Even though it was previously stated that there was no duplicate data, as a precaution, we removed duplicate data from "dataset_df" and displayed the number of duplicate data again from "dataset_df".

In [8]:
dataset_df.drop_duplicates(inplace=True)
print("The amount of duplications in the dataset: ", dataset_df.duplicated().sum())

The amount of duplications in the dataset:  0


From the statement above, there are 41 missing data in "dataset_df" with 2 missing data in the "main_camera_2" column and 39 missing data in the "main_camera_3" column. Therefore, we fill in the value 0 for the 41 missing data in "dataset_df" and display the amount of missing data again from "dataset_df".

In [9]:
dataset_df.fillna(value=0, inplace=True)
print("The amount of missing data in the dataset:\n", dataset_df.isna().sum())

The amount of missing data in the dataset:
 id               0
brand            0
name             0
image            0
release_date     0
resolution       0
weight           0
os               0
chipset          0
memory           0
ram              0
main_camera_1    0
main_camera_2    0
main_camera_3    0
selfie_camera    0
earphone_jack    0
battery          0
charging         0
colors           0
nfc              0
price            0
dtype: int64


Changing the data type in the "release_date" column because this column is a date container column.

In [10]:
dataset_df['release_date'] = pd.to_datetime(dataset_df['release_date'])
dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             96 non-null     int64         
 1   brand          96 non-null     object        
 2   name           96 non-null     object        
 3   image          96 non-null     object        
 4   release_date   96 non-null     datetime64[ns]
 5   resolution     96 non-null     object        
 6   weight         96 non-null     float64       
 7   os             96 non-null     object        
 8   chipset        96 non-null     object        
 9   memory         96 non-null     int64         
 10  ram            96 non-null     int64         
 11  main_camera_1  96 non-null     int64         
 12  main_camera_2  96 non-null     float64       
 13  main_camera_3  96 non-null     float64       
 14  selfie_camera  96 non-null     int64         
 15  earphone_jack  96 non-nul

  dataset_df['release_date'] = pd.to_datetime(dataset_df['release_date'])


## Exploratory Data Analysis (EDA)

Calling 5 data samples from within "dataset_df".

In [11]:
dataset_df.sample(5)

Unnamed: 0,id,brand,name,image,release_date,resolution,weight,os,chipset,memory,...,main_camera_1,main_camera_2,main_camera_3,selfie_camera,earphone_jack,battery,charging,colors,nfc,price
86,87,Xiaomi,Redmi Note 13 Pro 4G,https://i02.appmifile.com/878_item_id/19/02/20...,2024-02-28,1080x2400,188.0,Android 13,Mediatek Helio G99 Ultra,512,...,200,8.0,2.0,16,True,5000,67.0,Midnight Black; Lavender Purple; Forest Green,True,3799000
19,20,Infinix,Infinix Smart 8,https://southeast-asia.pro.infinixmobility.com...,2023-11-09,720x1612,184.0,Android 13,Unisoc T606,128,...,13,0.08,0.0,8,True,5000,10.0,Timber Black; Shiny Gold; Crystal Green; Galax...,False,1399000
17,18,Asus,Asus ROG Phone 8 Pro,https://dlcdnwebimgs.asus.com/gain/BF826D12-86...,2024-01-18,1080x2400,225.0,Android 14,Qualcomm SM8650-AB Snapdragon 8 Gen 3,1024,...,50,32.0,13.0,32,True,5500,65.0,Phantom Black,True,19999000
31,32,Samsung,Samsung Galaxy S24+,https://images.samsung.com/is/image/samsung/p6...,2024-01-24,1440x3120,196.0,Android 14,Exynos 2400,512,...,50,10.0,12.0,12,False,4900,45.0,Onyx Black; Marble Grey; Cobalt Violet; Amber ...,True,18999000
67,68,Apple,iPhone 15,https://www.apple.com/v/iphone-15/c/images/ove...,2022-10-27,1179x2556,171.0,iOS 17,Apple A16 Bionic,512,...,48,12.0,0.0,12,False,3349,20.0,Black; Blue; Green; Yellow; Pink,True,14249000


For model training data, there are several unused data columns such as "image", "release_date", "resolution", and "colors", therefore "new_dataset_df" is created which contains almost the same as "dataset_df" but these columns will be dropped.

In [12]:
new_dataset_df = dataset_df.drop(['image', 'release_date', 'resolution', 'colors'], axis=1)
new_dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             96 non-null     int64  
 1   brand          96 non-null     object 
 2   name           96 non-null     object 
 3   weight         96 non-null     float64
 4   os             96 non-null     object 
 5   chipset        96 non-null     object 
 6   memory         96 non-null     int64  
 7   ram            96 non-null     int64  
 8   main_camera_1  96 non-null     int64  
 9   main_camera_2  96 non-null     float64
 10  main_camera_3  96 non-null     float64
 11  selfie_camera  96 non-null     int64  
 12  earphone_jack  96 non-null     bool   
 13  battery        96 non-null     int64  
 14  charging       96 non-null     float64
 15  nfc            96 non-null     bool   
 16  price          96 non-null     int64  
dtypes: bool(2), float64(4), int64(7), object(4)
memory usage

Statistical check of "new_dataset_df" to ensure data accuracy.

In [13]:
new_dataset_df.describe(include="all")

Unnamed: 0,id,brand,name,weight,os,chipset,memory,ram,main_camera_1,main_camera_2,main_camera_3,selfie_camera,earphone_jack,battery,charging,nfc,price
count,96.0,96,96,96.0,96,96,96.0,96.0,96.0,96.0,96.0,96.0,96,96.0,96.0,96,96.0
unique,,10,96,,14,50,,,,,,,2,,,2,
top,,Oppo,Oppo Reno 11 Pro,,Android 13,Mediatek Dimensity 7050,,,,,,,False,,,True,
freq,,15,1,,35,5,,,,,,,55,,,83,
mean,48.5,,,197.385417,,,398.666667,9.5,61.604167,13.978542,5.778542,20.416667,,4839.84375,47.744792,,8126707.0
std,27.856777,,,39.334562,,,269.693419,3.565921,40.812145,17.334109,10.92495,14.545512,,705.729504,27.237937,,7054422.0
min,1.0,,,144.0,,,128.0,4.0,8.0,0.0,0.0,4.0,,2018.0,10.0,,1199000.0
25%,24.75,,,185.0,,,256.0,8.0,50.0,2.0,0.0,8.0,,4775.0,25.0,,2999000.0
50%,48.5,,,190.0,,,256.0,8.0,50.0,8.0,2.0,13.0,,5000.0,44.0,,4999000.0
75%,72.25,,,199.25,,,512.0,12.0,64.0,12.0,8.0,32.0,,5000.0,67.0,,11249000.0


Calling 5 data samples from within "new_dataset_df".

In [14]:
new_dataset_df.sample(5)

Unnamed: 0,id,brand,name,weight,os,chipset,memory,ram,main_camera_1,main_camera_2,main_camera_3,selfie_camera,earphone_jack,battery,charging,nfc,price
40,41,Vivo,Vivo V30e,188.0,Funtouch 14,Qualcomm SM6450 Snapdragon 6 Gen 1,256,8,50,8.0,0.0,32,False,5500,44.0,True,4999000
49,50,Vivo,Vivo V29,186.0,Funtouch 13,Qualcomm SM7325 Snapdragon 778G,512,12,50,8.0,2.0,50,False,4600,80.0,True,6999000
21,22,Infinix,Infinix GT 20 Pro,194.0,Android 14,Mediatek Dimensity 8200 Ultimate,256,12,108,2.0,2.0,32,False,5000,45.0,True,4399000
26,27,Infinix,Infinix Zero 30 5G,185.0,Android 13,Mediatek Dimensity 8020,256,12,108,13.0,2.0,50,False,5000,68.0,True,4999000
92,93,Poco,Poco F5,181.0,Android 13,Qualcomm SM7475-AB Snapdragon 7+ Gen 2,256,8,64,8.0,2.0,16,True,5000,67.0,True,4999000


Saving "dataset_df" into CSV file named 'phone_dataset_clean.csv' without including the index and "new_dataset_df" into CSV file named 'model_dataset.csv' also without including the index.

In [15]:
dataset_df.to_csv('phone_dataset_clean.csv', index=False)
new_dataset_df.to_csv('model_dataset.csv', index=False)