I work on a dataset that can be found on the following link:

https://www.kaggle.com/datasets/ahsan81/used-handheld-device-data?resource=download

This dataset contains the different attributes of used/refurbished phones and tablets.

I start importing the first libraries that will help me to work on the data

In [1]:
import pandas as pd
import numpy as np

I upload the dataset as a dataframe

In [2]:
used_device_df = pd.read_csv('used_device_data.csv')

## Data exploration

I print a summary of the database

In [4]:
used_device_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3454 entries, 0 to 3453
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   device_brand           3454 non-null   object 
 1   os                     3454 non-null   object 
 2   screen_size            3454 non-null   float64
 3   4g                     3454 non-null   object 
 4   5g                     3454 non-null   object 
 5   rear_camera_mp         3275 non-null   float64
 6   front_camera_mp        3452 non-null   float64
 7   internal_memory        3450 non-null   float64
 8   ram                    3450 non-null   float64
 9   battery                3448 non-null   float64
 10  weight                 3447 non-null   float64
 11  release_year           3454 non-null   int64  
 12  days_used              3454 non-null   int64  
 13  normalized_used_price  3454 non-null   float64
 14  normalized_new_price   3454 non-null   float64
dtypes: f

The values are of three types: object, float64 and int64.

Six variables have null values.

In [6]:
len(used_device_df.index)

3454

There are 3454 samples which means that this is the number of the smartphone registered.


In [7]:
len(used_device_df.columns)

15

There are 15 variables:
- **device_brand** is the name of manufacturing brand;
- **os** is the operating system on which the device runs;
- **screen_size** is the size of the screen in cm;
- **4g** is a string declaring whether 4G is available or not;
- **5g** is a string declaring whether 5G is available or not;
- **front_camera_mp** is the resolution of the front camera in megapixels;
- **rear_camera_mp** is the resolution of the rear camera in megapixels;
- **internal_memory** is the amount of internal memory (ROM) in GB;
- **ram** is the amount of RAM in GB;
- **battery** is the energy capacity of the device battery in mAh;
- **weight** is the weight of the device in grams;
- **release_year** is the year when the device model was released;
- **days_used** is the number of days the used/refurbished device has been used;
- **normalized_new_price** is the normalized price of a new device of the same model;
- **normalized_used_price** is the normalized price of the used/refurbished device.

In [9]:
used_device_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
screen_size,3454.0,13.713115,3.80528,5.08,12.7,12.83,15.34,30.71
rear_camera_mp,3275.0,9.460208,4.815461,0.08,5.0,8.0,13.0,48.0
front_camera_mp,3452.0,6.554229,6.970372,0.0,2.0,5.0,8.0,32.0
internal_memory,3450.0,54.573099,84.972371,0.01,16.0,32.0,64.0,1024.0
ram,3450.0,4.036122,1.365105,0.02,4.0,4.0,4.0,12.0
battery,3448.0,3133.402697,1299.682844,500.0,2100.0,3000.0,4000.0,9720.0
weight,3447.0,182.751871,88.413228,69.0,142.0,160.0,185.0,855.0
release_year,3454.0,2015.965258,2.298455,2013.0,2014.0,2015.5,2018.0,2020.0
days_used,3454.0,674.869716,248.580166,91.0,533.5,690.5,868.75,1094.0
normalized_used_price,3454.0,4.364712,0.588914,1.536867,4.033931,4.405133,4.7557,6.619433


I get some descriptive statistics.

The first interesting thing that can be inferred from these values is the fact that the devices have been released between 2013 and 2020 and they are used for a maximum of three years.

Another interesting fact is that the prices surprisely appear strange, because they have been normalized. There are usually two reasons to do this choice:

1. The general range of prices changes over time
2. The price of a smartphone depends also on where you buy it

In [22]:
used_device_df.head()

Unnamed: 0,device_brand,os,screen_size,4g,5g,rear_camera_mp,front_camera_mp,internal_memory,ram,battery,weight,release_year,days_used,normalized_used_price,normalized_new_price
0,Honor,Android,14.5,yes,no,13.0,5.0,64.0,3.0,3020.0,146.0,2020,127,4.307572,4.7151
1,Honor,Android,17.3,yes,yes,13.0,16.0,128.0,8.0,4300.0,213.0,2020,325,5.162097,5.519018
2,Honor,Android,16.69,yes,yes,13.0,8.0,128.0,8.0,4200.0,213.0,2020,162,5.111084,5.884631
3,Honor,Android,25.5,yes,yes,13.0,8.0,64.0,6.0,7250.0,480.0,2020,345,5.135387,5.630961
4,Honor,Android,15.32,yes,no,13.0,8.0,64.0,3.0,5000.0,185.0,2020,293,4.389995,4.947837


In [21]:
used_device_df.tail()

Unnamed: 0,device_brand,os,screen_size,4g,5g,rear_camera_mp,front_camera_mp,internal_memory,ram,battery,weight,release_year,days_used,normalized_used_price,normalized_new_price
3449,Asus,Android,15.34,yes,no,,8.0,64.0,6.0,5000.0,190.0,2019,232,4.492337,6.483872
3450,Asus,Android,15.24,yes,no,13.0,8.0,128.0,8.0,4000.0,200.0,2018,541,5.037732,6.251538
3451,Alcatel,Android,15.8,yes,no,13.0,5.0,32.0,3.0,4000.0,165.0,2020,201,4.35735,4.528829
3452,Alcatel,Android,15.8,yes,no,13.0,5.0,32.0,2.0,4000.0,160.0,2020,149,4.349762,4.624188
3453,Alcatel,Android,12.83,yes,no,13.0,5.0,16.0,2.0,4000.0,168.0,2020,176,4.132122,4.279994


I get the first 5 rows and the last 5 rows and it may seem like we are only working with android devices, I check that we are not

In [16]:
used_device_df['os'].unique()

array(['Android', 'Others', 'iOS', 'Windows'], dtype=object)

I proved that we are also working with also other operative systems