# Mobile Phones Market Data

<img src="https://storage.googleapis.com/kaggle-datasets-images/1137520/1908197/93084e667e82983099e2b7611faa9407/dataset-cover.png?t=2021-02-04-08-14-14" style="align:center">

<br>

# Introduction

This notebook uses the [Mobile Phones Market Data](https://www.kaggle.com/artempozdniakov/ukrainian-market-mobile-phones-data) - *Data with prices and parameters of smartphones, which can be bought in Ukraine.* dataset given by [Artem Pozdniakov](https://www.kaggle.com/artempozdniakov).

The objective of this notebook is to accomplish the following tasks:
- **Predict Prices** 
- **Exploratory Data Analysis**

> *The dataset set contains data about the mobile phones which were released in past 4 years and which can be bought in Ukraine. Dataset contains the model name, brand name and operating system of the phone and it's popularity. It also has it's financial characteristics like lowest/highest/best price and sellers amount. And some of the characteristics like screen/battery size, memory amount and release date. This data can be useful for improving your machine learning, analysis and vizualization, missing data filling skills. I'm waiting for your notebooks! :) Good luck!* - **@artempozdniakov**

# Table of contents

- Imports
- Load the data
- Basic insights
- EDA | Exploratory Data Analysis
  1. Numerical variables
  2. Categorical variables
  3. Time-Series
- Modelling
  1. Pre-Processing
  2. Model
- Post-Modelling
  1. Feature importance
  2. Predictions analysis
- Conclusion

# Imports

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # plotting
import matplotlib.pyplot as plt # plot handling
import time # timer and stuff
import warnings # warning handling

# Kaggle file system steup
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
warnings.filterwarnings("ignore")

___

# Load the data

In [None]:
# Load the csv file
mobiles = pd.read_csv('../input/ukrainian-market-mobile-phones-data/phones_data.csv', index_col=0)

## Dimensions

In [None]:
print(f"There are {mobiles.shape[0]} rows and {mobiles.shape[1]} columns.")
print(f"There are {mobiles.isna().sum().sum()} missing values which represents {round((mobiles.isna().sum().sum() / (mobiles.shape[0] * mobiles.shape[1])) * 100, 2)}% of the data.")
print(f"Columns : {mobiles.columns.tolist()}")

___

# Basic insights

In [None]:
mobiles.head()

> First 5 rows of the DataFrame.

In [None]:
mobiles.describe()

> Some descriptive stats of the numerical variables.

In [None]:
mobiles.describe(include=['object'])

> Some descriptive stats of the categorical variables.

In [None]:
((mobiles.isna().sum()[mobiles.isna().sum()  > 0 ] / mobiles.shape[0] * 100).apply(lambda x: round(x, 1))).astype(str) + '%'

> This is the percentage of missing data by columns/variables.

## Data Cleaning

In [None]:
# Dropping the columns that I can't handle
mobiles_names = mobiles['model_name']
mobiles       = mobiles.drop(columns=['model_name'])

# Convert release_date to datetime type
mobiles['release_date'] = pd.to_datetime(mobiles['release_date'])

In [None]:
mobiles.head()

___

# EDA a.k.a Exploratory Data Analysis

In [None]:
# Extract columns that are neither object nor datetime
numericals   = mobiles.dtypes[(mobiles.dtypes!='O') & (mobiles.dtypes!='<M8[ns]')].index.tolist()

# Extract categorical variables which are objects here
categoricals = mobiles.dtypes[mobiles.dtypes == 'O'].index.tolist()

In [None]:
# Constats for the EDA plots
WIDTH  = 20
HEIGHT = 8

## 1. Numerical variables

### Functions

In [None]:
def plot_numerical(frame, column, categorical=None, ax=None, n_row=None, n_col=None):
    # Simple
    if categorical is None:
        sns.histplot(data=frame, x=column, ax=ax[n_row][n_col])
    
    # With category
    else:
        sns.histplot(data=frame, x=column, hue=categorical, ax=ax[n_row][n_col], legend=False)

### Analysis

#### Distribution

In [None]:
n_row = -1

# Setup a grid of (no. of rows, no. of plots on a row) with figure size
fig, ax = plt.subplots(len(numericals), 1 + len(categoricals), figsize=(WIDTH, HEIGHT * (2 + len(categoricals))))

# Plot the figure for numericals
for numerical in numericals:
    # Increment
    n_col = 0
    n_row += 1
    
    # Single distribution plotting
    plot_numerical(mobiles, numerical, categorical=None, ax=ax, n_row=n_row, n_col=n_col)
    n_col += 1
    
    # Distribution plotting by category
    for categorical in categoricals:
        plot_numerical(mobiles, numerical, categorical=categorical, ax=ax, n_row=n_row, n_col=n_col)
        n_col += 1

# Display the plot
plt.show()

### Plot pairing

#### By os

In [None]:
by_col = 'os'

sns.pairplot(mobiles[numericals + [by_col]], hue=by_col)
plt.show()

#### By brand

In [None]:
by_col = 'brand_name'

sns.pairplot(mobiles[numericals + [by_col]], hue=by_col)
plt.show()

## 2. Categorical variables

### Functions

In [None]:
def plot_categorical(frame, column):
    # Count plot
    sns.countplot(frame[column])

### Analysis

In [None]:
titles = ['Brand', 'Operating System']

for categorical, title in zip(categoricals, titles):
    plt.figure(figsize=(WIDTH, HEIGHT))
    plot_categorical(mobiles, categorical)
    plt.title(title)
    
    if title == 'Brand':
        plt.xticks(rotation=90)

    plt.xlabel('')
    plt.show()

## 3. Time-Series

### Date interval

In [None]:
earliest = mobiles['release_date'].min().strftime("%B %d, %Y")
latest   = mobiles['release_date'].max().strftime("%B %d, %Y")

print(f"Release dates are between {earliest} and {latest}")

### Prices evolution

> Plotting only by os because there are too many brand names

In [None]:
for price in [ f"{e}_price" for e in ['lowest', 'highest', 'best']]:
    plt.figure(figsize=(WIDTH, HEIGHT))
    sns.lineplot(data=mobiles, x='release_date', y=price, hue='os')
    plt.show()

> Plotting lowest_price, highest_price and best_price

## Miscellaneous

In [None]:
def print_phone(phone):
    space = 30
    print(f"{'Name'.rjust(space)} : {phone['model_name']}")
    print(f"{'Price'.rjust(space)} : [{phone['lowest_price']}; {phone['highest_price']}]")
    print(f"{'Popularity'.rjust(space)} : {phone['popularity']}")
    print(f"{'Brand (OS)'.rjust(space)} : {phone['brand_name']} ({phone['os']})")

### Top 5 priciest mobile phones

In [None]:
for i in range(5):
    phone = mobiles.nlargest(5, 'highest_price').join(mobiles_names[mobiles.nlargest(5, 'highest_price').index]).iloc[i]
    print_phone(phone)
    print()

### Top 5 cheapest mobile phones

In [None]:
for i in range(5):
    phone = mobiles.nsmallest(5, 'highest_price').join(mobiles_names[mobiles.nsmallest(5, 'highest_price').index]).iloc[i]
    print_phone(phone)
    print()

___

# Modelling

In order to accomplish the task, we will select to predict **best_price** variable. In order to do that here are the details about the issue we are going to deal with...

- Type of issue : **Regression**
- Type of variables : **Numerical, Categorical and Time-Series**

In [None]:
data = mobiles.copy()

features = [
    'brand_name',
    'os',
    'popularity',
    'sellers_amount',
    'screen_size',
    'memory_size',
    'battery_size',
]

nums  = ['popularity', 'sellers_amount', 'screen_size', 'memory_size', 'battery_size']
cats  = ['brand_name', 'os']

TARGET = 'best_price'

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(WIDTH, HEIGHT/2))

sns.histplot(mobiles[TARGET], ax=ax[0])
ax[0].title.set_text(f'{TARGET} distribution')

sns.histplot(mobiles[TARGET].apply(np.log), ax=ax[1])
ax[1].title.set_text(f'Log scaled {TARGET} distribution')

plt.show()

> Because the data looks more *Gaussian* in log scale we will predict the log scale of the target variable then we will put it to exponential for the real predictions.

In [None]:
data[TARGET] = data[TARGET].apply(np.log)

## 1. Pre-Processing

### Date variable
We convert datetime type to timestamp (float) for the modelling part. So the bigger the timestamp, the earlier the phone was released.
We won't take into account the date in this notebook version...

In [None]:
# data['release_date'] = data['release_date'].apply( lambda x: x.timestamp())

### Numerical variables - Scaling/Normalizing

In this part you can chose either to scale or normalize your data. It might sometimes help the model to get better results, though it is a hypothesis.

In [None]:
from sklearn.preprocessing import StandardScaler

# Init. the scaler
scaler = StandardScaler()

# Fitting the scaler to the data
scaled_data = scaler.fit_transform(data[nums])
data[nums]  = pd.DataFrame(columns=nums, data=scaled_data)

### Numerical & Categorical variables - Imputation
- Fill missing categorical values with `'Unknown'`.
- In our case we will impute the missing numericals values with the median grouped by the `brand_name` and `os`.

In [None]:
data[cats] = data[cats].fillna('Unknown')

In [None]:
fill_data  = data.groupby(cats, sort=False)[nums].apply(lambda x: x.ffill().bfill())

data.loc[fill_data.index, nums] = fill_data

In [None]:
print(f"There are {data[nums].isna().sum().sum()} missing values which represents {round((data[nums].isna().sum().sum() / (data[nums].shape[0] * data[nums].shape[1])) * 100, 2)}% of the data.")

Because there is still missing data because the aggregation could not be done on all data, we will input the rest with the median.

In [None]:
data[nums] = data[nums].fillna(data[nums].median())

In [None]:
print(f"There are {data[nums].isna().sum().sum()} missing values which represents {round((data[nums].isna().sum().sum() / (data[nums].shape[0] * data[nums].shape[1])) * 100, 2)}% of the data.")

### Categorical variables - One-hot enconding

In [None]:
# One-hot encoding
oh_cats = pd.get_dummies(data[cats])

# Concatenate the on-hot encoded categorial variables to the data frame
data = pd.concat([
    data.drop(columns=cats),
    oh_cats
], axis=1)

# Correct features
for cat in cats:
    if cat in features:
        features.remove(cat)
        
features = features + oh_cats.columns.tolist()

## 2. Model

### Modelling function 

In [None]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error

SEED  = 42

def modelling(X, y, model, f_importance=False, fit=False):
    # Type of modelling : Train & Test basic splitting
    importance, tt_train_score, tt_test_score  = train_test_model(X, y, model, f_importance=f_importance)
    
    # Type of modelling : KFold Train & Test splitting
    kf_train_score, kf_test_score = kfold_model(X, y, model)
    
    if fit:
        model.fit(X, y)
        return model, tt_test_score, kf_test_score
    
    return (importance, tt_train_score, tt_test_score, kf_train_score, kf_test_score) if f_importance else (tt_train_score, tt_test_score, kf_train_score, kf_test_score)

def train_test_model(X, y, model, f_importance=True):
    
    importance = None
    
    # Train & test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=SEED)
    
    # Fitting
    model.fit(X_train, y_train)
    
    # Scores
    train_pred = model.predict(X_train)
    test_pred  = model.predict(X_test)
    
    train_score = mean_squared_error(y_train, model.predict(X_train))
    test_score = mean_squared_error(y_test, model.predict(X_test))
    
    # Feature importances
    if f_importance:
        try:
            try:
                importance = model.feature_importances_
            except:
                try:
                    importance = model.coef_
                except:
                    pass
            
            features   = X.columns.tolist()
            importance = pd.Series(index=features, data=importance)
            return importance, train_score, test_score
        except:
            pass
        
    # Model, RMSE on train, RMSE on test
    return importance, train_score, test_score

def kfold_model(X, y, model):
    # Parameters & variables
    K            = 5
    kf           = KFold(K)
    train_scores = list() 
    test_scores  = list() 
    
    # Looping over the folds
    for train_index, test_index in kf.split(X):
        
        # Define datasets
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        
        # Fitting
        model.fit(X_train, y_train)
        
        # Scores
        train_pred = model.predict(X_train)
        test_pred  = model.predict(X_test)
        
        train_score = mean_squared_error(y_train, model.predict(X_train))
        test_score = mean_squared_error(y_test, model.predict(X_test))
        
        # Increments
        train_scores.append(train_score)
        test_scores.append(test_score)
    
    kf_train_score = np.mean(train_scores)
    kf_test_score  = np.mean(test_scores)
    
    return kf_train_score, kf_test_score

In [None]:
# Classic linear regressor
from sklearn.linear_model import LinearRegression, Ridge, SGDRegressor

# Regressors with variable selection
from sklearn.linear_model import ElasticNet, Lars, Lasso, LassoLars

# Bayesian regressor
from sklearn.linear_model import ARDRegression, BayesianRidge

# XGBoost
from xgboost import XGBRegressor

models = [
    LinearRegression(),
    Ridge(),
    SGDRegressor(),
    ElasticNet(),
    Lars(),
    LassoLars(),
    ARDRegression(),
    BayesianRidge(),
    XGBRegressor()
]

### Training and evaluation

In [None]:
result_cols = ['name', 'basic_train', 'basic_test', 'kf_train', 'kf_test']
importances = dict()

# Model analysis DataFrame
model_analysis = pd.DataFrame(columns=result_cols)

# Splitting X & y
X = data[features]
y = data[TARGET]

for model in models:
    print(f"{type(model).__name__.rjust(20)}...", end='')
    
    # Function for modelling
    importance, tt_train_score, tt_test_score, kf_train_score, kf_test_score = modelling(X, y, model, f_importance=True, fit=False)
    
    # Add data from modelling
    model_analysis = model_analysis.append(
        pd.Series(
            index=result_cols, 
            data=np.array([
                type(model).__name__,
                tt_train_score,
                tt_test_score,
                kf_train_score,
                kf_test_score
            ])), 
        ignore_index=True)
    
    # Add data for importance analysis
    importances[type(model).__name__] = importance
    print(f" ended !")

In [None]:
print('Ranking based on test RSME : ')
print()
print(model_analysis[['name', 'basic_test', 'kf_test']].sort_values(by=['basic_test'], ascending=True))

In [None]:
print('Ranking based on cross validation test RSME : ')
print()
print(model_analysis[['name', 'basic_test', 'kf_test']].sort_values(by=['kf_test'], ascending=True))

In [None]:
def plot_importance(series):
    # Sort values
    data = series.apply(np.abs).sort_values(ascending=False)
    
    # Plot
    plt.figure(figsize=(WIDTH, HEIGHT))
    data.plot(kind='bar')
    plt.title(data.name)
    plt.show()

_ = pd.DataFrame.from_dict(importances)[['BayesianRidge', 'SGDRegressor', 'XGBRegressor']].apply(lambda x: plot_importance(x), axis=0)

___

# Post-modelling analysis

## Interesting features

- `brand_name` : The brand name has a big influence in most of the models, it can be mostly seen as minima the third most important feature among the others. Though we can see that most of the models give the ***Apple *** brand a big importance. We suppose that ***Apple *** smartphones are the most expensive so they are easy to identify.

- `screen_size` & `memory_size` : If we isolate the ***Apple *** brand, we suppose that those technical characteristcs are the most important when evaluating the price of a smartphone.

## Predictions analysis

In [None]:
data[TARGET]

In [None]:
# Model init
model = XGBRegressor()

# Training
_ = model.fit(X, y)

In [None]:
for brand in mobiles.brand_name.unique():
    # Create plot
    plt.figure(figsize=(WIDTH, HEIGHT))
    
    # Filter
    query = f"brand_name=='{brand}'"
    
    # Real data
    sns.lineplot(data=mobiles.query(query), x='release_date', y=TARGET, label="Real data")
    
    # Predictions
    sns.lineplot(x=mobiles.loc[X.index].query(query)['release_date'], y=np.exp(model.predict(X.iloc[mobiles.query(query).index,:])), label="Predictions")
    
    # Display
    plt.title(f"Temporal evolution of {brand} smartphone prices : Real data vs predictions")
    plt.show()

___

# Conclusion

It was a very interesting dataset to use! Though the main challenges here would be to select the most interesting features in order to predict the prices  in the future with the Time-Series variable (because I don't know yet, I am working on it).

So hope you enjoyed, don't forget to upvote, thank you.