# AI - Final
## Regression

### AmirHossein Habibvand - 810196447

In this assignment we are going to predict product prices from ads. We have a dataset from `divar`, advertisement
platform, which has title, description, brand, and price for products. Here all of our products are cell phones.

There 7 main steps in machine learning problems:
- Gathering data
- Preparing that data
- Choosing a model
- Training
- Evaluation
- Hyperparameter tuning
- Prediction

We will go through them.

### Gathering data
We have the dataset from divar, which can be downloaded from [here](https://research.cafebazaar.ir/visage/divar_datasets/).

In [4]:
from collections import Counter
from math import sqrt

import hazm
import matplotlib.pyplot as plt
import pandas as pd
from lightgbm import LGBMRegressor
from nltk.corpus import wordnet
from scipy.stats import randint as sp_randint
from scipy.stats import uniform
from sklearn import metrics
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (LabelEncoder, MinMaxScaler, OneHotEncoder,
                                   StandardScaler)
from sklearn.svm import SVR
from unidecode import unidecode

### Preparing that data

The data is in csv format, lets load the data in pandas dataframe.

In [5]:
df = pd.read_csv('./mobile_phone_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,brand,city,title,desc,image_count,created_at,price
0,0,Nokia::نوکیا,Qom,نوکیا6303,سلام.یه گوشیه6303سالم که فقط دوتا خط کوچیک رو ...,2,Wednesday 07AM,60000
1,1,Apple::اپل,Tehran,ایفون ٥اس٣٢گیگ,درحد نو سالم اصلى بدون ضربه مهلت تست میدم,0,Wednesday 11AM,1150000
2,2,Samsung::سامسونگ,Mashhad,سامسونگ j5,گوشى بسیار بسیار تمیز و فقط سه هفته کارکرده و ...,2,Wednesday 02PM,590000
3,3,Apple::اپل,Karaj,گرى 5s ایفون 32گیگ,گلس پشت و رو .کارت اپل ای دی. لوازم جانبی اصلی...,3,Wednesday 04PM,1100000
4,4,Samsung::سامسونگ,Tehran,galaxy S5 Gold در حد آک,کاملا تمیز و بدون حتی 1 خط و خش\nبه همراه گلاس...,2,Friday 01PM,900000


The data has 8 columns, let's define some constant to use afterward, out target is price:

In [6]:
COLUMN_TARGET = 'price'
COLUMN_BRAND = 'brand'
COLUMN_CITY = 'city'
COLUMN_TITLE = 'title'
COLUMN_DESCRIPTION = 'desc'
COLUMN_IMAGE_COUNT = 'image_count'

Now let's checkout different columns:

In [13]:
df.isna().sum()

Unnamed: 0     0
brand          0
city           0
title          0
desc           0
image_count    0
created_at     0
price          0
dtype: int64

In [14]:
df[COLUMN_BRAND].describe()

count                59189
unique                   9
top       Samsung::سامسونگ
freq                 19760
Name: brand, dtype: object

In [15]:
df[COLUMN_CITY].describe()

count      59189
unique         9
top       Tehran
freq       21860
Name: city, dtype: object

In [16]:
df[COLUMN_IMAGE_COUNT].describe()

count    59189.000000
mean         1.642974
std          1.371340
min          0.000000
25%          0.000000
50%          2.000000
75%          3.000000
max         11.000000
Name: image_count, dtype: float64

In [17]:
df[COLUMN_TARGET].describe()

count    5.918900e+04
mean     6.202780e+05
std      5.616647e+05
min     -1.000000e+00
25%      2.000000e+05
50%      4.500000e+05
75%      9.000000e+05
max      2.800000e+06
Name: price, dtype: float64

We can see that every cell in our data is filled, we have ads from 9 different cities and 9 different brands. Also we know that products with price of -1, do not have any price specified, so we filter them out, there also some ads with very low price like 10 thousand tomans, which can be advertiser mistake so we will fill them with brand mean:

In [22]:
INVALID_PRICE_THRESHOLD = 50000
df = df[df[COLUMN_TARGET] > 0]

mean_price_by_brand = df.groupby([COLUMN_BRAND]).mean()[COLUMN_TARGET]
df[COLUMN_TARGET] = df.apply(
    lambda row:
        row[COLUMN_TARGET] if row[COLUMN_TARGET] > INVALID_PRICE_THRESHOLD
        else mean_price_by_brand[row[COLUMN_BRAND]],
    axis=1,
)
target = df[COLUMN_TARGET]

df[COLUMN_TARGET].describe()

count    5.330100e+04
mean     7.037792e+05
std      5.422700e+05
min      5.010000e+04
25%      3.000000e+05
50%      5.500000e+05
75%      9.700000e+05
max      2.800000e+06
Name: price, dtype: float64

As we can see we have one numerical column (image_count), 2 categorical columns (brand and city), and 2 textual columns (title and description). There is also a column of created_at which in my opinion will not give any information about the price. So we will build out features dataframe from other columns.

As we know the models can only process numbers so we have to convert our data to numbers.
- For the numerical column `MinMaxScaler` from sklearn will be used.
- For the categorical columns we can use `LabelEncoder` and `OneHotEncoder`. `OneHotEncoder` is used here.
- For textual columns First the texts must be processed, there is this python library, [`hazm`](https://github.com/sobhe/hazm), for digesting Persian text. The normalizer, lemmatizer, and stopwords list from hazm are used here. The lemmatizer needs postagger trained model from hazm which can be downloaded from [here](https://github.com/sobhe/hazm/releases/download/v0.5/resources-0.5.zip) and must be beside the codes.

In [24]:
normalizer = hazm.Normalizer(persian_numbers=False)
post_tagger = hazm.POSTagger(model='./resources-0.5/postagger.model')
lemmatizer = hazm.InformalLemmatizer()
stopwords = hazm.stopwords_list()


def clean_text(text):
    text = text.replace('\n', ' ').replace('/', ' ').replace('آ', 'ا').lower()
    result = []
    for sentence in hazm.sent_tokenize(normalizer.normalize(text)):
        tagged_sentence = [
            (word, tag)
            for (word, tag) in post_tagger.tag(hazm.word_tokenize(sentence))
            if len(word) > 1 and word not in stopwords
        ]

        for word, tag in tagged_sentence:
            if tag:
                word = lemmatizer.lemmatize(word, tag)
            else:
                word = lemmatizer.lemmatize(word)

            result.append(unidecode(word))

    return result

def vectorize_text_column(df, column_name):
    df[column_name] = df[column_name].apply(clean_text)
    word_counter = Counter()
    for _, words in df[column_name].items():
        word_counter.update(words)

    new_columns = []
    most_common_words = [word for word, _ in word_counter.most_common(200)]
    for word in most_common_words:
        new_column_name = f'has_{word}_in_{column_name}'
        new_columns.append(new_column_name)
        df[new_column_name] = df[column_name].apply(
            lambda x: 1 if x.count(word) > 0 else 0)

    return new_columns


The `clean_text` helper function will get a text as its only argument and returns a list of lemmatized words which are not in stopwords list. It uses `hazm` part of speech tagger for better lemmatization. At last the the text are unidecoded to become ascii characters.

Then we have other helper function to convert a texttual columnt to a vector of numbers. It first use the `clean_text` to get list of cleaned words of text and then by using python built-in Counter will find 200 most common words from the column. Then at most 200 columns are added to dataframe, which has a value of 1 if that words exists in that textual column and 0 otherwise. Newly added columns are returned at last.


In [25]:
title_columns = vectorize_text_column(df, COLUMN_TITLE)
description_columns = vectorize_text_column(df, COLUMN_DESCRIPTION)

As said before, city and brand columns are going to be encoded so we will get them to use as column titles, The persian part is also omitted from the brand name:

In [34]:
df[COLUMN_BRAND] = df[COLUMN_BRAND].apply(lambda brand: brand.split('::')[0])

BRANDS = sorted(df[COLUMN_BRAND].unique())
CITIES = sorted(df[COLUMN_CITY].unique())

Now we can build our features dataframe:

In [35]:
features_df = pd.DataFrame(
    data=ColumnTransformer(
        transformers=[
            (
                'cleaner',
                'drop',
                [
                    'Unnamed: 0',
                    'created_at',
                    COLUMN_IMAGE_COUNT,
                    COLUMN_TITLE,
                    COLUMN_DESCRIPTION,
                    COLUMN_TARGET,
                ]
            ),
            (
                'scaler',
                MinMaxScaler(),
                [
                    COLUMN_IMAGE_COUNT,
                ]
            ),
            (
                'encoder',
                OneHotEncoder(),
                [
                    COLUMN_BRAND,
                    COLUMN_CITY,
                ]
            ),
        ],
        remainder='passthrough',
        sparse_threshold=0,
    ).fit_transform(df),
    columns=[
        COLUMN_IMAGE_COUNT,
        *BRANDS,
        *CITIES,
        *title_columns,
        *description_columns,
    ],
).infer_objects()
features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53301 entries, 0 to 53300
Columns: 419 entries, image_count to has_wsh_in_desc
dtypes: float64(419)
memory usage: 170.4 MB


### Choosing a model, Training, Evaluation, Hyperparameter tuning
Now we have out features and we can choose a model to train. There are different regressors available to use, I've tested 3 different regressors, `LinearRegression` and `Ridge` from `sklearn` and `LGBMRegressor` from `lightgbm`. `Ridge` is just `LinearRegression` with L2 regularization and `LightGBMRegressor` is a gradient boosting model that uses tree-based learning algorithms.

Some helper functions are defined and then each model is trained and evaluated to find the best model. `train_test_split` from `sklearn` is used to split the dataset into train and test sets.

MAE, mean absolute error and RMSE, root mean squared error are used as metrics to evaluate the models, the lower these metrics are, the better the model is.

In [36]:

x_train, x_test, y_train, y_test = train_test_split(features_df, target)


def calculate_metrics(y_true, y_pred):
    return {
        'mae': metrics.mean_absolute_error(y_true, y_pred),
        'rmse': sqrt(metrics.mean_squared_error(y_true, y_pred)),
    }


def evaluate_model(Model, **kwargs):
    model = Model(**kwargs).fit(x_train, y_train)

    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    return Model.__name__, {
        'train': calculate_metrics(y_train, y_train_pred),
        'test': calculate_metrics(y_test, y_test_pred),
    }


def find_best_hyper_param_index(res):
    _, best_param_index = min(
        [(v, i) for i, v in enumerate(list(map(lambda x: x[1]['test']['rmse'], res)))]
    )
    return best_param_index


In [30]:
evaluate_model(LinearRegression)

('LinearRegression',
 {'train': {'mae': 211132.58100846468, 'rmse': 297561.85882413964},
  'test': {'mae': 210419.85624346536, 'rmse': 296653.4701782954}})

In [31]:
ridge_res = [
    evaluate_model(Ridge, solver="sag", random_state=42, alpha=alpha)
    for alpha in [1, 2, 3, 3.5, 4, 4.5, 5, 6, 7]
]
ridge_res[find_best_hyper_param_index(ridge_res)]

('Ridge',
 {'train': {'mae': 211168.32239630638, 'rmse': 297564.5153490412},
  'test': {'mae': 210441.69538242024, 'rmse': 296660.1494826713}})

In [37]:
evaluate_model(LGBMRegressor, subsample=0.9)

('LGBMRegressor',
 {'train': {'mae': 186538.6156227561, 'rmse': 269330.4322410771},
  'test': {'mae': 193156.03616018448, 'rmse': 279520.24174821447}})

There is also this method to find the best hyperparameters for LGBM regressor from [this article](https://towardsdatascience.com/mercari-price-suggestion-97ff15840dbd):

In [38]:

params = {
    'learning_rate': uniform(0, 1),
    'n_estimators': sp_randint(200, 1500),
    'num_leaves': sp_randint(20, 200),
    'max_depth': sp_randint(2, 15),
    'min_child_weight': uniform(0, 2),
    'colsample_bytree': uniform(0, 1),
}
best_params = RandomizedSearchCV(
    estimator=LGBMRegressor(subsample=0.9), param_distributions=params, n_iter=10, cv=3, random_state=42,
    scoring='neg_root_mean_squared_error', verbose=10, return_train_score=True, n_jobs=-1
).fit(x_train, y_train).best_params_

print(best_params)
evaluate_model(LGBMRegressor, **best_params,
                     subsample=0.9, random_state=42, n_jobs=-1)


Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   14.0s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  19 out of  30 | elapsed:  1.7min remaining:   58.9s
[Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed:  1.8min remaining:   33.2s
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:  2.0min remaining:   13.3s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.1min finished


{'colsample_bytree': 0.44583275285359114, 'learning_rate': 0.09997491581800289, 'max_depth': 12, 'min_child_weight': 1.7323522915498704, 'n_estimators': 1323, 'num_leaves': 123}


('LGBMRegressor',
 {'train': {'mae': 110422.51712501375, 'rmse': 170358.44297517143},
  'test': {'mae': 189351.13891946105, 'rmse': 275360.2751925946}})

### Prediction

We can see that the best model is LGBM regressor with tuned hyperparameters. So this model is used as our final model to predict the prices:

In [40]:
model = LGBMRegressor(**best_params, subsample=0.9, random_state=42, n_jobs=-1)
model.fit(x_train, y_train)
model.predict(x_test)

array([ 431100.9737501 , 1288947.54960207,  499408.84773243, ...,
       1426617.59646782,  830701.96048179, 1929347.09398051])

There are some other methods such as TF-IDF to be used as text encoders. We can also extract more features like memory from the text, or if the cellphone is new or has any problems from the text to have better predictions.