# **Project Name**    -  **Mobile Price Range Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**

In the competitive mobile phone market companies want to understand sales to of mobile phones and factors which drive the prices. The objective is to find out some relation between features of a mobile phone(eg:- RAM, internal Memory etc)and its selling price. In this problem, we do not have to predict the actual price but price range indicating how high the price is.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
from xgboost import XGBRFClassifier
from sklearn.tree import export_graphviz

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Loading Dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/arshadmujawar2408/Mobile-Price-Range-Prediction-ML/main/data_mobile_price_range.csv")

### Dataset First View

In [None]:
# Dataset First
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(dataset[dataset.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(dataset.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

As we seen, our dataset consists of 2000 rows and 21 features. Fortunately we don't have any null values and duplicate values in any of the features.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe

dataset.describe(include='all')

### Variables Description

* **battery_power                :**Total energy a battery can store in one time measured in mAh

* **blue       :**Has bluetooth or not

* **clock_speed            :** speed at which microprocessor executes instructions

* **dual_sim            :**Has dual sim support or not

* **fc          :**Front Camera mega pixels

* **four_g        :**Has 4G or not

* **int_memory   :**Internal Memory in Gigabytes

* **m_dep       :**Mobile Depth in cm

* **mobile_wt    :**Weight of mobile phone

* **n_cores         :**Number of cores of processor

* **pc    :**Primary Camera mega pixels

* **px_height  :**Pixel Resolution Height

* **px_width  :**Pixel Resolution Width

* **ram :**Random Access Memory in Mega Bytes

* **sc_h :**Screen Height of mobile in cm

* **sc_w :**Screen Width of mobile in cm

* **talk_time :**longest time that a single battery charge will last when you are

* **three_g :**Has 3G or not

* **touch_screen :**Has touch screen or not

* **wifi :**Has wifi or not

* **price_range :**This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Bar plot (Bivariate)

In [None]:
# Chart - 1 visualization code
sns.set()
price_plot=dataset['price_range'].value_counts().plot(kind='bar')
plt.xlabel('price_range')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.


##### 2. What is/are the insight(s) found from the chart?

So, there are mobile phones in 4 price ranges. The number of elemenets is amost similar.  

#### Chart - 2 - Displot (Univariate)

In [None]:
# Chart - 2 visualization code
sns.set(rc={'figure.figsize':(5,5)})
ax=sns.displot(dataset["battery_power"])
plt.show()

##### 1. Why did you pick the specific chart?

A displot, short for distribution plot, is a visualization that provides insights into the distribution of a univariate data set. It combines a histogram with a kernel density estimate plot to give a comprehensive view of the data's distribution.

##### 2. What is/are the insight(s) found from the chart?

 This plot shows how the battery mAh is spread. there is a gradual increase as the price range increases

#### Chart - 3 - Barplot(Bivariate)

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.barplot(data=dataset,x='blue',y='price_range',ax=ax)

##### 1. Why did you pick the specific chart?

The bar graph is used to compare the items between different groups over time. Bar graphs are used to measure the changes over a period of time. When the changes are larger, a bar graph is the best option to represent the data.
  

##### 2. What is/are the insight(s) found from the chart?

We analyze that half of the devices have Bluetooth, and half don’t.

#### Chart - 4 - Scatetr plot(Bivariate)

In [None]:
# Chart - 4 visualization code
dataset.plot(x='price_range',y='ram',kind='scatter')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.



##### 2. What is/are the insight(s) found from the chart?

 Ram has continuous increase with price range while moving from Low cost to Very high cost

#### Chart - 5 - lmp plot (Bivariate)

In [None]:
# Chart - 5 visualization code
sns.lmplot(x='ram', y='price_range', data=dataset, line_kws={'color': 'purple'})
plt.yticks([0, 1, 2, 3])
plt.xlabel('Ram')
plt.ylabel('Price Range')
plt.show()

##### 1. Why did you pick the specific chart?

 An lmplot is a type of plot commonly used in statistical data visualization that combines scatter plots and linear regression models. It is part of the seaborn library in Python.

An lmplot allows you to explore the relationship between two variables and fit a linear regression model to the data. It provides a visual representation of how the variables are related and whether there is a linear trend in the data.

##### 2. What is/are the insight(s) found from the chart?

 The plot aboves shows the high correlation between ram and price range. It shows the general pattern: as ram increases, mobile's price increases

#### Chart - 6 - Correlation heatmap (Multivariate)

In [None]:
# Chart - 5 visualization code
import seaborn as sns
import matplotlib.pyplot as plt
corr=dataset.corr()
fig = plt.figure(figsize=(5,5))
r = sns.heatmap(corr, cmap='Reds')
r.set_title("Correlation ")

##### 1. Why did you pick the specific chart?

 A heatmap is a graphical representation of data where values are encoded as colors in a matrix. It is a popular visualization technique used to explore relationships and patterns in tabular data. Heatmaps are particularly useful for visualizing correlation matrices, displaying the magnitude of values across different categories or variables.

##### 2. What is/are the insight(s) found from the chart?

As we can see our target price range has highly positive correlation between ram. Also,
*   3G and 4G
*   pc(Primary Camera mega pixels) and fc(Front Camera mega pixels)
*   sc_w(Screen Width of mobile in cm) and sc_h(Screen Height of mobile in cm)
Features have highly positive correlation. For example as long as sc_w (screen width of mobile) increase, sc_h(screen height of mobile) is increasing.

#### Chart - 7 Pie Chart (Multivariate)

In [None]:
# Chart - 7 visualization code
labels4g = ["4G-supported",'Not supported']
values4g = dataset['four_g'].value_counts().values
fig1, ax1 = plt.subplots()
ax1.pie(values4g, labels=labels4g, autopct='%1.1f%%',shadow=True,startangle=90)
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

 52% mobile phone supported 4G while 48% mobile phones not supported 4G phone.

#### Chart - 8 - KDE Plot and Box plot (Bivariate)

In [None]:
fig, axs = plt.subplots(1,2, figsize=(15,5))
sns.kdeplot(data=dataset, x='px_width', hue='price_range', ax=axs[0])
sns.boxplot(data=dataset, x='price_range', y='px_width', ax=axs[1])
plt.show()

##### 1. Why did you pick the specific chart?

A kdeplot, short for kernel density estimate plot, is a type of data visualization that represents the estimated probability density function (PDF) of a continuous variable. It is commonly used to visualize the distribution of a single variable or compare multiple distributions.

A boxplot, also known as a box-and-whisker plot, is a type of data visualization that provides a summary of the distribution of a continuous variable. It is particularly useful for comparing multiple distributions or identifying potential outliers in the data

##### 2. What is/are the insight(s) found from the chart?

 There is not a continuous increase in pixel width as we move from Low cost to Very high cost. Mobiles with 'Medium cost' and 'High cost' has almost equal pixel width. so we can say that it would be a driving factor in deciding price_range.

## ***5. Feature Engineering & Data Pre-processing***

### 1. X and Y array

In [None]:
X=dataset.drop('price_range',axis=1)

In [None]:
y=dataset['price_range']

In [None]:
# describes info about X
X.shape

In [None]:
# describes info about y
y.shape

In [None]:
# Scaling values of X
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

### 2. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Split your data to train and test. Choose Splitting ratio wisely.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.20, random_state = 42)

# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

## ***7. ML Model Implementation***

### ML Model - 1 - **Implementing Logistic Regression**

In [None]:
# ML Model - 1 Implementation
clf = LogisticRegression()
# Fit the Algorithm
clf.fit(X_train, y_train)

In [None]:
# Checking the coefficients
clf.coef_

In [None]:
# Checking the intercept value
clf.intercept_

In [None]:
# Prediction
y_pred_test = clf.predict(X_test)
y_pred_train = clf.predict(X_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Evaluation metrics for test
from sklearn.metrics import classification_report
print('Classification report for Logistic Regression (Test set)= ')
print(classification_report(y_pred_test, y_test))

In [None]:
# Evaluation metrics for train

from sklearn.metrics import classification_report
print('Classification report for Logistic Regression (Train set)= ')
print( classification_report(y_pred_train, y_train))

In [None]:
#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()


I used Logistic regression algorithm to create the model. As I got good result.

Both reports show the precision, recall, and F1-score for each class, as well as the overall accuracy and average metrics. The test set achieved an accuracy of 0.94, with similar precision, recall, and F1-score values for each class. The train set also achieved an accuracy of 0.94, with comparable precision, recall, and F1-score values for all classes.


### ML Model - 2 - **Implementing Random Forest Classifier**

In [None]:
# ML Model - 2 Implementation
# taking 300 trees
clsr = RandomForestClassifier(n_estimators=300)
clsr.fit(X_train, y_train)

In [None]:
y_pred = clsr.predict(X_test)
test_score= accuracy_score(y_test, y_pred)
test_score

In [None]:
y_pred_train = clsr.predict(X_train)
train_score = accuracy_score(y_train, y_pred_train)
train_score

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# classification report on test
print('Classification report for Logistic Regression (Test set)= ')
print(classification_report(y_test, y_pred))

In [None]:
#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

Then, I used Random Forest algorithm to create the model. As I got not good result.
Next tryting to improving the score by using hyperparameter tuning technique.

#### 2. Feature Importance

In [None]:
feature_importance = pd.DataFrame({'Feature':X.columns,
                                   'Score':clsr.feature_importances_}).sort_values(by='Score', ascending=False).reset_index(drop=True)
feature_importance.head()

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax = sns.barplot(x=feature_importance['Score'], y=feature_importance['Feature'])
plt.show()

#### 3. Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'n_estimators':[10,50,100],
          'max_depth':[10,20,30,40],
           'min_samples_split':[2,4,6],
          'max_features':['sqrt',4,'log2','auto'],
          'max_leaf_nodes':[10, 20, 30]
          }
rf = RandomForestClassifier()
clsr = GridSearchCV(rf, params, scoring='accuracy', cv=3)
clsr.fit(X, y)

In [None]:
clsr.best_params_

In [None]:
clsr.best_estimator_

In [None]:
clsr.best_score_

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
clsr = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=30, max_samples=None,
                       min_impurity_decrease=0.0,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
clsr.fit(X_train, y_train)

In [None]:
y_pred = clsr.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
y_pred = clsr.predict(X_train)
accuracy_score(y_train, y_pred)

In [None]:
print(classification_report(y_train, y_pred))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model.  But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced.
Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After performing the future tunning the accuracy of model increses to 92%.

### ML Model - 3 - **Implementing DecisionTree Classifier**

In [None]:
# ML Model - 3 Implementation
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth = 5)
dtc.fit(X_train, y_train)

In [None]:
# Prediction
y_pred_test = dtc.predict(X_test)
y_pred_train = dtc.predict(X_train)

In [None]:
accuracy_score(y_test, y_pred_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Evaluation metrics for test

print('Classification report for Decision Tree (Test set)= ')
print(classification_report(y_pred_test, y_test))

Then, I used Decision tree algorithm to create the model. As I got there not so good result.

Next tryting to improving the score by using hyperparameter tuning technique.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
grid = GridSearchCV(dtc, param_grid = {'max_depth': (5, 30), 'max_leaf_nodes': (10, 100)}, scoring = 'accuracy', cv = 5, verbose = 24)
grid.fit(X_train, y_train)

In [None]:
# Prediction
y_pred_test = grid.predict(X_test)
y_pres_train = grid.predict(X_train)

In [None]:
# Evaluation metrics for test
print('Classification Report for Decision Tree (Test set)= ')
print(classification_report(y_test, y_pred_test))

In [None]:
# Evaluation metrics for train
print('Classification Report for Decision Tree (Train set)= ')
print(classification_report(y_train, y_pred_train))

In [None]:
 #Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

### ML Model - 4 - **Implementing XGboost Classifier**

In [None]:
# ML Model - 4 Implementation
xgb = XGBClassifier(max_depth = 5, learning_rate = 0.1)
xgb.fit(X_train, y_train)
XGBClassifier(max_depth=5, objective='multi:softprob')

In [None]:
# Prediction
y_pred_train = xgb.predict(X_train)
y_pred_test = xgb.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Evaluation metrics for test
score = classification_report(y_test, y_pred_test)
print('Classification Report for XGBoost(Test set)= ')
print(score)

In [None]:
# Evaluation metrics for train
score = classification_report(y_train, y_pred_train)
print('Classification Report for XGBoost(Train set)= ')
print(score)

Then, I used XGBoost algorithm to create the model. As I got there  good result.
Next tryting to improving the score by using hyperparameter tuning technique.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Cross validation
grid = GridSearchCV(xgb, param_grid={'n_estimators': (10, 200), 'learning_rate': [1, 0.5, 0.1, 0.01, 0.001], 'max_depth': (5, 10),
                                     'gamma': [1.5, 1.8], 'subsample': [0.3, 0.5, 0.8]}, cv = 5, scoring = 'accuracy', verbose = 10)
grid.fit(X_train,y_train)

In [None]:
#Generate the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
score = classification_report(y_test, y_pred_test)
print('Classification Report for XGBoost(Test set)= ')
print(score)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I would like to go with both Recall and Precision and which describes both is F1 Score.


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I have choosen Logestic regression model because it has high accuracy as compare to other models which are used in that.

# **Data Preprocessing Blog**


https://medium.com/almabetter/data-preprocessing-ea09fac6a7f7

# **Conclusion**

As a single person, I followed a standard workflow for data analysis and model building. Here's a summary of the steps I took:

- Dataset Understanding and EDA: I began by gaining a basic understanding of the dataset, exploring its contents, and performing Exploratory Data Analysis (EDA). This step helped me uncover valuable insights and characteristics of the dataset.

- Feature Engineering: To enhance the predictive power of the models, I applied feature engineering techniques. This involved creating new features, selecting relevant features, or transforming existing features to improve their representation.

- Model Building: The core of my work was focused on building predictive models. I constructed three different models: Logestic regression, Decision tree, Random Forest, and XGBoost. Initially, I used the default parameters for each model.

- Hyperparameter Optimization: To improve the performance of the models, I employed various techniques to find optimal hyperparameters for each model individually. This could include techniques like grid search, random search, or Bayesian optimization.

- Model Selection: Based on the evaluation results, I selected Logistic regression as the preferred model due to its good performance on the dataset.
  
By following these steps, I was able to perform data analysis, build models and select the best model. ```