# Pattern Discovery and Building Predictive Models
### PROJECT 2 - CITS 3401
#### Authors:


## <u>Introduction</u>

For this project, we would like to use the mobile price classification dataset as the source of data. The target of this project is to predict whether the price of a mobile phone is high or not.

<br />

### Tasks and Scope

#### 1) Data cleaning and analysis

  - Read through the table and the table column descriptions. Understand the meaning of each column in the table.
  - Distinguish the type of each attribute (e.g., nominal/categorical, numerical). You may need to discretise some attributes, when completing Task 2, 3 or 4.
  - Determine whether an attribute is relevant to your target variable. You may remove some attributes if they are not helpful for Task 2, 3, or 4. You might create separate data files for Task 2, 3 and 4.
  - Identify inconsistent data and take actions using the knowledge you have learnt in this unit.

#### 2) Association rule mining
  - Select a subset of the attributes (or all the attributes) to mine interesting patterns. To rank the degree of interesting of the rules extracted, use support, confidence and lift.
  - Explain the top k rules (according to lift or confidence) that have the "price_category” on the right-hand-side, where k >= 1.
  - Explain the meaning of the k rules in plain English.
  - Given the rules, what recommendation will you give to a company willing to design a high price mobile phone (e.g., should the mobile phone equipped with bluetooth)?

#### 3) Classification
  - Use the "price_category" as the target variable and train two classifiers based on different machine learning algorithms (e.g. classifier 1 based on a decision tree; classifier 2 based on SVMs).
  - Evaluate the classifiers based on some evaluation metrics (e.g., accuracy). You may use 10-fold cross-validation for the evaluation.

#### 4) Clustering
  - Run a clustering algorithm of your choice and explain how the results can be interpreted with respect to the target variable.

#### 5) Data reduction
  - Perform numerosity reduction and perform attribute reduction.
  - Train the two classifiers in Task 3 on the reduced data.
  - Answer the question: "Does data reduction improve the quality of the classifiers"?

#### 6) Attribute selection
  - Select the top-10 most important attributes manually based on your understanding of the problem; select the top-10 most important attributes based on Information Gain.
  - Which attribute selection method is better and why?

<br />

### Tools, Libraries and Packages

Python - Using throughout the project for data cleaning, data processing, and modelling.
Weka - Used a discrepency check (to ensure we receive the same values)

#### Imports

In [2]:
pip install pandas-profiling[notebook] --user

Note: you may need to restart the kernel to use updated packages.


'C:\Users\Max' is not recognized as an internal or external command,
operable program or batch file.


In [5]:
pip install mlxtend --user

Note: you may need to restart the kernel to use updated packages.


'C:\Users\Max' is not recognized as an internal or external command,
operable program or batch file.


In [9]:
# Data analysis, manipulation, and profiling
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Association Rule Mining
from mlxtend.frequent_patterns import apriori, association_rules

# Training Setup
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Training Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Training Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Optimization
from sklearn.model_selection import GridSearchCV

<br >

## <u>Data Cleaning and Profiling</u>

There are many different ways to perform data cleaning and profiling. For this process (in Project 2) we will be using an IPython Notebook, for the following reasons:

- The anaylsts are proficient in Python;
- The report can be integrated with code for specific sections of the analysis;
- The processes / procedures are highly repeatable and easily automated using scripts;
- Data exploration and anomaly detection can easily be performed through a variety of visualizations (charts, graphs, tables, etc);

The packages that will be used are built-in the Anaconda package, except for `pandas_profiling` from https://github.com/pandas-profiling/pandas-profiling which is used for doing detailed exploratory analysis of data.
Data profiling is important to measure the quality of data, which in turn assists greately in the determination of data anomalies/inconsistencies, and as such, necessary data transformations.
The data profiling reports will be generated with Pandas Profiling, and SSIS. Portions of these reports will be reference in our project discussion, with the resource/s (data profiling reports) attached as an Appendix.

<br />

In [12]:
raw_data = pd.read_csv("./data/raw/mobile_price.csv")
raw_metadata = pd.read_excel("./data/raw/ColumnDescription.xlsx", index_col="Column")

staging_data = raw_data.copy()
staging_data

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_category
0,0,842,no,2.2,no,1,0,7,0.6,188,...,20,756,2549,9,7,19,no,0,yes,0
1,1,1021,yes,0.5,YES,0,1,53,0.7,136,...,905,1988,2631,17,3,7,yes,1,no,0
2,2,563,yes,0.5,Yes,2,1,41,0.9,145,...,1263,1716,2603,11,2,9,yes,1,no,0
3,3,615,has,2.5,no,0,0,10,0.8,131,...,1216,1786,2769,16,8,11,Yes,0,no,0
4,4,1821,yes,1.2,NO,13,1,44,0.6,141,...,1208,1212,1411,8,2,15,Yes,1,no,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1995,794,yes,0.5,yes,0,1,2,0.8,106,...,1222,1890,668,13,4,19,yes,1,No,0
1996,1996,1965,yes,2.6,yes,0,0,39,0.2,187,...,915,1965,2032,11,10,16,has,1,yes,0
1997,1997,1911,no,0.9,yes,1,1,36,0.7,108,...,868,1632,3057,9,1,5,yes,1,no,1
1998,1998,1512,no,0.9,not,4,1,46,0.1,145,...,336,670,869,18,10,19,Yes,1,yes,0


In [13]:
raw_metadata.to_dict()["Explaination"]

{'id': 'ID',
 'battery_power': 'Total energy a battery can store in one time measured in mAh',
 'blue': 'Has bluetooth or not',
 'clock_speed': 'speed at which microprocessor executes instructions',
 'dual_sim': 'Has dual sim support or not',
 'fc': 'Front Camera mega pixels ',
 'four_g': 'Has 4G or not. 1 = yes , 0 = no',
 'int_memory': 'internal Memory in Gigabytes',
 'm_dep': 'Mobile Depth in cm',
 'mobile_wt': 'Weight of mobile phone',
 'n_cores': 'Number of cores of processor',
 'pc': 'Primary Camera mega pixels',
 'px_height': 'Pixel Resolution Height',
 'px_width': 'Pixel Resolution Width',
 'ram': 'Random Access Memory in Mega Bytes',
 'sc_h': 'Screen Height of mobile in cm',
 'sc_w': 'Screen Width of mobile in cm',
 'talk_time': 'longest time that a single battery charge will last when you are',
 'three_g': 'Has 3G or not',
 'touch_screen': 'Has touch screen or not, 1 = yes, 0 = no',
 'wifi': 'Has wifi or not',
 'price_category': 'This is the target variable with indicating if

## HOLDER 1

In [None]:
metadata_dict = raw_metadata.to_dict()["Explaination"]
raw_profile = ProfileReport(raw_data, explorative=True, orange_mode=True)

# Set Metadata
raw_profile.set_variable("variables.descriptions",metadata_dict)
                               
raw_profile.to_file("./profile_reports/raw_data_profile.html")
raw_profile

### Initial (raw import data) Interpretation, and Cleaning Strategy

`ID` -> The id field is unique, and it seems to be a good candidate for primary key