### Author: Vidhi Kokel
# Theoretical Questions
#### 🌈 Why Data Mining is a misnomer? What is another preferred name?
#### Data Mining actually refers to extraction of knowledge or patterns from large amounts of data and not extracting (or mining) the data itself. Hence, it is a misnomer. Other preferred names of Data Mining are:
1. Knowledge discovery (mining) in databases (KDD)
2. Knowledge extraction
3. Data/pattern analysis
4. Data archaeology
5. Data dredging
6. Information harvesting
7. Business intelligence

##### Source: https://www.geeksforgeeks.org/data-mining-process/

#### 🌈 What is the general knowledge discovery process? 
#### The knowledge discovery process (KDP), also called knowledge discovery in databases, seeks new knowledge in some application domain. It is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
##### Source: https://link.springer.com/chapter/10.1007/978-0-387-36795-8_2

#### 🌈 What is the difference between a data engineer and data scientist/AI engineer?
1. The primary role of a data engineer is to design and develop a highly maintainable database management system. The primary role of a data scientist is to take the raw data from the database and use it to provide insights to improve the business.
2. A data engineer works mainly on the design and architecture of a database management system. In contrast, a data scientist works on applying analytical tools and modeling techniques to process the data.
3. A data engineer is responsible for transforming Big Data into a useful form for analysis. The data scientist does the actual analysis of this Big Data.
4. Data engineers serve as the wardens for the storage and movement of data. They design, build, test, and maintain databases to extract, load, and update data on database management systems. Data scientists take this data using various queries specified by the data engineer for data retrieval. They apply machine learning or deep learning models to the data to identify patterns and glean valuable insights from the data.
#### Source: https://www.projectpro.io/article/data-engineer-vs-data-scientist-the-differences-you-must-know/430#:~:text=A%20data%20engineer%20works%20mainly,a%20useful%20form%20for%20analysis.

#### 🌈 In data mining, what is the difference between prediction and categorization?
#### Categorization/Classification is the process of identifying to which category, a new observation belongs to on the basis of a training data set containing observations whose category membership is known. Prediction is the process of identifying the missing or unavailable numerical data for a new observation.
#### Source: https://www.differencebetween.com/difference-between-classification-and-vs-prediction/

#### 🌈 Why data science/machine learning is a bad idea in the context of information security?
#### Because there are multiple security risks listed as follows while implementing the data science/machine learning practices.
1. Adversarial samples - Modifying the dataset slightly without any visible change thus leading to unexpected outcomes
2. Backdoor Attack - Introduction of corrupt or bad samples in the training dataset can contribute in highly misleading the model.
3. Information Leak - Model is solely evaluated based on testing
performance metric before release/deployment, but it could store or might have more information that could be exposed to the attackers.
4. Ethical Issues - Systems designers choose the features, metrics, and analytics structures of the models that enable data mining. Thus, data-driven
technologies, such as Artificial Intelligence, can potentially replicate the
preconceptions and biases of their designer.
5. Increased risk of data breach and fine
6. Increased uncertainty
7. Difficult to evaluate change in models
8. Difficult to verify against compliance
9. Responsibility, accountability, liability
10. Should be part of the risk management framework
#### Source: Course PPT Lecture-1

#### 🌈 What is CIA principle and how can we use it to access the security/privacy aspect of the AI system/pipelines?
#### The CIA principle means, the data should be Confidential, should have some Integrity and should be Available.
#### To fight against confidentiality breaches, you can classify and label restricted data, enable access control policies, encrypt data, and use multi-factor authentication (MFA) systems. It is also advisable to ensure that all in the organization have the training and knowledge they need to recognize the dangers and avoid them.
#### To protect the integrity of your data, you can use hashing, encryption, digital certificates, or digital signatures. For websites, you can employ trustworthy certificate authorities (CAs) that verify the authenticity of your website so visitors know they are getting the site they intended to visit. 
#### To protect the integrity of your data, you can use hashing, encryption, digital certificates, or digital signatures. For websites, you can employ trustworthy certificate authorities (CAs) that verify the authenticity of your website so visitors know they are getting the site they intended to visit. 
#### Source: https://www.fortinet.com/resources/cyberglossary/cia-triad#:~:text=What%20is%20the%20Information%20Security,and%20methods%20for%20creating%20solutions.




# Wish.com Product Rating Prediction

In [1]:
#Imports the required libraries, models and metrics
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from pprint import pprint

In [2]:
# sample: returns a random number of samples
# frac: It is an optional parameter of sample method that represents what fraction of the whole dataset should be returned randomly
# eg. df.sample(frac=0.5) returns 50% of the samples randomly selected from the whole dataset
# Here since we have 1 as value of frac the whole dataset will be returned but the rows will be shuffled
data = pd.read_csv('train_new.csv').sample(frac=1) #shuffle

# isin: Checks if the value of the selected column is from the values listed in the provided list.
# loc: Access a group of rows and/or columns by labels or a boolean array.
# Here we are selecting only those rows that have the ratings from 1 to 5 and excluding all others
data = data.loc[data['rating'].isin([1, 2, 3, 4, 5])]

# fillna: Fills the NA/NaN values with a specific method or value. Here we are filling those unknown values with 0
data = data.fillna(0)

# drop: Drop specified labels from rows or columns. Here we are dropping few columns which are not important for the model training
data = data.drop(['merchant_id', 'merchant_profile_picture', 'id', 'tags'], axis=1)

### Data Preprocessing

In [3]:
# Drop few more columns that are not so important (or may have duplicate values) for the training as they might not help much in predicting the rating of the product
data = data.drop(['merchant_has_profile_picture', 'theme', 'crawl_month'], axis=1)

In [4]:
# Create correlation matrix for the remaining features
cor_matrix = data.corr().abs()
# Selecting the upper triangle of the matrix as the lower triangle is its mirror image
upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
# Drop the columns with correlation value of more than 0.85
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.85)]
print(to_drop)
data = data.drop(to_drop, axis=1)

['rating_count', 'shipping_option_price']


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  after removing the cwd from sys.path.


### Dataset Preparation

In [5]:
# Separates the training dataset and validation dataset. Training dataset will have 70% of the data and validation the remaining 30%
msk = np.random.rand(len(data)) < 0.7
tr = data[msk]
val = data[~msk]

In [6]:
# Prints the training data
tr

Unnamed: 0,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,...,countries_shipped_to,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating
931,7.00,16,EUR,10000,0,4.0,0,0,0,0,...,45,50,0.0,0,CN,Shen Fashion Style,shenfashionstyle,"82 % avis positifs (1,702 notes)",1702,3.907168
630,4.94,5,EUR,10000,1,4.0,0,0,0,0,...,49,50,0.0,0,CN,Smart Home International Co.Ltd,smarthomeinternationalcoltd,"88 % avis positifs (55,670 notes)",55670,4.121484
962,8.00,68,EUR,8,0,5.0,0,0,0,0,...,43,50,0.0,0,CN,Lucky726,lucky726,80 % avis positifs (626 notes),626,3.916933
742,1.72,2,EUR,100,1,4.0,0,0,0,0,...,48,50,1.0,Quantité limitée !,CN,guangzhouweishiweifushiyouxiangongsi,广州唯适唯服饰有限公司,"83 % avis positifs (32,168 notes)",32168,3.884544
1040,12.00,11,EUR,20000,1,4.0,0,0,0,0,...,50,50,0.0,0,CN,redisland,redisland,"(59,903 notes)",59903,4.153665
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
848,11.00,38,EUR,5000,0,4.0,0,0,0,0,...,63,50,1.0,Quantité limitée !,CN,886806Blw,886806blw,"(30,225 notes)",30225,4.267891
245,5.83,8,EUR,5000,0,4.0,0,0,0,0,...,40,50,0.0,0,0,Maryswill,maryswill,"81 % avis positifs (91,631 notes)",91631,3.837937
470,2.81,3,EUR,100,1,4.0,0,0,0,0,...,35,50,0.0,0,CN,hellohorse,hellohorse,"(126,370 notes)",126370,4.146957
1035,2.69,3,EUR,100,0,5.0,0,0,0,0,...,43,50,0.0,0,CN,ZHOUkely,zhoukely,"91 % avis positifs (7,218 notes)",7218,4.247714


In [7]:
# Encodes the values of the categorical features of the training dataset
dict_cat = {}


# columns that are of categorical value
cat_cols = tr.columns[tr.dtypes==object].to_list()



def cat_digit(col):  
    # build the mapping
    encoded = col.astype('category').cat.codes
    # store the mapping
    dict_cat[col.name] = dict(zip(np.asarray(col), np.asarray(encoded)))
    return encoded

# for each categorical feature, apply cat_digit where we build the mapping and transform the data
# this is for the training set (where we build the mapping)
tr[cat_cols] = tr[cat_cols].apply(lambda col: cat_digit(col))
tr

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,...,countries_shipped_to,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating
931,7.00,16,0,10000,0,4.0,0,0,0,0,...,45,50,0.0,0,1,177,375,207,1702,3.907168
630,4.94,5,0,10000,1,4.0,0,0,0,0,...,49,50,0.0,0,1,183,398,479,55670,4.121484
962,8.00,68,0,8,0,5.0,0,0,0,0,...,43,50,0.0,0,1,124,281,180,626,3.916933
742,1.72,2,0,100,1,4.0,0,0,0,0,...,48,50,1.0,1,1,330,570,251,32168,3.884544
1040,12.00,11,0,20000,1,4.0,0,0,0,0,...,50,50,0.0,0,1,443,351,89,59903,4.153665
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
848,11.00,38,0,5000,0,4.0,0,0,0,0,...,63,50,1.0,1,1,4,5,60,30225,4.267891
245,5.83,8,0,5000,0,4.0,0,0,0,0,...,40,50,0.0,0,0,130,288,203,91631,3.837937
470,2.81,3,0,100,1,4.0,0,0,0,0,...,35,50,0.0,0,1,341,172,16,126370,4.146957
1035,2.69,3,0,100,0,5.0,0,0,0,0,...,43,50,0.0,0,1,238,551,581,7218,4.247714


In [8]:
# Lists all the categorical features from the training dataset
print('categorical features')
pprint(list(dict_cat.keys()))

categorical features
['currency_buyer',
 'product_color',
 'product_variation_size_id',
 'shipping_option_name',
 'urgency_text',
 'origin_country',
 'merchant_title',
 'merchant_name',
 'merchant_info_subtitle']


In [9]:
# Shows how the encoding has been done for the origin_country column
print('Lets see what the mapping for column origin_country :')
pprint(dict_cat['origin_country'])
print('It is a string to integer mapping')

Lets see what the mapping for column origin_country :
{0: 0, 'CN': 1, 'SG': 2, 'US': 3, 'VE': 4}
It is a string to integer mapping


In [10]:
# then we will use the mappings built from the training set, to transform the validation set
val[cat_cols] = val[cat_cols].apply(lambda col: col.map(dict_cat[col.name]))
# for string values that not seen in training set, we replace it with -1
val = val.fillna(-1)
val

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,...,countries_shipped_to,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating
534,5.00,5,0,1000,1,4.0,1,0,1,0,...,42,50,0.0,0,1.0,-1.0,-1.0,-1.0,13909,4.356819
816,5.89,5,0,100,1,4.0,0,0,0,0,...,30,50,1.0,1,1.0,-1.0,-1.0,-1.0,150,3.993333
408,8.00,7,0,10,1,5.0,0,0,0,0,...,33,50,0.0,0,1.0,-1.0,-1.0,-1.0,173992,4.326463
135,12.00,11,0,100,0,4.0,0,0,0,0,...,42,50,0.0,0,1.0,404.0,279.0,-1.0,1452,3.735537
524,12.00,11,0,1000,0,4.0,0,0,0,0,...,13,50,1.0,1,1.0,-1.0,-1.0,-1.0,2248,4.150356
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
779,9.00,9,0,1000,0,4.0,0,0,0,0,...,60,50,0.0,0,3.0,-1.0,-1.0,-1.0,4298,4.229409
232,8.00,47,0,100,1,5.0,1,0,1,0,...,43,50,0.0,0,1.0,-1.0,-1.0,-1.0,1032,4.577519
833,4.67,4,0,100,1,3.0,0,0,0,0,...,25,50,1.0,1,1.0,433.0,336.0,430.0,7497,4.079365
131,8.00,7,0,1000,1,3.0,0,0,0,0,...,25,50,1.0,1,1.0,-1.0,-1.0,-1.0,6623,4.072475


In [11]:
# Considers the rating column as the training and validating labels respectively
tr_y = tr['rating']
val_y = val['rating']
# Considers all other columns except rating as features for training the model and validating the trained model
tr_x = tr.drop('rating', axis=1)
val_x = val.drop('rating', axis=1)

### Model Training

#### Logistic Regression (Provided Sample)

In [12]:
# Trains the model using training featuers and labels
clf_regression = LogisticRegression().fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_regression = clf_regression.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_regression = f1_score(val_y, pred_val_regression, average='micro')
print(val_score_regression)

0.7483221476510066


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


#### Decision Tree

In [13]:
# Trains the model using training featuers and labels
from sklearn.tree import DecisionTreeClassifier
clf_decision_tree = DecisionTreeClassifier().fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_decision_tree = clf_decision_tree.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_decision_tree = f1_score(val_y, pred_val_decision_tree, average='micro')
print(val_score_decision_tree)

0.4563758389261745


In [14]:
# Modifying the default parameters of Decision Tree
# Considering square root of max_features as it gets rid of some unnecessary features that are not helping in decision making
clf_decision_tree_tuned = DecisionTreeClassifier(max_features="sqrt").fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_decision_tree_tuned = clf_decision_tree_tuned.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_decision_tree_tuned = f1_score(val_y, pred_val_decision_tree_tuned, average='micro')
print(val_score_decision_tree_tuned)

0.5503355704697986


#### SVM

In [15]:
# Trains the model using training featuers and labels
from sklearn.svm import SVC
clf_svm = SVC().fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_svm = clf_svm.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_svm = f1_score(val_y, pred_val_svm, average='micro')
print(val_score_svm)

0.7449664429530202


In [16]:
# Modifying the default parameters of SVM
# Changing the gamma parameter as smaller value of gamma introduces larger variance
clf_svm_tuned = SVC(gamma=0.001).fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_svm_tuned = clf_svm_tuned.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_svm_tuned = f1_score(val_y, pred_val_svm_tuned, average='micro')
print(val_score_svm_tuned)

0.7919463087248322


### Naive-Bayes

In [17]:
# Trains the model using training featuers and labels
from sklearn.naive_bayes import GaussianNB
clf_naive_bayes = GaussianNB().fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_naive_bayes = clf_naive_bayes.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_naive_bayes = f1_score(val_y, pred_val_naive_bayes, average='micro')
print(val_score_naive_bayes)

0.5167785234899329


### AdaBoost

In [18]:
# Trains the model using training featuers and labels
from sklearn.ensemble import AdaBoostClassifier
clf_ada = AdaBoostClassifier().fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_ada = clf_ada.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_ada = f1_score(val_y, pred_val_ada, average='micro')
print(val_score_ada)

0.7684563758389261


### XGBoost

In [19]:
# Trains the model using training featuers and labels
import xgboost as xgb
xgbr = xgb.XGBClassifier()
clf_xgb = xgbr.fit(tr_x, tr_y)
# Predicts the labels for the validation dataset to evaluate the performance
pred_val_xgb = clf_xgb.predict(val_x)
# Evaluates the micro f1 score by comparing the predicted values and the actual values from the validation dataset
val_score_xgb = f1_score(val_y, pred_val_xgb, average='micro')
print(val_score_xgb)

0.7953020134228188


### Test Data Preparation

In [20]:
# once you are happy with your local model, let's prepare a submission

test_data = pd.read_csv('test_new.csv').sample(frac=1) 
_id = test_data['id']
test_data = test_data.fillna(0)
test_data = test_data.drop(['merchant_id', 'merchant_profile_picture', 'id', 'tags'], axis=1)

# Dropping columns deducted from pre-processing steps
test_data = test_data.drop(['merchant_has_profile_picture', 'theme', 'crawl_month'], axis=1)
test_data = test_data.drop(to_drop, axis=1)
test_data[cat_cols] = test_data[cat_cols].apply(lambda col: col.map(dict_cat[col.name]))

# again, not-seen string value filled with -1
test_data = test_data.fillna(-1)

### Testing the trained model and generating submission

In [21]:
# Since XGB gives the best f1-score after multiple executions, we generate the final submission using XGB model
pred_test = clf_xgb.predict(test_data)
pred_df = pd.DataFrame(data={'id': np.asarray(_id), 'rating': pred_test})
pred_df.to_csv('pred_walkthrough.csv', index=False)