# Introduction

In this notebook ,we'll evaluate the results of our classification experiements .

In [299]:
# Used to manipule and analyse data
import pandas as pd

# Used for cleaning
import re

# Used to work with arrays
import numpy as np

# Used to interact with Kaggle (Kaggle API)
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

# Used for classification experiement
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Used to generate beautiful charts
import plotly.express as px

# Used to deal with class imbalance
import imblearn
from imblearn.over_sampling import SMOTE


import os

## Create a dataset for testing purposes 

### Authentification

In [300]:
api = KaggleApi()
api.authenticate()

List of getting started competitions **(16 competitions)**

In [301]:
competitions = api.competitions_list(category="gettingStarted")
competitions

[contradictory-my-dear-watson,
 gan-getting-started,
 store-sales-time-series-forecasting,
 tpu-getting-started,
 digit-recognizer,
 titanic,
 house-prices-advanced-regression-techniques,
 connectx,
 nlp-getting-started,
 spaceship-titanic,
 facial-keypoints-detection,
 street-view-getting-started-with-julia,
 word2vec-nlp-tutorial,
 data-science-london-scikit-learn,
 just-the-basics-the-after-party,
 just-the-basics-strata-2013]

List of Kernel references of the well documented notebooks that we have selected from Learn section on Kaggle

In [302]:
competitions = ['contradictory-my-dear-watson','gan-getting-started','store-sales-time-series-forecasting','tpu-getting-started','digit-recognizer','titanic','house-prices-advanced-regression-techniques','connectx','nlp-getting-started','spaceship-titanic','facial-keypoints-detection','word2vec-nlp-tutorial','data-science-london-scikit-learn','just-the-basics-the-after-party']

Create a dataframe where we will store data related to our notebooks

In [303]:
df = pd.DataFrame(columns = ['kernel_ref','path','title','competition','domain','technique'])

In [304]:
for comp in competitions :
    try :
        kernels_list = api.kernels_list(page=1,competition = str(comp),sort_by="voteCount")
        for k in kernels_list :
                df.loc[len(df)] = [str(k.ref),re.sub("^[^/]*/",'',str(k.ref))+".ipynb",str(k),str(comp),'','']         
    except Exception as e:
                        print(e)

Here is the list of all  the successfuly collected  notebooks data , we still need to attribute the corresponding domain & technique for each notebook :

In [305]:
df

Unnamed: 0,kernel_ref,path,title,competition,domain,technique
0,anasofiauzsoy/tutorial-notebook,tutorial-notebook.ipynb,Tutorial Notebook,contradictory-my-dear-watson,,
1,nkitgupta/text-representations,text-representations.ipynb,Text-Representations,contradictory-my-dear-watson,,
2,rohanrao/tpu-sherlocked-one-stop-for-with-tf,tpu-sherlocked-one-stop-for-with-tf.ipynb,TPU Sherlocked: One-stop for 🤗 with TF,contradictory-my-dear-watson,,
3,faressayah/text-analysis-topic-modelling-with-...,text-analysis-topic-modelling-with-spacy-gensi...,Text Analysis|Topic Modelling with spaCy & GENSIM,contradictory-my-dear-watson,,
4,vbookshelf/basics-of-bert-and-xlm-roberta-pytorch,basics-of-bert-and-xlm-roberta-pytorch.ipynb,Basics of BERT and XLM-RoBERTa - PyTorch,contradictory-my-dear-watson,,
...,...,...,...,...,...,...
259,orangecatman/scikit-learn-data-science-lon-wil...,scikit-learn-data-science-lon-william.ipynb,手把手教學使用_Scikit-Learn_中文教學_Data Science Lon_Wil...,data-science-london-scikit-learn,,
260,irinana/spam-detection-strata2013after-party,spam-detection-strata2013after-party.ipynb,SPAM_Detection Strata2013After-party,just-the-basics-the-after-party,,
261,meln1337/just-the-basics-notebook,just-the-basics-notebook.ipynb,Just The Basics Notebook,just-the-basics-the-after-party,,
262,nikosavgeros/classification-using-machine-and-...,classification-using-machine-and-deep-learning...,Classification using Machine and Deep Learning,just-the-basics-the-after-party,,


After a manual examination of competitions we can affect the following domains and techniques :

- contradictory-my-dear-watson : **nlp - Classification**
- gan-getting-started : **Computer Vision**
- store-sales-time-series-forecasting :* - **Regression**
- tpu-getting-started : **Computer Vision - Classification**
- digit-recognizer : **nlp - Classification**
- titanic : * - **Classification**
- house-prices-advanced-regression-techniques :* - **Regression**
- connectx : **Reinforcement learning** - * 
- nlp-getting-started : **nlp - Classification**
- spaceship-titanic :* - **Classification**
- facial-keypoints-detection : **Computer Vision - Regression**
- word2vec-nlp-tutorial : **nlp - Classification**
- data-science-london-scikit-learn : * - **Classification**
- just-the-basics-the-after-party : **nlp - Classification**


In [306]:
data = {'competition': competitions,
 'Domain': ['nlp', 'computer vision', 'no_domain', 'computer vision','nlp', 'no_domain', 'no_domain','reinforcement learning','nlp', 'no_domain', 'computer vision','nlp', 'no_domain','nlp'],
 'Technique':['classification','no_technique','regression','classification','classification','classification','regression','no_technique','classification','classification','regression','classification','classification','classification']}

df_comp_tag = pd.DataFrame.from_dict(data)
df_comp_tag

Unnamed: 0,competition,Domain,Technique
0,contradictory-my-dear-watson,nlp,classification
1,gan-getting-started,computer vision,no_technique
2,store-sales-time-series-forecasting,no_domain,regression
3,tpu-getting-started,computer vision,classification
4,digit-recognizer,nlp,classification
5,titanic,no_domain,classification
6,house-prices-advanced-regression-techniques,no_domain,regression
7,connectx,reinforcement learning,no_technique
8,nlp-getting-started,nlp,classification
9,spaceship-titanic,no_domain,classification


In [307]:
for i in df.index:
        df.loc[i]['competition','domain','technique'] = list(df_comp_tag[df_comp_tag['competition'] == df.loc[i]['competition']].reset_index(drop=True).loc[0])

In [309]:
df

Unnamed: 0,kernel_ref,path,title,competition,domain,technique
0,anasofiauzsoy/tutorial-notebook,tutorial-notebook.ipynb,Tutorial Notebook,contradictory-my-dear-watson,nlp,classification
1,nkitgupta/text-representations,text-representations.ipynb,Text-Representations,contradictory-my-dear-watson,nlp,classification
2,rohanrao/tpu-sherlocked-one-stop-for-with-tf,tpu-sherlocked-one-stop-for-with-tf.ipynb,TPU Sherlocked: One-stop for 🤗 with TF,contradictory-my-dear-watson,nlp,classification
3,faressayah/text-analysis-topic-modelling-with-...,text-analysis-topic-modelling-with-spacy-gensi...,Text Analysis|Topic Modelling with spaCy & GENSIM,contradictory-my-dear-watson,nlp,classification
4,vbookshelf/basics-of-bert-and-xlm-roberta-pytorch,basics-of-bert-and-xlm-roberta-pytorch.ipynb,Basics of BERT and XLM-RoBERTa - PyTorch,contradictory-my-dear-watson,nlp,classification
...,...,...,...,...,...,...
259,orangecatman/scikit-learn-data-science-lon-wil...,scikit-learn-data-science-lon-william.ipynb,手把手教學使用_Scikit-Learn_中文教學_Data Science Lon_Wil...,data-science-london-scikit-learn,no_domain,classification
260,irinana/spam-detection-strata2013after-party,spam-detection-strata2013after-party.ipynb,SPAM_Detection Strata2013After-party,just-the-basics-the-after-party,nlp,classification
261,meln1337/just-the-basics-notebook,just-the-basics-notebook.ipynb,Just The Basics Notebook,just-the-basics-the-after-party,nlp,classification
262,nikosavgeros/classification-using-machine-and-...,classification-using-machine-and-deep-learning...,Classification using Machine and Deep Learning,just-the-basics-the-after-party,nlp,classification


## Save Data

In [144]:
df.to_csv("test.csv")

In [310]:
df_coll = pd.read_csv("../data/ntb_list_clean.csv").drop(['Unnamed: 0'],axis=1)
df_coll

Unnamed: 0,title,subcategory,category
0,sudhirnl7/linear-regression-tutorial,linear regression,regression
1,goyalshalini93/car-price-prediction-linear-reg...,linear regression,regression
2,divan0/multiple-linear-regression,linear regression,regression
3,anthonypino/price-analysis-and-linear-regression,linear regression,regression
4,vivinbarath/simple-linear-regression-for-salar...,linear regression,regression
...,...,...,...
8592,fanbyprinciple/reinforcement-learning-on-opena...,reinforcement,reinforcement learning
8593,alexisbcook/exercise-one-step-lookahead,reinforcement,reinforcement learning
8594,alexisbcook/exercise-interactive-maps,reinforcement,reinforcement learning
8595,lbarbosa/connectx-deep-reinforcement-learning,reinforcement,reinforcement learning


Drop and count notebooks that already exist on the trainning set 

In [311]:
c1 =[]
c2 =[]

for i in df.index :

    if df.loc[i]['kernel_ref'] in list(df_coll['title']):
        c1.append(df.loc[i]['title'])
    else :
        c2.append(df.loc[i]['title'])

print(len(np.unique(c1)),len(np.unique(c2)))


78 173


78 notebooks need to be droped from our data :

In [312]:
for ntb in c1 :
    df.drop(df.index[df['title'] == ntb], inplace=True)

Here's finally the complete data :

In [313]:
df

Unnamed: 0,kernel_ref,path,title,competition,domain,technique
3,faressayah/text-analysis-topic-modelling-with-...,text-analysis-topic-modelling-with-spacy-gensi...,Text Analysis|Topic Modelling with spaCy & GENSIM,contradictory-my-dear-watson,nlp,classification
9,yihdarshieh/more-nli-datasets-hugging-face-nlp...,more-nli-datasets-hugging-face-nlp-library.ipynb,More NLI datasets - Hugging Face nlp library,contradictory-my-dear-watson,nlp,classification
10,pradeepmuniasamy/contradictory-my-dear-watson-...,contradictory-my-dear-watson-everything-you-ne...,"Contradictory, My Dear Watson- Everything you ...",contradictory-my-dear-watson,nlp,classification
17,rhtsingh/interpreting-text-models-with-bert-on...,interpreting-text-models-with-bert-on-tpu.ipynb,Interpreting text models with BERT on TPU,contradictory-my-dear-watson,nlp,classification
19,narendrageek/nlp-augmenter-5-fold-bert-translator,nlp-augmenter-5-fold-bert-translator.ipynb,"NLP Augmenter, 5 Fold BERT & Translator",contradictory-my-dear-watson,nlp,classification
...,...,...,...,...,...,...
259,orangecatman/scikit-learn-data-science-lon-wil...,scikit-learn-data-science-lon-william.ipynb,手把手教學使用_Scikit-Learn_中文教學_Data Science Lon_Wil...,data-science-london-scikit-learn,no_domain,classification
260,irinana/spam-detection-strata2013after-party,spam-detection-strata2013after-party.ipynb,SPAM_Detection Strata2013After-party,just-the-basics-the-after-party,nlp,classification
261,meln1337/just-the-basics-notebook,just-the-basics-notebook.ipynb,Just The Basics Notebook,just-the-basics-the-after-party,nlp,classification
262,nikosavgeros/classification-using-machine-and-...,classification-using-machine-and-deep-learning...,Classification using Machine and Deep Learning,just-the-basics-the-after-party,nlp,classification


In [314]:
DATA_PATH = "./data_/"
test = os.listdir(DATA_PATH)
L=[]
for item in test:
        L.append(item)

L

['1-house-prices-solution-top-1.ipynb',
 '10-simple-hacks-to-speed-up-your-data-analysis.ipynb',
 '2-15-loss-simple-split-trick.ipynb',
 '25-million-images-0-99757-mnist.ipynb',
 '93-f-score-bag-of-words-m-bags-of-popcorn-with-rf.ipynb',
 'a-beginner-s-approach-to-classification.ipynb',
 'a-complete-guide-to-decision-trees-ensembles.ipynb',
 'a-complete-guide-to-linear-regression.ipynb',
 'a-complete-guide-to-support-vector-machine.ipynb',
 'a-data-science-framework-to-achieve-99-accuracy.ipynb',
 'a-detailed-explanation-of-keras-embedding-layer.ipynb',
 'a-detailed-regression-guide-with-house-pricing.ipynb',
 'a-journey-through-titanic.ipynb',
 'a-real-disaster-leaked-label.ipynb',
 'a-simple-petals-tf-2-2-notebook.ipynb',
 'a-statistical-analysis-ml-workflow-of-titanic.ipynb',
 'a-study-on-regression-applied-to-the-ames-dataset.ipynb',
 'advanced-analysis-to-infinity-and-beyond.ipynb',
 'advanced-uses-of-shap-values.ipynb',
 'all-imputation-techniques-with-pros-and-cons.ipynb',
 'alp

Drop and count notebooks that that have failed to be pulled from Kaggle

In [315]:
c3 =[]
c4 =[]

for i in df.index :

    if df.loc[i]['path'] in L:
        c3.append(df.loc[i]['path'])
    else :
        c4.append(df.loc[i]['path'])

print(len(np.unique(c3)),len(np.unique(c4)))

153 15


15 notebooks need to be droped from our data :

In [316]:
for ntb in c4 :
    df.drop(df.index[df['title'] == ntb], inplace=True)

In [317]:
df.drop(df.index[df['path'] == 'scikit-learn-data-science-lon-william.ipynb'], inplace=True)
df

Unnamed: 0,kernel_ref,path,title,competition,domain,technique
3,faressayah/text-analysis-topic-modelling-with-...,text-analysis-topic-modelling-with-spacy-gensi...,Text Analysis|Topic Modelling with spaCy & GENSIM,contradictory-my-dear-watson,nlp,classification
9,yihdarshieh/more-nli-datasets-hugging-face-nlp...,more-nli-datasets-hugging-face-nlp-library.ipynb,More NLI datasets - Hugging Face nlp library,contradictory-my-dear-watson,nlp,classification
10,pradeepmuniasamy/contradictory-my-dear-watson-...,contradictory-my-dear-watson-everything-you-ne...,"Contradictory, My Dear Watson- Everything you ...",contradictory-my-dear-watson,nlp,classification
17,rhtsingh/interpreting-text-models-with-bert-on...,interpreting-text-models-with-bert-on-tpu.ipynb,Interpreting text models with BERT on TPU,contradictory-my-dear-watson,nlp,classification
19,narendrageek/nlp-augmenter-5-fold-bert-translator,nlp-augmenter-5-fold-bert-translator.ipynb,"NLP Augmenter, 5 Fold BERT & Translator",contradictory-my-dear-watson,nlp,classification
...,...,...,...,...,...,...
258,ziadhamadafathy/using-some-models-in-classific...,using-some-models-in-classification-accuracy-9...,"Using Some Models in classification, accuracy...",data-science-london-scikit-learn,no_domain,classification
260,irinana/spam-detection-strata2013after-party,spam-detection-strata2013after-party.ipynb,SPAM_Detection Strata2013After-party,just-the-basics-the-after-party,nlp,classification
261,meln1337/just-the-basics-notebook,just-the-basics-notebook.ipynb,Just The Basics Notebook,just-the-basics-the-after-party,nlp,classification
262,nikosavgeros/classification-using-machine-and-...,classification-using-machine-and-deep-learning...,Classification using Machine and Deep Learning,just-the-basics-the-after-party,nlp,classification


In [162]:
df.groupby('domain').count()


Unnamed: 0_level_0,kernel_ref,path,title,competition,technique
domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
computer vision,56,56,56,56,56
nlp,50,50,50,50,50
no_domain,59,59,59,59,59
reinforcement learning,14,14,14,14,14


In [163]:
df.groupby('technique').count()

Unnamed: 0_level_0,kernel_ref,path,title,competition,domain
technique,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
classification,103,103,103,103,103
no_technique,31,31,31,31,31
regression,45,45,45,45,45


In [318]:
df.to_csv("../data/test_data.csv")


### Analysing classification experiement 

In [169]:
data = pd.read_csv("../data/final_data.csv")
data

Unnamed: 0,num_feature_0,num_feature_1,num_feature_2,num_feature_3,num_feature_4,num_feature_5,num_feature_6,num_feature_7,num_feature_8,num_feature_9,...,num_feature_759,num_feature_760,num_feature_761,num_feature_762,num_feature_763,num_feature_764,num_feature_765,num_feature_766,num_feature_767,target_category
0,-0.347906,0.298659,0.306194,0.055175,-0.430964,-0.466804,-0.108935,0.225360,0.385942,0.492067,...,-0.174081,-0.516293,0.402670,-0.094396,0.222461,0.574193,-0.606644,-0.461940,0.539914,computer vision
1,-0.381965,0.317170,0.336794,0.081098,-0.454720,-0.551013,-0.051526,0.256912,0.381023,0.474694,...,-0.182501,-0.524450,0.344222,-0.027710,0.259213,0.608695,-0.722355,-0.420306,0.589060,clustering
2,-0.334516,0.283154,0.291431,0.043395,-0.385153,-0.494500,-0.111106,0.227195,0.380586,0.548708,...,-0.169292,-0.530989,0.416996,-0.104790,0.218367,0.636691,-0.612310,-0.446481,0.508087,computer vision
3,-0.275612,0.267364,0.266519,0.076538,-0.282673,-0.329046,-0.045134,0.221697,0.293198,0.421586,...,-0.149041,-0.485052,0.493056,-0.112996,0.220296,0.618480,-0.594641,-0.469342,0.497985,nlp
4,-0.243645,0.288036,0.266372,0.109039,-0.166584,-0.294326,-0.053085,0.150146,0.304883,0.368951,...,-0.142109,-0.482496,0.465995,-0.077088,0.226020,0.523076,-0.427591,-0.406878,0.511194,classification
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6255,-0.393783,0.277537,0.316518,0.040720,-0.391826,-0.480739,-0.062746,0.245819,0.256513,0.471789,...,-0.129478,-0.558991,0.332909,-0.120112,0.225102,0.562891,-0.788352,-0.403093,0.606365,clustering
6256,-0.350655,0.288547,0.306525,0.093222,-0.323761,-0.448132,-0.068621,0.184130,0.309040,0.422071,...,-0.132200,-0.475985,0.370314,-0.145366,0.221709,0.561444,-0.666458,-0.397126,0.554349,clustering
6257,-0.269415,0.342038,0.304815,0.037482,-0.235388,-0.309742,-0.080752,0.213444,0.319876,0.423119,...,-0.133684,-0.532163,0.415316,-0.049092,0.259647,0.605257,-0.576541,-0.392241,0.530491,nlp
6258,-0.324500,0.284065,0.302361,0.090343,-0.268171,-0.412675,-0.082972,0.214550,0.285261,0.424154,...,-0.130789,-0.487126,0.364878,-0.106732,0.233295,0.579414,-0.619685,-0.412788,0.527967,nlp


In [206]:
x = data.drop('target_category', axis = 1).copy()
y = data['target_category'].copy()

In [207]:
xtrain, xtest,  ytrain, ytest = train_test_split(x, y, test_size=0.2)

In [208]:
logr = LogisticRegression(max_iter=10000, class_weight='balanced', multi_class='auto')
scores = cross_val_score(logr, x, y, cv=5)
logr.fit(xtrain, ytrain)
yhat = logr.predict(xtest)

In [209]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(xtrain, ytrain)
ydt = clf.predict(xtest)

In [229]:
neigh = KNeighborsClassifier(n_neighbors=6)
neigh.fit(xtrain, ytrain)
yknn = neigh.predict(xtest)

In [230]:
data_analyse = pd.DataFrame(columns=['Real','Predicted_1','Predicted_2','Predicted_3'])
data_analyse['Real'] = ytest
data_analyse['Predicted_1'] = yhat
data_analyse['Predicted_2'] = ydt
data_analyse['Predicted_3'] = yknn
data_analyse


Unnamed: 0,Real,Predicted_1,Predicted_2,Predicted_3
5827,computer vision,computer vision,reinforcement learning,computer vision
1250,clustering,nlp,nlp,nlp
3300,clustering,clustering,nlp,nlp
2238,nlp,computer vision,computer vision,computer vision
58,regression,regression,regression,regression
...,...,...,...,...
621,nlp,nlp,computer vision,computer vision
2712,nlp,classification,classification,nlp
1074,clustering,clustering,computer vision,clustering
44,nlp,nlp,reinforcement learning,nlp


In [231]:
c1 =0
c2 =0
c3 =0
for i in data_analyse.index:
    if data_analyse.loc[i]['Real'] == data_analyse.loc[i]['Predicted_1']:
        c1+=1
    if data_analyse.loc[i]['Real'] == data_analyse.loc[i]['Predicted_2']:
        c2+=1
    if data_analyse.loc[i]['Real'] == data_analyse.loc[i]['Predicted_3']:
        c3+=1
print(c1,c2,c3)

984 541 609


In [232]:
data_analyse['Result_1'] = np.where(data_analyse['Real'] == data_analyse['Predicted_1'],"well_classified","miss_classified")
data_analyse['Result_2'] = np.where(data_analyse['Real'] == data_analyse['Predicted_2'],"well_classified","miss_classified")
data_analyse['Result_3'] = np.where(data_analyse['Real'] == data_analyse['Predicted_3'],"well_classified","miss_classified")

In [233]:
data_analyse


Unnamed: 0,Real,Predicted_1,Predicted_2,Predicted_3,Result_1,Result_2,Result_3
5827,computer vision,computer vision,reinforcement learning,computer vision,well_classified,miss_classified,well_classified
1250,clustering,nlp,nlp,nlp,miss_classified,miss_classified,miss_classified
3300,clustering,clustering,nlp,nlp,well_classified,miss_classified,miss_classified
2238,nlp,computer vision,computer vision,computer vision,miss_classified,miss_classified,miss_classified
58,regression,regression,regression,regression,well_classified,well_classified,well_classified
...,...,...,...,...,...,...,...
621,nlp,nlp,computer vision,computer vision,well_classified,miss_classified,miss_classified
2712,nlp,classification,classification,nlp,miss_classified,miss_classified,well_classified
1074,clustering,clustering,computer vision,clustering,well_classified,miss_classified,well_classified
44,nlp,nlp,reinforcement learning,nlp,well_classified,miss_classified,well_classified


In [216]:
fig = px.sunburst(data_analyse, path=['Real','Result_1'],values=np.ones(1252))
fig.show()

In [217]:
fig = px.sunburst(data_analyse, path=['Real','Result_2'],values=np.ones(1252))
fig.show()

In [218]:
fig = px.sunburst(data_analyse, path=['Real','Result_3'],values=np.ones(1252))
fig.show()

In [237]:
data_2 = pd.read_csv("../data/final_data.csv")
x_2 = data_2.drop('target_category', axis = 1).copy()
y_2 = data_2['target_category'].copy()
xtrain_2, xtest_2,  ytrain_2, ytest_2 = train_test_split(x_2, y_2, test_size=0.2)
logr_2 = LogisticRegression(max_iter=10000, class_weight='balanced', multi_class='auto')
scores_2 = cross_val_score(logr_2, x_2, y_2, cv=5)
logr_2.fit(xtrain_2, ytrain_2)
yhat_2 = logr_2.predict(xtest_2)
clf_2 = tree.DecisionTreeClassifier()
clf_2 = clf_2.fit(xtrain_2, ytrain_2)
ydt_2 = clf_2.predict(xtest_2)
neigh_2 = KNeighborsClassifier(n_neighbors=3)
neigh_2.fit(xtrain_2, ytrain_2)
yknn_2 = neigh_2.predict(xtest_2)
data_analyse_2 = pd.DataFrame(columns=['Real','Predicted_1','Predicted_2','Predicted_3'])
data_analyse_2['Real'] = ytest_2
data_analyse_2['Predicted_1'] = yhat_2
data_analyse_2['Predicted_2'] = ydt_2
data_analyse_2['Predicted_3'] = yknn_2


c1_2 =0
c2_2 =0
c3_2 =0
for i in data_analyse_2.index:
    if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_1']:
        c1_2+=1
    if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_2']:
        c2_2+=1
    if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_3']:
        c3_2+=1
print(c1_2,c2_2,c3_2)
data_analyse_2['Result_1'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_1'],"well_classified","miss_classified")
data_analyse_2['Result_2'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_2'],"well_classified","miss_classified")
data_analyse_2['Result_3'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_3'],"well_classified","miss_classified")
data_analyse_2


972 530 582


Unnamed: 0,Real,Predicted_1,Predicted_2,Predicted_3,Result_1,Result_2,Result_3
5556,4,4,4,2,well_classified,well_classified,miss_classified
5229,0,3,3,4,miss_classified,miss_classified,miss_classified
2054,2,2,2,2,well_classified,well_classified,well_classified
969,1,1,1,1,well_classified,well_classified,well_classified
1455,0,4,0,1,miss_classified,well_classified,miss_classified
...,...,...,...,...,...,...,...
5414,0,0,4,0,well_classified,miss_classified,well_classified
2235,1,1,1,4,well_classified,well_classified,miss_classified
6109,4,4,4,0,well_classified,well_classified,miss_classified
3007,1,1,0,1,well_classified,miss_classified,well_classified


In [238]:
integer_mapping = {l: i for i, l in enumerate(le.classes_)}
integer_mapping

{'classification': 0,
 'clustering': 1,
 'computer vision': 2,
 'nlp': 3,
 'regression': 4,
 'reinforcement learning': 5}

In [239]:
data_2

Unnamed: 0,num_feature_0,num_feature_1,num_feature_2,num_feature_3,num_feature_4,num_feature_5,num_feature_6,num_feature_7,num_feature_8,num_feature_9,...,num_feature_759,num_feature_760,num_feature_761,num_feature_762,num_feature_763,num_feature_764,num_feature_765,num_feature_766,num_feature_767,target_category
0,-0.347906,0.298659,0.306194,0.055175,-0.430964,-0.466804,-0.108935,0.225360,0.385942,0.492067,...,-0.174081,-0.516293,0.402670,-0.094396,0.222461,0.574193,-0.606644,-0.461940,0.539914,2
1,-0.381965,0.317170,0.336794,0.081098,-0.454720,-0.551013,-0.051526,0.256912,0.381023,0.474694,...,-0.182501,-0.524450,0.344222,-0.027710,0.259213,0.608695,-0.722355,-0.420306,0.589060,1
2,-0.334516,0.283154,0.291431,0.043395,-0.385153,-0.494500,-0.111106,0.227195,0.380586,0.548708,...,-0.169292,-0.530989,0.416996,-0.104790,0.218367,0.636691,-0.612310,-0.446481,0.508087,2
3,-0.275612,0.267364,0.266519,0.076538,-0.282673,-0.329046,-0.045134,0.221697,0.293198,0.421586,...,-0.149041,-0.485052,0.493056,-0.112996,0.220296,0.618480,-0.594641,-0.469342,0.497985,3
4,-0.243645,0.288036,0.266372,0.109039,-0.166584,-0.294326,-0.053085,0.150146,0.304883,0.368951,...,-0.142109,-0.482496,0.465995,-0.077088,0.226020,0.523076,-0.427591,-0.406878,0.511194,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6255,-0.393783,0.277537,0.316518,0.040720,-0.391826,-0.480739,-0.062746,0.245819,0.256513,0.471789,...,-0.129478,-0.558991,0.332909,-0.120112,0.225102,0.562891,-0.788352,-0.403093,0.606365,1
6256,-0.350655,0.288547,0.306525,0.093222,-0.323761,-0.448132,-0.068621,0.184130,0.309040,0.422071,...,-0.132200,-0.475985,0.370314,-0.145366,0.221709,0.561444,-0.666458,-0.397126,0.554349,1
6257,-0.269415,0.342038,0.304815,0.037482,-0.235388,-0.309742,-0.080752,0.213444,0.319876,0.423119,...,-0.133684,-0.532163,0.415316,-0.049092,0.259647,0.605257,-0.576541,-0.392241,0.530491,3
6258,-0.324500,0.284065,0.302361,0.090343,-0.268171,-0.412675,-0.082972,0.214550,0.285261,0.424154,...,-0.130789,-0.487126,0.364878,-0.106732,0.233295,0.579414,-0.619685,-0.412788,0.527967,3


### Dealing with class Imbalance

In [273]:
data_2 = pd.read_csv("../data/final_data.csv")
x_2 = data_2.drop('target_category', axis = 1).copy()
y_2 = data_2['target_category'].copy()
oversample = SMOTE()
x_2, y_2 = oversample.fit_resample(x_2, y_2)
xtrain_2, xtest_2,  ytrain_2, ytest_2 = train_test_split(x_2, y_2, test_size=0.2)
logr_2 = LogisticRegression(max_iter=10000, class_weight='balanced', multi_class='auto')
scores_2 = cross_val_score(logr_2, x_2, y_2, cv=5)
logr_2.fit(xtrain_2, ytrain_2)
yhat_2 = logr_2.predict(xtest_2)
clf_2 = tree.DecisionTreeClassifier()
clf_2 = clf_2.fit(xtrain_2, ytrain_2)
ydt_2 = clf_2.predict(xtest_2)
neigh_2 = KNeighborsClassifier(n_neighbors=3)
neigh_2.fit(xtrain_2, ytrain_2)
yknn_2 = neigh_2.predict(xtest_2)
data_analyse_2 = pd.DataFrame(columns=['Real','Predicted_1','Predicted_2','Predicted_3'])
data_analyse_2['Real'] = ytest_2
data_analyse_2['Predicted_1'] = yhat_2
data_analyse_2['Predicted_2'] = ydt_2
data_analyse_2['Predicted_3'] = yknn_2


c1_2 =0
c2_2 =0
c3_2 =0
for i in data_analyse_2.index:
    if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_1']:
        c1_2+=1
    if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_2']:
        c2_2+=1
    if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_3']:
        c3_2+=1
print(c1_2,c2_2,c3_2)
data_analyse_2['Result_1'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_1'],"well_classified","miss_classified")
data_analyse_2['Result_2'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_2'],"well_classified","miss_classified")
data_analyse_2['Result_3'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_3'],"well_classified","miss_classified")
data_analyse_2


1753 1330 1543


Unnamed: 0,Real,Predicted_1,Predicted_2,Predicted_3,Result_1,Result_2,Result_3
2788,classification,classification,classification,classification,well_classified,well_classified,well_classified
2277,regression,regression,reinforcement learning,computer vision,well_classified,miss_classified,miss_classified
6532,classification,classification,classification,classification,well_classified,well_classified,well_classified
1076,clustering,clustering,regression,clustering,well_classified,miss_classified,well_classified
7953,computer vision,computer vision,clustering,computer vision,well_classified,miss_classified,well_classified
...,...,...,...,...,...,...,...
5281,nlp,nlp,computer vision,nlp,well_classified,miss_classified,well_classified
1518,classification,classification,clustering,classification,well_classified,miss_classified,well_classified
1011,clustering,clustering,classification,clustering,well_classified,miss_classified,well_classified
1319,clustering,classification,classification,classification,miss_classified,miss_classified,miss_classified


In [274]:
print('Accuracy score:',accuracy_score(ytest, yhat))
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
print('F1 score:',f1_score(ytest, yhat, average='weighted'))

Accuracy score: 0.7859424920127795
0.76 accuracy with a standard deviation of 0.03
F1 score: 0.7862735097766143


In [275]:
print('Accuracy score:',accuracy_score(ytest_2, yhat_2))
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores_2.mean(), scores_2.std()))
print('F1 score:',f1_score(ytest_2, yhat_2, average='weighted'))

Accuracy score: 0.8153488372093023
0.83 accuracy with a standard deviation of 0.03
F1 score: 0.8149708981150121


In [276]:
fig = px.sunburst(data_analyse_2, path=['Real','Result_1'],values=np.ones(2150))
fig.show()

In [278]:
scores

array([0.76198083, 0.80191693, 0.76996805, 0.70846645, 0.75958466])

In [277]:
scores_2

array([0.79627907, 0.80921359, 0.79013495, 0.85714286, 0.87436017])

In [279]:

logr_2.score(xtest_2,ytest_2)

0.8153488372093023

Percentage of misclassified samples :

*  Classification ~ 25 %
*  Clustering ~ 22 %
*  Nlp ~ 21 %
*  Regression ~ 17 %
*  Reinforcement Learning ~ 13 %
*  Computer Vision ~ 9 %

In [293]:
def without_oversampling (path):
    data_2 = pd.read_csv(path)
    x_2 = data_2.drop('target_category', axis = 1).copy()
    y_2 = data_2['target_category'].copy()
    xtrain_2, xtest_2,  ytrain_2, ytest_2 = train_test_split(x_2, y_2, test_size=0.2)
    logr_2 = LogisticRegression(max_iter=10000, class_weight='balanced', multi_class='auto')
    scores_2 = cross_val_score(logr_2, x_2, y_2, cv=5)
    logr_2.fit(xtrain_2, ytrain_2)
    yhat_2 = logr_2.predict(xtest_2)
    clf_2 = tree.DecisionTreeClassifier()
    clf_2 = clf_2.fit(xtrain_2, ytrain_2)
    ydt_2 = clf_2.predict(xtest_2)
    neigh_2 = KNeighborsClassifier(n_neighbors=3)
    neigh_2.fit(xtrain_2, ytrain_2)
    yknn_2 = neigh_2.predict(xtest_2)
    data_analyse_2 = pd.DataFrame(columns=['Real','Predicted_1','Predicted_2','Predicted_3'])
    data_analyse_2['Real'] = ytest_2
    data_analyse_2['Predicted_1'] = yhat_2
    data_analyse_2['Predicted_2'] = ydt_2
    data_analyse_2['Predicted_3'] = yknn_2


    c1_2 =0
    c2_2 =0
    c3_2 =0
    for i in data_analyse_2.index:
        if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_1']:
            c1_2+=1
        if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_2']:
            c2_2+=1
        if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_3']:
            c3_2+=1
    print(c1_2,c2_2,c3_2,len(data_analyse_2))
    data_analyse_2['Result_1'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_1'],"well_classified","misclassified")
    data_analyse_2['Result_2'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_2'],"well_classified","misclassified")
    data_analyse_2['Result_3'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_3'],"well_classified","misclassified")
    

    fig = px.sunburst(data_analyse_2, path=['Real','Result_1'],values=np.ones(len(data_analyse_2)))
    fig.show()

    print('Accuracy score:',accuracy_score(ytest_2, yhat_2))
    print("%0.2f accuracy with a standard deviation of %0.2f" % (scores_2.mean(), scores_2.std()))
    print('F1 score:',f1_score(ytest_2, yhat_2, average='weighted'))

In [321]:
def with_oversampling(path):
    data_2 = pd.read_csv(path)
    x_2 = data_2.drop('target_category', axis = 1).copy()
    y_2 = data_2['target_category'].copy()
    oversample = SMOTE()
    xtrain_2, xtest_2,  ytrain_2, ytest_2 = train_test_split(x_2, y_2, test_size=0.2)
    xtrain_2, ytrain_2 = oversample.fit_resample(xtrain_2, ytrain_2)
    logr_2 = LogisticRegression(max_iter=10000, multi_class='auto')
    scores_2 = cross_val_score(logr_2, x_2, y_2, cv=5)
    logr_2.fit(xtrain_2, ytrain_2)
    yhat_2 = logr_2.predict(xtest_2)
    clf_2 = tree.DecisionTreeClassifier()
    clf_2 = clf_2.fit(xtrain_2, ytrain_2)
    ydt_2 = clf_2.predict(xtest_2)
    neigh_2 = KNeighborsClassifier(n_neighbors=3)
    neigh_2.fit(xtrain_2, ytrain_2)
    yknn_2 = neigh_2.predict(xtest_2)
    data_analyse_2 = pd.DataFrame(columns=['Real','Predicted_1','Predicted_2','Predicted_3'])
    data_analyse_2['Real'] = ytest_2
    data_analyse_2['Predicted_1'] = yhat_2
    data_analyse_2['Predicted_2'] = ydt_2
    data_analyse_2['Predicted_3'] = yknn_2


    c1_2 =0
    c2_2 =0
    c3_2 =0
    for i in data_analyse_2.index:
        if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_1']:
            c1_2+=1
        if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_2']:
            c2_2+=1
        if data_analyse_2.loc[i]['Real'] == data_analyse_2.loc[i]['Predicted_3']:
            c3_2+=1
    print(c1_2,c2_2,c3_2,len(data_analyse_2))
    data_analyse_2['Result_1'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_1'],"well_classified","misclassified")
    data_analyse_2['Result_2'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_2'],"well_classified","misclassified")
    data_analyse_2['Result_3'] = np.where(data_analyse_2['Real'] == data_analyse_2['Predicted_3'],"well_classified","misclassified")
    

    fig = px.sunburst(data_analyse_2, path=['Real','Result_1'],values=np.ones(len(data_analyse_2)))
    fig.show()

    print('Accuracy score:',accuracy_score(ytest_2, yhat_2))
    print("%0.2f accuracy with a standard deviation of %0.2f" % (scores_2.mean(), scores_2.std()))
    print('F1 score:',f1_score(ytest_2, yhat_2, average='weighted'))

In [295]:
without_oversampling("../data/domain_data.csv")

580 426 456 661


Accuracy score: 0.8774583963691377
0.86 accuracy with a standard deviation of 0.03
F1 score: 0.8791411829112372


Percentage of misclassified samples :

*  Reinforcement Learning ~ 16 %
*  Nlp ~ 13 %
*  Computer Vision ~ 8 %

In [323]:
with_oversampling("../data/domain_data.csv")

599 415 430 661


Accuracy score: 0.9062027231467473
0.88 accuracy with a standard deviation of 0.02
F1 score: 0.9064893418317701


Percentage of misclassified samples :

*  Nlp ~ 11 %
*  Reinforcement Learning ~ 6 %
*  Computer Vision ~ 6 %

In [297]:
without_oversampling("../data/technique_data.csv")

450 286 328 592


Accuracy score: 0.7601351351351351
0.76 accuracy with a standard deviation of 0.03
F1 score: 0.7619230834526061


Percentage of misclassified samples :

*  Regression ~ 20.6 %
*  Classification ~ 20.5 %
*  Clustering ~ 20.2 %

In [322]:
with_oversampling("../data/technique_data.csv")

475 294 326 592


Accuracy score: 0.8023648648648649
0.76 accuracy with a standard deviation of 0.02
F1 score: 0.8028864994666574


Percentage of misclassified samples :

*  Classification ~ 20 %
*  Clustering ~ 23 %
*  Regression ~ 19 %

In [320]:
print(classification_report(ytest, yhat))

                        precision    recall  f1-score   support

        classification       0.67      0.74      0.70       166
            clustering       0.85      0.68      0.76       234
       computer vision       0.82      0.90      0.86       248
                   nlp       0.86      0.80      0.83       334
            regression       0.74      0.77      0.75       205
reinforcement learning       0.64      0.83      0.72        65

              accuracy                           0.79      1252
             macro avg       0.76      0.79      0.77      1252
          weighted avg       0.79      0.79      0.79      1252

