# Introduction

In this notebook we provide our strategy on collecting suitable data for documentation generation task .We've relied essentialy on kaggle notebooks .We've decided to utilize the top voted notebooks from 'getting started' competitions on Kaggle to construct our dataset , we've added also notebooks in the courses section . Our choice is based in the fact that those notebooks are of good quality in code documentation since they are used for learning purposes .

--> What is Kaggle ?
Kaggle is an online community platform for data scientists and machine learning enthusiasts. Kaggle allows users to collaborate with other users, find and publish datasets, use GPU integrated notebooks, and compete with other data scientists to solve data science challenges. The aim of this online platform is to help professionals and learners reach their goals in their data science journey with the powerful tools and resources it provides.

## Importing librairies

In [1]:
# Used to interact with Kaggle (Kaggle API)
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

# Used to manipule and analyse data
import pandas as pd

# Used to processing automatic language 
import nltk
from langdetect import detect
from langdetect import detect_langs
from langdetect import DetectorFactory
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0

Path to Data

In [17]:
DATA_PATH="../data_/"

## Authentication 

To use Kaggle‚Äôs API to interact with Kaggle resources (competitions , datasets and kernels) you need to authenticate first using an API token. To do so , you can follow this [Guide](https://towardsdatascience.com/how-to-search-and-download-data-using-kaggle-api-f815f7b98080)

In [3]:
api = KaggleApi()
api.authenticate()

Create our main dataframe where we'll store our desired notebooks 

In [4]:
df = pd.DataFrame(columns=['kernel_ref','title','competition'])
df

Unnamed: 0,kernel_ref,title,competition


List of Kernel references of the well documented notebooks that we have selected from Learn section on Kaggle

In [35]:
List_Kernels_Courses = ['dansbecker/basic-data-exploration','dansbecker/your-first-machine-learning-model','dansbecker/model-validation','dansbecker/underfitting-and-overfitting','dansbecker/random-forests','alexisbcook/missing-values','alexisbcook/categorical-variables','alexisbcook/pipelines','alexisbcook/cross-validation','alexisbcook/xgboost','alexisbcook/data-leakage','ryanholbrook/what-is-feature-engineering','ryanholbrook/mutual-information','ryanholbrook/creating-features','ryanholbrook/clustering-with-k-means','ryanholbrook/principal-component-analysis','ryanholbrook/target-encoding','ryanholbrook/stochastic-gradient-descent','ryanholbrook/overfitting-and-underfitting','ryanholbrook/dropout-and-batch-normalization','ryanholbrook/binary-classification','ryanholbrook/the-convolutional-classifier','ryanholbrook/convolution-and-relu','ryanholbrook/maximum-pooling','ryanholbrook/the-sliding-window','ryanholbrook/custom-convnets','ryanholbrook/data-augmentation','ryanholbrook/linear-regression-with-time-series','ryanholbrook/trend','ryanholbrook/time-series-as-features','ryanholbrook/hybrid-models','ryanholbrook/forecasting-with-machine-learning','dansbecker/partial-plots','dansbecker/shap-values','dansbecker/advanced-uses-of-shap-values']

Save locally collected notebooks

In [None]:
for ntb in List_Kernels_Courses :
        try :
                api.kernels_pull(ntb, path = DATA_PATH)
        except Exception as e:
            print('Kaggle API exception : ',ntb, 'Notebook not found')

List of getting started competitions **(16 competitions)**

In [5]:
competitions = api.competitions_list(category="gettingStarted")
competitions

[contradictory-my-dear-watson,
 gan-getting-started,
 store-sales-time-series-forecasting,
 tpu-getting-started,
 digit-recognizer,
 titanic,
 house-prices-advanced-regression-techniques,
 connectx,
 nlp-getting-started,
 spaceship-titanic,
 facial-keypoints-detection,
 street-view-getting-started-with-julia,
 word2vec-nlp-tutorial,
 data-science-london-scikit-learn,
 just-the-basics-the-after-party,
 just-the-basics-strata-2013]

Collect **most voted** *notebooks* for each **getting-started** *competitions*

In [6]:
for comp in competitions :
    try :
        kernels_list = api.kernels_list(page=1,competition = str(comp),sort_by="voteCount")
        for k in kernels_list :
                df.loc[len(df)] = [str(k.ref),str(k),str(comp)]         
    except Exception as e:
                        print(e)

Here is the list of all  the successfuly collected  notebooks titles :

In [7]:
df

Unnamed: 0,kernel_ref,title,competition
0,anasofiauzsoy/tutorial-notebook,Tutorial Notebook,contradictory-my-dear-watson
1,nkitgupta/text-representations,Text-Representations,contradictory-my-dear-watson
2,rohanrao/tpu-sherlocked-one-stop-for-with-tf,TPU Sherlocked: One-stop for ü§ó with TF,contradictory-my-dear-watson
3,faressayah/text-analysis-topic-modelling-with-...,Text Analysis|Topic Modelling with spaCy &¬†GENSIM,contradictory-my-dear-watson
4,vbookshelf/basics-of-bert-and-xlm-roberta-pytorch,Basics of BERT and XLM-RoBERTa - PyTorch,contradictory-my-dear-watson
...,...,...,...
267,irinana/spam-detection-strata2013after-party,SPAM_Detection Strata2013After-party,just-the-basics-the-after-party
268,nikosavgeros/classification-using-machine-and-...,Classification using Machine and Deep Learning,just-the-basics-the-after-party
269,meln1337/just-the-basics-notebook,Just The Basics Notebook,just-the-basics-the-after-party
270,pythonkumar/xgboost-feat-imp-acc-93,XGBoost + Feat Imp Acc:93%,just-the-basics-the-after-party


## Data Cleaning 

Now let's try to remove non-english notebooks based first on their titles .

Create a list of words relatively linked to ML & DS domain so as not to deceive our model given that among these words, there are many abbreviations .

In [8]:
custom_list = ['dataset', 'datasets', 'feature', 'transformer', 'transformers', 'using', 'detect', 'detecting', 'machine', 'intro', 'pca', 'connectx', 'xgboost', 'visualising', 'visualizing', 'gaussian', 'bayesian', 'score', 'scores', 'map', 'maps', 'ml', 'dl', 'algorithm', 'algorithms', 'feature', 'features','model','models','Hyperparameters','Hyperparameter','preprocessing','CNN','Keras','Tensorflow']

We have decided to use NLTK and Langdetect combine with each other to detect non-english notebooks.

In [9]:
words = set(nltk.corpus.words.words())
words.update(custom_list)

In [10]:
def clean_non_english(df, word_list):
    non_english = []
    for row in df.index:
        
        title = df.loc[row, 'title']
        try:
            # title = "Chaii EDA&Baseline ÂÆüÊ≥Å"
            clean_title = " ".join(w for w in nltk.wordpunct_tokenize(title) \
                    if w.lower() in word_list or not w.isalpha())

            if ((len(title.split()) - len(clean_title.split())) > 3) or len(clean_title.split()) == 0: 
                non_english.append(row) # builds list of titles with low # of english words
        except Exception as e:
            print(e)
    return non_english

Return the index of non-english detected titles

In [11]:
non_english = clean_non_english(df, words)
non_english

[39, 49, 63, 89, 94, 138, 210, 226]

We need first to examine those titles in order to validate results

In [12]:
for i in non_english:
    print(i,df.loc[i]['kernel_ref'],df.loc[i]['title'])

39 upamanyumukherjee/cyclegan CycleGAN
49 andrej0marinchenko/hyperparamaters Hyperparamaters
63 ryanholbrook/tfrecords-basics TFRecords Basics
89 mauriciofigueiredo/mlp-simples-com-keras-para-iniciantes MLP simples com Keras para Iniciantes
94 mauriciofigueiredo/cnn-simples-com-keras-para-iniciantes CNN simples com Keras para Iniciantes
138 marsggbo/kaggle Êàø‰ª∑È¢ÑÊµãkaggleÂÖ•Èó®È°πÁõÆ
210 ritvik1909/facial-keypoint-detection-cnn-aug-tl Facial Keypoint Detection - CNN + Aug + TL
226 piterrudyy/julia Julia


Remove the real non-english ones after examination

In [13]:
non_english_to_be_dropped = non_english[3:]

And then we have dropped them directly from our dataframe

In [14]:
df.drop(non_english_to_be_dropped, inplace=True)
df = df.reset_index(drop=True)
df

Unnamed: 0,kernel_ref,title,competition
0,anasofiauzsoy/tutorial-notebook,Tutorial Notebook,contradictory-my-dear-watson
1,nkitgupta/text-representations,Text-Representations,contradictory-my-dear-watson
2,rohanrao/tpu-sherlocked-one-stop-for-with-tf,TPU Sherlocked: One-stop for ü§ó with TF,contradictory-my-dear-watson
3,faressayah/text-analysis-topic-modelling-with-...,Text Analysis|Topic Modelling with spaCy &¬†GENSIM,contradictory-my-dear-watson
4,vbookshelf/basics-of-bert-and-xlm-roberta-pytorch,Basics of BERT and XLM-RoBERTa - PyTorch,contradictory-my-dear-watson
...,...,...,...
262,irinana/spam-detection-strata2013after-party,SPAM_Detection Strata2013After-party,just-the-basics-the-after-party
263,nikosavgeros/classification-using-machine-and-...,Classification using Machine and Deep Learning,just-the-basics-the-after-party
264,meln1337/just-the-basics-notebook,Just The Basics Notebook,just-the-basics-the-after-party
265,pythonkumar/xgboost-feat-imp-acc-93,XGBoost + Feat Imp Acc:93%,just-the-basics-the-after-party


Another test using Langdetect

In [15]:
for i in df.index :
    if detect(df.loc[i]['title']) != 'en':
        print(i,df.loc[i]['kernel_ref'],df.loc[i]['title'])

0 anasofiauzsoy/tutorial-notebook Tutorial Notebook
14 qinhui1999/more-nli-datasets-xmlr-large More NLI datasets xmlr-large
18 jswxhd/nli-beginner-eda-bert-baseline NLI Beginner: EDA & Bert Baseline
22 ohseokkim/transfering-style Transfering Style!
28 nachiket273/cyclegan-pytorch CycleGAN_Pytorch
35 dapy15/monet-using-gan Monet using GAN
39 upamanyumukherjee/cyclegan CycleGAN
49 andrej0marinchenko/hyperparamaters Hyperparamaters
50 vanguarde/study-series-uplift-modeling [study series] Uplift modeling
51 mpwolke/vacaciones-en-ecuador-with-galearn Vacaciones en Ecuador with Galearn
52 hiro5299834/store-sales-ridge-voting-bagging-et-bagging-rf Store Sales: Ridge+Voting(Bagging(ET)+Bagging(RF))
56 romaupgini/guide-external-data-features-for-multivariatets üåéGuide: External Data&Features for MultivariateTS
65 atamazian/fc-ensemble-external-data-effnet-densenet FC Ensemble External Data (EffNet+DenseNet)
73 allunia/differential-evolution Differential Evolution
82 poonaml/deep-neural-networ

Drop the non-english ones after examination

In [16]:
df.drop([51,52,266], inplace=True)
df = df.reset_index(drop=True)
df

Unnamed: 0,kernel_ref,title,competition
0,anasofiauzsoy/tutorial-notebook,Tutorial Notebook,contradictory-my-dear-watson
1,nkitgupta/text-representations,Text-Representations,contradictory-my-dear-watson
2,rohanrao/tpu-sherlocked-one-stop-for-with-tf,TPU Sherlocked: One-stop for ü§ó with TF,contradictory-my-dear-watson
3,faressayah/text-analysis-topic-modelling-with-...,Text Analysis|Topic Modelling with spaCy &¬†GENSIM,contradictory-my-dear-watson
4,vbookshelf/basics-of-bert-and-xlm-roberta-pytorch,Basics of BERT and XLM-RoBERTa - PyTorch,contradictory-my-dear-watson
...,...,...,...
259,rocklen/data-science-london-scikit-learn,üè¢ Data Science London + Scikit-learn üè¢,data-science-london-scikit-learn
260,irinana/spam-detection-strata2013after-party,SPAM_Detection Strata2013After-party,just-the-basics-the-after-party
261,nikosavgeros/classification-using-machine-and-...,Classification using Machine and Deep Learning,just-the-basics-the-after-party
262,meln1337/just-the-basics-notebook,Just The Basics Notebook,just-the-basics-the-after-party


Now , we have a complete clean list of suitable notebooks .Let's get them now using kaggle API :

In [19]:
for i in df.index :
        try :
                api.kernels_pull(df.loc[i]['kernel_ref'], path = DATA_PATH )
        except Exception as e:
            print('Kaggle API exception : ',df.loc[i]['title'], 'Notebook not found')