## Notes
There's a pattern of workshop data from year 2017 to 2022, can be scraped by incrementing the year in the following url:
https://icml.cc/Conferences/2017/Schedule?type=Workshop to https://icml.cc/Conferences/2022/Schedule?type=Workshop
After which the format of the website changes and so new code will have to be written.

In [64]:
import requests
import csv
from bs4 import BeautifulSoup as bs

In [78]:
fields_title = ['title', 'year']

In [97]:
for year in range(17,23):
    
    URL = f'https://icml.cc/Conferences/20{year}/Schedule?type=Workshop'

    print(URL, year)
    
    req = requests.get(URL)
    soup = bs(req.text, 'html.parser')
      
    titles = soup.find_all('div',attrs = {'class','maincardBody'})
    
    with open('titles_icml', 'a') as f:
        
        write = csv.writer(f)
        
        #write.writerow(fields_title)
        for titleNumber in range(0,len(titles)):
            print(titles[titleNumber].text)
            
            write.writerow([titles[titleNumber].text, f'20{year}'])
  
    #print(titles[0].text)

https://icml.cc/Conferences/2017/Schedule?type=Workshop 17
ML on a budget: IoT, Mobile and other tiny-ML applications
Principled Approaches to Deep Learning
Video Games and Machine Learning
Learning to Generate Natural Language
Workshop on Human Interpretability in Machine Learning (WHI)
Implicit Generative Models
Lifelong Learning: A Reinforcement Learning Approach
Automatic Machine Learning (AutoML 2017)
Workshop on Computational Biology
Workshop on Visualization for Deep Learning
ICML Workshop on Machine Learning for Autonomous Vehicles 2017
Interactive Machine Learning and Semantic Information Retrieval
Picky Learners: Choosing Alternative Ways to Process Data.
Time Series Workshop
Human in the Loop Machine Learning
Reinforcement Learning Workshop
Private and Secure Machine Learning
Machine Learning for Music Discovery
Reliable Machine Learning in the Wild
Machine Learning in Speech and Language Processing
Reproducibility in Machine Learning Research
Deep Structured Prediction
http

In [75]:
print(titles[0].text)

My ML Workshop [EXAMPLE]


## Scraping abstracts
Annoyingly there's no obvious pattern in the website for the different abstracts, so a manual inspect of each html element must be performed

In [48]:
array17 = [[2017], [*range(1,22),930]]
array18 = [[2018], [*range(3280, 3353)]]
array19 = [[2019], [*range(3502, 3533)]]
array20 = [[2020],[*range(5715, 5749)]]
array21 = [[2021],[*range(8347, 8377)]]
array22 = [[2022], [*range(13446, 13479),21435]]

In [77]:
allArrays = [array17, array18, array19, array20, array21, array22]

### After manually inspecting each year to assertain the eventID we iterate over each year and list of eventIDs to produce a list of abstracts, sadly there are no abstracts for year 2018

In [81]:
for array in allArrays:
    
    for eventID in array[1]:
    
        #Change below later so can iterate through all arrays from 17-22
        year = array[0][0]
    
        URL = f'https://icml.cc/Conferences/{year}/Schedule?showEvent={eventID}'

        print(URL, year)
    
        req = requests.get(URL)
        soup = bs(req.text, 'html.parser')
      
        abstracts = soup.find_all('div',attrs = {'class','abstractContainer'})
    
        with open('abstracts_icml', 'a') as f:
        
            write = csv.writer(f)
        
            for abstractNumber in range(0,len(abstracts)):
                print(abstracts[abstractNumber].text)
            
                write.writerow([abstracts[abstractNumber].text, f'{year}'])

https://icml.cc/Conferences/2017/Schedule?showEvent=1 2017
Machine learning has achieved considerable successes in recent years and an ever-growing number of disciplines rely on it. However, this success crucially relies on human machine learning experts, who select appropriate features, workflows, machine learning paradigms, algorithms, and their hyperparameters. As the complexity of these tasks is often beyond non-experts, the rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge. We call the resulting research area that targets progressive automation of machine learning AutoML.

https://icml.cc/Conferences/2017/Schedule?showEvent=2 2017
In recent years, deep learning has revolutionized machine learning. Most successful applications of deep learning involve predicting single variables (e.g., univariate regression or multi-class classification). However, many real problems invo

### With both titles and abstracts for each year and workshop we now perform some simple data exploration, starting with BOW

In [8]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [9]:
data = pd.read_csv(r'abstracts_icml')
df = pd.DataFrame(data)
df.columns = ['abstract', 'year']

In [24]:
df

Unnamed: 0,abstract,year
0,"In recent years, deep learning has revolutioni...",2017
1,For details see:http://machlearn.gitlab.io/hit...,2017
2,Although dramatic progress has been made in th...,2017
3,Probabilistic models are a central implement i...,2017
4,Retrieval techniques operating on text or sema...,2017
...,...,...
145,We propose the 1st ICML Workshop on Safe Learn...,2022
146,As modern astrophysical surveys deliver an unp...,2022
147,This workshop proposal builds on the success o...,2022
148,Machine learning (ML) approaches can support d...,2022


In [10]:
df = df.dropna(axis=0)

In [33]:
df

Unnamed: 0,abstract,year
0,"In recent years, deep learning has revolutioni...",2017
1,For details see:http://machlearn.gitlab.io/hit...,2017
2,Although dramatic progress has been made in th...,2017
3,Probabilistic models are a central implement i...,2017
4,Retrieval techniques operating on text or sema...,2017
...,...,...
144,A long-standing objective of AI research has b...,2022
145,We propose the 1st ICML Workshop on Safe Learn...,2022
146,As modern astrophysical surveys deliver an unp...,2022
147,This workshop proposal builds on the success o...,2022


In [11]:
df2017 = df.loc[df['year'] == 2017]
df2019 = df.loc[df['year'] == 2019]
df2020 = df.loc[df['year'] == 2020]
df2021 = df.loc[df['year'] == 2021]
df2022 = df.loc[df['year'] == 2022]

In [12]:
from sklearn.feature_extraction import text 

stopwords = text.ENGLISH_STOP_WORDS.union(['machine learning','machine','ml','learning','workshop'])

CountVec17 = CountVectorizer(ngram_range=(1,2), stop_words = stopwords)
CountVec19 = CountVectorizer(ngram_range=(1,2), stop_words = stopwords)
CountVec20 = CountVectorizer(ngram_range=(1,2), stop_words = stopwords)
CountVec21 = CountVectorizer(ngram_range=(1,2), stop_words = stopwords)
CountVec22 = CountVectorizer(ngram_range=(1,2), stop_words = stopwords)

In [327]:
bow = CountVec.fit_transform(df2017['abstract'])

In [300]:
print(bowCount)

[[1 1 1 ... 1 1 1]]


In [328]:
bowDf = pd.DataFrame(bow.toarray(), columns = CountVec.get_feature_names_out())

In [302]:
bowDf

Unnamed: 0,00625,00625 2016,01783,01783 2016,01868,01868 2016,03801,03801 2016,06057,06057 2016,...,years witnessed,years workshop,york,york max,zahavy,zahavy daniel,zeming,zeming lin,zero,zero fatality
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Now lets explore the different topic as a function of year

In [7]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [106]:
ldaBow  = LDA(n_components=5, random_state=42) 
ldaBow.fit(bow)

In [107]:
ldaBow.transform(bow[:2])

array([[0.99422791, 0.00143745, 0.00145025, 0.00144628, 0.0014381 ],
       [0.86407728, 0.00121582, 0.1322501 , 0.00122719, 0.00122961]])

In [108]:
for idx, topic in enumerate(ldaBow.components_):
    print(f"Top 5 words in Topic #{idx}:")
    print([CountVec.get_feature_names()[i] for i in topic.argsort()[-5:]]) 
    print('')

Top 5 words in Topic #0:
['information', 'retrieval', 'community', 'human', 'systems']

Top 5 words in Topic #1:
['video', 'ai', 'arxiv', '2016', 'games']

Top 5 words in Topic #2:
['lifelong', 'series', 'time', 'implicit', 'models']

Top 5 words in Topic #3:
['search', 'prediction', 'ml', 'time', 'driving']

Top 5 words in Topic #4:
['large', 'community', 'reproducibility', 'papers', 'models']





## Interesting, now lets run LDA on the entire dataset

In [36]:
bow17 = CountVec17.fit_transform(df2017['abstract'])
bow19 = CountVec19.fit_transform(df2019['abstract'])
bow20 = CountVec20.fit_transform(df2020['abstract'])
bow21 = CountVec21.fit_transform(df2021['abstract'])
bow22 = CountVec22.fit_transform(df2022['abstract'])

In [37]:
ldaBow17  = LDA(n_components=5, random_state=42) 
ldaBow17.fit(bow17)
ldaBow19  = LDA(n_components=5, random_state=42) 
ldaBow19.fit(bow19)
ldaBow20  = LDA(n_components=5, random_state=42) 
ldaBow20.fit(bow20)
ldaBow21  = LDA(n_components=5, random_state=42) 
ldaBow21.fit(bow21)
ldaBow22  = LDA(n_components=5, random_state=42) 
ldaBow22.fit(bow22)

In [38]:
for idx, topic in enumerate(ldaBow17.components_):
    print(f"Top 5 words in Topic #{idx}:")
    print([CountVec17.get_feature_names()[i] for i in topic.argsort()[-5:]]) 
    print('')

Top 5 words in Topic #0:
['deep', 'implicit models', 'community', 'implicit', 'models']

Top 5 words in Topic #1:
['researchers', 'data', 'research', 'prediction', 'time']

Top 5 words in Topic #2:
['feedback', 'ai', 'arxiv', '2016', 'games']

Top 5 words in Topic #3:
['loop', 'privacy', 'systems', 'lifelong', 'human']

Top 5 words in Topic #4:
['collections', 'interactive', 'information retrieval', 'information', 'retrieval']





In [39]:
for idx, topic in enumerate(ldaBow19.components_):
    print(f"Top 5 words in Topic #{idx}:")
    print([CountVec19.get_feature_names()[i] for i in topic.argsort()[-5:]]) 
    
    print('')

Top 5 words in Topic #0:
['methods', 'research', 'social', 'models', 'researchers']

Top 5 words in Topic #1:
['theory', 'techniques', 'coding', 'community', 'systems']

Top 5 words in Topic #2:
['data', 'deep', 'time', 'real', 'rl']

Top 5 words in Topic #3:
['modeling', 'inference', 'probabilistic', 'data', 'models']

Top 5 words in Topic #4:
['deep', 'agents', 'tasks', 'multi', 'reinforcement']



In [52]:
for idx, topic in enumerate(ldaBow20.components_):
    print(f"Top 5 words in Topic #{idx}:")
    print([CountVec20.get_feature_names()[i] for i in topic.argsort()[-5:]]) 
    print('')

Top 5 words in Topic #0:
['problems', 'healthcare', 'systems', 'health', 'research']

Top 5 words in Topic #1:
['networks', 'neural networks', 'graph', 'neural', 'data']

Top 5 words in Topic #2:
['values', 'br', 'open', 'missing', 'data']

Top 5 words in Topic #3:
['generalization', 'models', 'data', 'methods', 'systems']

Top 5 words in Topic #4:
['based', 'ai', 'systems', 'models', 'data']



In [41]:
for idx, topic in enumerate(ldaBow21.components_):
    print(f"Top 5 words in Topic #{idx}:")
    print([CountVec21.get_feature_names()[i] for i in topic.argsort()[-5:]]) 
    print('')

Top 5 words in Topic #0:
['order', 'collaboration', 'human', 'data', 'methods']

Top 5 words in Topic #1:
['applications', 'ssl', 'tasks', 'data', 'systems']

Top 5 words in Topic #2:
['theoretical', 'researchers', 'new', 'causal', 'data']

Top 5 words in Topic #3:
['systems', 'distribution', 'privacy', 'data', 'models']

Top 5 words in Topic #4:
['researchers', 'systems', 'medical', 'models', 'climate']



In [42]:
for idx, topic in enumerate(ldaBow22.components_):
    print(f"Top 5 words in Topic #{idx}:")
    print([CountVec22.get_feature_names()[i] for i in topic.argsort()[-5:]]) 
    print('')

Top 5 words in Topic #0:
['bring', 'systems', 'methods', 'researchers', 'new']

Top 5 words in Topic #1:
['methods', 'training', 'privacy', 'models', 'data']

Top 5 words in Topic #2:
['language', 'data', 'models', 'approaches', 'knowledge']

Top 5 words in Topic #3:
['2022', 'decision', 'pre', 'exvo', 'models']

Top 5 words in Topic #4:
['science', 'ai', 'data', 'systems', 'community']



In [47]:
import pyLDAvis 
import pyLDAvis.sklearn 

pyLDAvis.enable_notebook()

In [63]:
display_data = pyLDAvis.sklearn.prepare(ldaBow22, bow22, CountVec22)

  by='saliency', ascending=False).head(R).drop('saliency', 1)
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


In [64]:
pyLDAvis.display(display_data)