<a href="https://colab.research.google.com/github/ahmedlila/Web-Scraping-Notebooks/blob/main/CS230%20-%20DL%20Projects%20Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import Libraries Needed 

In [92]:
pip install validators



In [93]:
import validators
from validators import ValidationFailure
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from tabulate import tabulate
from collections import Counter
from nltk.stem import PorterStemmer
from IPython.display import HTML

### Helpers

In [94]:
# Function source code: https://miguendes.me/how-to-check-if-a-string-is-a-valid-url-in-python
def is_string_an_url(url_string: str) -> bool:
    result = validators.url(url_string)
    if isinstance(result, ValidationFailure):
        return False
    return result

In [95]:
base_directory = "https://cs230.stanford.edu/past-projects/"
html = urlopen(base_directory)
bsObj = BeautifulSoup(html, features="html.parser")
project_links, project_names = list(), list()

for project_name in bsObj.findAll("strong"):
    project_name_text = project_name.get_text()
    link = project_name.find_next_siblings("a")

    if link:  # if list is not empty
        #get the report and check if the url is ok 
        path = link[0].attrs['href']
        link_text = link[0].get_text()
        html = is_string_an_url(path)

        if link_text =='report':  # select reports only
            if html:
                project_links.append(path)
                project_names.append(project_name_text)
            else:
                project_links.append(base_directory+path)
                project_names.append(project_name_text)

In [96]:
# TEST
project_links[3]

'http://cs230.stanford.edu/projects_fall_2021/reports/102730335.pdf'

In [97]:
# Creat Dataframe
df = pd.DataFrame({'Project Name': project_names, 
                   'Project Link': project_links})

In [98]:
#Drop Duplicates 
df.drop_duplicates(inplace=True)
# TEST
df[df.duplicated()]

Unnamed: 0,Project Name,Project Link


In [99]:
df['Project Link'] = df['Project Link'].apply(lambda x: '<a href="{}">Paper Source</a>'.format(x))
HTML(df.head().to_html(escape=False))

Unnamed: 0,Project Name,Project Link
0,Classification of Medical Imagery using DL (?),Paper Source
1,In Learning we Truss: Structural Design Optimization Using Deep Learning,Paper Source
2,Predicting Regional US COVID Risk Using Publicly Available Satellite Images,Paper Source
3,Image Exposure Correction with DNN,Paper Source
4,Identifying presence or absence of seismic facies types in geological images using deep learning,Paper Source


In [None]:
# Most repeated 100 words in projects name
Counter(" ".join(df["Project Name"]).split()).most_common(100)

### Filter and Randomizer 

**Most repeated words you can search with:**
>  - Classification 
 - Recognition 
 - Prediction
 - Classification
 - Identification
 - GANs
 - CNNs
 - RNNs 
 - LSTM
 - Image
 - Video
 - Text
 - Sentiment
 - Stock
 - Facial 
 - Medical 
 - MRI
 - Automatic 





#### <font color='green' > I- Project Filter 

In [101]:
def filter_projects(word: str):
    """
    Filter all projects by the word given by user.
    
    Arguments:
    word -- string, we search about e.g.(detection).
    
    Returns:
    projects -- all projects that contain the word given.
    """
    new_df = df['Project Name'].apply(lambda x: PorterStemmer().stem(word.lower()) in x.lower())
    return df[new_df].to_html(escape=False)

In [102]:
# TEST
HTML(filter_projects(word='3D'))

Unnamed: 0,Project Name,Project Link
42,Mesh: Generating 3D Renderings from 2D Images,Paper Source
138,Improving Generalization Results for 3D Point Cloud Data Reconstruction From 2D Images,Paper Source
147,OSRSNet: Real-time Object Recognition in 3D MMORPGs,Paper Source
249,3d point cloud completion,Paper Source
251,3D object detection,Paper Source
346,Generative Lattice Structures for 3D Printing,Paper Source
418,Activity Recognition Using 3D Human Pose Estimation with Deep Learning,Paper Source
470,Predicting protein inhibition sites through point-cloud atomic encodings and 3D deep convolutional neural networks (Healthcare),Paper Source
497,2D to 3D Animation Style Transfer,Paper Source
498,3D Object Detection from Point Cloud,Paper Source


#### <font color='green' > II- Project Randomizer

In [103]:
def random_project(word: str, k: int=3):
    """
    Filter number of k projects by the word given by user.
    
    Arguments:
    word -- string, we search about e.g.(detection).
    k -- number of projects to return.
    
    Returns:
    k projects -- according to the entered k. 
    """

    new_df = df['Project Name'].apply(lambda x: PorterStemmer().stem(word.lower()) in x.lower())
    last = df[new_df]
    return last.sample(n = k).to_html(escape=False)

In [104]:
# TEST
HTML(random_project(word='MRI', k=2))

Unnamed: 0,Project Name,Project Link
1337,Alzheimer's Disease Levels Diagnosis from Brain MRI Images,Paper Source
909,Through Thick and Thin: MRI Super-Resolution Using a Generative Adversarial Network,Paper Source
