# First Step - Crawling 

In this file we are crawling from the website kaggle.
We used selenium for the crawling because the website uses JavaScript code to render their pages.
Our crawling fetches the data from 3 main sources:
1. From each page of the website we fetch the links to every DataSet.
2. From each DataSet we fetch the following data:
    1. The dataset's title -> the title of the dataset
    2. The dataset's subtitle -> the subtitle of the dataset
    3. The dataset's author -> the name of the author that created the dataset
    4. The dataset's Version -> the number of the latest version of the dataset
    5. The dataset's date of the last update -> when the author updated the dataset for the last time 
    6. The dataset's Rating -> the number of upvotes that the dataset gets
    7. The dataset's Usability -> the quality of the dataset documantation and tables
    8. The dataset's amount of file -> the number of files that the dataset has
    9. The dataset's size -> the weight of all the files 
    10. The dataset's views -> the amount of people that entered the dataset's page 
    11. The dataset's downloads -> the amount of people that downloaded the dataset's files
    12. The dataset's topics -> the amount of open dicussions about the dataset
3. From each Dataset's author we fetch the following data:
    1. The author's experience -> the seniority of the author
    2. The author's following -> the number of people that the author follows
    3. The author's followers -> the number of people that follow the author
    4. The author's discussions -> the number of discussions that the author started/ participated in
    5. The author's competitions -> the number of competitive competitions that the author attended
    6. The author's location -> the geographic location of the author 
    7. The author's code -> the number of updates that the author contributed to other datasets
    8. The author's datasets -> the number of datasets that the author owns.
    

# Imports

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time

# Our main crawling function
### param: 
* num_of_page_start - the first page that the crawling will get the data from
* num_of_page_end - the last page that the crawling will get the data from

In [5]:

def load_data_of_database(num_of_page_start,num_of_page_end):
    #
    # 1.we created a list for each type of data (column)
    #
    dataSetTitle = []
    dataSetSubTitle = []
    dataSetAuthorName = []
    dataSetVersion = []
    dataSetDateNum = []
    dataSetDateNumType = []
    dataSetRating = []
    dataSetUsability = []
    dataSetFileCount = []
    dataSetFileSize = []
    dataSetFileSizeType = []
    dataSetViewCount = []
    dataSetDownloadNum = []
    dataSetNotebookNum = []
    dataSetTopicNum = []
    dataSetAuthorDiscussionCount = []
    dataSetAuthorCompetitiveCount = []
    dataSetAuthorDatasetCount = []
    dataSetAuthorCodeCount = []
    dataSetAuthorFollowers = []
    dataSetAuthorFollowing = []
    dataSetAuthorLocation = []
    dataSetAuthorExperienceNum = []
    dataSetAuthorExperienceType = []
    
    for page in range(num_of_page_start,num_of_page_end+1):
        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
        #
        # 2.for each page we fetch the links of the datasets and put them into the 'links' list.
        #
        driver.get(f'https://www.kaggle.com/datasets?page={page}')
        time.sleep(3)
        i=0
        data = driver.find_element(By.XPATH, "//ul[@class='km-list km-list--three-line']").find_elements(By.XPATH, './/li')
        links =[]
        for li in data:
            links.append(li.find_elements(By.XPATH, './/div')[0].find_element(By.XPATH, './/a').get_attribute('href'))
        #
        #3. for each link we start to fetch the data from the dataset page
        #
        for link in links:
            driver.get(link)
            time.sleep(3)
            #####################################################THE DATASET DATA##############################
            title = driver.find_element(By.XPATH,"//h1[@class='dataset-header-v2__title']").text #The Dataset Title
            try:
                #The Dataset subTitle
                subtitle = driver.find_element(By.XPATH,"//h2[@class='dataset-header-v2__subtitle']").text
            except:
                subtitle=None
            author = driver.find_element(By.XPATH,"//div[@class='dataset-header-v2__details']")
            authorLink = author.find_element(By.XPATH,"//a[@class='dataset-header-v2__owner-name']") #The link to the author page
            authorName = author.text.split("•")[0].split("\n")[0] # The Author Name
            try:
                date = author.text.split("•")[1].split("(")
                version = date[1].split(")")[0].split(" ")[1] # The version of the dataset
                date = date[0].split("  updated ")[1].split(" ")
                if(date[0]=='an' or date[0]=='a'): # the date of the dataset's last update 
                    dateNum=1
                    dateNumType = date[1]
                else:
                    dateNum=date[0]
                    dateNumType=date[1][:-1]
            except:
                dateNum=0
                dateNumType=None
                version=1
            date = date[0] # the date of dataset
            data = driver.find_element(By.XPATH, "//ul[@class='horizontal-list']").find_elements(By.XPATH, './/li')
            rating = driver.find_element(By.XPATH,"//span[@role='button']").text #the number of Upvotes 
            usability = driver.find_element(By.XPATH,"//p[@data-test='rating']").text #the usability rank
            try:
                file = driver.find_elements(By.XPATH,"//h6")[1].find_element(By.XPATH,'..//p').text.split(" ")
                #number of files
                fileCount = driver.find_elements(By.XPATH,"//h6")[2].find_element(By.XPATH,'..//p').text.split(" ")[0]
                fileSize = file[0] #the size of all the files
                fileSizeType = file[1] # the size type of the files -> kb,mb,gb.....
            except:
                fileCount=0
                fileSize=None
                fileSizeType =None
            viewC = data[0].text.split(" ")[0] #the number of people that view the dataset
            downloadNum=data[1].text.split(" ")[0] #the number of people that download the dataset
            notebookNum=data[2].text.split(" ")[0] #the number of updates that people contributed to the dataset
            topicNum = data[3].text.split(" ")[0] #the number of topics of the dataset
            #
            #4. switching to the author page to fetch his data
            #
            authorLink.click()
            time.sleep(2)
            #####################################################THE AUTHOR DATA##############################
            followData = driver.find_elements(By.XPATH,".//div[@class='profile__user-followers-item']")
            userData = driver.find_element(By.XPATH,".//div[@class='pageheader__nav-wrapper']")    
            try:
                #the amount of datasets that the author has
                authorDatasetCount = userData.find_element(By.XPATH,".//a[@title='datasets']").find_element(By.XPATH,".//span[@class='pageheader__link-count']").find_element(By.XPATH,".//span").text
                rankOfAuthor = driver.find_element(By.XPATH,".//a[@title='Progression']").find_elements(By.XPATH,".//p")[1].text # the rank of the author
                authorExperience = driver.find_element(By.XPATH,".//p[@class='profile__user-metadata']").find_element(By.XPATH,".//span").text.split(" ") #the experience of the author
                authorExperienceNum =authorExperience[0] #the number of days/month/years of experience
                authorExperienceType = authorExperience[1] #the number type of the experience -> days/month/years 
            except:
                authorDatasetCount = None
                rankOfAuthor =None
                authorExperienceNum = 0
                authorExperienceType =None
            try:
                #the amount of competitive competitions that the author has
                authorCompetitiveCount = userData.find_element(By.XPATH,".//a[@title='competitions']").find_element(By.XPATH,".//span[@class='pageheader__link-count']").find_element(By.XPATH,".//span").text
            except:
                authorCompetitiveCount=0
            try:
                #the amount of code updates that the author contributed to the site
                authorCodeCount = userData.find_element(By.XPATH,".//a[@title='code']").find_element(By.XPATH,".//span[@class='pageheader__link-count']").find_element(By.XPATH,".//span").text
            except:
                authorCodeCount=0
            try:
                #the amount of discussions that the author has joined
                authorDiscussionCount = userData.find_element(By.XPATH,".//a[@title='discussion']").find_element(By.XPATH,".//span[@class='pageheader__link-count']").find_element(By.XPATH,".//span").text
            except:
                authorDiscussionCount =0
            try:
                authorFollowers = followData[0].text.split("s")[1]
            except:
                authorFollowers=0
            try:
                authorFollowing = followData[1].text.split("g")[1]
            except:
                authorFollowing=0
            try:
                authorLocation = driver.find_element(By.XPATH,".//p[@class='profile__user-location']").text # the location of the author
            except:
                authorLocation=None
                
            ##############################adding to the lists##########################
            #
            #5. adding the data of the current dataset into their list
            #
            dataSetTitle.append(title)
            dataSetSubTitle.append(subtitle)
            dataSetAuthorName.append(authorName)
            dataSetVersion.append(version)
            dataSetDateNum.append(dateNum)
            dataSetDateNumType.append(dateNumType)
            dataSetRating.append(rating)
            dataSetUsability.append(usability)
            dataSetFileCount.append(fileCount)
            dataSetFileSize.append(fileSize)
            dataSetFileSizeType.append(fileSizeType)
            dataSetViewCount.append(viewC)
            dataSetDownloadNum.append(downloadNum)
            dataSetNotebookNum.append(notebookNum)
            dataSetTopicNum.append(topicNum)
            dataSetAuthorDiscussionCount.append(authorDiscussionCount)
            dataSetAuthorCompetitiveCount.append(authorCompetitiveCount)
            dataSetAuthorDatasetCount.append(authorDatasetCount)
            dataSetAuthorCodeCount.append(authorCodeCount)
            dataSetAuthorFollowers.append(authorFollowers)
            dataSetAuthorFollowing.append(authorFollowing)
            dataSetAuthorLocation.append(authorLocation)
            dataSetAuthorExperienceNum.append(authorExperienceNum)
            dataSetAuthorExperienceType.append(authorExperienceType)
    #
    #6. converting the lists into a DataFrame
    #
    df= pd.DataFrame({"Title":dataSetTitle,"SubTitle":dataSetSubTitle,"Version":dataSetVersion,"Date Num":dataSetDateNum,"Date Type":dataSetDateNumType,
        "Usability":dataSetUsability,"Rating":dataSetRating,"Views":dataSetViewCount,"Downloads":dataSetDownloadNum,
        "Notebooks":dataSetNotebookNum,"Topics":dataSetTopicNum,"Number Of Files":dataSetFileCount,"File Size":dataSetFileSize,
        "File Size Type":dataSetFileSizeType,"Author":dataSetAuthorName,"Location":dataSetAuthorLocation,
        "Experience Num":dataSetAuthorExperienceNum,"Experience Num Type":dataSetAuthorExperienceType,"Followers":dataSetAuthorFollowers,
        "Following":dataSetAuthorFollowing,"Owned Datasets":dataSetAuthorDatasetCount,"Code Helper":dataSetAuthorCodeCount,
        "Discussion":dataSetAuthorDiscussionCount,"Competitions":dataSetAuthorCompetitiveCount})
    #
    #7. exporting the DataFrame into a csv file
    #
    df.to_csv('All_Data_Stored_From_Kaggle.csv',index=False)

# Starting the crawling on the pages 2 - 160

In [3]:
load_data_of_database(2,160)

# The dataframe of all the data
importing the data from the csv file

In [4]:
df = pd.read_csv(f'All_Data_Stored_From_Kaggle.csv')
df

Unnamed: 0,Title,SubTitle,Version,Date Num,Date Type,Usability,Rating,Views,Downloads,Notebooks,...,Author,Location,Experience Num,Experience Num Type,Followers,Following,Owned Datasets,Code Helper,Discussion,Competitions
0,US Public Food Assistance,"Where does it come from, who spends it, who ge...",8,1,year,9.1,367,92967,15336,1771,...,JohnM,"Fort Worth, Texas, United States",7,years,1092,229,28,44,930,129
1,Kepler Exoplanet Search Results,10000 exoplanet candidates examined by the Kep...,2,4,year,8.2,639,112406,9760,1460,...,NASA,,0,,0,0,,0,0,0
2,Things on Reddit,The top 100 products in each subreddit from 20...,1,4,year,5.9,204,56658,8014,1513,...,Aleksey Bilogur,"New York, New York, United States",5,years,1602,30,44,230,675,1
3,"18,393 Pitchfork Reviews","Pitchfork reviews from Jan 5, 1999 to Jan 8, 2017",1,5,year,7.1,364,65139,9916,1753,...,Nolan Conaway,"New York, New York, United States",5,years,10,0,5,9,6,1
4,Animal Crossing New Horizons Catalog,"A comprehensive inventory of ACNH items, villa...",3,7,month,8.2,10356,145172,12857,12,...,Jessica Li,"New York, New York, United States",4,years,926,17,29,21,258,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2732,Drug Seizues annually since 1970s,seizues of drugs from 1970s to pre covid period,1,2,month,10.0,11,1414,191,3,...,Ram Jas Maurya,"New Delhi, Delhi, India",a,year,35,17,47,18,233,1
2733,Fashion Anchor Cloth Pairs,Over 76k human-outfit item pairs for 5 categories,5,1,month,6.9,9,341,44,0,...,Kritanjali Jain,"Mumbai, Maharashtra, India",2,years,27,3,6,14,8,1
2734,Prediction of music genre,Classify music into genres,1,3,month,10.0,37,13243,1603,9,...,gaoyuan,"Wellington, Wellington, New Zealand",5,months,1,0,1,1,1,3
2735,Marketing Analytics,Practice Exploratory and Statistical Analysis ...,1,1,year,10.0,278,106585,12168,67,...,Jack Daoud,"Boston, Massachusetts, United States",a,year,25,81,4,1,28,0
