# BASALT 2022 MineDojo Wiki Search Engine

## Background
I am working on potential solutions for https://www.aicrowd.com/challenges/neurips-2022-minerl-basalt-competition. The goal of this search engine is to quickly find relevant articles when looking at a certain item. This algorithm uses PySpark's TF-IDF algorithm for scoring word frequency and runs quite quickly.

## Data Mining
In order to get the data to run this script, I downloaded the MineDojo Wiki Dataset and crawled that for data.json files.

## Model Code

### Imports

In [1]:
import json
import pandas as pd
import os
from pyspark.sql.types import *
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

### Methods

In [2]:
def fast_scandir(dirname):
    subfolders= [f.path for f in os.scandir(dirname) if f.is_dir()]
    for dirname in list(subfolders):
        subfolders.extend(fast_scandir(dirname))
    return subfolders

### Crawl Dataset

In [3]:
wikiPath = "/Volumes/Extreme SSD/Extra Datasets/wiki_full/"
subfolders = fast_scandir(wikiPath)

In [4]:
dataFiles = []
for folder in subfolders:
    if os.path.exists(folder + "/data.json"):
        dataFiles.append(folder + "/data.json")

In [5]:
print(len(dataFiles))
dataDict = {"Location": [], "Text": []}
for file in dataFiles:
    location = "--".join(file.split("/")[5:-1])
    fullText = ""
    with open(file, 'r') as jsonFile:
        data = json.load(jsonFile)
        try:
            for text in data['texts']:
                fullText += text['text'] + "\n"
        except:
            print(file, "Failed to Load Texts")
        try:
            for image in data['images']:
                if not str(image['alt_text']) == "null":
                    fullText += image['alt_text'] + "\n"
        except:
            print(file, "Failed to Load Images")
        try:
            for table in data['tables']:
                if not str("\n".join(table['headers']['text'])):
                    fullText += "\n".join(table['headers']['text'])
                if not str("\n".join(table['cells']['text'])):
                    fullText += "\n".join(table['cells']['text'])
        except:
            print(file, "Failed to Load Tables")
    dataDict["Location"].append(location)
    dataDict["Text"].append(fullText.replace("\n", " ").replace(",", ""))
    
            

6281
/Volumes/Extreme SSD/Extra Datasets/wiki_full/.minecraft/data.json Failed to Load Texts
/Volumes/Extreme SSD/Extra Datasets/wiki_full/.minecraft/path/data.json Failed to Load Texts
/Volumes/Extreme SSD/Extra Datasets/wiki_full/Mods/TheGunMod/.GUN2_File_Format/data.json Failed to Load Texts


In [6]:
rawData = pd.DataFrame.from_dict(dataDict)
rawData.to_csv(wikiPath + "page_text.csv", index=False)

### Open Text Data with PySpark

In [7]:
rawData = spark.read.csv(wikiPath + "page_text.csv")
articles = rawData.toDF("Title", "Document")

In [8]:
articles.printSchema()

root
 |-- Title: string (nullable = true)
 |-- Document: string (nullable = true)



In [9]:
articles.show()

+--------------------+--------------------+
|               Title|            Document|
+--------------------+--------------------+
|            Location|                Text|
|  Launcher_2.1.1350x|Launcher 2.1.1350...|
|Bedrock_Edition_b...|"Bedrock Edition ...|
|Xbox_360_Edition_...|"Xbox 360 Edition...|
|     Launcher_1.6.19|Launcher 1.6.19 1...|
|            Badlands|"Badlands The bad...|
|      Banner_Pattern|"Banner Pattern B...|
|Reinforced_Deepslate|"Reinforced Deeps...|
|                Well|Well Well may ref...|
|Bedrock_Edition_1...|"Bedrock Edition ...|
|Bedrock_Dedicated...|Bedrock Dedicated...|
|            Breaking|"Breaking This ar...|
| Java_Edition_18w03b|"Java Edition 18w...|
|Xbox_One_Edition_...|"Xbox One Edition...|
|Bedrock_Dedicated...|Bedrock Dedicated...|
|    Pillager_Outpost|"Pillager Outpost...|
|     Water_Breathing|"Water Breathing ...|
|Java_Edition_1.10...|Java Edition 1.10...|
|Bedrock_Edition_b...|"Bedrock Edition ...|
|     Adam_Martinsson|"Adam Mart

### Clean Dataset

In [10]:
articles.filter(articles.Document.isNull()).count()

2

In [11]:
cleanedArticles = articles.filter(articles.Document.isNotNull())
cleanedArticles.filter(cleanedArticles.Document.isNull()).count()

0

In [12]:
tokenizer = Tokenizer(inputCol="Document", outputCol="words")
wordsData = tokenizer.transform(cleanedArticles)

hashingTF = HashingTF(inputCol="words", outputCol = "rawFeatures")
featurizedData = hashingTF.transform(wordsData)
featurizedData.show()

+--------------------+--------------------+--------------------+--------------------+
|               Title|            Document|               words|         rawFeatures|
+--------------------+--------------------+--------------------+--------------------+
|            Location|                Text|              [text]|(262144,[143985],...|
|  Launcher_2.1.1350x|Launcher 2.1.1350...|[launcher, 2.1.13...|(262144,[12524,27...|
|Bedrock_Edition_b...|"Bedrock Edition ...|["bedrock, editio...|(262144,[619,784,...|
|Xbox_360_Edition_...|"Xbox 360 Edition...|["xbox, 360, edit...|(262144,[3449,592...|
|     Launcher_1.6.19|Launcher 1.6.19 1...|[launcher, 1.6.19...|(262144,[12524,27...|
|            Badlands|"Badlands The bad...|["badlands, the, ...|(262144,[238,535,...|
|      Banner_Pattern|"Banner Pattern B...|["banner, pattern...|(262144,[702,8254...|
|Reinforced_Deepslate|"Reinforced Deeps...|["reinforced, dee...|(262144,[702,1512...|
|                Well|Well Well may ref...|[well, well

### Starting IDF Algorithm

In [13]:
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

                                                                                

In [14]:
rescaledData.show()

22/07/18 16:43:30 WARN DAGScheduler: Broadcasting large task binary with size 4.0 MiB
+--------------------+--------------------+--------------------+--------------------+--------------------+
|               Title|            Document|               words|         rawFeatures|            features|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|            Location|                Text|              [text]|(262144,[143985],...|(262144,[143985],...|
|  Launcher_2.1.1350x|Launcher 2.1.1350...|[launcher, 2.1.13...|(262144,[12524,27...|(262144,[12524,27...|
|Bedrock_Edition_b...|"Bedrock Edition ...|["bedrock, editio...|(262144,[619,784,...|(262144,[619,784,...|
|Xbox_360_Edition_...|"Xbox 360 Edition...|["xbox, 360, edit...|(262144,[3449,592...|(262144,[3449,592...|
|     Launcher_1.6.19|Launcher 1.6.19 1...|[launcher, 1.6.19...|(262144,[12524,27...|(262144,[12524,27...|
|            Badlands|"Badlands The bad...|["badlands, the

### Testing With Search Term

In [15]:
schema = StructType([StructField("words", ArrayType(StringType()))])

df = spark.createDataFrame(([[["tree"]]]), schema).toDF("words")
df.show()

[Stage 12:>                                                         (0 + 1) / 1]                                                                                

+------+
| words|
+------+
|[tree]|
+------+



In [16]:
gettysburg = hashingTF.transform(df)
gettysburg.show()

+------+--------------------+
| words|         rawFeatures|
+------+--------------------+
|[tree]|(262144,[193711],...|
+------+--------------------+



In [17]:
featureVec = gettysburg.select('rawFeatures').collect()
print(featureVec)

[Row(rawFeatures=SparseVector(262144, {193711: 1.0}))]


In [18]:
treeID = int(featureVec[0].rawFeatures.indices[0])
print(treeID)

193711


In [19]:
termExtractor = udf(lambda x: float(x[treeID]), FloatType())

treeDF = rescaledData.withColumn('score', termExtractor(rescaledData.features))
treeDF.show()

22/07/18 16:43:31 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB


[Stage 19:>                                                         (0 + 1) / 1]

+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
|               Title|            Document|               words|         rawFeatures|            features|score|
+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
|            Location|                Text|              [text]|(262144,[143985],...|(262144,[143985],...|  0.0|
|  Launcher_2.1.1350x|Launcher 2.1.1350...|[launcher, 2.1.13...|(262144,[12524,27...|(262144,[12524,27...|  0.0|
|Bedrock_Edition_b...|"Bedrock Edition ...|["bedrock, editio...|(262144,[619,784,...|(262144,[619,784,...|  0.0|
|Xbox_360_Edition_...|"Xbox 360 Edition...|["xbox, 360, edit...|(262144,[3449,592...|(262144,[3449,592...|  0.0|
|     Launcher_1.6.19|Launcher 1.6.19 1...|[launcher, 1.6.19...|(262144,[12524,27...|(262144,[12524,27...|  0.0|
|            Badlands|"Badlands The bad...|["badlands, the, ...|(262144,[238,535,...|(262144,[23

                                                                                

In [20]:
sortedResults = treeDF.filter("score > 0").orderBy('score', ascending=False).select("Title", "score")
sortedResults.show(truncate=100)

22/07/18 16:43:34 WARN DAGScheduler: Broadcasting large task binary with size 4.1 MiB




+---------------------------------------+---------+
|                                  Title|    score|
+---------------------------------------+---------+
|                                   Tree|320.97998|
|                Tutorials--Tree_farming|229.72096|
|                                    Oak|94.405876|
|                        Tree--Structure|94.405876|
|                            Jungle_tree| 88.11215|
|                                  Birch| 78.67156|
|                                Sapling|62.937252|
|                                 Spruce|59.790386|
|                                 Acacia|56.643524|
|                     Biome--Before_1.18|47.202938|
|                     Configured_feature|37.762352|
|                               Dark_oak|31.468626|
|                            Azalea_tree|31.468626|
|          Java_Edition_removed_features|28.321762|
|          Tutorials--Superflat_survival|28.321762|
|       Tutorials--Best_biomes_for_homes|28.321762|
|           

                                                                                