<a href="https://colab.research.google.com/github/l-monninger/ahl/blob/main/exploration/final_project_ver0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Project Intro

Project Explanaiton here

## 1. Setting

In [48]:
# Installation 

!pip install kaggle
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [49]:
# libraries
import json
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import * 

In [50]:
# Run this cell to mount your drive (you will be prompted to sign in)
from google.colab import drive
drive.mount('/content/drive')

!mkdir ~/.kaggle

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [51]:
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/

In [52]:
# start sparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()

## 2. Data Wrangling


### 1) Arxiv Data - ahreum


#### a. Import Data

**Arxiv Data**
* id: ArXiv ID (can be used to access the paper, see below)
* submitter: Who submitted the paper
* authors: Authors of the paper
* title: Title of the paper
* comments: Additional info, such as number of pages and figures
* journal-ref: Information about the journal the paper was published in
* doi: [https://www.doi.org](Digital Object Identifier)
* abstract: The abstract of the paper
* categories: Categories / tags in the ArXiv system
* versions: A version history



In [None]:
!kaggle datasets download -d Cornell-University/arxiv
!unzip /content/arxiv.zip

Downloading arxiv.zip to /content
 99% 1.12G/1.14G [00:09<00:00, 112MB/s]
100% 1.14G/1.14G [00:09<00:00, 128MB/s]
Archive:  /content/arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [None]:
arxiv_df = spark.read.json("arxiv-metadata-oai-snapshot.json", multiLine = False, primitivesAsString = True)

In [None]:
arxiv_df.show()

+--------------------+--------------------+--------------------+-----------------+--------------------+--------------------+---------+--------------------+--------------------+--------------------+------------------+--------------------+-----------+--------------------+
|            abstract|             authors|      authors_parsed|       categories|            comments|                 doi|       id|         journal-ref|             license|           report-no|         submitter|               title|update_date|            versions|
+--------------------+--------------------+--------------------+-----------------+--------------------+--------------------+---------+--------------------+--------------------+--------------------+------------------+--------------------+-----------+--------------------+
|  A fully differe...|C. Bal\'azs, E. L...|[[Balázs, C., ], ...|           hep-ph|37 pages, 15 figu...|10.1103/PhysRevD....|0704.0001|Phys.Rev.D76:0130...|                null|    ANL-HEP

#### b. Authors Analysis

- authors article count (counted by all authors based, merged by first author)
- collaboration T/F

In [None]:
arxiv_df = arxiv_df.withColumn('first_author', regexp_replace(concat_ws(" ", element_at("authors_parsed", 1)), ",", ""))
arxiv_df = arxiv_df.withColumn("first_author", trim('first_author'))


In [None]:
author_flattened_ver = arxiv_df.selectExpr("*", "explode(authors_parsed) as authors_flat")
count_df = author_flattened_ver.withColumn('authors_join', regexp_replace(concat_ws(" ", "authors_flat"), ",", ""))
count_df = count_df.withColumn("authors_join", trim(count_df.authors_join))
count_df = count_df.groupBy("authors_join").agg(count('*').alias('article_counts_1st'))
count_df = count_df.orderBy('article_counts_1st', ascending= False)
count_df.show()

+-------------------+-------------+
|       authors_join|article_count|
+-------------------+-------------+
|           Zhang Y.|         2187|
|            Wang Y.|         1536|
|            Wang J.|         1433|
|              Li Y.|         1359|
|           Liu Yang|         1260|
|           Zhang L.|         1247|
|           Wang Wei|         1196|
|             Liu X.|         1195|
|             Gao Y.|         1181|
|  CMS Collaboration|         1175|
|  Taniguchi Takashi|         1165|
|            Wang Z.|         1139|
|     Watanabe Kenji|         1126|
|ATLAS Collaboration|         1073|
|            Yang Z.|         1038|
|             Sun L.|         1015|
|          Zhang Lei|          982|
|              Xu Z.|          936|
|        Krokovny P.|          936|
|          Zhang Wei|          935|
+-------------------+-------------+
only showing top 20 rows



In [None]:
count_df.filter(col('authors_join') == 'Zhang Y.').select('article_counts_1st').show()

+-------------+
|article_count|
+-------------+
|         2187|
+-------------+



In [None]:
arxiv_df.filter(col('1st_author')=='Zhang Y.').show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+---------+--------------------+--------------------+-----------+--------------------+----------+
|            abstract|             authors|      authors_parsed|          categories|            comments|                 doi|       id|         journal-ref|             license|report-no|           submitter|               title|update_date|            versions|1st_author|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+---------+--------------------+--------------------+-----------+--------------------+----------+
|  We investigate ...|Y. Zhang, G. Liu,...|[[Zhang, Y., ], [...|cond-mat.mes-hall...|                null|                null|0708.3118|                null|              

In [None]:
arxiv_df = arxiv_df.join(count_df, arxiv_df.first_author == count_df.authors_join, "left")
arxiv_df.show(10)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+-------------+
|            abstract|             authors|      authors_parsed|          categories|            comments|                 doi|              id|         journal-ref|             license|           report-no|           submitter|               title|update_date|            versions|        first_author|        authors_join|article_count|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------

In [None]:
arxiv_df = arxiv_df.withColumn('collab_tf', when(lower(col('authors')).contains('collaboration'), 1).otherwise(0))

In [None]:
arxiv_df.groupBy('collab_tf').agg(count('*').alias('count')).show()

+---------+-------+
|collab_tf|  count|
+---------+-------+
|        1|  24873|
|        0|2206644|
+---------+-------+



#### c. Category Name Mapping

**category name mapping** 

- parsing from arxiv official taxonomy web page "https://arxiv.org/category_taxonomy" : 
**currently parsing is blocked. Trying to use another way.**


In [None]:
arxiv_df.withColumn('categories_parsed', concat_ws(" ", "categories")).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+------------------+---------+--------------------+
|            abstract|             authors|      authors_parsed|          categories|            comments|                 doi|              id|         journal-ref|             license|           report-no|           submitter|               title|update_date|            versions|        first_author|        authors_join|article_counts_1st|collab_tf|   categories_parsed|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-

In [None]:
cat_map_dict = {}
cat_map_dict['cs'] = 'Computer Science'
cat_map_dict['econ'] = 'Economics'
cat_map_dict['eess'] = 'Electrical Engineering and Systems Science'
cat_map_dict['math'] =  'Mathematics'
cat_map_dict['astro-ph'] = 'Astrophysics'
cat_map_dict['cond-mat'] = 'Condensed Matter'
cat_map_dict['gr-qc'] = 'General Relativity and Quantum Cosmology'
cat_map_dict['hep-ex'] =  'High Energy Physics - Experiment'
cat_map_dict['hep-lat'] = 'High Energy Physics - Lattice'
cat_map_dict['hep-ph'] =  'High Energy Physics - Phenomenology'
cat_map_dict['hep-th'] = 'High Energy Physics - Theory'
cat_map_dict['math-ph'] = 'Mathematical Physics'
cat_map_dict['nlin'] = 'Nonlinear Sciences'
cat_map_dict['nucl-ex'] = 'Nuclear Experiment'
cat_map_dict['nucl-th'] =  'Nuclear Theory'
cat_map_dict['physics'] = 'Physics'
cat_map_dict['quant-ph'] = 'Quantum Physics'
cat_map_dict['q-bio'] = 'Quantitative Biology'
cat_map_dict['q-fin'] = 'Quantitative Finance'
cat_map_dict['stat'] = 'Statistics'

Parsing version below

In [None]:
import requests
from lxml import etree
import json

method_name = 'query'
parameters = 'id:all'
url = f"http://export.arxiv.org/api/{method_name}?{parameters}"

import urllib.request as libreq
with libreq.urlopen(url) as url:
  r = url.readlines()
print(r)


[b'<?xml version="1.0" encoding="UTF-8"?>\n', b'<feed xmlns="http://www.w3.org/2005/Atom">\n', b'  <link href="http://arxiv.org/api/query?search_query%3D%26id_list%3D%26start%3D0%26max_results%3D10" rel="self" type="application/atom+xml"/>\n', b'  <title type="html">ArXiv Query: search_query=&amp;id_list=&amp;start=0&amp;max_results=10</title>\n', b'  <id>http://arxiv.org/api/iQlyOhQ3W8t8kI903+StmrBZNWM</id>\n', b'  <updated>2023-04-15T00:00:00-04:00</updated>\n', b'  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:totalResults>\n', b'  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>\n', b'  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">10</opensearch:itemsPerPage>\n', b'</feed>\n']


In [None]:
import requests
from lxml import etree

url = "https://arxiv.org/category_taxonomy"

response = requests.get(url)
# print(response)
categories = pd.DataFrame(columns=['cat_id', 'main_category','sub_category'])
if response.status_code == 200:
    parser = etree.HTMLParser()
    tree = etree.fromstring(response.content, parser=parser)

    for x, y in zip(tree.xpath("//*[@id='category_taxonomy_list']//h4/text()"), tree.xpath("//*[@id='category_taxonomy_list']//h4/span/text()")):
      cat_id = x.strip()
      sub_cat = y.strip("()").strip()
      big_cat = cat_id.split(".")[0]
      categories = pd.concat([categories, pd.DataFrame({'cat_id': [cat_id], 'main_category' : [big_cat], 'sub_category': [sub_cat]})], ignore_index=True)
    
    big_category = tree.xpath("//*[@id='category_taxonomy_list']/h2/text()") + tree.xpath("//*[@id='category_taxonomy_list']//h3/text()")
categories

Unnamed: 0,cat_id,main_category,sub_category


In [None]:
import urllib.request as libreq
import feedparser


# Base api query url
base_url = 'http://export.arxiv.org/api/query?';
search_query = 'search_query=all:electron' # search for electron in all fields

# perform a GET request using the base_url and query
response = libreq.urlopen(base_url+search_query).read()

# parse the response using feedparser
feed = feedparser.parse(response)

# print out feed information
print('Feed title: %s' % feed.feed.title)
print('Feed last updated: %s' % feed.feed.updated)
for entry in feed.entries:
  print('Primary Category: %s' % entry.tags[0]['term'])
  
  # Lets get all the categories
  all_categories = [t['term'] for t in entry.tags]
  print(all_categories)

Feed title: ArXiv Query: search_query=all:electron&amp;id_list=&amp;start=0&amp;max_results=10
Feed last updated: 2023-04-15T00:00:00-04:00
Primary Category: cond-mat.str-el
['cond-mat.str-el']
Primary Category: astro-ph
['astro-ph']
Primary Category: cond-mat.str-el
['cond-mat.str-el']
Primary Category: physics.space-ph
['physics.space-ph']
Primary Category: cond-mat.supr-con
['cond-mat.supr-con', 'cond-mat.mes-hall']
Primary Category: astro-ph
['astro-ph']
Primary Category: astro-ph
['astro-ph']
Primary Category: cond-mat.str-el
['cond-mat.str-el']
Primary Category: cond-mat.quant-gas
['cond-mat.quant-gas', 'cond-mat.mtrl-sci', 'cond-mat.str-el']
Primary Category: cond-mat.str-el
['cond-mat.str-el']


In [None]:
# reference

import urllib
import feedparser

# Base api query url
base_url = 'http://export.arxiv.org/api/query?';

# Search parameters
search_query = 'id:704.000' # search for electron in all fields
start = 0                     # retreive the first 5 results
max_results = 5

query = 'search_query=%s&start=%i&max_results=%i' % (search_query,
                                                     start,
                                                     max_results)

# Opensearch metadata such as totalResults, startIndex, 
# and itemsPerPage live in the opensearch namespase.
# Some entry metadata lives in the arXiv namespace.
# This is a hack to expose both of these namespaces in
# feedparser v4.1
feedparser._FeedParserMixin.namespaces['http://a9.com/-/spec/opensearch/1.1/'] = 'opensearch'
feedparser._FeedParserMixin.namespaces['http://arxiv.org/schemas/atom'] = 'arxiv'

# perform a GET request using the base_url and query
response = urllib.urlopen(base_url+query).read()

# parse the response using feedparser
feed = feedparser.parse(response)

# print out feed information
print 'Feed title: %s' % feed.feed.title
print 'Feed last updated: %s' % feed.feed.updated

# print opensearch metadata
print 'totalResults for this query: %s' % feed.feed.opensearch_totalresults
print 'itemsPerPage for this query: %s' % feed.feed.opensearch_itemsperpage
print 'startIndex for this query: %s'   % feed.feed.opensearch_startindex

# Run through each entry, and print out information
for entry in feed.entries:
    print 'e-print metadata'
    print 'arxiv-id: %s' % entry.id.split('/abs/')[-1]
    print 'Published: %s' % entry.published
    print 'Title:  %s' % entry.title
    
    # feedparser v4.1 only grabs the first author
    author_string = entry.author
    
    # grab the affiliation in <arxiv:affiliation> if present
    # - this will only grab the first affiliation encountered
    #   (the first affiliation for the first author)
    # Please email the list with a way to get all of this information!
    try:
        author_string += ' (%s)' % entry.arxiv_affiliation
    except AttributeError:
        pass
    
    print 'Last Author:  %s' % author_string
    
    # feedparser v5.0.1 correctly handles multiple authors, print them all
    try:
        print 'Authors:  %s' % ', '.join(author.name for author in entry.authors)
    except AttributeError:
        pass

    # get the links to the abs page and pdf for this e-print
    for link in entry.links:
        if link.rel == 'alternate':
            print 'abs page link: %s' % link.href
        elif link.title == 'pdf':
            print 'pdf link: %s' % link.href
    
    # The journal reference, comments and primary_category sections live under 
    # the arxiv namespace
    try:
        journal_ref = entry.arxiv_journal_ref
    except AttributeError:
        journal_ref = 'No journal ref found'
    print 'Journal reference: %s' % journal_ref
    
    try:
        comment = entry.arxiv_comment
    except AttributeError:
        comment = 'No comment found'
    print 'Comments: %s' % comment
    
    # Since the <arxiv:primary_category> element has no data, only
    # attributes, feedparser does not store anything inside
    # entry.arxiv_primary_category
    # This is a dirty hack to get the primary_category, just take the
    # first element in entry.tags.  If anyone knows a better way to do
    # this, please email the list!
    print 'Primary Category: %s' % entry.tags[0]['term']
    
    # Lets get all the categories
    all_categories = [t['term'] for t in entry.tags]
    print 'All Categories: %s' % (', ').join(all_categories)
    
    # The abstract is in the <summary> element
    print 'Abstract: %s' %  entry.summary

In [None]:
categories['main_category'].unique()

array([], dtype=object)

In [None]:
big_category

['Computer Science',
 'Economics',
 'Electrical Engineering and Systems Science',
 'Mathematics',
 'Physics',
 'Quantitative Biology',
 'Quantitative Finance',
 'Statistics',
 'Astrophysics',
 'Condensed Matter',
 'General Relativity and Quantum Cosmology',
 'High Energy Physics - Experiment',
 'High Energy Physics - Lattice',
 'High Energy Physics - Phenomenology',
 'High Energy Physics - Theory',
 'Mathematical Physics',
 'Nonlinear Sciences',
 'Nuclear Experiment',
 'Nuclear Theory',
 'Physics',
 'Quantum Physics']

In [None]:
cat_map_dict = {}
cat_map_dict['cs'] = 'Computer Science'
cat_map_dict['econ'] = 'Economics'
cat_map_dict['eess'] = 'Electrical Engineering and Systems Science'
cat_map_dict['math'] =  'Mathematics'
cat_map_dict['astro-ph'] = 'Astrophysics'
cat_map_dict['cond-mat'] = 'Condensed Matter'
cat_map_dict['gr-qc'] = 'General Relativity and Quantum Cosmology'
cat_map_dict['hep-ex'] =  'High Energy Physics - Experiment'
cat_map_dict['hep-lat'] = 'High Energy Physics - Lattice'
cat_map_dict['hep-ph'] =  'High Energy Physics - Phenomenology'
cat_map_dict['hep-th'] = 'High Energy Physics - Theory'
cat_map_dict['math-ph'] = 'Mathematical Physics'
cat_map_dict['nlin'] = 'Nonlinear Sciences'
cat_map_dict['nucl-ex'] = 'Nuclear Experiment'
cat_map_dict['nucl-th'] =  'Nuclear Theory'
cat_map_dict['physics'] = 'Physics'
cat_map_dict['quant-ph'] = 'Quantum Physics'
cat_map_dict['q-bio'] = 'Quantitative Biology'
cat_map_dict['q-fin'] = 'Quantitative Finance'
cat_map_dict['stat'] = 'Statistics'

In [None]:
categories['main_category'] = categories['main_category'].apply(lambda x : cat_map_dict[x])

In [None]:
categories.set_index('cat_id', inplace = True)

In [None]:
categories_dict = categories.to_dict(orient='index')

{'cs.AI': {'main_category': 'Computer Science',
  'sub_category': 'Artificial Intelligence'},
 'cs.AR': {'main_category': 'Computer Science',
  'sub_category': 'Hardware Architecture'},
 'cs.CC': {'main_category': 'Computer Science',
  'sub_category': 'Computational Complexity'},
 'cs.CE': {'main_category': 'Computer Science',
  'sub_category': 'Computational Engineering, Finance, and Science'},
 'cs.CG': {'main_category': 'Computer Science',
  'sub_category': 'Computational Geometry'},
 'cs.CL': {'main_category': 'Computer Science',
  'sub_category': 'Computation and Language'},
 'cs.CR': {'main_category': 'Computer Science',
  'sub_category': 'Cryptography and Security'},
 'cs.CV': {'main_category': 'Computer Science',
  'sub_category': 'Computer Vision and Pattern Recognition'},
 'cs.CY': {'main_category': 'Computer Science',
  'sub_category': 'Computers and Society'},
 'cs.DB': {'main_category': 'Computer Science', 'sub_category': 'Databases'},
 'cs.DC': {'main_category': 'Computer

In [None]:
# make categories list first
arxiv_df['categories_parsed'] = arxiv_df['categories'].apply(lambda x: x.split(' '))

In [None]:
def cat_main_name(list_x):
  out_list = []
  for a in list_x:
    try:
      if categories_dict[a]['main_category'] not in out_list:
        out_list.append(categories_dict[a]['main_category'])
    except:
      out_list.append(a)
  return out_list

def cat_sub_name(list_x):
  out_list = []
  for a in list_x:
    try:
      if categories_dict[a]['sub_category'] not in out_list:
        out_list.append(categories_dict[a]['sub_category'])
    except:
      out_list.append(a)
  return out_list

In [None]:
arxiv_df['cat_main_name'] = arxiv_df['categories_parsed'].apply(lambda x: cat_main_name(x))
arxiv_df['cat_sub_name'] = arxiv_df['categories_parsed'].apply(lambda x: cat_sub_name(x))

#### d. Title & Abstract Key-Word Extraction
* by NLP

### 2) Patent Data - Liam

#### Loading the data
The kaggle dataset is split into multiple frames, we'll use a glob pattern on the loader to load from train/train

In [56]:
!kaggle datasets download -d jackchungchiehyu/big-patent
!unzip /content/big-patent.zip
!ls

big-patent.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/big-patent.zip
replace train/train/a/data000000000000? [y]es, [n]o, [A]ll, [N]one, [r]ename: big-patent.zip	drive  sample_data  spark-warehouse  train  val


In [57]:
patents_df = spark.read.json("train/train/*/data*")

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.9/dist-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: ignored

#### Data structure
The data has four columns `abstract`, `publication_number`, `description`. They are all strings.

In [None]:
patents_df.show()
patents_df

#### Tokenizing
We'll start by getting non-stopwords and named entities.

In [None]:
from tables.table import get_nested_field
import nltk
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType, ArrayType, FloatType

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('vader_lexicon')
stopwords = set(nltk.corpus.stopwords.words('english'))

def tokenize_content(content):
  tokens = nltk.word_tokenize(content)
  words = [
    token.lower() 
    for token in tokens 
      if token.isalpha() and token.lower() not in stopwords
  ]
  return words

tokenize_content_udf = udf(tokenize_content, ArrayType(StringType()))

def get_sentence_ne(sentence):
  chunked = nltk.ne_chunk(sentence)
  return [ 
      " ".join(w for w, t in elt) 
      for elt in chunked if isinstance(elt, nltk.Tree) 
  ]

def get_ne(content):
  sentences = nltk.sent_tokenize(content)
  sentence_nes = [get_sentence_ne(sentence) for sentence in sentences]
  return [ne for nes in sentence_nes for ne in nes]


get_ne_udf = udf(get_nested_field, ArrayType(StringType()))

#### Abstract
We'll work over the abstract first because it is smaller.

In [None]:
patents_df = patents_df.withColumn("abstract_tokens", tokenize_content_udf("abstract"))
patents_df.show()

In [None]:
patents_df = patents_df.withColumn("abstract_ne", get_ne_udf("abstract"))
patents_df.show()

#### Description
The description is the body of the patent. It's a larger body of text, so we'll work on this.

In [None]:
patents_df = patents_df.withColumn("description_tokens", get_ne_udf("description"))
patents_df.show()

In [None]:
patents_df = patents_df.withColumn("description_ne", get_ne_udf("description"))
patents_df.show()

#### Counts
We'll extract the top 10 terms from the abstract as a counts column.

In [69]:
from collections import Counter

def get_top(tokens, *, num : int = 10):
  return [
      token
      for token, _ in Counter(tokens).most_common(num)
  ]

get_top_10_udf = udf(get_top, ArrayType(StringType()))

['e', 'b', 'a']


In [None]:
patents_df = patents_df.withColumn("top_tokens", get_top_10_udf("abstract"))
patents_df.show()

#### Sentence BERT
Sentence BERT is like word2vec for sentences, we can use it to get embeddings for our abstracts.

In [None]:
!pip install torch
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
sbert = SentenceTransformer('bert-base-nli-mean-tokens')

def encode_abstract(content):

  sentences = nltk.sent_tokenize(content)
  return sum(sbert.encode(sentences))/len(sentences)

encode_abstract_udf = udf(encode_abstract, FloatType())

In [None]:
patents_df = patents_df.withColumn("abstract_encoding", encode_abstract_udf("abstract"))
patents_df.show()

We'll try the same thing with just the top tokens as if they were a setences.


In [None]:
def encode_top(top_tokens):

  return sbert.encode(" ".join(top_tokens))[0]

encode_top_udf = udf(encode_top, FloatType())

In [None]:
patents_df = patents_df.withColumn("top_encoding", encode_top("abstract"))
patents_df.show()

## Modeling

### Knowledge Graph?

### NLP pre-processing

### Regression model 

## Interface - liam