## Resoureces
- LLM papers
    - 
- Extract and save pdfs : arxiv api
- pdf to text : https://github.com/karpathy/arxiv-sanity-preserver/blob/master/parse_pdf_to_text.py
- llama_index   

## Playground

In [11]:
import arxiv

# search = arxiv.Search(id_list=["2303.18223","1910.10683"])
paper = next(search.results())
print(paper.title)

A Survey of Large Language Models


In [19]:
for paper in search.results():
    print(paper.title)

A Survey of Large Language Models
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer


In [16]:
paper = next(search.results())

In [18]:
paper.title

'A Survey of Large Language Models'

In [10]:
paper.download_pdf()

'./2303.18223v11.A_Survey_of_Large_Language_Models.pdf'

In [14]:
from arxiv import SortCriterion,SortOrder

In [27]:
search = arxiv.Search(
  query="llm",
    max_results=10,
    sort_by=SortCriterion.SubmittedDate,
    sort_order=SortOrder.Descending
)

In [28]:
for result in search.results():
      print(result.title)

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation
An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures
MetRoBERTa: Leveraging Traditional Customer Relationship Management Data to Develop a Transit-Topic-Aware Language Model
Extrapolating Large Language Models to Non-English by Aligning Languages
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
LLaMA-E: Empowering E-commerce Authoring with Multi-Aspect Instruction Following
No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT
CLEVA: Chinese Language Models EVAluation Platform
Adaptive Intellect Unleashed: The Feasibility of Knowledge Transfer in Large Language Models
Universal Fuzzing via Large Language Models


## Extract links from md

In [1]:
with open("assets/llm_survey.md") as file:
    item = file.read()

In [2]:
len(item)

87696

In [3]:
import markdown
import re
md_file = markdown.markdown(item)


In [4]:
links_to_scrap = (re.findall(r'https://arxiv.org/abs/\d+\.\d+',md_file))

In [5]:
links_to_scrap

['https://arxiv.org/abs/2303.18223',
 'https://arxiv.org/abs/1910.10683',
 'https://arxiv.org/abs/2010.11934',
 'https://arxiv.org/abs/2104.12369',
 'https://arxiv.org/abs/2106.10715',
 'https://arxiv.org/abs/2110.08207',
 'https://arxiv.org/abs/2204.06745',
 'https://arxiv.org/abs/2203.13474',
 'https://arxiv.org/abs/2204.07705',
 'https://arxiv.org/abs/2205.05131',
 'https://arxiv.org/abs/2205.01068',
 'https://arxiv.org/abs/2207.04672',
 'https://arxiv.org/abs/2211.05100',
 'https://arxiv.org/abs/2210.02414',
 'https://arxiv.org/abs/2210.11416',
 'https://arxiv.org/abs/2211.01786',
 'https://arxiv.org/abs/2211.09085',
 'https://arxiv.org/abs/2211.01786',
 'https://arxiv.org/abs/2212.12017',
 'https://arxiv.org/abs/2304.01373',
 'https://arxiv.org/abs/2302.13971',
 'https://arxiv.org/abs/2303.17568',
 'https://arxiv.org/abs/2005.14165',
 'https://arxiv.org/abs/2201.08239',
 'https://arxiv.org/abs/2109.04650',
 'https://arxiv.org/abs/2107.03374',
 'https://arxiv.org/abs/2107.02137',
 

In [11]:
from llama_index import SimpleDirectoryReader, VectorStoreIndex

In [5]:
reader = SimpleDirectoryReader(input_dir="./textdata/")

In [8]:
docs = reader.load_data()

In [12]:
index = VectorStoreIndex.from_documents(docs)

In [15]:
from llama_index import ServiceContext, LLMPredictor, OpenAIEmbedding, PromptHelper
from llama_index.llms import OpenAI
from llama_index.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser

llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
embed_model = OpenAIEmbedding()
node_parser = SimpleNodeParser(
  text_splitter=TokenTextSplitter(chunk_size=512, chunk_overlap=100)
)
# prompt_helper = PromptHelper(
#   context_window=4096, 
#   num_output=256, 
#   chunk_overlap_ratio=0.1, 
#   chunk_size_limit=None
# )

service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
  node_parser=node_parser,
)

In [17]:
query_engine = index.as_query_engine(service_context = service_context)

In [35]:
response = query_engine.query("alternatives to Chinchilla scaling laws?")

## Functions

In [16]:
from PyPDF2 import PdfReader
from glob import glob
import markdown
import re
import arxiv
from tqdm import tqdm 
import os
def extract_arxivlinks_md(filename:str):
    
    with open(filename) as file:
        item = file.read()
    md_file = markdown.markdown(item)
    links_to_scrap = (re.findall(r'https://arxiv.org/abs/\d+\.\d+',md_file))
    return [file.split('/')[-1] for file in links_to_scrap]


def download_arxiv(id_list:list,dirpath="./"):
    
    search = arxiv.Search(id_list=id_list)
    print(f"Downloading {len(id_list)} papers..")
    for paper in tqdm(search.results()):
        paper.download_pdf(dirpath)
    print("DONE..let's rock!")
    
def pdf_to_text(from_dir:str,to_dir:str):
    
    pdfs = glob(f"{from_dir}/*.pdf")
    print("Convering pdf to texts...")
    for file in tqdm(pdfs):
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text = text + "\n\n" + pdf_reader.pages[page_num].extract_text()
        name = os.path.join(to_dir,file.split('/')[-1][:-4])
        with open(name,"w") as file:
            file.write(text)
                
    print("Done...")


In [17]:
# filename = "assets/llm_survey.md"
# files = extract_arxivlinks_md(filename)
# download_arxiv(files[:3],dirpath="./pdfdata")
pdf_to_text("./pdfdata","./textdata")

Convering pdf to texts...


100%|████████████████████████████████████████████████████| 3/3 [00:07<00:00,  2.36s/it]

Done...



