In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import utils

torch.cuda.empty_cache()

In [2]:
model = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
def prompt_llama2(prompt):
    prompt_template=f'''[INST] <<SYS>>
    You are a helpful, respectful and honest assistant.
    <</SYS>> {prompt} [/INST]'''
    
    sequences = pipeline(
        prompt_template,
        do_sample=True,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        #max_length=2048,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        repetition_penalty=1.1,
    )
    result = sequences[0]['generated_text']
    return result.split('[/INST]')[1].strip()

In [3]:
urls = ['https://cs.illinois.edu/research/areas/artificial-intelligence',
        'https://cs.illinois.edu/research/undergraduate-research',
        'https://cs.illinois.edu/corporate#research']

### Prompt with new scraping method

In [4]:
text = utils.extract_all_text(urls[0])
print(text)



In [6]:
extraction_prompt = f"""Here is the text that I scraped from a webpage: {text}

I've scraped the text from each header and paragraph together into one large string.
Imagine you are a person reading this webpage. Which sections of the page are related to the main content of the page? 
Please list those sections and all of their respective text and output in markdown format:"""

In [7]:
output = prompt_llama2(extraction_prompt)

In [8]:
print(output)

Sure! As a person reading this webpage, the main content seems to be related to the Artificial Intelligence group at the University of Illinois, their research focus, strengths, and accomplishments. Here are the relevant sections and their corresponding text in markdown format:

### Main Content

The Artificial Intelligence group at Illinois is strong, diverse, and growing. It combines expertise in core strengths with promising new research directions.

* **Research Focus**: The group is studying theoretical foundations of deep and reinforcement learning, developing novel models and algorithms for deep neural networks, and exploring issues related to scalability, security, privacy, and fairness of learning systems.
* **Strengths and Impact**: The group's research has been recognized with various awards, and their faculty members have received numerous best paper awards.
* **Group Research**: The group is involved in two $20 million AI institutes recently funded by the National Science 

In [7]:
extraction_prompt_example = f"""Here is the text that I scraped from a webpage: {text}

I've scraped the text from each header and paragraph together into one large string.
Imagine you are a person reading this webpage. Which sections of the page are related to the main content of the page? 
Please list those sections with all of their respective text and output in markdown format:

For example:

### Section 1
[display all of the text corresponding to section 1, enclosed in square brackets]

### Section 2
[display all of the text corresponding to section 2, enclosed in square brackets]

and so on. Display information only present in the given text, and do not generate new information.
"""

In [10]:
output = prompt_llama2(extraction_prompt_example)

In [11]:
print(output)

Sure! I can help you with that.

The main content of the page appears to be the various research groups and their respective faculty members, along with some related news and awards.

Here are the relevant sections and their corresponding text in markdown format:

### Section 1: Research Groups

[List of research groups with their respective faculty members and areas of expertise]

### Section 2: Research Awards

[List of awards and honors received by the faculty members]

### Section 3: Group Research

[Description of startups founded by faculty members and their research efforts]

### Section 4: Faculty & Affiliate Faculty

[List of faculty members and their areas of expertise]

### Section 5: Related News

[List of recent news items related to the research being done by the faculty members]

So, the main content of the page is the research being done by the faculty members, their expertise, and their awards, along with the startups they have founded and the recent news related to th

### Instruction tuned

In [8]:
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
messages = [
    {"role": "user", "content": extraction_prompt_example},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to('cuda')
model.to('cuda')

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRM

In [14]:
generated_ids = model.generate(input_ids, max_new_tokens=2000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [37]:
t = len(extraction_prompt_example) + 20
print(decoded[0][t:])

### Introduction

Artificial Intelligence (AI) is the study of systems that behave intelligently and recognize leaders in several key areas, including computer vision, machine listening, natural language processing, machine learning, and robotics.

### Strengths and Impact

The AI group at Illinois has grown, diversified, and has many key strengths and impact points, including research focus, research awards, group research, and research efficiencies.

### Research Focus

Each research focus in AI includes detailed descriptions of the latest research findings and innovations being made by the AI group at Illinois.

### Research Efforts and Groups

The AI group has several research efforts, including seminars, mailing lists, external speaker series, and faculty and affiliate research projects.

### Faculty

The AI group includes several faculty members, including both computer science and affiliated faculty, who are listed in a table. Each faculty member is associated with one or more r

### Evaluating extraction

In [15]:
import pandas as pd

In [26]:
df = pd.read_csv('extraction_dataset.csv')
df.head()

Unnamed: 0,url,text
0,https://cs.illinois.edu/research/areas/artific...,The study of systems that behave intelligently...
1,https://cs.illinois.edu/research/areas/data-an...,The rapid growth of big data creates unprecede...
2,https://cs.illinois.edu/academics/undergraduat...,Undergraduate majors course registration proce...
3,https://cs.illinois.edu/student-life/student-o...,"At Illinois, there's a lot of life outside the..."
4,https://cs.illinois.edu/academics/graduate,Graduate Degrees and Programs\nPhD Program\nAd...


In [27]:
text = utils.extract_all_text(df['url'][0])
print(text)



In [28]:
import spacy

In [29]:
#!python -m spacy download en_core_web_lg

In [30]:
nlp = spacy.load('en_core_web_lg')
text1 = 'Machine learning'
text2 = 'computer science'
doc1 = nlp(text1)
doc2 = nlp(text2)
print(doc1.similarity(doc2))

0.6181662612518235


In [31]:
print(df['text'][0])

The study of systems that behave intelligently, artificial intelligence includes several key areas where our faculty are recognized leaders: computer vision, machine listening, natural language processing, machine learning and robotics.


Strengths and Impact
The AI group at Illinois is strong, diverse, and growing. It combines expertise in core strengths with promising new research directions.

Research Focus

In machine learning, AI group faculty are studying theoretical foundations of deep and reinforcement learning; developing novel models and algorithms for deep neural networks, federated and distributed learning; as well as investigating issues related to scalability, security, privacy, and fairness of learning systems. Computer vision faculty are developing novel approaches for 2D and 3D scene understanding from still images and video; joint understanding of images and language; low-shot learning (recognition of rare or previously unseen categories); transfer learning and domain

In [32]:
doc1 = nlp(text)
doc2 = nlp(df['text'][0])
print(doc1.similarity(doc2))

0.999737512945093


In [39]:
similarities = []
texts = []
for i in range(len(df)):
    text = utils.extract_all_text(df['url'][i])
    texts.append(text)
    doc1 = nlp(text)
    doc2 = nlp(df['text'][i])
    similarities.append(doc1.similarity(doc2))

In [40]:
similarities

[0.999737512945093,
 0.999476050226349,
 0.9999989907314845,
 0.9993266911337699,
 0.9999603687264453,
 0.9999581243969186,
 0.9983780650463049,
 0.9990800224476888,
 0.9992856083676531,
 0.8186458322848597,
 0.9999950804753748,
 0.9999482895921847,
 0.9999752192968125,
 0.9999986875875064,
 1.0000001585683593]

In [43]:
df['url'][9]

'https://cs.illinois.edu/news/notes'

In [41]:
texts[9]

'                    News Notes            News Notes of our faculty, alumni, and students are below.\xa0See also our News and Media Mentions .                                          '

In [42]:
df['text'][9]

'News Notes of our faculty, alumni, and students are below. See also our News and Media Mentions.\n\n12/19/2023\nCS Professor Sheldon H. Jacobson says, in response to bills filed over the use of AI-manipulated media in the Florida 2024 legislative session, “Law and artificial intelligence are now starting to merge and confront each other. This is the Wild West … We’re just at the beginning.”  (Tallahassee Democrat)\n\n12/5/2023\nTuring laureate Joseph Sifakis has created an extension of CS professor Lui Sha’s Simplex architecture for trustworthy autonomous systems. Sha’s team and colleagues have worked on provably safe AI-enhanced CPS systems.\n\n11/29/2023\nCS Professor Sheldon H. Jacobson wrote an opinion piece on gerrymandering congressional districts in advance of the 2024 presidential election.\n(The Hill)\n\n11/14/2023\nIllinois CS Professor Sarita Adve will be honored for her "valuable contributions to the field of high-performance computing" and featured in the SC23 Conference 

In [44]:
texts[-1]

"                    AI4ALL              AI4ALL Discover AI  INTERESTED IN LEARNING ABOUT ARTIFICIAL INTELLIGENCE AND HOW IT INTERSECTS WITH A VARIETY OF FIELDS?\xa0 WANT TO LEARN ABOUT CAREER PATHS IN TECH?  Apply Now!  Application Deadline: February 10, 2023.  What is AI4ALL?  Discover AI is a\xa0 virtual program\xa0 that offers a hands-on introduction to computer science and artificial intelligence (AI), ethical issues surrounding AI implementation, and tech careers. There is no cost to participate. Students that complete the program\xa0 will emerge with\xa0 actionable next steps in pursuing an academic or career path in AI - and the opportunity to continue in the following semester to our second program, Apply AI.  AI4ALL programs are designed to bring together and highlight voices that have been historically excluded, and that will lead and shape the future of AI. We aim to serve the following students, especially those at the intersection of two or more of these identities:   Ind

In [48]:
list(df['text'])[-1]

"AI4ALL Discover AI\nINTERESTED IN LEARNING ABOUT ARTIFICIAL INTELLIGENCE AND HOW IT INTERSECTS WITH A VARIETY OF FIELDS?  WANT TO LEARN ABOUT CAREER PATHS IN TECH?\nApply Now!\n\nApplication Deadline: February 10, 2023.\n\nWhat is AI4ALL?\n\nDiscover AI is a  virtual program  that offers a hands-on introduction to computer science and artificial intelligence (AI), ethical issues surrounding AI implementation, and tech careers.  There is no cost to participate. Students that complete the program  will emerge with  actionable next steps in pursuing an academic or career path in AI - and the opportunity to continue in the following semester to our second program, Apply AI.\n\nAI4ALL programs are designed to bring together and highlight voices that have been historically excluded, and that will lead and shape the future of AI. We aim to serve the following students, especially those at the intersection of two or more of these identities:\n\nIndigenous Peoples, Black, Hispanic or Latinx, P