In [45]:
from transformers import AutoTokenizer
import transformers
import torch
import utils

torch.cuda.empty_cache()

In [2]:
model = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
#tokenizer.pad_token_id = tokenizer.eos_token_id
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

Nov 08 13:51:24.822 [[1m[34mINFO[0m] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
def prompt_llama2(prompt):
    prompt_template=f'''[INST] <<SYS>>
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    <</SYS>> {prompt} [/INST]'''
    
    sequences = pipeline(
        prompt_template,
        do_sample=True,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        #max_length=2048,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        repetition_penalty=1.1,
    )
    result = sequences[0]['generated_text']
    return result.split('[/INST]')[1].strip()

def count_tokens(prompt):
    tokens = tokenizer.tokenize(prompt)
    return len(tokens)

def shorten_tokens(prompt):
    tokens = tokenizer.tokenize(prompt)
    tokens = tokens[:4000]
    return tokenizer.convert_tokens_to_string(tokens)

#### Get paragraphs for testing

In [32]:
urls = ['https://cs.illinois.edu/research/areas/artificial-intelligence',
        'https://cs.illinois.edu/research/undergraduate-research',
        'https://cs.illinois.edu/corporate#research']
from trafilatura import fetch_url, extract
paragraphs_dict = {}
paragraphs_dict_trafilatura = {}
for i in range(len(urls)):
    #p = utils.extract_paragraphs(urls[i])
    #p = utils.extract_paragraphs_trafilatura(urls[i])
    p = utils.extract_paragraphs_lists(urls[i])
    paragraphs_dict[i] = p
    p = extract(fetch_url(urls[i]), favor_precision=True, deduplicate=True)
    paragraphs_dict_trafilatura[i] = p

### Main Content Extraction

In [87]:
# test
prompt = f'Here is a webpage url: "{urls[0]}"\n\n'
prompt += """Can you extract the main text content from this webpage? Remove unnecessary information from the text, such as addresses, emails, an dates.
Try to retain as much of the original passage, and return in paragraph form.
"""
print(prompt)

Here is a webpage url: "https://cs.illinois.edu/research/areas/artificial-intelligence"

Can you extract the main text content from this webpage? Remove unnecessary information from the text, such as addresses, emails, an dates.
Try to retain as much of the original passage, and return in paragraph form.



In [82]:
out = prompt_llama2(prompt)
print(out)

Of course! I can do that for you. Here is the main content of the webpage you provided:

Artificial Intelligence (AI) is revolutionizing numerous fields such as healthcare, transportation, education, and entertainment. The field of AI has made significant progress in recent years due to advances in machine learning, natural language processing, computer vision, and robotics.

At the University of Illinois at Urbana-Champaign, we are committed to advancing the frontiers of AI research and applying it to real-world problems. Our faculty members have expertise in various areas of AI, including machine learning, computer vision, natural language processing, and human-computer interaction. We are actively engaged in interdisciplinary collaborations with other departments and institutes on campus, such as the Coordinated Science Laboratory, the Department of Computer Science, and the National Center for Supercomputing Applications.

Our research focuses on developing new algorithms, techniqu

#### With the paragraphs+list scraping method

In [88]:
prompt = ''
s = ''.join(paragraphs_dict[0]).replace('  ', ' ')
prompt += f'Here is a passage: "{s}"\n\n'
prompt += """
Remove unnecessary information from this passage, such as addresses, emails, an dates.
Try to retain as much of the original passage, and return in paragraph form.
"""
print(prompt)



Remove unnecessary information from this passage, such as addresses, emails, an dates.
Try to retain as much of the original passage, and return in paragraph form.



In [58]:
out = prompt_llama2(prompt)
print(out)

Sure! Here is the revised version of the provided passage after removing unnecessary information:

The study of systems that behave intelligently, artificial intelligence includes several key areas where our faculty are recognized leaders: computer vision, machine listening, natural language processing, machine learning, and robotics. Computer vision systems can understand images and videos, build extensive geometric and physical models of cities from video, or warn construction workers about nearby dangers. Natural language processing systems understand written and spoken language, and possibilities include automatic translation of text from one language to another or understanding text on Wikipedia to produce knowledge about the world. Machine listening systems understand audio signals, with applications like speech recognition, acoustic monitoring, or transcribing polyphonic music automatically. 

Crucial to modern artificial intelligence, machine learning methods exploit examples t

In [59]:
outputs_pl = []
for i in range(len(urls)):
    prompt = ''
    s = ''.join(paragraphs_dict[i]).replace('  ', ' ')
    prompt += f'Here is a passage: "{s}"\n\n'
    prompt += """
    Remove unnecessary information from this passage, such as addresses, emails, an dates.
    Try to retain as much of the original passage, and return in paragraph form.
    """
    out = prompt_llama2(prompt)
    outputs_pl.append(out)
    print(f"{i} done")

0 done
1 done
2 done


In [60]:
for i in range(len(urls)):
    s = ''.join(paragraphs_dict[i]).replace('  ', ' ')
    print(f'Original passage:\n{"="*30}\n{s}\n{"="*30}\n')
    
    out_i = outputs_pl[i].replace('\n', '')
    print(f'LLM extracted passage:\n{"="*30}\n{out_i}\n{"="*30}\n')
    print('-'*150)

Original passage:

LLM extracted passage:

------------------------------------------------------------------------------------------------------------------------------------------------------
Original passage:
Undergraduates at Illinois Computer Science are an important part of our world-renowned research. From summer programs to paid research positions with faculty, there are multiple ways for our students to contribute to high impact research early in their careers.https://illinois.zoom.us/j/83439167799?pwd=UXBpREVtOXJ0ZjIwdFFiMHhtWGdVQT09Thomas M. Siebel Center for Computer ScienceMy.CSYour path begins here.Your path begins here.B.S. in Computer ScienceB.S. in Mathematics & Computer ScienceB.S. in Statistics & Computer ScienceGuidelines for Forming Ph.D. CommitteePh.D. / M.S. Thesis Format Review GuidelinesB.S. in Computer ScienceB.S. in Mathematics & Computer ScienceB.S. in Statistics & Computer ScienceB.S. in Computer ScienceB.S. in Mathematics & Computer ScienceB.S. in Statisti

#### With trafilatura scraping method

In [61]:
outputs_tr = []
for i in range(len(urls)):
    prompt = ''
    s = ''.join(paragraphs_dict_trafilatura[i]).replace('  ', ' ')
    prompt += f'Here is a passage: "{s}"\n\n'
    prompt += """
    Remove unnecessary information from this passage, such as addresses, emails, an dates.
    Try to retain as much of the original passage, and return in paragraph form.
    """
    out = prompt_llama2(prompt)
    outputs_tr.append(out)
    print(f"{i} done")

0 done
1 done
2 done


In [77]:
for i in range(len(urls)):
    s = ''.join(paragraphs_dict_trafilatura[i]).replace('  ', ' ')
    print(f'Original passage:\n{"="*30}\n{s}\n{"="*30}\n')
    
    out_i = outputs_tr[i].replace('\n', '')
    print(f'LLM extracted passage:\n{"="*30}\n{out_i}\n{"="*30}\n')
    print('-'*150)

Original passage:
Artificial Intelligence
The study of systems that behave intelligently, artificial intelligence includes several key areas where our faculty are recognized leaders: computer vision, machine listening, natural language processing, machine learning and robotics.
Strengths and Impact
The AI group at Illinois is strong, diverse, and growing. It combines expertise in core strengths with promising new research directions.
In machine learning, AI group faculty are studying theoretical foundations of deep and reinforcement learning; developing novel models and algorithms for deep neural networks, federated and distributed learning; as well as investigating issues related to scalability, security, privacy, and fairness of learning systems. Computer vision faculty are developing novel approaches for 2D and 3D scene understanding from still images and video; joint understanding of images and language; low-shot learning (recognition of rare or previously unseen categories); trans

#### Pass URL into prompt

In [64]:
outputs_url = []
for i, url in enumerate(urls):
    prompt = f'Here is a webpage url: "{url}"\n\n'
    prompt += """
    Can you extract the main text content from this webpage? Remove unnecessary information from the text, such as addresses, emails, an dates.
    Try to retain as much of the original text, and return in paragraph form.
    """
    out = prompt_llama2(prompt)
    outputs_url.append(out)
    print(f"{i} done")

0 done
1 done
2 done


In [66]:
for i in range(len(urls)):
    out_i = outputs_url[i].replace('\n', '')#.split(':')[1]
    print(f'LLM extracted passage:\n{"="*30}\n{out_i}\n{"="*30}\n')
    print('-'*150)

LLM extracted passage:
Of course! I can assist with that. Here is the main text content from the webpage "https://cs.illinois.edu/research/areas/artificial-intelligence":The Artificial Intelligence (AI) research area at the University of Illinois at Urbana-Champaign is one of the most vibrant and diverse in the world. Our faculty members are internationally recognized leaders in their respective fields, and our students have access to state-of-the-art facilities and resources. We focus on a wide range of AI subfields, including machine learning, natural language processing, computer vision, robotics, and human-computer interaction.Our research is driven by a deep commitment to advancing the scientific understanding of AI and its applications in various domains, such as healthcare, education, transportation, and entertainment. We collaborate closely with other departments and institutes across campus, such as the Department of Computer Science, the Department of Electrical and Computer 

#### Compare three types

In [78]:
for i in range(len(urls)):
    out_i = outputs_pl[i].replace('\n', '')
    print(f'Paragraph + List extracted passage:\n{"="*30}\n{out_i}\n{"="*30}\n')
    out_i = outputs_tr[i].replace('\n', '')
    print(f'trafilatura extracted passage:\n{"="*30}\n{out_i}\n{"="*30}\n')
    out_i = outputs_url[i].replace('\n', '')
    print(f'URL extracted passage:\n{"="*30}\n{out_i}\n{"="*30}\n')
    print('-'*150)

Paragraph + List extracted passage:

trafilatura extracted passage:

URL extracted passage:
Of course! I can assist with that. Here is the main text content from the webpage "https://cs.illinois.edu/research/areas/artificial-intelligence":The Artificial Intelligence (AI) research area at the University of Illinois at Urbana-Champaign is one of the most vibrant and diverse in the world. Our faculty members are internationally recognized leaders in their respective fields, and our students have access to state-of-the-art facilities and resources. We focus on a wide range of AI subfields, including machine learning, natural language processing, computer vision, robotics, and human-computer interaction.Our research is driven by a deep commitment to advancing the scientific understanding of AI and its applications in various domains, such as healthcare, education, transportation, and entertainment. We collaborate closely with other departments and institutes across campus, such as the Depar

### Add citation to prompt

#### Using the LLM-extracted paragraphs + lists text

In [83]:
prompt = ''
for i in range(len(urls)):
    s = ''.join(outputs_pl[i]).replace('  ', ' ')
    prompt += f'Here is passage number {i+1}: "{s}"\n\n'
prompt += """
Can you combine these passages into one summary?
Before the end of each sentence in the summary, add a citation showing the original passage that the sentence came from.
For example, add "[Passage 1]" before the ending period of the sentence if the sentence came from the original Passage 1.
"""
print(prompt)

Here is passage number 1: "Sure! Here is the passage after removing unnecessary information:


The AI group at Illinois is strong, diverse, and growing, combining expertise in core strengths with promising new research directions. In machine learning, AI group faculty are studying theoretical foundations of deep and reinforcement learning, developing novel models and algorithms for deep neural networks, federated, and distributed learning, as well as investigating issues related to scalability, security, privacy, and fairness of learning systems. Computer vision faculty are developing novel approaches for 2D and 3D scene understanding from still images and video, joint understanding of images and language, low-shot learning, transfer learning, and domain adaptation. Natural language processing faculty are working on topics such as grounded language understanding, information extraction and text mining, and knowledge-driven natural language generation. Machine listening faculty are work

In [84]:
out = prompt_llama2(prompt)
print(out)

Of course! Here is a summary of the three passages, with citations added at the end of each sentence:

The University of Illinois' Computer Science Department offers various research opportunities for undergraduate students to contribute to impactful projects, develop their skills, and prepare for their future careers [Passage 2]. The department is one of the largest in the nation, providing hands-on learning experiences in the classroom and lab, and preparing students for leadership positions in academia and industry [Passage 3]. Faculty members are internationally recognized and bring their expertise to solve real-world problems, with research expenditures exceeding $30 million annually [Passage 3]. The Thomas M. Siebel Center for Computer Science features state-of-the-art facilities, including classrooms, research labs, and meeting spaces [Passage 3]. Undergraduate students can contribute to high-impact research early in their careers through programs like ISUR, IBM-ILLINOIS and ISU

#### Using the LLM-extracted trafilatura text

In [93]:
prompt = ''
for i in range(len(urls)):
    s = ''.join(outputs_tr[i]).replace('  ', ' ')
    prompt += f'Here is passage number {i+1}: "{s}"\n\n'
prompt += """
Can you combine these passages into one passage?
Before the end of each sentence in the passage, add a citation showing the original passage that the sentence came from.
For example, add "[Passage 1]" before the ending period of the sentence if the sentence came from the original Passage 1.
"""

In [97]:
# out = prompt_llama2(prompt)
# print(out)

#### Using webpage URL extracted method

In [106]:
prompt = ''
for i in range(len(urls)):
    s = ''.join(outputs_url[i]).replace('  ', ' ')
    prompt += f'Here is passage number {i+1}: "{s}"\n\n'
prompt += """
Can you combine these passages into one summary?
Before the end of each sentence in the summary, add a citation showing the original passage that the sentence came from.
For example, add "[Passage 1]" before the ending period of the sentence if the sentence came from the original Passage 1.
"""

In [96]:
# out = prompt_llama2(prompt)
# print(out)