### Load model

In [1]:
from transformers import AutoTokenizer
import transformers
import torch

In [2]:
torch.cuda.empty_cache()

In [3]:
model = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### Test prompts

In [4]:
prompt = 'Can you give a brief overview about the UIUC computer science department?'
tokens = tokenizer.tokenize(prompt)
print(tokens)
print(len(prompt.split()), len(tokens))
tokenizer.convert_tokens_to_string(tokens)

['▁Can', '▁you', '▁give', '▁a', '▁brief', '▁over', 'view', '▁about', '▁the', '▁UI', 'UC', '▁computer', '▁science', '▁department', '?']
12 15


'Can you give a brief overview about the UIUC computer science department?'

In [5]:
prompt = 'Can you give a brief overview about the UIUC computer science department?'
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>> {prompt}[/INST]'''

sequences = pipeline(
    prompt_template,
    do_sample=True,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=1024,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1,
)

In [8]:
result = sequences[0]['generated_text']
print(result.split('[/INST]')[1])

  Hello! I'd be happy to provide an overview of the University of Illinois at Urbana-Champaign (UIUC) Computer Science Department.

The UIUC Computer Science Department is a highly respected and well-established program that offers undergraduate and graduate degrees in computer science. The department is known for its strong research focus and has a diverse range of faculty expertise, including artificial intelligence, machine learning, data science, human-computer interaction, programming languages, and more.

Some key facts about the UIUC Computer Science Department include:

* Rankings: The department is consistently ranked among the top 10 public computer science programs in the country by US News & World Report.
* Research areas: The department has research groups in areas such as algorithms, computer vision, databases, distributed systems, human-computer interaction, natural language processing, operating systems, programming languages, and software engineering.
* Faculty: The de

In [5]:
def prompt_llama2(prompt):
    prompt_template=f'''[INST] <<SYS>>
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    <</SYS>> {prompt} [/INST]'''
    
    sequences = pipeline(
        prompt_template,
        do_sample=True,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1024,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        repetition_penalty=1.1,
    )
    result = sequences[0]['generated_text']
    return result.split('[/INST]')[1].strip()

def count_tokens(prompt):
    tokens = tokenizer.tokenize(prompt)
    return len(tokens)

def shorten_tokens(prompt):
    tokens = tokenizer.tokenize(prompt)
    tokens = tokens[:3500]
    return tokenizer.convert_tokens_to_string(tokens)

In [7]:
out = prompt_llama2('What are the degree options for the UIUC computer science department?')

In [8]:
print(out)

Hello! As a helpful and respectful assistant, I'm happy to assist you with your question about the University of Illinois at Urbana-Champaign (UIUC) computer science department.

The UIUC computer science department offers several degree options at both the undergraduate and graduate levels. Here are some of the degree options available:

Undergraduate Degrees:

1. Bachelor of Science in Computer Science (BS CS): This is the most popular undergraduate degree offered by the department, and it provides a comprehensive education in computer science principles and practices.
2. Bachelor of Arts in Computer Science (BA CS): This degree program has a stronger focus on the liberal arts and humanities, and it is designed for students who want to combine their passion for computer science with other interests.
3. Minor in Computer Science: Students from other majors can also minor in computer science to gain knowledge and skills in this field.

Graduate Degrees:

1. Master of Science in Compute

### Scraping

In [6]:
import utils
import pickle

In [10]:
school_name = 'UIUC'
tab_queries = ['degree options', 'coursework', 'faculty']
queries = ['research areas', 'student organizations']
# passages = utils.get_query_passages(school_name, queries)

https://cs.illinois.edu/academics/undergraduate/degree-program-options

https://grainger.illinois.edu/academics/undergraduate/majors-and-minors/computer-science

https://cs.illinois.edu/academics/undergraduate/degree-program-options/cs-x-degree-programs

https://cs.illinois.edu/academics/courses

http://catalog.illinois.edu/courses-of-instruction/cs/

https://cs.illinois.edu/academics/undergraduate/degree-program-options/bs-computer-science

https://cs.illinois.edu/research

https://grainger.illinois.edu/research/undergraduate/get-involved/how-to

https://cs.illinois.edu/research/areas

https://cs.illinois.edu/about/people/all-faculty

https://ece.illinois.edu/about/directory/faculty

https://cs.illinois.edu/about/people/staff/office-department-head

https://cs.illinois.edu/student-life/student-organizations

https://grainger.illinois.edu/academics/student-organizations

https://cs.illinois.edu/student-life

In [11]:
# with open('passages.pkl', 'wb') as f:
#     pickle.dump(passages, f, protocol=pickle.HIGHEST_PROTOCOL)

In [12]:
with open('passages.pkl', 'rb') as f:
    passages = pickle.load(f)

In [13]:
# repeat sentences, some issue from scraping that particular website
passages['degree options'][0].split('.')

['The University of Illinois offers numerous pathways to incorporate computing into undergraduate education and research',
 ' The Department of Computer Science in The Grainger College of Engineering is shaping the future of computing and data science backed by a legacy of excellence',
 ' The breadth of CS topics and applications offered is unmatched almost anywhere',
 ' When you complete, in the last two years of your undergraduate degree, any Illinois\nCS\n,\nCS + X\n, or\nCS minor\nwith a GPA of 3',
 '0 or better and a GPA of 3',
 '2 or better in CS courses,\nyou are guaranteed admission\nto our\nOnline MCS\nor\nMCS in Data Science (MCS-DS)\nprograms',
 ' Recent UIUC graduates may qualify for an\napplication fee waiver',
 ' Meet with a Computer Science (CS) advisor to learn more about what CS at Illinois has to offer and hear from a current CS student',
 ' We will discuss all the different opportunities and options of the different CS majors',
 " There will be time for Q & A's",
 ' 

In [14]:
lst = passages['degree options'][0].split('.')
''.join(set(lst))

"2 or better in CS courses,\nyou are guaranteed admission\nto our\nOnline MCS\nor\nMCS in Data Science (MCS-DS)\nprograms Meet with a Computer Science (CS) advisor to learn more about what CS at Illinois has to offer and hear from a current CS student The Department of Computer Science in The Grainger College of Engineering is shaping the future of computing and data science backed by a legacy of excellenceThe University of Illinois offers numerous pathways to incorporate computing into undergraduate education and research0 or better and a GPA of 3 Recent UIUC graduates may qualify for an\napplication fee waiver We will discuss all the different opportunities and options of the different CS majors The breadth of CS topics and applications offered is unmatched almost anywhere There will be time for Q & A's When you complete, in the last two years of your undergraduate degree, any Illinois\nCS\n,\nCS + X\n, or\nCS minor\nwith a GPA of 3"

In [15]:
joined_passage = ''.join(passages['faculty'])
joined_passage = ''.join(set(joined_passage.split('.')))
joined_passage

'illinoisRichard T (ECE)2232 Siebel Center, (217) 333-3426, Fax: (217) 333-3501, head@cs Herman M Prof Dieckamp Endowed Chair Emeritus in Engineering Adjunct Asst Gillies Professor in Computer Science Donald Bedu Cheng Professor Donald B Everitt Distinguished Professor Emeritus Adjunct Assistant Prof Gillies Chair in Computer ScienceWilliam L'

### Keyword extraction for tabular pages

In [15]:
query = 'coursework'
print(f'---{query}---')
joined_passage = ''.join(passages[query])
joined_passage = ''.join(set(joined_passage.split('.')))
n_tokens = count_tokens(joined_passage)
prompt = f"""
Can you extract the major keyword from this passage? Here is the passage: {joined_passage}
"""

---coursework---


In [17]:
prompt = shorten_tokens(prompt)
out = prompt_llama2(prompt)
print(out)






In [16]:
url = 'https://cs.illinois.edu/about/people/department-faculty'
all_text = utils.extract_all_text(url)

In [17]:
count_tokens(all_text)

3762

In [23]:
query = 'faculty'
print(f'---{query}---')
joined_passage = ''.join(passages[query])
joined_passage = ''.join(set(joined_passage.split('.')))
n_tokens = count_tokens(joined_passage)
prompt = f"""
Can you extract the faculty names and titles from this passage? Here is the passage: {all_text}
"""
out = prompt_llama2(prompt)
print(out)

---faculty---



In [19]:
prompt

'\nCan you extract a list of faculty names from this passage? Here is the passage:                   Alumni  Corporate  People  My.CS         University of Illinois at Urbana-Champaign   The Grainger College of Engineering  Computer Science          Search         Menu             Search        About    About  Rankings & Statistics  Contact Us & Office Locations  History Timeline  Accreditation  Values & Code of Conduct  CS CARES Committee  Contact CS CARES  Governance  Members  Resources    People  All Faculty  Department Faculty  Affiliate Faculty  Adjunct Faculty  Emeritus Faculty  Postdoctoral Researchers  Staff  Office of the Department Head  Communications & Engagement Team  Undergraduate Advising Office  Graduate Advising Office  Instructional Development Team  Business Office  Faculty Support Contacts  Facilities, Shipping and Receiving    Graduating PhD Students    Open Positions  Faculty Positions  Postdoctoral Positions  Future Faculty Fellows    Staff Positions  Choose Illi

### Classification/Relevancy for textual pages

In [20]:
query = 'research areas'
print(f'---{query}---')
joined_passage = ''.join(passages[query])
joined_passage = ''.join(set(joined_passage.split('.')))
prompt = f"""
Is this passage related to the computer science field, more specifically about {query}? Here is the passage: {joined_passage}
"""
out = prompt_llama2(prompt)
print(out)

---research areas---
Yes, the passage is related to the computer science field, specifically highlighting research areas. It discusses various technical conferences, workshops, and speaker series events hosted by the Department of Computer Science at the University of Illinois, with the goal of promoting conversations about important challenges and topics in the discipline. Additionally, it highlights the department's global reputation for developing revolutionary technology and groundbreaking research that addresses real-world problems.

The passage also encourages undergraduate students to get involved in research opportunities, providing several programs and matrices to help them identify and connect with faculty members working in their areas of interest. Overall, the passage emphasizes the department's commitment to innovation, excellence, and interdisciplinary collaboration in computer science research.


### Summarization for textual pages

In [20]:
query = 'research areas'
print(f'---{query}---')
joined_passage = ''.join(passages[query])
joined_passage = ''.join(set(joined_passage.split('.')))
prompt = f"""
Can you summarize this passage into a few sentences? Here is the passage: {joined_passage}
"""
out = prompt_llama2(prompt)
print(out)

---research areas---
Sure! Here is a summary of the provided text in a few sentences:

The Grainger Engineering research opportunities are abundant and can serve as a launchpad for pursuing innovative ideas at a prestigious program. The Illinois Computer Science department has a global reputation for groundbreaking research in various fields, with faculty members contributing to diverse areas of the discipline. The department offers a range of research topics, providing undergraduates with opportunities to engage in semester-long and summer programs. Recordings of past events are also available for current students through Media Space.


In [21]:
# out2 = out.replace('\n\n', '\n')
# out2.split('\n')[1]

In [22]:
prompt = f"""
What are they key ideas and keywords from this passage? Here is the passage: {joined_passage}
"""
out = prompt_llama2(prompt)
print(out)

Sure! Here are the key ideas and keywords from the passage:

1. Research opportunities: The passage highlights the abundance of research opportunities available to students at Grainger Engineering, particularly in the field of computer science.
2. Prestige: The program is described as "prestigious," emphasizing its reputation for excellence.
3. Global impact: The research conducted at Illinois Computer Science is said to address "real-world problems" and have a "global reputation" for revolutionary technology.
4. Faculty expertise: The passage highlights the breadth, depth, and quality of the faculty's research contributions in various areas of computer science.
5. Technical conferences and workshops: The department hosts events to benefit the academic and research communities, including speaker series with prominent leaders and experts.
6. Undergraduate involvement: The passage mentions several programs that allow undergraduates to get involved in research, including PURE, ISUR, and t

### Revision

In [69]:
prompt = f"""
    These summaries are being used to build a wiki page describing the {school_name} Computer Science Department. 
    What information is missing and should be added? Here is the passage: {''.join(summaries)}
"""
out = prompt_llama2(prompt)

In [71]:
print(out)

Based on the provided passage, here is what is missing and should be added to the UIUC Computer Science Department wiki page: 

1. Additional Research Areas: While the passage mentions a wide range of research topics, it would be beneficial to provide more specific details about the department's research areas. For example, the department may have strengths in areas like machine learning, data analytics, cybersecurity, human-computer interaction, etc. Additionally, the passage could highlight any interdisciplinary research collaborations between different departments or colleges within the university.  
2. More Information on Undergraduate Research Opportunities: Although the passage mentions undergraduate research programs like PURE, ISUR, and the Summer Research Program, it would be useful to provide more detailed information about each program. This could include eligibility criteria, application deadlines, and the types of research projects available to undergraduate students.  
3.