# GenAI for Humanists
*Capstone project - 2024W 136031-1 GenAI for Humanists*

***

## Generative AI for Automated Text Assessment and Tailored Practice in Spanish Language Learning - *Project Proposal, Felix Aufreiter (01251759)*

The primary goal of this project is to create a generative AI-driven pipeline that supports Spanish language learners in improving their writing skills. This system will automatically analyse students' written texts, identifying linguistic mistakes and assessing their overall proficiency according to the *Common European Framework of Reference for Languages (CEFRL)* and the available grading schemes of the *BMBWF*.
 
Based on this analysis, the system will generate personalised practice exercises that focus on the specific areas where each student struggles, ensuring targeted and effective learning. These new exercises will be accompanied by detailed solutions. Students will be provided with the possibility to use these newly created tasks to improve their grammar skills. 

Either OpenAI API or a suitable open-source LLM will be selected as the platform for developing this pipeline with detailed prompt engineering. This project should combine generative AI and humanistic education. The aim is to use generative AI to improve the personalised support of students and help educators/teachers to create personalised exercises.

### A short guide to the planned development process.
####  Data Collection
* Collect a dataset of Spanish language learner texts.
* As I work in adult and secondary education myself, I have access to authentic texts from students who currently have a Spanish level of around A1-A2. These texts could be used anonymously for the test of the prompts developed. 
#### Pipeline Development
* **Step 1:** Preprocess student submissions, if necessary with OCR provided by OpenAI API or an open-source model.
* **Step 2:** Detect specific errors (e.g., verb conjugations, word order, prepositions).
* **Step 3:** Use the CEFRL rating scheme to classify texts into proficiency levels. The available rating schemes of the Bundesministerium für Bildung, Wissenschaft und Forschung will be used to grade the texts in accordance with the official guidelines for austrian teachers (Bundesministerium für Bildung, Wissenschaft und Forschung, n.d.).
* **Step 4:** Provide an improved/corrected version of the text . 
* **Step 5:** Use generative AI to design practice materials targeting the identified errors. 
* **Step 6:** Convert the generated exercises to a pdf.
* **Step 7:** Saving the personalised files to a cloud folder with the option to share the newly created files. (This step will only be done if a real-world test is possible.)
#### Expected Outcomes
* Corrected texts with a CEFRL rating according to the learners’ level, corrected texts according to official grading schemes of the Austrian educational system.
* Exercises for the students (pdf with solutions for self-study).
#### Future Use
* Another project: Research Seminar 

## Inicial setup
* Step 1: importing OpenAI API-Key
* Step 2: importing packages
* Step 3: creating a function and test prompt to use **OpenAI API**
* 

In [4]:
!pip install openai python-dotenv



In [21]:
pip install python-docx

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
Installing collected packages: python-docx
Successfully installed python-docx-1.1.2
Note: you may need to restart the kernel to use updated packages.


In [17]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file
_ = load_dotenv(find_dotenv())

# Access your API key
openai.api_key = os.getenv('OPENAI_API_KEY')

In [5]:
#Checking directory for '.env' + '.gitignore'
print("Current working directory:", os.getcwd())
print("Contents:", os.listdir(os.getcwd()))

Current working directory: C:\Users\faufr\Documents\GitHub\GenAI_project
Contents: ['.env', '.git', '.gitignore', '.ipynb_checkpoints', 'capstone_project.ipynb', 'README.md']


In [None]:
# CHECK your API key
print(os.environ.get("OPENAI_API_KEY"))

## Creating test-prompt

In [9]:

from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(
    api_key=os.getenv('OPENAI_API_KEY')
)

def get_completion(prompt, model="gpt-4o-mini"):
    messages = [
        {
            "role": "system",
            "content": "You are a very skilled language teacher. Your answers are precise and you use easy language."
        },
        {
            "role": "user",
            "content": prompt
        }
    ]
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    return response.choices[0].message.content


In [11]:
prompt = "Who is 'The Undertaker' and what are his favorite wrestling moves?"
response = get_completion(prompt)
print(response)

The Undertaker is a famous professional wrestler in WWE (World Wrestling Entertainment). His real name is Mark Calaway, and he is known for his dark, mysterious character and impressive performances in the ring. He has been a part of wrestling for over 30 years and is considered one of the greatest wrestlers of all time.

Some of his favorite wrestling moves include:

1. **Tombstone Piledriver**: This is his signature finishing move, where he lifts his opponent upside down and then drops them on their head.

2. **Chokeslam**: He grabs his opponent by the throat and lifts them high before slamming them down to the mat.

3. **Last Ride**: This is a powerful move where he lifts his opponent onto his shoulders and then slams them down.

4. **Old School**: He walks along the top rope and then jumps down onto his opponent.

These moves are part of what makes The Undertaker a legendary figure in wrestling!


## Importing texts
* example texts to analyze with the LLM

In [25]:
#IMPORTING example text
from docx import Document

# Path to the Word document
file_path = './data/correo electronico.docx'

# Open and read the document
doc = Document(file_path)
print(doc)

# Combine all paragraphs into one string
document_text = "\n".join(paragraph.text for paragraph in doc.paragraphs)
print(document_text)

<docx.document.Document object at 0x00000187AA941220>
Correo electronico 

De: adrian123@gmail.com
Para: padres123@gmail.com
Fecha: 12 de novembre de 2024
Asunto: me ciudad favorita

Queridos padres,

Estoy en mi ciuadad favorita, Barcelona. La ciudad es muy bonita y hay mucha gente viviendo aquí. Además, hace mucho calor aquí en Barcelona.

Hoy he visto la ciudad. Para el almuerzo probe tapas en un pequeno restaurante local. ! Fue delicioso! Me gusta de la playa y comer. Luego pasee por el parque y pasee por la playa de la noche. Mas tarde me relaje en la playa de la Barcelona.

Manana tengo planes emocionantes. Quiero visitar muchos mercados y restaurantes para probar mas comida tipica. Tambien espero encontar algunos Souvenirs unicos para llevar a casa.

Estoy disfrutando mucho de mi tiempo aqui y no puedo esperar para contarles mas.

Saludos,
Adrian




## Analyzing the data with an LLM

* The prompt was designed using the `CO-STAR framework`. (https://towardsdatascience.com/how-i-won-singapores-gpt-4-prompt-engineering-competition-34c195a93d41)
***
    * Context: Provide background information on the task
    * Objective: Define what the task is that you want the LLM to perform
    * Style: Specify the writing style you want the LLM to use
    * Tone: Set the attitude of the response
    * Audience: Identify who the response is intended for
    * Response: Provide the response format
***
* The aim of this prompt is to provide examples structure for the LLM to correct the provided text sample ("correo_electrónico" --> document_text). The LLM should use the CEFRL (Common Framework of Reference for Language) to set a bar for the expected quality of the written text which was delivered by a student.
* The actual structre of the feedback is derived from the offical Austrian guidelines for grading students' assignements released by the Austrian **Ministry for Education, Science and Research** (

In [29]:
prompt = f"""
Correct the mistakes of the provided text. The text was written by a student. The student should have a language level A2-B1 according
to Common European Framework of Reference for Languages: Learning, Teaching, Assessment CEFRL. Correct the mistakes and provide the adequate solutions.
Don't change the original input text. Only use this structure to correct the text:

Example 1: "Yo te llama José."
Corrected example 1: "Yo te llama José. (Yo me llamo José.)"
Example 2: "Las coches es caros."
Corrected example 2: "Las coches es caros. (Las coches son caros.)"
Example 3: "Hola María, yo te llama José."
Corrected example 3: "Hola María, yo te llama José. (Hola María, me llamo José.) Me gusta los caballo. (Me gustan los caballos.)"

Please correct this text and provide one sentence objective and simple with feedback and which grammar topics the student should improve.
The feedback and explanations have to be in german.
Feedback structure:
"Feedback: [insert feedback sentence here]
An diesem Grammatikthema/diesen Grammatikthemen solltest du noch arbeiten: [insert grammar topics here]
Beurteilungsraster A2:
- Erfüllung der Aufgabenstellung (EA): [insert chosen descriptions]
- Aufbau und Layout (AL): [insert chosen descriptions]
- Spektrum Sprachlicher Mittel (SSM): [insert chosen descriptions]
- Sprachrichtigkeit (SR): [insert chosen descriptions]"
```{document_text}```
"""

response = get_completion(prompt)
print(response)

Correo electronico 

De: adrian123@gmail.com  
Para: padres123@gmail.com  
Fecha: 12 de novembre de 2024  
Asunto: me ciudad favorita  

Queridos padres,  

Estoy en mi ciuadad favorita, Barcelona. (Estoy en mi ciudad favorita, Barcelona.) La ciudad es muy bonita y hay mucha gente viviendo aquí. Además, hace mucho calor aquí en Barcelona.  

Hoy he visto la ciudad. Para el almuerzo probe tapas en un pequeno restaurante local. (Para el almuerzo probé tapas en un pequeño restaurante local.) ! Fue delicioso! Me gusta de la playa y comer. (Me gusta la playa y comer.) Luego pasee por el parque y pasee por la playa de la noche. (Luego paseé por el parque y paseé por la playa de la noche.) Mas tarde me relaje en la playa de la Barcelona. (Más tarde me relajé en la playa de Barcelona.)  

Manana tengo planes emocionantes. (Mañana tengo planes emocionantes.) Quiero visitar muchos mercados y restaurantes para probar mas comida tipica. (Quiero visitar muchos mercados y restaurantes para probar má

## Ad data to the pipeline
* Beurteilungsraster
* templates for text types
* connect to cloud

https://www.reddit.com/r/LangChain/comments/18dloff/attaching_files_to_user_prompt_when_using/?rdt=42513

https://python.langchain.com/docs/tutorials/rag/

https://llamahub.ai/

* python analyzer: https://pypi.org/project/cefrpy/

In [32]:
# Langchain?
# Llamaindex? --> RAG?

## Llamaindex: RAG-pipeline

In [58]:
# installing packages

# custom selection of integrations to work with core

!pip install llama-index-core
!pip install llama-index-llms-openai
!pip install llama-index-llms-replicate
!pip install llama-index-embeddings-huggingface



ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index 0.12.0 requires llama-index-core<0.13.0,>=0.12.0, but you have llama-index-core 0.11.23 which is incompatible.
llama-index 0.12.0 requires llama-index-llms-openai<0.4.0,>=0.3.0, but you have llama-index-llms-openai 0.2.16 which is incompatible.
llama-index-readers-llama-parse 0.4.0 requires llama-index-core<0.13.0,>=0.12.0, but you have llama-index-core 0.11.23 which is incompatible.
llama-index-readers-file 0.4.0 requires llama-index-core<0.13.0,>=0.12.0, but you have llama-index-core 0.11.23 which is incompatible.
llama-index-question-gen-openai 0.3.0 requires llama-index-core<0.13.0,>=0.12.0, but you have llama-index-core 0.11.23 which is incompatible.
llama-index-question-gen-openai 0.3.0 requires llama-index-llms-openai<0.4.0,>=0.3.0, but you have llama-index-llms-openai 0.2.16 which is incompatib

Collecting llama-index-core<0.12.0,>=0.11.7
  Using cached llama_index_core-0.11.23-py3-none-any.whl (1.6 MB)
Installing collected packages: llama-index-core
  Attempting uninstall: llama-index-core
    Found existing installation: llama-index-core 0.12.13
    Uninstalling llama-index-core-0.12.13:
      Successfully uninstalled llama-index-core-0.12.13
Successfully installed llama-index-core-0.11.23
Collecting llama-index-llms-replicate

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llama-index 0.12.0 requires llama-index-llms-openai<0.4.0,>=0.3.0, but you have llama-index-llms-openai 0.2.16 which is incompatible.
llama-index-question-gen-openai 0.3.0 requires llama-index-llms-openai<0.4.0,>=0.3.0, but you have llama-index-llms-openai 0.2.16 which is incompatible.
llama-index-program-openai 0.3.1 requires llama-index-llms-openai<0.4.0,>=0.3.0, but you have llama-index-llms-openai 0.2.16 which is incompatible.
llama-index-postprocessor-cohere-rerank 0.2.1 requires llama-index-core<0.12.0,>=0.11.0, but you have llama-index-core 0.12.13 which is incompatible.
llama-index-multi-modal-llms-openai 0.3.0 requires llama-index-llms-openai<0.4.0,>=0.3.0, but you have llama-index-llms-openai 0.2.16 which is incompatible.
llama-index-llms-openai 0.2.16 requires llama-index-core<0.12.0,>=0.11.7, but you h


  Downloading llama_index_llms_replicate-0.4.0-py3-none-any.whl (3.2 kB)
Collecting llama-index-core<0.13.0,>=0.12.0
  Using cached llama_index_core-0.12.13-py3-none-any.whl (1.6 MB)
Installing collected packages: llama-index-core, llama-index-llms-replicate
  Attempting uninstall: llama-index-core
    Found existing installation: llama-index-core 0.11.23
    Uninstalling llama-index-core-0.11.23:
      Successfully uninstalled llama-index-core-0.11.23
Successfully installed llama-index-core-0.12.13 llama-index-llms-replicate-0.4.0


In [62]:
openai.api_key = os.getenv('OPENAI_API_KEY')

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/").load_data()
index = VectorStoreIndex.from_documents(documents)

In [66]:
# USE --> OpenAI-embedding --> COLLAB!--> vector store index!

In [None]:
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.replicate import Replicate
from transformers import AutoTokenizer

# set the LLM
llama2_7b_chat = "meta/llama-2-7b-chat:8e6975e5ed6174911a6ff3d60540dfd4844201974602551e10e9e87ab143d81e"
Settings.llm = Replicate(
    model=llama2_7b_chat,
    temperature=0.01,
    additional_kwargs={"top_p": 1, "max_new_tokens": 300},
)

# set tokenizer to match LLM
Settings.tokenizer = AutoTokenizer.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf"
)

# set the embed model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(
    documents,
)

In [None]:
query_engine = index.as_query_engine()
query_engine.query("YOUR_QUESTION")

# To-do:
* add langchain? (add tools and other methods?)
* add MS Outlook Mail or MS Teams message?
* create a pdf out of the feedback?
* RAG with langchain: https://levelup.gitconnected.com/unlocking-llms-potential-with-rag-a-complete-guide-from-basics-to-advanced-techniques-b4557f268134


# Future use
* connect LLM to SQL-database (CONNECTION to project of RESEARCH SEMINAR): https://medium.com/dataherald/how-to-connect-llm-to-sql-database-with-langchain-sqlagent-48635fddaa74
* connect to database of "Aufgabenpool" tasks: https://aufgabenpool.at/srdp_lfs/index.php?id=sp
* langchain + sqlchain https://medium.com/dataherald/how-to-langchain-sqlchain-c7342dd41614