# **Problem Statement 1: Natural Language Processing (NLP)**





In [None]:

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import string
import nltk
import re
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

def preprocess_tokenize(text):
    '''
    Processes a string of text through various natural language processing steps.

    This function performs the following operations:
    1. Converts the text to lowercase
    2. Removes punctuation
    3. Removes stop words
    4. Applies lemmatization
    5. Tokenizes the text into sentences and words

    Args:
    text (str): The input text to be processed

    Returns:
    tuple: A tuple containing two elements:
        - list of str: Processed and tokenized sentences
        - list of list of str: Tokenized words for each sentence
    '''
    # Remove newline characters from the text to ensure continuous text flow without line breaks
    cleaned_text = re.sub(r'\n', '', text)

    # Convert the cleaned text to lowercase to ensure uniformity
    lower_cleaned_text = cleaned_text.lower()

    # Tokenize the text into sentences
    sentences = sent_tokenize(lower_cleaned_text)

    # Define the set of stop words
    stop_words = set(stopwords.words('english'))

    # Initialize the WordNet lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Initialize lists to store processed sentences and words
    processed_sentences = []
    list_of_words = []

    # Process each sentence individually
    for sentence in sentences:
        # Tokenize the sentence into words
        words = word_tokenize(sentence)

        # Remove punctuation from the list of words
        words = [word for word in words if word not in string.punctuation]

        # Remove stop words and apply lemmatization to the remaining words
        filtered_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
        list_of_words.append(filtered_words)

        # Join the filtered words back into a single string
        processed_sentence = ' '.join(filtered_words)
        processed_sentences.append(processed_sentence)

    # Return the list of processed sentences and the list of tokenized words
    return processed_sentences, list_of_words


# **Problem Statement 2: Text Generation**

I have decided to use microsoft phi for text generation. Since colab offers limited compute units, I have quantized and double quantized the model with bits and bytes library offered by Hugging face and loaded it in 4bit.
Further, I have fed the model a one shot prompt on what type of answer is expected.

**Please change the runtime to T4 GPU as quantization requires GPU**

In [1]:
!pip install transformers --quiet
!pip install accelerate --quiet
!pip install -U bitsandbytes --quiet

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from transformers import pipeline

model_name = "microsoft/Phi-3.5-mini-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map="auto", trust_remote_code=True, )

In [None]:

messages = [
    {"role": "system", "content": "You are a helpful AI assistant specializing in south indian food. When asked about any dish, you give out the ingredients as well as the recipe"},
    {"role": "user", "content": "How is karam dosa made?"},
    {"role": "assistant", "content": "Karam dosa is a spicy version of the traditional South Indian dosa. The term karam means spicy in Telugu, and this dish is particularly popular in the Andhra Pradesh region. Here's a high-level summary of how it's made: Ingredients: Dosa batter: Made from rice and urad dal (black gram). Karam (spicy) chutney: Made from ingredients like red chilies, garlic, onions, and tamarind. Oil or ghee: For cooking the dosa. Toppings: Some variations may include finely chopped onions, curry leaves, and coriander. Method: Prepare the batter: The dosa batter is prepared by soaking rice and urad dal, grinding them to a fine paste, and fermenting the mixture overnight. Make the karam chutney: Blend red chilies, garlic, onions, tamarind, and other spices to create a thick, spicy chutney. Cook the dosa: Heat a tawa (griddle), pour a ladleful of batter, and spread it evenly in a circular motion to form a thin dosa. Apply karam chutney: Once the dosa is partially cooked, spread a layer of the spicy chutney on top. Cook until crispy: Add a bit of oil or ghee around the edges and cook until the dosa is crispy and golden brown. Serve: Fold the dosa and serve hot, often with coconut chutney and samba"},
    {"role": "user", "content": "How is rava dosa made?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])