# Pretraining LLMs from DeepLearning.ai

This notebook consists of generated text from

1. Base pretrained model (TinySolar)
2. Fine tuned model
3. Continued pretrained model

Code here has been modified to match my style and preferences.

In [4]:
from google.colab import drive
drive.mount('/content/drive')


import pandas as pd

#df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/DeepLearning_ai_Pretrain_LLMs/rentmudah.csv')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
path = '/content/drive/MyDrive/Colab_Notebooks/DeepLearning_ai_Pretrain_LLMs/requirements.txt'

!pip install -r {path} #insert -q for quiet install



In [8]:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import TextStreamer


def fix_torch_seed(seed=42):
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False

fix_torch_seed()

# 1. Base model TinySolar-248m-4k

In [26]:
model_path = "upstage/TinySolar-248m-4k"

#import model from huggingface
tiny_general_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map ="cpu",
    torch_dtype = torch.bfloat16)


# Set the autotokenizer for the imported model
tiny_general_tokenizer = AutoTokenizer.from_pretrained(model_path)




## Text generation -  Ok la

In [31]:
prompt = "I am a superhero and I've been thinking the purpose of life on earth"

def model_output(prompt,model,tokenizer):

  tiny_tokenizer = tokenizer

  inputs = tiny_tokenizer(prompt, return_tensors="pt")

  streamer = TextStreamer(tiny_general_tokenizer, skip_prompt=False, skip_special_tokens=True)

  output = tiny_general_model.generate(**inputs, streamer=streamer, max_new_tokens=128,do_sample=False,temperature=0.0,repetition_penalty=1.1)

model_output(prompt,tiny_general_model,tiny_general_tokenizer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


I am a superhero and I've been thinking the purpose of life on earth is to be able to live in peace.
I think that's what makes me so happy.
I love this quote from my friend, "The only thing you can do is make yourself happy."
It's true. You have to be happy to be happy. But if you are not happy, then you don't have any happiness.
You have to be happy to be happy.
But if you are unhappy, then you have to be happy.
And if you are happy, then you have to be happy.
If you are happy, then you have to be happy.
So if you


## Python sample output-> CRAP

In [33]:
prompt = "def find_max(numbers):"

model_output(prompt,tiny_general_model,tiny_general_tokenizer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


def find_max(numbers):
   """
   Returns the number of times a user has been added to the list.
   """
   num = len(list)
   if len(list) > 1:
       return len(list[0])
   else:
       return len(list)


def get_user_id(user_id, user_name):
   """
   Returns the user id for this user.
   """
   return user_id


def get_user_id_from_user(user_id, user_name):
   """
  


# 2. Fine tuned model

In [34]:
model_path = "upstage/TinySolar-248m-4k-code-instruct"

#import model from huggingface
tiny_finetuned_model = AutoModelForCausalLM.from_pretrained(model_path)
tiny_finetuned_tokenizer = AutoTokenizer.from_pretrained(model_path)



In [35]:
prompt = "def find_max(numbers):"

model_output(prompt,tiny_finetuned_model,tiny_finetuned_tokenizer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


def find_max(numbers):
   """
   Returns the number of times a user has been added to the list.
   """
   num = len(list)
   if len(list) > 1:
       return len(list[0])
   else:
       return len(list)


def get_user_id(user_id, user_name):
   """
   Returns the user id for this user.
   """
   return user_id


def get_user_id_from_user(user_id, user_name):
   """
  


# 3. Pretrained model

In [36]:
model_path = "upstage/TinySolar-248m-4k-py"

tiny_pretrained_model = AutoModelForCausalLM.from_pretrained(model_path)
tiny_pretrained_tokenizer = AutoTokenizer.from_pretrained(model_path)




config.json:   0%|          | 0.00/639 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [37]:
model_output(prompt,tiny_pretrained_model,tiny_pretrained_tokenizer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


def find_max(numbers):
   """
   Returns the number of times a user has been added to the list.
   """
   num = len(list)
   if len(list) > 1:
       return len(list[0])
   else:
       return len(list)


def get_user_id(user_id, user_name):
   """
   Returns the user id for this user.
   """
   return user_id


def get_user_id_from_user(user_id, user_name):
   """
  
