<a href="https://colab.research.google.com/github/eeshashetty/podgpt/blob/main/pod_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Mount Drive (to save checkpoints + generated texts)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!nvidia-smi

Thu Mar 30 20:55:59 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P8    12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install GPT2 Simple Model package - https://github.com/minimaxir/gpt-2-simple
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime

# Scraping
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

## Part 1 - Scraping

In [None]:
# Scraping specifically for happyscribe

root = "https://www.happyscribe.com"
pages = ["https://www.happyscribe.com/public/freakonomics-radio", "https://www.happyscribe.com/public/freakonomics-radio?page=2"]
links = []
for page in pages:
  res = requests.get(page)
  soup = BeautifulSoup(res.content, 'html.parser')
  
  for d in soup.select('a.hsp-card-episode'):
    links.append(root + d["href"])

len(links)

28

In [None]:
def get_transcript(link):
  res = requests.get(link)
  soup = BeautifulSoup(res.content, 'html.parser')
  txt = "<|startoftext|>"
  for para in soup.select('div.hsp-paragraph'):
    txt += para.text[11:].strip() + "\n"
  txt += "<|endoftext|>"
  return txt

In [None]:
transcripts = []
for link in tqdm(links):
  transcripts.append(get_transcript(link))

with open("/content/drive/MyDrive/Spring 2023/10-615 Art/dataset_freakonomics.txt", "w") as f:
  f.writelines(transcript)

100%|██████████| 28/28 [00:34<00:00,  1.23s/it]


## Part 2 - Finetuning

In [None]:
# Setup Config for Fine Tuning
config = {
    "model": "124M", # choose which GPT2 model to finetune on
    "steps": 1000, # number of epochs
    "run_name": "run-3-1000", # specify run name for different runs
    "print_every": 100, 
    "sample_every": 200,
    "save_every": 300
}

In [None]:
# Download GPT2 Model
gpt2.download_gpt2()

Fetching checkpoint: 1.05Mit [00:00, 517Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 1.06Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 576Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:36, 13.7Mit/s]
Fetching model.ckpt.index: 1.05Mit [00:00, 278Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 1.40Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 1.41Mit/s]


In [None]:
sess = gpt2.start_tf_sess()
DATASET_PATH = "/content/drive/MyDrive/Spring2023/10615/dataset_freakonomics.txt"
gpt2.finetune(sess,
              dataset=DATASET_PATH,
              model_name=config["model"],
              steps = config["steps"],
              restore_from="fresh",
              run_name=config["run_name"],
              print_every=config["print_every"],
              sample_every=config["sample_every"],
              save_every=config["save_every"]
              )

/content/drive/MyDrive/Spring2023/10615
Loading checkpoint models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:01<00:00,  1.80s/it]


dataset has 266912 tokens
Training...
[100 | 230.96] loss=2.86 avg=2.86
[200 | 457.02] loss=2.42 avg=2.64
 Do I need a cell?
Yeah. So, look, you could pick up a cell now if you want, by the way, but I don't think you need one. I'm just curious.
You've just written about the difficulty of replicating a famous process. Do you want to share your work? Also, please describe something that's been like a whole life, except now you're more familiar with the original.
A whole lot less familiar. Yeah.
I mean, there's this incredible passage in Book 2 about the growth of a bird, a big tree. And then the tree is taken from it and the trunk is pulled out onto the ground. It's like a basket of leaves and branches like from something you'd see in a movie. And the trunk isn't anything like anything you'd see in a movie, really. Because the original tree trunk does not contain very much nutrients and so you grow it like a child tree. And and so the thing about a famous tree goes back thousands of year

Instructions for updating:
Use standard file APIs to delete files with this prefix.


 but a little over two hundred and seventy thousand dollars a year.
Some of these firms are based in Massachusetts, others are based in California. The total tax burden in the U.S. is higher than anywhere else. But the reason that Massachusetts is so high is that Massachusetts is an incredibly regressive state, and that redistributes wealth incredibly quickly. How regressive is it to the extent that you've got a company that makes a lot of money in the state and then hundreds or thousands of others start producing hundreds of thousands of dollars in the next place?
Now, to figure it out, we had to do a much wider investigation.
We went back hundreds of cases where the actual tax code in that state is regressive even more than the one we're describing here. And tax advocates would say that's precisely the problem here. So we're saying, you know, have a lot of kids. Have lots of kids. And then you write your own rate of taxation based on that data.
Here's another side effect of that. We 

In [None]:
# # Uncomment these lines to load a pretrained checkpoint
# sess = gpt2.start_tf_sess()
# gpt2.load_gpt2(sess, run_name = config["run_name"])

# Generate to a filepath
OUTPATH = "/content/drive/MyDrive/Spring2023/10615/example_1000_3.txt"
gpt2.generate_to_file(sess, 
                      destination_path = OUTPATH,
                      prefix="<|startoftext|>", # Can choose to update this to give a prompt to start generating from
                      truncate="<|endoftext|>",
                      run_name = config["run_name"],
                      include_prefix = False)