# AIT500 - Assignment 1 - Part 2

## Objective

For this part of Assignment 1, we want to explore the feasibility of using the 6 Pokemon statisics to predict the Pokemon's type

In *Assignment 1 - Part 1* a scatter matrix was used to explore whether there were relationships between each of the 6 statistics for the same type Pokemon (ie "Type1")

With a large number of features, the scatter matrix becomes more difficult to visualise similarities of datapoints having the same class

In this part of the assignment, you will use the [t-SNE algorithm from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to reduce the 6 features into 3-dimensions to provide a different way to visualize whether there are Pokemons of the same type have similarities (ex: will all Grass Pokemons cluster near each other in the visualization?)

You will use Colab for this assignment

## Assignment Instructions

1. Download the Pokemon dataset [here](https://www2.cs.arizona.edu/classes/)
1. Create pandas dataframe called `df_pokemon_csc` from the dataset
1. Create a scatter matrix for the 6 Pokemon statistics
1. Use the provided TSNE code for dimensionality reduction
1. Use the provided plotly code to visualize different Pokemon on a 3d plot
  1. describe your observations of the 3d plot - how easy is it to separate the different types of Pokemon?

1. Explore Using OpenAI Embeddings
  1. Read `Pokemon.csv` into a list
  1. Process each entry to exclude the columns `#`, `Type 1`, `Type 2`.  The first few rows should look like this
  ```
                      Bulbasaur,318,45,49,49,65,65,45,1,False
                        Ivysaur,405,60,62,63,80,80,60,1,False
                    Venusaur,525,80,82,83,100,100,80,1,False
      VenusaurMega Venusaur,625,80,100,123,122,120,80,1,False
  ```
  1. use OpenAI embedding API to create an embedding vector for each Pokemon using the model `"text-embedding-ada-002`<br>
    [Embedding API reference](https://platform.openai.com/docs/api-reference/embeddings/create),
  1. Validate that an embedding was created in the `response` variable for each Pokemon (ie show that the number of embeddings in the `response` variable matches the number of Pokemons in the csv data
  1. Extract the embeddings vector into an array called `X`
  1. use provided TSNE code to reduce 1536-dimension vector into 3 dimensions
  1. use provided plotly code to create interactive 3d visualization ([reference here](https://plotly.com/python/3d-scatter-plots/#style-3d-scatter-plot))
  1. Show your calculation of the cost of using OpenAI embeddings model (https://openai.com/pricing)
1. Explain in your own words 2 observations comparing the t-SNE plot using the 6 Pokemon statistics vs the t-SNE plot using embeddings generated by the LLM.  Ex: was the additional effort to use LLM-embeddings provide useful information?

Once your lab is completed, download your notebook **with all cells evaluated** to the Classwork section of the Course Notebook


### Pokemon Dataset Description

The Pokemon Dataset is a compilation of data for each Pokémon from the famous [video game](https://en.wikipedia.org/wiki/Pok%C3%A9mon_(video_game_series)) and [TV series](https://www.pokemon.com/us). It includes information like the Pokémon's type (Water, Fire, Grass, etc.), its various statistics (HP, Attack, Defense, etc.), and other characteristics.

There are many sources of Pokemon data online.  For this assignment we will use data and compare Pokemon data from two sources:
1. [CSC120 Pokemon Dataset](https://www2.cs.arizona.edu/classes/cs120/fall17/ASSIGNMENTS/assg02/pokemon.html)
1. [Kaggle Pokemon Dataset](https://www.kaggle.com/datasets/abcsds/pokemon)



## Reference Information
1. [CSC120 Pokemon Dataset](https://www2.cs.arizona.edu/classes/cs120/fall17/ASSIGNMENTS/assg02/pokemon.html)
1. [Kaggle Pokemon Dataset](https://www.kaggle.com/datasets/abcsds/pokemon)

In [None]:
# increase Jupyter cell width

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Setup Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt # plotting library
import seaborn as sns
import plotly.express as px

import os
from PIL import Image

import json
import requests


import numpy as np
import re
from scipy.io.arff import loadarff

from sklearn.manifold import TSNE

os.getcwd()

In [None]:
try:
  import openai
  import tiktoken
except ModuleNotFoundError:
  !pip install openai # install openAI library to the notebook instance
  !pip install tiktoken # install tokenizer library
finally:
  import openai
  import tiktoken


In [None]:
# set pandas options
pd.set_option('display.max_columns', 50) # show more columns
pd.set_option('display.max_rows', 200) # show more rows
pd.set_option('max_colwidth', 400) # set wider columns

# Setup Google Drive access to API keys

In [None]:
#@markdown connect to drive for API keys stored in <br> `My Drive/Colab Notebooks/API_Keys`
import os, sys
from google.colab import drive
drive.mount('/content/mnt')
nb_path = '/content/notebooks'
os.symlink('/content/mnt/My Drive/Colab Notebooks', nb_path)
sys.path.insert(0, nb_path)  # or append(nb_path)


In [None]:
!ls -al /content/notebooks

# Setup OpenAPI key

In [None]:
#@markdown This cell loads your `API_Keys/openai.json` key as environment variable for local use by Colab

#@markdown 1) create and save your openai API key to a text file  on your Google Drive Colab Notebooks folder here:  `Colab Notebooks/API_Keys/openai.txt`

SETUP_OPENAI=True #@param # True or False
if(SETUP_OPENAI):
  my_api_key_path = '/content/notebooks/API_Keys/openai.txt'
  with open(my_api_key_path,'r') as f:
    openai_key = f.readline()

  os.environ['OPENAI_API_KEY'] = openai_key

  from openai import OpenAI
  client = OpenAI(
    api_key=os.environ['OPENAI_API_KEY'],  # this is also the default, it can be omitted
  )

In [None]:
#@markdown show list of models accessible with our API key
model_list = client.models.list()
model_list = [m['id'] for m in model_list.dict()['data']]

for m in model_list:
  print(m)


# Pokemon Dataset from CSC120

## Pokemon CSC120 - Step 1 - Download data

save the raw file to this file: `csc120_pokemon_data.csv`

In [None]:
# your code to download dataset here

In [None]:
save_csc120_pokemon_raw_filename = 'csc120_pokemon_data.csv'

In [None]:
# show first 10 lines of file
!head -n 10 {save_csc120_pokemon_raw_filename}

## Pokemon CSC120 - Step 2 - Load Pokemon data into dataframe

load data into dataframe `df_pokemon_csc`

In [None]:
# your code here


# Data Analysis

The description and meaning of each data field can be found from different online sources ([here CSC120](https://www2.cs.arizona.edu/classes/cs120/fall17/ASSIGNMENTS/assg02/pokemon.html) or [here Kaggle](https://www.kaggle.com/datasets/abcsds/pokemon) )

|Field|Description|
|-----|----------|
# | ID for each pokemon |
Name | Name of each pokemon |
Type 1 | Each pokemon has a type, this determines weakness/resistance to attacks |
Type 2 | Some pokemon are dual type and have 2 |
Total | sum of all stats that come after this, a general guide to how strong a pokemon is |
HP | hit points, or health, defines how much damage a pokemon can withstand before fainting |
Attack | the base modifier for normal attacks (eg. Scratch, Punch) |
Defense | the base damage resistance against normal attacks |
SP Atk | special attack, the base modifier for special attacks (e.g. fire blast, bubble beam) |
SP Def | the base damage resistance against special attacks |
Speed | determines which pokemon attacks first each round |




In [None]:
# create list of numerical fields for later use
numerical_fields = ['#', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation']

# create list of pokemon statistics fields for later use
stats_fields = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# create list of text fields for later use
text_fields = ['Name', 'Type 1', 'Type 2', 'Legendary']

# Pokemon TSNE visualization



In [None]:
# your code here to create numpy array containing the 6 statistics

X = your code here

In [None]:
#@markdown Run this cell after creating X above
tsne = TSNE(n_components=3, perplexity=15, random_state=42, init="random", learning_rate=200)

vis_dims3 = tsne.fit_transform(X) # reduce 6-dimension stats into 3 dimensions for visualization

df = pd.DataFrame(vis_dims3, columns=['X','Y','Z'])
df.head(3)

In [None]:
#@markdown Run this cell after TSNE model is created above
df = pd.DataFrame(vis_dims3, columns=['X','Y','Z'])
df['Type 1'] = df_pokemon_csc['Type 1'].copy()
df['size'] = 0.03
fig = px.scatter_3d(df, x='X', y='Y', z='Z', size='size', color='Type 1', opacity=0.5, width=1000, height=1000,
                    title='Pokemon Stats Visualization by Type1 using tSNE Dimensionality Reduction')
fig.show()

# Pokemon LLM Embedding visualization w/ TSNE

Create a new column called `text_for_embedding_generation` containing a string created from Pokemon features you want to give to the text-embedding LLM to generate an embedding for each of the Pokemon


In [None]:
df_pokemon_csc.head(3)

In [None]:
# your code here to create a new column containing a string created from Pokemon features you want to give to the text-embedding LLM to generate an embedding for each of the Pokemon
# Ex: you could concatenate some fields into a string
df_pokemon_csc['text_for_embedding_generation'] = ... your code here ...

In [None]:
#convert to list list of strings
pokemon_embedding_input = df_pokemon_csc['text_for_embedding_generation'].tolist()

In [None]:
#@markdown Run this cell after `pokemon_embedding_input` created above.  Call OpenAI to create embedding for each Pokemon
client = OpenAI()

response = client.Embedding.create(
  model="text-embedding-ada-002",
  input=pokemon_embedding_input
)
print(f"Tokens used: prompting({response['usage']['prompt_tokens']}, total({response['usage']['total_tokens']})")

In [None]:
#@markdown Run this cell as-is.  Convert to numpy array
X = np.array([e['embedding'] for e in response['data']])
X.shape

In [None]:
#@markdown Run this cell as-is.  This will create a t-SNE model from the high-dimensional text embeddings data
tsne = TSNE(n_components=3, perplexity=15, random_state=42, init="random", learning_rate=200)

vis_dims3 = tsne.fit_transform(X) # reduce 6-dimension stats into 3 dimensions for visualization

df = pd.DataFrame(vis_dims3, columns=['X','Y','Z'])
df.head(3)

In [None]:
#@markdown Run this cell as-is.  This will visualize the embeddings in 3D space.
df = pd.DataFrame(vis_dims3, columns=['X','Y','Z'])
df['Name'] = df_pokemon_csc['Name'].copy()
df['Type 1'] = df_pokemon_csc['Type 1'].copy()
df['size'] = 0.03
fig = px.scatter_3d(df, x='X', y='Y', z='Z', size='size', color='Type 1', opacity=0.5, width=1000, height=1000,
                    title='Pokemon Visualization by Type1 using tSNE Dimensionality Reduction'
                    )

fig.show()

# End