# AIT500 - Lab Test 2

## Objective

During Lab Test 2 you will have a chance to practice components of the data science process covered during this course:  Data collection, data cleaning, data analysis, and using data to answer business questions

Download this notebook into your own [Colab](https://colab.research.google.com/) envionment before coding your answers


## Lab Test 2 Instructions

### Part 1 - Cleaning bad data

You are a data engineer at **Fabulous AI Inc**.  The data science team has come to you for help ingest a new dataset.  They encountered strange results during their analysis and believe the issue may be related to bad values in some of the data fields.

Identify as many bad fields as you can using the different diagnosis techniques in your toolbox

The [bad data file can be found here](https://github.com/dora-lee/seneca-ait500-2024-winter/blob/main/lab_tests/nasa_meteorites_bad_data.csv).  The first line in the file contains a header row with column names that accurately describe the information contained in the file

1. Perform data collection and create a dataframe called `meteorites_raw_df` from the raw data file
2. Identify fields that have bad data (or data that does not make sense)


### Part 2 - Build MNIST prediction model with Generative AI

You are a student with tight deadlines.  One of your courses has a project that requires you to build a model that accurately predicts MNIST digits

To help you save time, you plan to use ChatGPT3.5 to give you code for a model that successfully predicts MNIST digits.

You are also aware that ChatGPT hallucinates and to avoid incorrect code, you plan to test all responses from ChatGPT.  

Note: you will run code provided by ChatGPT without modification.  If the result is incorrect, revise your prompt

1. Ask ChatGPT to show code that will download MNIST data into training and test dataframes
  1. Run the code ChatGPT provided (if the code is incorrect, change your prompt until it runs correctly)
2. Ask ChatGPT to show code that will train a scikit-learn classifier to accurately predict digits
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)
3. Ask ChatGPT to show code that will create a report showing precision, recall, f1-score, and support of the trained classifier
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)
4. Ask ChatGPT to show code that will create a confusion matrix for the performance of the classifier using test data
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)
5. Ask ChatGPT to show code that will display 5 randomly selected test digits along with its actual and predicted class in the title of the image
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)


Hints: you will need to provide explicit instructions on variable names to use in order to maintain consistency as you progress from Q1 to Q2 to Q3 to Q4




### Submitting your Lab Test 2 for Grading

1. Download your notebook **with all cells evaluated** to your PC
2. Upload the downloaded notebook to the Lab Test Classwork section of the Course Notebook


In [None]:
# increase Jupyter cell width

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Setup Google Drive access to API keys

In [None]:
#@markdown connect to drive for API keys stored in <br> `My Drive/Colab Notebooks/API_Keys`
import os, sys
from google.colab import drive
drive.mount('/content/mnt')
nb_path = '/content/notebooks'
os.symlink('/content/mnt/My Drive/Colab Notebooks', nb_path)
sys.path.insert(0, nb_path)  # or append(nb_path)


Mounted at /content/mnt


# Setup Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt # plotting library
import seaborn as sns
import plotly.express as px

import os
from PIL import Image

import json
import requests


import numpy as np
import re
from scipy.io.arff import loadarff

from sklearn.manifold import TSNE

os.getcwd()

'/content'

In [None]:
# set pandas options
pd.set_option('display.max_columns', 50) # show more columns
pd.set_option('display.max_rows', 200) # show more rows
pd.set_option('max_colwidth', 400) # set wider columns

# Part 1 - Meteorites Dataset - Clean bad data

## Get data and create dataframe

In [None]:
# your code to get datafile here

In [None]:
# your code to create dataframe

meteorites_raw_df = ... # your code here

## Find one field with data that does not make sense

1. Use code to show how you found the bad column

1. Write a description to explain why some data in the column does not make sense.  

1. Explain how you could "clean" this bad column (there are many possible ways to clean data - use your best judgement and show your assumptions)

1. Use code to show the data cleaning step described above

In [None]:
# 1) your code to find bad column here

In [None]:
# 2) your description here

In [None]:
# 3) explain your data cleaning approach here

In [None]:
# 4) your code to clean the bad column here

## Find a second field with data that does not make sense

1. Use code to show how you found the bad column

1. Write a description to explain why some data in the column does not make sense.  

1. Explain how you could "clean" this bad column (there are many possible ways to clean data - use your best judgement and show your assumptions)

1. Use code to show the data cleaning step described above

In [None]:
# 1) your code to find bad column here

In [None]:
# 2) your description here

In [None]:
# 3) explain your data cleaning approach here

In [None]:
# 4) your code to clean the bad column here

## Find third field with data that does not make sense

1. Use code to show how you found the bad column

1. Write a description to explain why some data in the column does not make sense.  

1. Explain how you could "clean" this bad column (there are many possible ways to clean data - use your best judgement and show your assumptions)

1. Use code to show the data cleaning step described above

In [None]:
# 1) your code to find bad column here

In [None]:
# 2) your description here

In [None]:
# 3) explain your data cleaning approach here

In [None]:
# 4) your code to clean the bad column here

## Find fourth field with data that does not make sense

1. Use code to show how you found the bad column

1. Write a description to explain why some data in the column does not make sense.  

1. Explain how you could "clean" this bad column (there are many possible ways to clean data - use your best judgement and show your assumptions)

1. Use code to show the data cleaning step described above

In [None]:
# 1) your code to find bad column here

In [None]:
# 2) your description here

In [None]:
# 3) explain your data cleaning approach here

In [None]:
# 4) your code to clean the bad column here

## Find fifth (or more!) fields with data that does not make sense

1. Use code to show how you found the bad column

1. Write a description to explain why some data in the column does not make sense.  

1. Explain how you could "clean" this bad column (there are many possible ways to clean data - use your best judgement and show your assumptions)

1. Use code to show the data cleaning step described above

In [None]:
# 1) your code to find bad column here

In [None]:
# 2) your description here

In [None]:
# 3) explain your data cleaning approach here

In [None]:
# 4) your code to clean the bad column here

# Part 2 Setup

## Install Libraries

In [None]:
try:
  import openai
  import tiktoken
except ModuleNotFoundError:
  !pip install openai==1.13.3 # install openAI library to the notebook instance
  !pip install tiktoken==0.6.0 # install tokenizer library
finally:
  import openai
  import tiktoken


# show installed versions
import pkg_resources
tiktoken_version = pkg_resources.get_distribution("tiktoken").version

print(f"installed openai version {openai.__version__}")
print(f"installed tiktoken version: {tiktoken_version}")


installed openai version 1.13.3
installed tiktoken version: 0.6.0


## Load Libraries

In [None]:
import matplotlib.pyplot as plt

import openai
import tiktoken


import pandas as pd
import os
from PIL import Image

import json
import requests

import seaborn as sns
import numpy as np
import re
from scipy.io.arff import loadarff

os.getcwd()

'/content'

## Setup Google Drive access to API keys

In [None]:
import google.colab.auth
google.colab.auth.authenticate_user()

In [None]:
#@markdown This cell mounts `API_Keys` directory containing your API keys for use with Colab

#@markdown connect to drive for API keys stored in `My Drive/Colab Notebooks/API_Keys` <br>

#@markdown 1) mount google drive `Colab Notebooks/API_Keys/` to local path `/content/notebooks`

import os, sys
from google.colab import drive
drive.mount('/content/mnt')
nb_path = '/content/notebooks'
os.symlink('/content/mnt/My Drive/Colab Notebooks', nb_path)
sys.path.insert(0, nb_path)  # or append(nb_path)


Mounted at /content/mnt


## Setup Colab and Kaggle API Key

In [None]:
#@markdown This cell links your `API_Keys/kaggle.json` key for local use by Colab

#@markdown 1) create and save your kaggle API key as a json file on your Google Drive Colab Notebooks folder here:  `Colab Notebooks/API_Keys/kaggle.json`

SETUP_KAGGLE=False #@param True or False
if(SETUP_KAGGLE):
  !mkdir ~/.kaggle
  !ln -s /content/notebooks/API_Keys/kaggle.json ~/.kaggle/
  !ls -al ~/.kaggle/


## Setup Colab and OpenAPI key

In [None]:
#@markdown This cell loads your `API_Keys/openai.json` key as environment variable for local use by Colab

#@markdown 1) create and save your openai API key to a text file  on your Google Drive Colab Notebooks folder here:  `Colab Notebooks/API_Keys/openai.txt`

SETUP_OPENAI=True #@param # True or False
if(SETUP_OPENAI):
  my_api_key_path = '/content/notebooks/API_Keys/openai.txt'
  with open(my_api_key_path,'r') as f:
    openai_key = f.readline()

  os.environ['OPENAI_API_KEY'] = openai_key

  from openai import OpenAI
  client = OpenAI(
    api_key=os.environ['OPENAI_API_KEY'],  # this is also the default, it can be omitted
  )

In [None]:
#@markdown show list of models accessible with our API key
model_list = client.models.list()
model_list = [m['id'] for m in model_list.dict()['data']]

for m in model_list:
  print(m, end=", ")


# Part 2 - Build MNIST prediction model with Generative AI

You are a student with tight deadlines.  One of your courses has a project that requires you to build a model that accurately predicts MNIST digits

To help you save time, you plan to use ChatGPT3.5 to give you code for a model that successfully predicts MNIST digits.

You are also aware that ChatGPT hallucinates and to avoid incorrect code, you plan to test all responses from ChatGPT.  

Note: you will run code provided by ChatGPT without modification.  If the result is incorrect, revise your prompt

1. Ask ChatGPT to show code that will download MNIST data into training and test dataframes
  1. Run the code ChatGPT provided (if the code is incorrect, change your prompt until it runs correctly)
2. Ask ChatGPT to show code that will train a scikit-learn classifier to accurately predict digits
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)
3. Ask ChatGPT to show code that will create a report showing precision, recall, f1-score, and support of the trained classifier
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)
4. Ask ChatGPT to show code that will create a confusion matrix for the performance of the classifier using test data
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)
5. Ask ChatGPT to show code that will display 5 randomly selected test digits along with its actual and predicted class in the title of the image
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)


Hints: you will need to provide explicit instructions on variable names to use in order to maintain consistency as you progress from Q1 to Q2 to Q3 to Q4

## Q1

1. Ask ChatGPT to show code that will download MNIST data into training and test dataframes
  1. Run the code ChatGPT provided (if the code is incorrect, change your prompt until it runs correctly)

In [None]:
my_context = "your prompt to set context for model"
my_prompt = "your prompt for chatgpt to answer the question"

model_name = "gpt-3.5-turbo"
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "system", "content": my_context},
    {"role": "user", "content": my_prompt}
  ]
)

print(completion.choices[0].message.content)

In [None]:
# paste code generated by chatgpt here and run the cell

## Q2

2. Ask ChatGPT to show code that will train a scikit-learn classifier to accurately predict digits
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)

## Q3

3. Ask ChatGPT to show code that will create a report showing precision, recall, f1-score, and support of the trained classifier
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)

## Q4
4. Ask ChatGPT to show code that will create a confusion matrix for the performance of the classifier using test data
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)


## Q5

5. Ask ChatGPT to show code that will display 5 randomly selected test digits along with its actual and predicted class in the title of the image
  1. Run the code ChatGPT provided  (if the code is incorrect, change your prompt until it runs correctly)


#End