<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject #13: Building a Scene Recognition Model form Video Frames</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/">https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Frames of a Video

Visual images are an important part of all media and Data Scientists are often using images as data sources.  In this MicroProject, you will create a simple model to detect the amount of time spent in two different "scenes" we used when creating office-hour style videos for Data Science DISCOVERY.

In doing this, you will be building a **simple artificial intelligence (AI) algorithm** that will be able to predict the classification of unknown future data.  Let's nerd out! 🎉

<hr style="color: #DD3403;">

## Part 1: Loading a Video Frame

For this MicroProject, we already did a screen capture from the DISCOVERY video [*"Outliers Impact on Correlation (m6-02b)"*](https://www.youtube.com/watch?v=bd6hQ2UcIJc) that is used as part of our [DISCOVERY lecture covering Correlation](https://discovery.cs.illinois.edu/learn/Towards-Machine-Learning/Correlation/).  We captured one video frame every second and stored it as an image, and those files are available fo you in the `frames` directory of this MicroProejct.

To analyze these images, the `skimage` library is commonly used to load image data into Python.  Specifically the `skimage.io.imread(filename)` will read a filename and return the pixel color for every pixel in the image.

To use the `imread` function, you will need to either do one of the following:

1. Either import all of `skimage` by using the import line `import skimage`.  After importing all of `skimage`, you will call the function using its fully qualified name: `skimage.io.imread(filename)`.

**OR**

2. Import only the `imread` function by using the import line `from skimage.io import imread`.  After importing only `imread`, you will call the function directly: `imread(filename)`

### Part 1.1: Read Pixel Data for `frames/frame_0001.jpg`

In the following cell, store the pixel color data from the file named `frames/frame_0001.jpg` in the variable `pixels` by using the `imread` function:


In [3]:
from skimage.io import imread
pixels = imread("frames/frame_0001.jpg")

In [4]:
### TEST CASE for Part 1: Loading a Video Frame
tada = "\N{PARTY POPPER}"

assert("pixels" in vars()), "Make sure you store the pixel data in the variable `pixels`."
assert(pixels.shape == (360, 640, 3)), "Make sure you are getting the pixel color data from the correct file."
assert(pixels[0][0][0] == 91), "Make sure you are getting the pixel color data from the correct file."

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


Here's the image you loaded into `pixels` for reference as we work on the next section:

![frame_0001](frames/frame_0001.jpg "frame_0001")


<hr style="color: #DD3403;">

## Part 2: Finding Average Image Colors

The **shape** of your data is the `rows` by `columns` by `color values` as 3-dimensional list.  Here's a formatted view of your `pixels` data:

```py
[
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #1
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #2
  ...                                                    # ...
]
```

The current shape of `pixels` is 360 rows by 640 columns by 3 colors.  Each of the three colors represent the three color channels on a screen: red, green, and blue.

Using `pixel.mean()`, we find the average color by grouping **ALL** the color channels (combining blues and reds and greens together).  Try it out:


In [5]:
pixels.mean()

np.float64(72.18011863425926)

We can also find the average of each color channel instead of grouping them all together.  The `pixels.reshape(-1, 3).mean(axis=0)` function will return the mean of each color channel across all pixels.  Check out the new mean output:

In [6]:
# pixels.reshape(-1, 3) is being used to arrange the `pixels` data into 1 column
# while still preserving the original color channel data for each pixel, which is
# necessary to get the mean for each color channel:
pixels.reshape(-1, 3)

array([[ 91,  83,  80],
       [ 91,  83,  80],
       [ 91,  83,  80],
       ...,
       [162, 131, 110],
       [162, 131, 110],
       [162, 131, 110]], shape=(230400, 3), dtype=uint8)

We can combine it with `.mean(axis=0)` to find the average color of the three color channels:

In [9]:
pixels.reshape(-1, 3).mean(axis=0)

array([88.65917535, 67.45620226, 60.4249783 ])

Finally, when a `list` or `array` is returned, we can spread the values into different variables where each variable takes one value from the list.  For example:

> ```py
> # Spreads the [2, 4], so x=2, and y=4:
> x, y = [2, 4]
> ```

We can spread any size list into as many variables as elements in the list:

> ```py
> # Spreads a list of five elements into five variables:
> a, b, c, d, e = [20, 30, 40, 50, 60]
> # a = 20; b = 30; c = 40; d = 50; and e = 60.
> ```

This can be helpful for the average pixel colors of an image:

> ```py
> # Spreads the list of three color channels into r, g, and, b:
> r, g, b = pixels.reshape(-1, 3).mean(axis=0)
> ```

### Part 2.1: Finding the Average Color of One Image

Store `pixels`'s average red value in `r`, average green value in `g`, and average blue value in `b`:

In [10]:
r, g, b = pixels.reshape(-1, 3).mean(axis=0)

In [13]:
### TEST CASE for Part 2.1: Finding the Average Color of One Image
tada = "\N{PARTY POPPER}"

import math
assert("r" in vars()), "The average red value should be stored in the variable `r`."
assert("g" in vars()), "The average green value should be stored in the variable `g`."
assert("b" in vars()), "The average blue value should be stored in the variable `b`."
assert(math.isclose(r, 88.65917534722222, abs_tol=1)), f"Your average red value is incorrect (r={r})."
assert(math.isclose(g, 67.45620225694445, abs_tol=1)), f"Your average green value is incorrect (g={g})."
assert(math.isclose(b, 60.42497829861111, abs_tol=1)), f"Your average blue value is incorrect (b={b})."

print(f"{tada} All Tests Passed! {tada}")
print(f"- The image's average red color channel is {round(r)} / 255")
print(f"- The image's average green color channel is {round(g)} / 255")
print(f"- The image's average blue color channel is {round(b)} / 255")


🎉 All Tests Passed! 🎉
- The image's average red color channel is 89 / 255
- The image's average green color channel is 67 / 255
- The image's average blue color channel is 60 / 255


### Part 2.2: Finding the Average Color of All Images

The following code loops through every file in the `frames` directory -- this will include `frame_0001.jpg` (like you analyzed already) and also `frame_0002.jpg`, `frame_0003.jpg`, and all 300+ frames!

Create a DataFrame named `df` where each row is one frame with the following four columns:
- `frame`, the filename of the frame
- `r`, the average red color of the frame
- `g`, the average green color of the frame
- `b`, the average blue color of the frame

The structure of the code should be nearly identical to writing a simulation.  Instead of creating random variables for your real world data, your real world data will be the filename, and the average color values. 

- See: https://discovery.cs.illinois.edu/learn/Simulation-and-Distributions/Simple-Simulations-in-Python/

*(Hint: Start at the very beginning of this MicroProject and observe the steps taken to get the average colors of `frame_0001.jpg`.)*

In [16]:
import glob
import os
import pandas as pd

data = []
for frameFileName in glob.glob(os.path.join("frames", "*.jpg")): 
  # `frameFileName` contains the filename of the frame (ex: "frames/frame_0001.jpg").
  # Use `frameFileName` for `imread` to read the frame image data.
  pixels = imread(frameFileName)
  r, g, b = pixels.reshape(-1, 3).mean(axis=0)
  d = {"frame" : frameFileName, "r": r, "g": g, "b": b}
  data.append(d)


df = pd.DataFrame(data)
df

Unnamed: 0,frame,r,g,b
0,frames\frame_0001.jpg,88.659175,67.456202,60.424978
1,frames\frame_0002.jpg,88.697865,67.453529,60.475660
2,frames\frame_0003.jpg,88.028351,66.913845,60.064592
3,frames\frame_0004.jpg,88.825629,67.340347,60.491645
4,frames\frame_0005.jpg,88.211714,66.979661,59.983173
...,...,...,...,...
325,frames\frame_0326.jpg,7.470391,7.473355,7.479188
326,frames\frame_0327.jpg,7.469779,7.472743,7.478576
327,frames\frame_0328.jpg,7.480234,7.481519,7.487826
328,frames\frame_0329.jpg,7.480004,7.481289,7.487595


In [15]:
### TEST CASE for Part 2.2: Finding the Average Color of All Images
tada = "\N{PARTY POPPER}"

import math
assert("df" in vars()), "Make sure your DataFrame is named `df`."
assert(len(df) == 330), "Your DataFrame has the incorrect number of rows."
assert("r" in df), "Your `df` is missing the `r` column."
assert("g" in df), "Your `df` is missing the `g` column."
assert("b" in df), "Your `df` is missing the `b` column."
assert("frame" in df), "Your `df` is missing the `frame` column."
assert( abs( df[ df.frame.str.endswith("_0001.jpg") ]["r"].sum() - 88 ) < 1 ), "You have calculated the color averages incorrectly."

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 3: Create a Simple Classifier

In the DISCOVERY lecture videos, there are two primary "scenes" in the video:

1. **"Office Hours Studio Scene"**, where Karle and Wade are talking to each other and the audience

2. **"Notebook Scene"**, where the notebook is displayed

View the `frames` folder on your computer and find **at least three more frames** that are in the "Office Hours Studio Scene" and **at least three more frames** that are in the "Notebook Scene".  Add the frames you found to the lists below:

In [17]:
# List of at least four office hour frames by the filename's frame number:
office_hour_frames = [1, 2, 3, 4]

# List of at least four notebook frames by the filename's frame number:
notebook_frames = [30, 31, 32, 33]

### Part 3.1: Observing the Average Colors of Your Frames

The following code uses your sample frames to display the average color values for your selected frames:

In [18]:
import os

print("== Office Hour Frames ==")
print( df[ df["frame"].isin( [os.path.join("frames", f"frame_{frame:04d}.jpg") for frame in office_hour_frames] ) ] )
print()
print("== Notebook Frames ==")
print( df[ df["frame"].isin( [os.path.join("frames", f"frame_{frame:04d}.jpg") for frame in notebook_frames] ) ] )

== Office Hour Frames ==
                   frame          r          g          b
0  frames\frame_0001.jpg  88.659175  67.456202  60.424978
1  frames\frame_0002.jpg  88.697865  67.453529  60.475660
2  frames\frame_0003.jpg  88.028351  66.913845  60.064592
3  frames\frame_0004.jpg  88.825629  67.340347  60.491645

== Notebook Frames ==
                    frame           r           g           b
29  frames\frame_0030.jpg  237.225595  236.513451  236.777122
30  frames\frame_0031.jpg  237.253437  236.602648  236.892174
31  frames\frame_0032.jpg  237.195208  236.540660  236.820846
32  frames\frame_0033.jpg  237.115829  236.491884  236.751220


### Part 3.2: Create Your Classifier Function

A **classifier function** is a function that takes data and gives a classification for that data.  Create a new function, `classifyFrame` that receives an `r`, `g`, and `b` value.

Using information from your frames above, have the function return the string `"office hour"` or `"notebook"` based on the values of `r`, `g`, and `b`. To do this, try to observe general trends in the color values for both kinds of frames.

**IMPORTANT**: Make sure your classifier can handle **ANY** input -- even frames you have not seen before!  For example, you might decide that you will call a frame an `"office hour"` frame if the sum of `r`, `g` and `b` is greater than 100 and otherwise it's a `"notebook"` scene.

In [25]:
def classifyFrame(r, g, b):
  # Return either "office hour" or "notebook" based on the values of `r`, `g`, and `b`.
  if r < 100 and g < 100 and b < 100:
      return 'office hour'
  return 'notebook'

Here's a function to test your classifier to make sure your function returns a valid result (feel free to edit this).
- Right now, focus on making sure your function returns a valid result (either `notebook` or `office hour`)
- In the next part, we'll work on making sure it is an accurate classifier.

In [26]:
# Testing the `classifyFrame` function:
classifyFrame(0, 0, 0)

'office hour'

In [28]:
### TEST CASE for Part 3.2: Create Your Classifier Function
tada = "\N{PARTY POPPER}"

r = classifyFrame(0, 0, 0)
assert(r == "notebook" or r == "office hour"), "Your classifier function is misclassifying frames."

r = classifyFrame(255, 255, 255)
assert(r == "notebook" or r == "office hour"), "Your classifier function is misclassifying frames."

r = classifyFrame(0, 255, 255)
assert(r == "notebook" or r == "office hour"), "Your classifier function is misclassifying frames."

r = classifyFrame(255, 255, 0)
assert(r == "notebook" or r == "office hour"), "Your classifier function is misclassifying frames."

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 4: Using Your Classifier!

Now that we have a classifier, we should run it on every frame and see if your classifier correctly classifies scenes in the video frames.

The following cell runs your `classifyFrame` classifier on every frame and adds a new column `scene`. Then, 20 random rows are displayed below.  This output can be checked against the actual images.
- You should see a mix of `office hour` and `notebook` values in your **scene** column.
- If your classifier is not correct, update your `classifyFrame` function in the previous section and re-run that cell, and then re-run this cell to use your updated `classifyFrame` function.

In [29]:
# The following cell runs your `classifyFrame` classifier on every frame and adds a new column `scene`:
df["scene"] = df.apply(lambda row: classifyFrame(row.r, row.g, row.b), axis=1)
df.sample(20)

Unnamed: 0,frame,r,g,b,scene
80,frames\frame_0081.jpg,230.721385,229.915091,230.48303,notebook
18,frames\frame_0019.jpg,88.00727,67.002426,59.679549,office hour
305,frames\frame_0306.jpg,89.403867,70.149223,62.83901,office hour
21,frames\frame_0022.jpg,87.7952,66.529692,59.443464,office hour
153,frames\frame_0154.jpg,230.078668,229.208051,226.855508,notebook
91,frames\frame_0092.jpg,230.462405,230.04694,230.497027,notebook
274,frames\frame_0275.jpg,243.495807,242.536293,241.133168,notebook
264,frames\frame_0265.jpg,243.797096,242.944306,241.363533,notebook
0,frames\frame_0001.jpg,88.659175,67.456202,60.424978,office hour
93,frames\frame_0094.jpg,230.418442,230.041072,230.527708,notebook


### Observing Your Results

In the next four cells, we display a frame and you'll run code to check what your classifier classified the frame as being!  Make sure to run the code for each frame.
- If your classier is off, you will need to update your `classifyFrame` function and re-run both the `classifyFrame` cella and the cell at the top of this section to generate a enw **scene** classification for each image.
- It may take several attempts to get this correct, building an AI can be difficult!

### Frame #0001: Office Hours

In [30]:
df[ df.frame.str.endswith("0001.jpg") ]

Unnamed: 0,frame,r,g,b,scene
0,frames\frame_0001.jpg,88.659175,67.456202,60.424978,office hour


![Frame 0001](frames/frame_0001.jpg)

### Frame #0081: Notebook

In [31]:
df[ df.frame.str.endswith("0081.jpg") ]

Unnamed: 0,frame,r,g,b,scene
80,frames\frame_0081.jpg,230.721385,229.915091,230.48303,notebook


![Frame 0001](frames/frame_0081.jpg)

### Frame #0191: Notebook

In [32]:
df[ df.frame.str.endswith("0191.jpg") ]

Unnamed: 0,frame,r,g,b,scene
190,frames\frame_0191.jpg,233.117088,232.354644,230.103359,notebook


![Frame 0001](frames/frame_0191.jpg)

### Frame #0306: Office Hours

In [33]:
df[ df.frame.str.endswith("0306.jpg") ]

Unnamed: 0,frame,r,g,b,scene
305,frames\frame_0306.jpg,89.403867,70.149223,62.83901,office hour


![Frame 0001](frames/frame_0306.jpg)

### Run the "Part 4" Test Case:


In [34]:
### TEST CASE for Part 4: Using Your Classifier
tada = "\N{PARTY POPPER}"
assert("scene" in df)
assert(len(df[ df.scene == "notebook" ]) > 100), "Your classifier function misclassified lots of frames. Take another look at how you are classifying `office hour` and `notebook` frames."
assert(len(df[ df.scene == "office hour" ]) > 75), "Your classifier function misclassified lots of frames. Take another look at how you are classifying `office hour` and `notebook` frames."
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) == len(df)), "Your classifier function misclassified lots of frames. Take another look at how you are classifying `office hour` and `notebook` frames."
assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 ), "Your classifier function misclassified an 'office hours' frame. Take another look at how you are classifying `office hour` and `notebook` frames."
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 ), "Your classifier function misclassified an 'office hours' frame. Take another look at how you are classifying `office hour` and `notebook` frames."
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 ), "Your classifier function misclassified a 'notebook' frame. Take another look at how you are classifying `office hour` and `notebook` frames."
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 ), "Your classifier function misclassified a 'notebook' frame. Take another look at how you are classifying `office hour` and `notebook` frames."
print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 5: Improve Your Classifier

There are some frames that are neither the `office hour` nor the `notebook`.  How did your classifier do...?

Run the following cells to see how you classified some unexpected frames:

### Frame #0320: Data Science Duo Logo???

What did you classify the DUO logo as?  It's neither one, but we don't have that option!

In [35]:
df[ df.frame.str.endswith("0320.jpg") ]

Unnamed: 0,frame,r,g,b,scene
319,frames\frame_0320.jpg,221.227565,71.838433,54.457305,notebook


![Frame 0001](frames/frame_0320.jpg)

### Frame #328: Video Credits

What did you classify the video credits as?  It's another tricky one!


In [36]:
df[ df.frame.str.endswith("0328.jpg") ]

Unnamed: 0,frame,r,g,b,scene
327,frames\frame_0328.jpg,7.480234,7.481519,7.487826,office hour


![Frame 0328](frames/frame_0328.jpg)

### Part 5.1: Building a Second Classifier

Create a second classifier -- `classifyFrame2` -- that returns either `"notebook"`, `"office hour"` or `"other"`.  Your classifier should correctly handle the "Data Science Duo" (ex: #0320) frames and the "Credit" frames (ex: #0328).
- Think of what might **uniquely identify these two special frames** that is different than the office hour or notebook scenes.  The average color values displayed with the images above may help you with some ideas.
- You must still correctly classify `"notebook"` and `"office hour"` frames, in addition to classifying `"other"` frames
- *Hint: It may be helpful to first identify if it's an `"other"` since these scenes may be most distinctive.*

In [37]:
def classifyFrame2(r, g, b):
  # Return either "office hour", "notebook", or "other" based on the values of `r`, `g`, and `b`.
  # orange dou logo
  if r > 200 and g < 100 and b < 100:
      return 'other'
  elif r < 10 and g < 10 and b < 10:
      return 'other'
  return classifyFrame(r, g, b)

### Part 5.2: Apply your `classifyFrame2` function

Using `classifyFrame2`, this code replaces the value in the column `scene` with your `classifyFrame2` classification function.  The output of this cell shows the last frames of the video, which we expect many of the results to be `"other"`:

In [38]:
df["scene"] = df.apply(lambda row: classifyFrame2(row.r, row.g, row.b), axis=1)
df.tail(20)

Unnamed: 0,frame,r,g,b,scene
310,frames\frame_0311.jpg,89.052721,70.084679,62.528559,office hour
311,frames\frame_0312.jpg,89.577539,70.261745,62.89487,office hour
312,frames\frame_0313.jpg,89.365169,70.192526,62.526467,office hour
313,frames\frame_0314.jpg,89.24036,70.183095,62.777127,office hour
314,frames\frame_0315.jpg,89.053277,70.11523,62.639093,office hour
315,frames\frame_0316.jpg,227.706259,67.024722,48.608728,other
316,frames\frame_0317.jpg,233.53987,66.995486,47.76674,other
317,frames\frame_0318.jpg,227.335317,67.101398,48.556484,other
318,frames\frame_0319.jpg,221.847739,72.007214,54.58467,other
319,frames\frame_0320.jpg,221.227565,71.838433,54.457305,other


### Test Cases

You can re-run the sample image cells earlier in this notebook after running the code in Part 4.2 since we updated the DataFrame using your new classifier.  Remember to update `classifyFrame2` to update your new classifier (not the original one).

In [39]:
### TEST CASE for Part 5: Update Your Classifier to Account with an Other Category
tada = "\N{PARTY POPPER}"
assert("scene" in df)
assert(len(df[ df.scene == "notebook" ]) > 100), "There should be many frames classified as 'notebook'. There are frames being misclassified."
assert(len(df[ df.scene == "office hour" ]) > 75), "You should have more frames classified as 'office hour'. There are frames being misclassified."
assert(len(df[ df.scene == "other" ]) >= 15), "Most of the last 20 frames should be classified as 'other'. Try adjusting the bounds of your classification parameters."
assert(len(df[ df.scene == "other" ]) <= 18)   # Okay to classify the intro screens as well, but not any others.
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) + len(df[ df.scene == "other" ]) == len(df))
assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 ), "Your classifier function misclassified an 'office hours' frame. Take another look at how you are classifying the four frame categories."
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 ), "Your classifier function misclassified an 'office hours' frame. Take another look at how you are classifying the four frame categories."
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 ), "Your classifier function misclassified a 'notebook' frame. Take another look at how you are classifying the four frame categories."
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 ), "Your classifier function misclassified a 'notebook' frame. Take another look at how you are classifying the four frame categories."
assert( len( df[ (df.frame.str.endswith("0317.jpg")) & (df.scene == "other") ] ) == 1 ), "Your classifier function misclassified an 'other' frame. Take another look at how you are classifying the four frame categories."
assert( len( df[ (df.frame.str.endswith("0325.jpg")) & (df.scene == "other") ] ) == 1 ), "Your classifier function misclassified an 'other' frame. Take another look at how you are classifying the four frame categories."
assert( len( df[ (df.frame.str.endswith("0328.jpg")) & (df.scene == "other") ] ) == 1 ), "Your classifier function misclassified an 'other' frame. Take another look at how you are classifying the four frame categories."
print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## You Built a Classier! 🎉🎉

In this MicroProject, instead of training a model with existing data, you fine-tuned your own classier function.  If you want to nerd out more, feel free to explore using k-means clustering on this data (or any other classification algorithm) -- how does a classical AI algorithm do on this same problem?

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉