# Lab 2 — Exploring Image Data
### by Eddie Diaz and Noah Barnard

## Chosen Dataset
https://www.v7labs.com/open-datasets/multiple-pose-human-body-database

## Business Understanding

This dataset was made by v7 Labs for the purpose of training machines to identify various qualities of people. The range of ML predictive capabilities using this dataset include poses, activities, gender, body type, and more. Third parties who build recognition software would benefit from these machine learning algorithms because they allow for software to identify the user, his/her features, and whichever activity they are performing. 

One particular task that could be interesting is the ability for machines to gauge proper form at the gym. Although working out is crucial for long-term health, repeating incorrect movement patterns can be very detrimental. For instance, someone that performs deadlifts while bending their spine could develop chronic back problems over time. A technology-enabled tool that prevents these problems and offers corrective feedback could therefore prove highly beneficial to many people. Furthermore, by comparing a particular exercise to a dataset with both proper and improper form, it should be relatively easy for the machine to pick up on incorrect patterns with only average performance.

However, this task is beyond the scope of this assignment. For now, we want to build a simple algorithm that can determine whether a person is performing physical activity. We feel like this is a good start. Also, for the sake of simplicity, we will only iterate over the first 1000 items, as this dataset contains for 25,000 pictures. As these images seem to be in no particular order, a simple cutoff at 1000 seems appropriate.

## Data Preparation

To begin working with image data, I will first improt our dependencies, load the images as a numpy array,and edit the image as necessary. 

In [12]:
## import based on https://www.kaggle.com/code/lgmoneda/from-image-files-to-numpy-arrays/notebook

import numpy as np
import pandas as pd
import os, sys
from IPython.display import display
from IPython.display import Image as _Imgdis
from PIL import Image

folder = "./dataset"

onlyfiles = [f for f in os.listdir(folder) if os.path.isfile(os.path.join(folder, f))]

print("Working with {0} images".format(len(onlyfiles)))
print("Image examples: ")

for i in range(40, 42):
    print(onlyfiles[i])
    display(_Imgdis(filename=folder + "/" + onlyfiles[i], width=240, height=320))

Working with 1000 images
Image examples: 
00047.jpg


<IPython.core.display.Image object>

00721.jpg


<IPython.core.display.Image object>

Ok, we got visuals. Let's turn these into numpy arrays, and import them into a pandas dataframe for table use.

In [24]:
from numpy import asarray

df = pd.DataFrame(columns= ["data"])

for file in os.listdir(folder):
    img = Image.open(os.path.join(folder, file))
    numpydata = asarray(img)
    df.loc[file] = {"data": numpydata}

print(df.head())

                                                        data
00132.jpg  [[[212, 235, 251], [212, 235, 251], [212, 235,...
00654.jpg  [[[255, 254, 252], [255, 254, 252], [255, 254,...
00640.jpg  [[[49, 47, 50], [50, 48, 51], [51, 49, 52], [5...
00898.jpg  [[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...
00126.jpg  [[[8, 13, 6], [8, 13, 6], [8, 13, 6], [8, 13, ...


As we can see, we have a padas dataframe containing two columns. First, the name of the file, and second the data of the entire image as a numpy array. Now, we just need to add a target column, and we'll be good to conduct our analysis.

In [27]:
df["target"] = np.ones([df.data.shape[0],1])

print(df.head())

                                                        data  target
00132.jpg  [[[212, 235, 251], [212, 235, 251], [212, 235,...     1.0
00654.jpg  [[[255, 254, 252], [255, 254, 252], [255, 254,...     1.0
00640.jpg  [[[49, 47, 50], [50, 48, 51], [51, 49, 52], [5...     1.0
00898.jpg  [[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...     1.0
00126.jpg  [[[8, 13, 6], [8, 13, 6], [8, 13, 6], [8, 13, ...     1.0


## Data Reduction

Now we are ready for our analysis. Beginning with PCA.