# [Cog Sci 1] 7 plus-or-minus 2 billion

### Professor Paul Li 

_Estimated Time: 50 Minutes_

![brain-600.jpg](attachment:brain-600.jpg)

### Topics Covered 
Welcome! This lab will be an introduction to Big Data and the Human Brain as well as a gentle introduction to Jupyter Notebooks. By the end of this lab you will be able to: 
1. Define _Big Data_ and _dimensions_ in reference to a data set
2. Explain the high-level steps of simple computer facial recognition
3. Identify 3 reasons related to human cognition limits that we reduce the dimensions of big data during analysis
4. Identify 2 human ethics limitations/consequences that can result from reducing the dimensions of a large data set

![big_data_rcs_case_1440x600_canvas.jpg](attachment:big_data_rcs_case_1440x600_canvas.jpg)

## Table of Contents 

__need to add in hyperlinks__

1. Jupyter Notebooks 
    - Running Cells
2. Big Data 
    - What is big data? 
    - tie back in with our data
3. Computer Vision: Recognizing Faces in an Image 
    - Overview of Facial Recognition 
    - How do these recognition systems work? 
    - How do humans recognize faces?
4. Dimensionality Reduction 
    - What is dimenionality reduction? 
    - Widget 
5. Big data can be too big! 
    - Why would we simplify data? 
    - What are the risks with simplifying data?

### Dependencies: 

In [4]:
#Import pandas, a data science library that will be used later 
import pandas as pd 

## Jupyter Notebooks 

A Jupyter Notebook is an online, interactive computing environment, composed of different types of __cells__. Cells are chunks of code or text that are used to break up a larger notebook into smaller, more manageable parts and to let the viewer modify and interact with the elements of the notebook.

Notice that the notebook consists of 2 different kinds of cells: **markdown** and **code**. A markdown cell (like this one) contains text, while a code cell contains expressions in Python, the programming language in this Notebook.

### The Data

In this lab we will be using a dataset that contains a set of face images that were captured between April 1992 and April 1994 at AT&T Laboratories Cambridge. There are 10 different images of 40 distinct subjects. The images were not taken in the same lighting conditions, time of day, and the subjects did not all have the same facial expressions. All of the images, however, were taken against a dark background with the subejcts facing the camera. It is important to note that no efforts were made to create a diversified unbiased population sample. The images in this dataset do not represent the wider population of the US at the time, so any algorithm that was fed this data would not be able to be extrapolated to apply to the rest of the country. The participants were not chosen randomly.Some example pictures are in the cell below. Check out this [link](<https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html>) for more information!  


![ATT.png](attachment:ATT.png)

Let's load in the data into the notebook. Again, you don't have to know how the code is implemented, but think of it as "opening" the data like you would open an Excel file.

In [11]:
#Run to load the data
from sklearn.datasets  import fetch_olivetti_faces
faces = fetch_olivetti_faces()

targets = faces.target 
data = faces.images.reshape((len(faces.images), -1))

__output of fetching the data:__ 

- data : numpy array of shape (400, 4096)
Each row corresponds to a ravelled face image of original size 64 x 64 pixels.

- images : numpy array of shape (400, 64, 64)
Each row is a face image corresponding to one of the 40 subjects of the dataset.

- target : numpy array of shape (400, )
Labels associated to each face image. Those labels are ranging from 0-39 and correspond to the Subject IDs.

![<img src=“COGSCI88.jpg” alt=“Drawing” style=“width: 100px;“/>](attachment:COGSCI88.jpg)

### Context 

This lab explores how cognitive limits encourage different techniques to simplifying information, such as computer programs breaking down data into different parts. This is similar to how the nervous system breaks down infomration that it receives from the environment. Our brains and _Big Data_ algorithms need to simplify the mass amounts of data they receive in order to efficiently process it and output a result. You will learn about simple facial recognition and how this relates to how the brain process images. 

### Running Cells 

"Running" a cell is similar to pressing 'Enter' on a calculator once you've typed in an expression; it computes all of the expressions contained within the cell.

To run a code cell, you can do one of the following:
- press __Shift + Enter__
- click __Cell -> Run Cells__ in the toolbar at the top of the screen.

You can navigate the cells by either clicking on them or by using your up and down arrow keys. Try running the cell below to see what happens. 

In [1]:
print("Hello, World!")

Hello, World!


The input of the cell consists of the text/code that is contained within the cell's enclosing box. Here, the input is an expression in Python that "prints" or repeats whatever text or number is passed in. 

The output of running a cell is shown in the line immediately after it. Notice that markdown cells have no output. 

## Big Data 

![Big-data-azzurro.jpg](attachment:Big-data-azzurro.jpg)

### What is _Big Data_?
The term _Big Data_ seems to be a buzz word everywhere lately, but what really is _Big Data_? _Big Data_ refers to an extremely large data set that may be analyzed to reveal patterns, trends, associations, especially relating to human behavior. Most companies and organizations collect data on every transaction or interaction for each user or consumer, so the data can expand very quickly! 

Consider the image data that your brain deals with, we are constantly processing information from our eyes which expands with every passing second. This adds up to a lot of data to PROCESS. While the dataset that we will be using later in this notebook, is not this big, it is still important to recognize that _Big Data_ is everywhere. Both computer programs and our brains deal with large amount of data every day, and have developed strategies of efficiently processing this information. 

### Dimensionality 

How is the size of a dataset measured? Dimensionality refers to how many attributes a dataset has. For example, if Berkeley had a dataset of students, with each row representing a students, some variables could be: residency, year, major, units, gpa. A high dimensional dataset means that the number of dimensions are very large and that calculations could become difficult and the number of features can exceed the number of observations. Later in this lab, you will learn about how reducing the dimensionality of your data can affect an image. 

## Computer Vision: Recogizing Faces in an Image

![computer-vision-in-flux-v1-web.jpg](attachment:computer-vision-in-flux-v1-web.jpg)

### Overview of Facial Recognition

Facial Recognition systems are computer programs that analyze images of human faces in order to identify the individuals present in them. These systems can be used for general surveillance in a public setting with public video cameras, for personal security purposes, and in many other situations. An well known example of facial recognition is the Face ID feature of an iPhone X. The phone only unlocks when you either enter a passcord, or if it recognizese your face. In general, these systems work by comparing selected facial features from a given image, to faces that are already known within a databse. 




__should I bring up ethics of facial recognition here?__

![1_qseidoEBfxVX6I2KIbHsRA.jpg](attachment:1_qseidoEBfxVX6I2KIbHsRA.jpg)

### How do these systems work? 

Facial recognition systems use computer algorithms to pick out distinctive features and details of a person's face. The computer looks for the ways that faces can be different such as the distance between eyes, shading on the face and many other features. Below are the steps of how common algorithms work: 

1. Image is captured
2. Eye locations are determined
3. Image is converted to grayscale and cropped
4. Image is converted ot a template used by the program for facial comparison reults
5. Image is searched and matched using an algorithm to compare the template to other templates on file 

source: https://www.eff.org/pages/face-recognition

![Screen%20Shot%202019-07-11%20at%2012.26.22%20PM.png](attachment:Screen%20Shot%202019-07-11%20at%2012.26.22%20PM.png)

### How do humans recognize faces? 

Our ability to recognize faces quickly and efficently is remarkable. But, how do we actually do it? Your brain can identify items and faces within milliseconds and is a complex process. 


In early face recognition processing, the occipital lobe recognizes individual features of a face, such as the eyes, ears, nose etc. Then, the fusiform gyrus, an area of the brain that is involved when you look at a face, is activated. It is responsible for holistic information, meaning it puts all of the information from the occipital lobe together fo


When this area is damaged, individuals lose their abilitiy to identify known faces. Scientists believe that the fusiform gyrus helps people to recognize faces as a whole category. 

source: https://www.brainblogger.com/2015/10/17/how-the-brain-recognizes-faces/

On a high level, facial recognition systems work by comparing facial features from a given image to features in a database. 

Explain how these systems work 

- Explain (high-level) how basic computer image recognition works. Compare/contrast with how human vision systems recognize faces
- E.g. “computer looks for the ways that faces can be different (like shading) to identify different faces. It’s possible to identify the ways that are the best for distinguishing between different faces”

Then compare/contrast with how humans recognize faces 



- Overview: what is computer facial recognition used for?
- Explain (high-level) how basic computer image recognition works. 
- Compare/contrast with how human vision systems recognize faces
E.g. humans recognize faces as wholes (fusiform face area) and objects as sums of component features (outlines, shapes, etc). Computers don’t differentiate; in basic computer vision, everything is an amalgam of components


- Explain eigenfaces and principal component analysis, without ever actually using the words “eigenface”, “eigen vector”, or “principal component analysis”
E.g. “computer looks for the ways that faces can be different (like shading) to identify different faces. It’s possible to identify the ways that are the best for distinguishing between different faces”
Use lots of example text and images




## Dimensionality Reduction

description of what dimensionality reduction is 

- https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/

helpful link for PCA and dimensionality reduction

Widget of dimensionInteractive dimensionality reduction image widget


play with adding and subtracting different components to see when faces become recognizable to humans, and to see how the accuracy of the computer changes
ality redcution


look at that old jupyter notebook 

## Big Data can be too big! 

### Why would we simplify data?

- remove noise
- speed up computation
- make analysis more likely to be human understandable 
- tie in article 

### What are the risks with simplifying data?


- Can potentially connect to lots of real-life examples:
- Takeaway: if we simplify data sets to account for “most” variation, algorithms trained on that data can then fail to work well for edge cases. In real life, “edge cases” often are minorities/people underrepresented in data
- Google Images tags dark-skinned people as “gorillas”
- HP cameras only recognize light-skinned faces
- NY Times on history of facial recognition controversy
- perhaps changing the data inputs the data scientist's bias into the dataset 



- https://www.eff.org/pages/face-recognition
- good article for downsides of facial recognition

## TO DO: 

- add in hyperlinks for table of contents
- finish introductory cells for each section 
- work more with the data for the PCA
- make some of the images smaller
- ask how to properly cite all of the images
- clean up language
- add a link below each image 


## Bibliography 

image 1: https://clalliance.org/blog/computing-brains-neuroscience-machine-intelligence-and-big-data-in-the-cognitive-classroom/

image 2: https://www.reply.com/en/topics/big-data-and-analytics/a-harmonised-big-data-management-model 

image 3: https://www.wollybi.com/en/2018/03/05/big-data-analytics-opportunities-in-the-digital-era/

image 4: http://data8.org/connector/Cognitive%20Science/ 

image 5: http://terminalcoders.blogspot.com/2017/03/at-face-database-in-png.html

image 6: https://www.weareworldquant.com/en/thought-leadership/understanding-images-computer-vision-in-flux/

image 7: https://medium.com/wobot-intelligence/how-facial-recognition-is-the-next-big-disruption-e3d4ac73666f

image 8: https://www.eff.org/pages/face-recognition 