# ML for Monsters

<img src="../img/darkbertscreenshot.png" alt="Alternative text" />


# Introduction: Goals of the Course


I am supposed to write here about what machine learning is, what it might or might not become for all of us, with all of us behind its back, on its case, under its skin, peeking through the hedges of its enclosure. I'm supposed to speak to its affects and potentials, illuminate its pitfalls, stab at its weak points. Instead, I look out the window at the St. Lawrence. I see a swarm of small insects with papery wings dancing under a tree in front of the pale peach dying light. I think of emerging programs of collective intelligence, growing out of top-down programs for collecting intelligence. Of how we can learn from those around us, new ways of seeing, of statistics as a practice of quiet observation. I see the water, brackish and stratified by the different textures of its varied currents. The wide water of the river slowly cleaning mud so thoroughly that in three years it will have forgotten being dumped out of a sewage plant, will have been transposed by completely new mud. Mudamorphosis.
 
The biggest question I am bringing to this course, that I am stuck on, that i hope we might stick to, stick it to, or even stick up, is: Do we fuck with machine learning? If so, how? Some of us are already fuck with it and wonder why, some of us are curious if it has anything to do with us, or to discover how exactly it already fucks us.  

I've been coding statistical models of meaning a decade, wading into murky waters following the siren song of the playful machine. I'm in deep. Deep in the mud of a river where more and more industrial waste gets dumped every day. I've been drawn in, drawn on by this suspicion of potential, barely visible in the far distance, more felt than seen. But as I approach, it recedes. I study meaning space so i can walk around in it. Not as a way to map the world but as a world unto itself, to be explored, to get lost in. I'm here because I need help, and I hope that if we all come at it from different directions eventually we'll have it surrounded or even if we don't, we'll have passed through each other.

Let's fuck around and find it.

## Three Perspectives on Machine Learning

In this course, we'll use natural language processing (NLP) as a context to explore the state of machine learning and AI from many perspectives: critical and constructive, philosophical and practical. 

**critique / AI genealogies** : how are logics of capture and (mis)representation encoded into machine learning models, and how are these models currently employed within apparatuses of control?

**combat / AI adversaries:** how can the function of ML models be exploited to get them to behave in unintended ways? How can we resist the push to be desired users?

**construct / AI poetics:** can we imagine use ML in ways that step outside the paradigm of representation, prediction, and control? What does partisan machine learning look like? 

## Three Levels of Exploration

In order to develop these perspectives, we will cover (mostly language-related) machine learning technologies at three levels: 

- **mathematical:** familiarity and comfort with formalisms and concepts from statistics from linear algebra and probability theory
    - information theory and probability
    - loss functions
    - backpropagation
- **operational:** ability to use tools in the ecosystem that implement these formalisms
    - text processing
    - language modeling (guessing the next word)
    - classification
- **socio-historical:** how did these foundational formalisms come to be so important in the field? how do the  commercial interests and ideologies of machine learning developers and affect the way problems are defined and approached? 
    - applications (weaponized & radical)
    - social theory
    
These levels correspond to degrees of contextualization. 




## Synthesis

We welcome any mode you wish to engage with the material. If you want to compare the effect of different kinds of loss functions using pure math or a fake procedurally generated dataset, go for it! And teach us what you learn. If your project is a prototype for an app, or a proposal for a new kind of dataset and a plan to collect it, we will stay up late with you designing the pilot. If you want to write a philosophical essay on the  impossibility of 'ML for social good', use us to workshop your ideas.

We want to push aas far as we can in all of these directions. But we also insist on taking a holistic stance, and hope that you do to: the lenses we look through and the levels we look at necessarily interact.
We believe that in order to approach ML through any of these lenses or at any of these levels, and do it well, it is absolutely necessary to engage with the others.

To understand the math, it helps to understand the social contexts that led to its development, and to use tools that allow us to abstract away from the implementation details once we understand them. EXAMPLE

To understand and critique the current sociocultural landscapes of AI, we have to know how to interpret the methods used by applications, at the formal and toolchain level. To intervene in this landscape through misuse of models, deception of models, or application of ML to our own curiosities and problems, we need to be comfortable using the tools. EXAMPLE

To confidently apply tools to a variety of situations and new kinds of data, we need to develop firm intuitions about the underlying math.  Conversely, thinking through sociological and epistemological questions raised by different modeling techniques will enable us to recognize what possibilities the ecosystem of tools open up and what possibilities they foreclose on. EXAMPLE

# Structure of the Course

Our classes will be a combination of

* Interactive lectures w/ coding exercises in Jupyter notebooks
* Slightly longer guided labs
* 4 discussion sections throughout the week anchored by readings
* Projects
    - solo or group
    - we'll get started thinking about these early
    - ex: make a Streamlit app that queries a large language model
    - ex: design a data collection pri
    
    
Over the course of a week we'll learn all the foundational math and techniques for building our own ML pipeline, including training our own language model that we use to train a neural net classifier.

### Critique

- dissect automated content moderation by implementing it in pytorch

- machine learning as pathological recapitulation of the past - Anne Dufourmantelle
- how labeled datasets define categories and objectives

- who decides what problems are interesting to work on, what problems deserve to be "solved", and why?
- limits (and affordances) of quantification, encoding, problematic of information
- empirical experimentation: how do the choices we make affect the constructions of categories and norms
    - . e.g., how do our representations change when we use a different training corpus?
    - what about preprocessing techniques
    - feature selection
- how do we measure the difference between desired and observed behaviors?

### Combat

- how is biopolitical power exercised in language models?
- prompt injection / prompt leaking
- evading the censor


### Construct
- text mashups with n-gram language models
- formulating the problem: how can we design tasks.
- autonomy: open source language models and ML tools
- building LLM apps

# What is NLP

> "Be able to solve problems that require deep understanding of text" - Greg Durrett CS 388 NLP

> "The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves." - Wikipedia

What do the people who wrote these definitions think it means to process language?

# What is NLP

"Be able to **solve problems** that require deep understanding of text" - Greg Durrett CS 388 NLP

"The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then **accurately extract information** and insights **contained** in the documents as well as **categorize and organize** the documents themselves." - Wikipedia

### Key concepts
    * desired behavior (both us and the model)
    * the meaning is in the data (conduit metaphor - Michael Reddy)
    * language as categorization engine

# Symbolic AI
## a.k.a. Good Old Fashioned AI (GOFAI)

- rule-based logic
- ontologies


## Dartmouth Summer Research Project on Artificial Intelligence (1956)

- Participants : Claude Shannon, Marvin Minsky, & others
- Funded by Rockefeller Foundation
- Proposal: http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf

"every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it."

## Georgetown IBM experiment (1954)

<img src="../img/electronicbrain.png" alt="Alternative text" />

https://www.youtube.com/watch?v=aygSMgK3BEM
    
Russian - English "translation" system

> "this will be quite an adequate speed to deal with the whole output of the soviet union in just a few hours' computer time a week."

DoD funded project

What happened?

> "the spirit is willing but the flesh is weak."

Translate this to Russian, and then translate back

> "the vodka is good but the meat is rotten."

## ELIZA rogerian psychoanalysis ( Joseph Weizenbaum, 1966)

<img src="../img/ELIZA_conversation.png" alt="Alternative text" />

## SHRDLU blocks world (1968-70)

<img src="../img/shrdlu.jpg" alt="Alternative text" />

Terry Winograd MIT AI lab dissertation

terminal connected to a robot arm that interacted with a 'blocks world'

https://hci.stanford.edu/~winograd/shrdlu/AITR-235.pdf



### The Perceptron

In 1969, a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. 

<img src="../img/xor.png" alt="Alternative text" width="400"/>



# AI Winter
<img src="../img/aiwinter.png" alt="Alternative text" />

# Statistical NLP (1990s - 2000s)

- Probabilistic rules are inferred through statistical regularities in corpora
- N-gram language models
- Logistic regression
- Designed features

## Statistical Machine Translation (SMT)

<img src="../img/emstep5.png" alt="Alternative text" width="600" />

* What's the probability that 'maison' is translated as house?
* For example, the outcome might be: 'maison' is translated as 'house'
* Expectation Maximization (EM)




## Logistic Regression
<img src="../img/logistic_regression.png" alt="Alternative text" width="500" />

* We have access to a number of known variables or 'features'
* learn statistical regularities between the presence of those features and the outcomes we are interested in predicting. 
* A lot of work goes into designing features.

# Features

* Each axis/dimension represents some feature of the data, which might be a document, might be a person. In practice, we represent the data along way more than two dimensions.
* The goal ist to maximize the separation of your data in geometric space, according to the divisions you thing are important. In this way the goal of our project influences what kind of **digital objects** we create. (Yuk Hui)
* Choose characteristics that are relevant to the task (same document, different features, depending on what we are trying to do)


### Drawing lines in meaning space
> Biopower appears to function by dividing people into those who must live and those who must die. As it proceeds on the bases of a split between the living and the dead, such power defines itself in relation to the biological field---of which it takes control and in which it invests itself. This control presupposes a distribution of human species into groups, a subdivision of the population into subgroups, and the establishment of a biological caesurea between these subgroups. Foucault refers to this using the seemingly familiar term "racism"

Necropolitics, Achille Mbembe

## Distributional Semantics

Latent Semantic Indexing

<img src="../img/lsipatent.png" alt="Alternative text" />

Example term x document co-occurence matrix from Scott Deerwester & Susan Dumais's patent for Bell Labs, filed 1988.

- Used to discover "implicit higher order structure"
- Important in information extraction

## NLP Pipelines

NLP systems were organized as pipelines. The idea was to first process a sentences in a way that extracts a bunch of useful structural information about it---parts of speech, parsing sentences, building semantic representations. The outputs of these 'upstream tasks' are used as input representations in other systems that perform specific operations of interest: mostly machine translation.

### "Upstream" Tasks - Low-Level Features

Syntactic Processing
1. Tokenization
2. Linguistic Annotation
   - Part of Speech tagging
   - Syntactic parsing

   
Semantic Processing
1. Topic Modeling
3. Semantic similarity
4. Named Entity Recognition
5. Relation Extraction


### "Downstream" Tasks - Applications

The main areas of applied interest in NLP during this era were things like

- Machine Translation
- Information Retrieval / document querying

# Neural NLP

- Feature engineering no longer necessary
- Features are learned.
- http://www.r2d3.us/visual-intro-to-machine-learning-part-1/


## Multi Layer Perceptron (Bengio 2003)

<img src="../img/bengio2003neurallanguagemodels.png" alt="Alternative text" />

## Deep learning 

<img src="../img/deeplearning.png" alt="Alternative text" />


## Word2Vec (Google: Mikolov et al. 2013)

<img src="../img/word2vecgraph.png" alt="Alternative text" />


## Attention is all you need (Vaswani et al., 2017)


Attention allows the model to decide which words to 'attend' to during the generation process: based on the words it already knows about, which words are the most important for guessing the next word?

<img src="../img/transformers.png" alt="Alternative text" />

## BERT (google: Devlin et al., 2018)

<img src="../img/attention_alignment.png" alt="Alternative text" />

* enter the era of pretrained language models
* just the encoder.
* Pre-Train + Fine-tune Paradigm
* fine tuning updates the model parameters on domain specific data
* people are still training models on their specific datasets (SciBERT, DarkBERT, etc.)

## GPT

<img src="../img/astrofiziks.png" alt="Alternative text" />

<img src="../img/zero-shot.jpg" alt="Alternative text" />

# The Innovation... if you can call it that

<img src="../img/transformers_bert_gpt.png" alt="Alternative text" />


<img src="../img/gpt2-sizes-hyperparameters-3.png" alt="Alternative text" />

## Scale

Basically Nothing has changed since 2018 except the addition of parameters

| Model       | Training Data | Parameters |
| ----------- | ----------- |------------|
| GPT2        | 40G (WebText)       | 1.5 Billion |
| GPT3        |   45 Terabytes (45K GB)      | 750 B | 
| GPT 3.5 | Above + | 
| ChatGPT | Above + Conversation Data | 
| GPT 4 | Unknown |  Eight models with 220 billion parameters each |

And fine-tuning the model to be a good bot

* so it kind of feels like we're in the middle of an AI arms race
* there's been a lot of fearmongering about what the robot overlords will do if they aren't 'aligned' with 'human' values.
* there's also been a ton of valid critique of models perpetuating social injustices. They are racist, sexist.
* there have been calls to regulate "AI", aka the development of language models, on many sides
* unlikely given that AI is seen as a strategic technology
    - The Age of AI and our Human Future - Henry Kissinger + Eric Schmidt

## ACL 2023 Keynote - Geoffrey Hinton

- 'godfather of ai' - many are saying this. His work popularized backpropagation (Rumelhart et al., 1986)
-  "LLMs have subjective experience"

### The Future
* digital computers share memory; analog computers are faster (cheap transistors)
* flocks of analog language models without shared weights. they'll teach each other, and
* when the computer dies, the memory dies with it
* backpropagation won't serve us anymore, we need something new

## Frameworks: Supervised and Unsupervised

<img src="../img/supervisedunsupervised.png" alt="Alternative text" />

# Tasks

- Start with a decision problem: given an input (sentence, document, photo?) can i sort it into the right category?
- Build a dataset of a lot of examples of inputs with their **ground truth** labels
- Amazon Mechanical Turk 
- This dataset comes to define the problem

# Examples of Tasks

1. Search
2. Question Answering
3. Image captioning
4. Speech Recognition
5. Text/Document Classification
    - sentiment analysis
6. Machine Translation
9. Language Modeling - generating new text

# Surveillance Capitalism


### PredPol

<img src="../img/predpol.png" alt="Alternative text" />

Recently rebranded to https://geolitica.com/

### Moonshot: data to end violent extremism

* Partner of Google Jigsaw (https://jigsaw.google.com/the-current/)
    - Run by Jared Cohen (counterterrorism guy; working for state department now)
* "Working to end online harms" "around the globe"

Case Study
Location: USA
Sector: Private
Client: LLoyd's (the oldest continuously active insurance
marketplace in the world)
    
    "With millions of data points, we were able to design, train and test an algorithm that generates an overall and a daily risk score for locations across the US. These scores reflect the likelihood of threats of violence online translating into violence against people or property in the real world"

https://moonshotteam.com/resource/the-moonshot-threat-bulletin-june-2023-at-a-glance/

### Turnstile Hoppers
<img src="../img/nycsubway.png" alt="Alternative text" />
<img src="../img/fareevasion.png" alt="Alternative text" />

### Scanning inmate mail for gang affiliation
<img src="../img/pigeonlycorrections.png" alt="https://www.pigeonlycorrections.com/" />
<img src="../img/pigeonlypowered.png" alt="Alternative text" />

### Anything you don't say can and will be used against you

<img src="../img/semanticreconstruction.png" alt="Alternative text" />
<img src="../img/decodedstimulus.png" alt="Alternative text" />

## Applications: Supervised or Unsupervised?

* Machine translation
* Speech-segmentation
* Automated content moderation
* Auto-generated tags / topic labels
* Smart search results - Question answering
* Recommendation systems
* Sentiment analysis
* Fraud detection / loan approval
* Language modeling


## Semi-supervised

- In between supervised and unsupervised learning
- I have some data with gold labels
- I have a lot of data without labels
- Can I somehow extend insights from the labeled data by using unlabeled data?

Q: Why should this work?
Q: Assume that each gold label is a coherent group.
- Can unlabeled data give you insight on group boundaries?



## Activity 

* Pick an industry that uses AI (hint: they all do)
* Search for some things that they use it for (use cases)
* Choose one application 
* Find out details about how this application is developed
* What are the data? supervised or unsupervised? is it regulated? updated?
