# Introduction to Feature Vectors, Data Matrices, and  Probability

<font color=green>__Vector:__</font> A vector is a sequence of values. For our purposes, these values may be categorical data, discrete numeric data, or continuous numeric data. In our examples, we will often use integers to make it easy. Vectors should be handwritten with a little arrow over them or typed in bold.

<font color=green>__Review:__</font> What is a sequence? What are categorical, discrete, and continuous data? How do we plot discrete data? Categorical data? What kind of probability distribution function is associated with discrete data? With continuous data?

<font color=green>__Row vector:__</font> A row vector is written horizontally:<br>
  $\pmb{x_{1}} = \begin{bmatrix} 2 & -1 & 5 \end{bmatrix}$

<font color=green>__Column vector:__</font> A column vector is written vertically:<br>
$\pmb{x_{2}} = \begin{bmatrix} 2 \\ -1 \\ 5 \end{bmatrix}$

<font color=green>__Review:__</font> What is the relationship between $\pmb{x_{1}}$ and $\pmb{x_{2}}$ and how do we write it?

<font color=green>__Instance/observation:__</font> In machine learning (ML), an instance or observation is data that we have about a real-world object (e.g. a customer). What do we want to do with our instances? Have we met any instances today?

<font color=green>__Prediction:__</font> In ML, predictions often don't have to do with the future. It means to estimate a value that likely isn't associated with time, like `will default/won't default` on their credit card.

<font color=green>__Machine learning:__</font> A program that implements a mathematical model and then uses data to optimize the model so that it makes the best possible predictions.

<font color=green>__Training data:__</font> The data that the ML program uses to optimize itself.

<font color=green>__Feature/attribute:__</font> Each type of data that we have about an instance is called a feature or attribute. For example, if our instances are customers, we may have a list of previous orders, a credit score, an age, and a gender for each of them. These are features/attributes of our instances. It is easiest to talk about attributes if we have a name for them (unlike the instances above), such as `Age`.

<font color=green>__Feature vector:__</font> We store the data for an instance in feature vectors. In principle, the order of the features in a feature vector doesn't matter, but implementation-wise, all of our features representing different instances must have the features in the same order. 

# Let's Look at Some Data

Given the csv file `TIA_1987_2016_with_dates.csv` that contains meteorological data for Tucson International Airport and looks like this:
![image.png](attachment:image.png)

How do we get a `DataFrame` that looks like this?

![image.png](attachment:image.png)

The index of our frame is a sequence of instance identifiers.  What are our instances?

Each row of our frame represents an instance. What are the names of the attributes of our instances and where are they stored?

What are the values for the attributes for December 31<sup>st</sup>, 1987?

What is this (pandas-wise)?

<font color=green>__Data matrix:__</font> We will call a sequence of feature vectors stored one row per vector in a frame a data matrix.

How many instances do we have?

What do they represent?

Let's look at January of 1987:

Let's just store this data in its own frame:

# Probability

Now we will review some jargon from the field of probability and relate it to our ML jargon using this data as an example.

<font color=green>__Random variable:__</font> A random variable is a quantity of interest that can take on one of at least two possible values.  What is/are the random variable(s) in our example? 

<font color=green>__Outcome:__</font> An outcome is a possible value that a random variable can take on. In common usage, an outcome is an actual result. In probability, an outcome is equivalent to a possible outcome in common usage. Is `MinT == 41.0` an outcome in our `jan87` example?

<font color=green>__Event:__</font> An event is a set of outcomes. In common usage, an event is something that actually happens. In probability, an event is equivalent to a possible event in common usage. Is `MinT >= 41.0` an event in our `jan87` example?

<font color=green>__Frequency of an event:__</font> The frequency of an event is the number of times it occurs in a dataset. What is the frequency of `MinT >= 41.0` in our `jan87` example?

<font color=green>__Relative Frequency of an event:__</font> The relative frequency of an event is the number of times it occurs in a dataset divided by the number of observations. What is the relative frequency of `MinT >= 41.0` in our `jan87` example?

<font color=green>__Estimated probability of an event:__</font> The estimated probability of an event is the relative frequency of the event in our dataset or a subset of our dataset. In our example, the relative frequency of our example event is the exact probability that a randomly selected day in Jan 1987 had `MinT >= 41.0` at TIA. We could use it as an estimated probability that any day in any January at TIA has `MinT >= 41.0`. Do you see any problems with this? 

## <font color=blue> Worksheet