# INFO 3401 Introduction to pandas

This may review content you have learned in other classes, but we want to start with the same foundation since pandas is so fundamental to everything else we will do in this class!

## Learning Objectives

* Learn about the pandas `Series` class

## Load libraries

In this class, it will be important to make sure we are all using the same versions of the libraries. You should use the same package versions, or you might get different results!

- Let's look at pandas [releases](https://github.com/pandas-dev/pandas)
- If you research, you can find major breaking [changes to pandas](https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html) API in the past

In [1]:
import pandas as pd
import numpy as np
import altair as alt

pd.options.display.max_columns = 100

print("Check versions !")
print(alt.__version__)
print(np.__version__)
print(pd.__version__)

Check versions !
4.1.0
1.21.0
1.3.2


## Fundamental data types

There are two fundamental data types in `pandas`: a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) and a [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series). 

Today we will learn about `Series`.
- A series is a collection of values
- Each value is referenced with a name called an `index`. Indexes are optional, if you don't include an index then pandas will reference each element in an array by its position number.
- Usually, everything in a series has the same type
- You can think of a series as a kind of array, with extra features (including indexes)
- Example:
    - A series of everyone in who lives in your house (strings)
    - A series of the age of everyone in 3401 (ints)  
- You can also think of a series as collecting a bunch of observations of a single variable. For instance, you can think of a series as collecting observations of rolls of a die. That is a boring example from 2301 though (or your intro stats class). In 3401 the point is to take those same concepts and apply to way more interesting things, such as a series of the tax rates of every company on the S & P 500, or a series of the CO2 emissions from the manufacture of each make and model of electric car. 

#### NFL QB passing yards for 2020

Here is an example of a Series 

| Player     | Passing yards  |
| ----------- | ----------- |
| Deshaun Watson     | 4823       |
| Patrick Mahomes   | 4740        |
| Tom Brady   | 4633        |
| Matt Ryan | 4581 | 

Aside: [pretty tables](https://www.markdownguide.org/extended-syntax/) in markdown. As part of your data scientist mindset in this class, you should be up for reading about markdown table syntax to modify as needed for your own projects.

We can turn this table into a Series like this

In [8]:
yards = [4823, 4740, 4633, 4581]
yards_series = pd.Series(first_7_letters_list)

We can reference each cell in the series by an index, like a spreadsheet or numpy array

In [9]:
yards_series[0]

4823

But we want to be able to refer to each element by name. For that, you use an index.

In [11]:
yards = [4823, 4740, 4633, 4581]
index=["Deshaun Watson", "Patrick Mahomes", "Tom Brady", "Matt Ryan"]
yards_named = pd.Series(yards, index=index)
yards_named

Deshaun Watson     4823
Patrick Mahomes    4740
Tom Brady          4633
Matt Ryan          4581
dtype: int64

In [12]:
yards_named["Matt Ryan"] # much better

4581

Question: what is the data type of `yards_named`?

[Your answer here]

Question: make a Series called `first_three` of the first three letters of the alphabet.

[Your answer here]

#### Flexing your data science mindset!
Recall that a uniform distribution assigns an equal probability to each outcome of an experiment. In other words, if we have a sample of possible outcomes $\Omega$ then we will observe each outcome with probability $\frac{1}{\vert \Omega \vert}$. (Pause for "[latex](https://en.wikipedia.org/wiki/LaTeX)" here).

For instance, rolling a fair die is a uniform distribution over the numbers 1 to 6. Flex your data scientist mindset to review the [documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html) for the scientific computing library numpy to simulate 10 rolls of a fair die and store the results in a Pandas series. 

Notice that this example takes us from computational thinking to quantitative thinking and back (via hacking the docs!)

[Your group's code here]