# Data Structures and Operations

Doing data science requires us to understand the different "flavors" of data and how stored it is stored in a computer.

<img src = '../images/supercomputer_lungs.jpg' width = 500>

## Lists

* A collection of items
* Can be different data types
* Ordered and changeable

In [56]:
food = ['coffee', 'bread', 'eggs', 'orange juice']

print(food)

['coffee', 'bread', 'eggs', 'orange juice']


In [57]:
food[1]

'bread'

In [None]:
food[3] = 'milk'

In [None]:
food

## Arrays 
* A list that contains a single data type

In [3]:
# Which lists are arrays?

grocery_list = ['eggs', 'butter', 'flour']

recipe_list = [1, 'cup', 'sugar']

age_list = [3, 8, 10, 38, 40]

number_list = [4, 'five', 'ten', 32]


### Arrays in Tables



#### Checkpoint: Where are the arrays?



<img src="../images/arrays_checkpoint.png" width = 300>

## Numpy

Numpy is a fundamental library that is used for scientific computing. Its basic building block is an *ndarray*, or an "n-dimensional array".

Suppose we want to make a NumPy array of the following data:

<img src="../images/numpy_planets.png" width = 300>

#### 1D Arrays

A "1D array" would be a single column of this table. We can make it by passing the list planets in the first column to the `array` method

In [5]:
import numpy as np
planets = np.array(['mercury', 'venus', 'earth', 'mars', 'jupiter'])

In [6]:
planets

array(['mercury', 'venus', 'earth', 'mars', 'jupiter'], dtype='<U7')

#### 2D Arrays

To make a "2D array" we can pass a "list of lists". (Notice the nested brackets):

In [7]:
planets = np.array([['mercury', 'venus', 'earth', 'mars', 'jupiter'], 
                    [1,2,3,4,5], 
                    [0,0,1,2,79]])

In [8]:
planets

array([['mercury', 'venus', 'earth', 'mars', 'jupiter'],
       ['1', '2', '3', '4', '5'],
       ['0', '0', '1', '2', '79']], dtype='<U21')

## Pandas

Pandas is a data analysis and manipulation tool. It has a lot of similarities to NumPy, but we tend to use it more for data in tables. NumPy is faster on smaller datasets (<50K rows), but Pandas is better on larger ones (>500K rows).

A 1D array in Pandas is called a `Series`.

In [10]:
import pandas as pd

pandas_planets = pd.Series(['mercury', 'venus', 'earth', 'mars', 'jupiter'])

pandas_planets

0    mercury
1      venus
2      earth
3       mars
4    jupiter
dtype: object

Notice the difference compared to a NumPy 1D array. A Pandas series gives a "labeled index" to the array. This is very useful when it comes to tracking data in a table.

## Pandas DataFrames

A 2D array in Pandas is called a `DataFrame`. In addition to the labeled index for rows, we can also give each column a header.

<img src="../images/pandas_planets.png" width = 300>

This is done by passing a `Dictionary` to the `DataFrame` method. A dictionary is a "key-ordered pair". In the case of a table, the key is the column header and a list is the data in that column.

In [11]:
pandas_planets = {'A': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter'],
        'B': [1, 2, 3, 4, 5], 'C': [0, 0, 1, 2, 79]}

In [12]:
pandas_planets

{'A': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter'],
 'B': [1, 2, 3, 4, 5],
 'C': [0, 0, 1, 2, 79]}

In [15]:
planets_df = pd.DataFrame(pandas_planets)
planets_df

Unnamed: 0,A,B,C
0,Mercury,1,0
1,Venus,2,0
2,Earth,3,1
3,Mars,4,2
4,Jupiter,5,79


## Reading in Data from a .csv File

Rarely will you have to make ndarrays as dataframes from scratch. Usually, the data is already packaged up in a file. The most common file formate for raw data is a `.csv` file, short for "comma separated values"

<img src="../images/planets_csv.png" width = 200>

In [18]:
planets = pd.read_csv('../data/planets.csv')
planets

Unnamed: 0,Planet,Order,Moons
0,Mercury,1,0
1,Venus,2,0
2,Earth,3,1
3,Mars,4,2
4,Jupiter,5,79
