# Data Types and Data Structures

<center><img src = '../imgs/supercomputer_lungs.jpg' width = 500><center>

Obviously, doing data science involves working with data. Understanding the different "flavors" of data and how it is organized in the computer's memory is critical knowledge.

### Data Types

Notice `Hello, world!` is in quotes `" "`. This is because the phrase `Hello, world!` is a specific **data type** known as a **string** (text) and strings must be in quotes. By the way, single quotes `' '` can be used as well. They are the same as double quotes `" "` in Python. 

**What happens if you remove the quotes?**

In [2]:
print("Hello, world!")

Hello, world!


There are many data types, but the following are the most common:

|data type| Description|Example|
|--------|-------------|--------|
|int|integer|`5`|
|float|number with a decimal|`3.14`|
|str|string of characters|`"What?"`|
|list|ordered collection|`["eggs", "milk", "juice"]`|

Data types are important because functions only accept arguments with certain data types. 

In [3]:
round("5.3")

TypeError: type str doesn't define __round__ method

A useful function is `type()` which will tell you the data type of a value of data stored in a variable.

In [18]:
type("5.3")

str

## Data Structures

Lists are an example of **structured data**. Essentially, several pieces of data have been organized. Lists are **ordered**, meaning the items are labeled using an **index**. 

In [19]:
grocery_list = ["eggs", "milk", "juice"]   

**Print "milk" from `gocery_list`**

In [20]:
print(grocery_list[1])

milk


### Arrays

Arrays are a special type of list that contains only one data type.

**Which list is an array?**

In [22]:
planet_list = ['mercury', 'venus', 'earth', 'mars', 'jupiter']

age_list = [3, 8, 12.0, 38, 40]

number_list = [4, 'five', 'ten', 32]

In [26]:
type(planet_list)

list

## Numpy

Numpy is a fundamental library that is used for scientific computing. Its basic building block is called an **ndarray**, or an "n-dimensional array".

A common practice it to `import <library name> as <abbreviated name>`. This allows us to define an abbreviation for the library to use whenever we call a function from it instead of typing out the full name of the library every time. The abbreviation is up to you, but most libraries have an agreed upon abbreviation. For examples, most people abbreviate Numpy as `np`.

**Import the Numpy library**:

In [12]:
import numpy as np

#### 1D Arrays

A list with one type of data is a **1D array**. However, for Numpy to see it as an array data type, we need to use the `array()` function in Numpy

In [24]:
planet_array = np.array(planet_list)
type(planet_array)

numpy.ndarray

#### 2D Arrays

A 2D array can be thought of as a list of lists. Each list is an array, but not all the arrays must contain the same data type. 

**Let's put the following data into a Numpy 2D array:**

<img src="../imgs/numpy_planets.png" width = 300>

In [14]:
numpy_planets = np.array([['mercury', 'venus', 'earth', 'mars', 'jupiter'], 
                    [1,2,3,4,5], 
                    [0,0,1,2,79]])
print(numpy_planets)

[['mercury' 'venus' 'earth' 'mars' 'jupiter']
 ['1' '2' '3' '4' '5']
 ['0' '0' '1' '2' '79']]


In [15]:
type(numpy_planets)

numpy.ndarray

## Pandas

Pandas is a data analysis and manipulation tool. It has a lot of similarities to NumPy, but we tend to use it more for data in tables. 

A 1D array in Pandas is called a `Series`.

In [3]:
import pandas as pd

pandas_planets = pd.Series(['mercury', 'venus', 'earth', 'mars', 'jupiter'])

type(pandas_planets)

pandas.core.series.Series

### Pandas DataFrames

A 2D array in Pandas is called a `DataFrame`. 

This is done by passing another structured data type, known as a `Dictionary`, to the `DataFrame` function. 

In [27]:
pandas_planets = {'A': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter'],
        'B': [1, 2, 3, 4, 5], 'C': [0, 0, 1, 2, 79]}

In [28]:
planets_df = pd.DataFrame(pandas_planets)
planets_df

Unnamed: 0,A,B,C
0,Mercury,1,0
1,Venus,2,0
2,Earth,3,1
3,Mars,4,2
4,Jupiter,5,79


### Numpy 2D Array vs. Pandas Dataframe

**Numpy 2D Array**

<img src="../imgs/numpy_planets.png" width = 300>

**Pandas Dataframe**

<img src="../imgs/pandas_planets.png" width = 300>

## Reading in Data from a .csv File

We will primarily be using Pandas dataframes, but Numpy will be a valuable library as well. Fortunately, we don't need to build our 2D arrays and dataframes from scratch like above. Usually, we will bring in data from an existing file (also known as a **dataset**). The most common file format for raw data is a `.csv` file, short for "comma separated values" and it is very easy to convert it into a dataframe:

In [7]:
planets = pd.read_csv('../data/planets.csv')
planets

Unnamed: 0,Planet,Order,Moons
0,Mercury,1,0
1,Venus,2,0
2,Earth,3,1
3,Mars,4,2
4,Jupiter,5,79


A Pandas dataframe should look familiar. It is how data is structured in spreadsheets like Excel and Google Sheets. The bold numbers on the left are known as the **index** and allow us to number rows. The column names allow us to label and reference individual 1D arrays in the dataframe.