# Reading Data

* Python has a large number of different ways to read data from external files. 
* Python supports almost any type of file you can think of, from simple text files to complex binary formats.
* In this class we are going to mainly use the pakage **`pandas`** to load external files.
* `pandas` loads data into an object called a `DataFrame` - think of it as a table.
* `DataFrames` are very useful, since there are lots of built-in methods that allow us to easily manipulate the data.

---

# The `pandas` package - `DataFrame`

In [0]:
import pandas as pd

##### Let us read-in the file: `./Data/Planets.csv`

```
Name,a,
Mercury,0.3871,0.2056
Earth,0.9991,0.0166
Jupiter,5.2016,0.0490
Neptune,29.9769,0.0088
```

In [0]:
planet_table = pd.read_csv('./Data/Planets.csv')

In [0]:
planet_table

In [0]:
print(planet_table)

### Notice that each row has an `index` assigned to it.

## Renaming columns

In [0]:
planet_table.rename(columns={'Unnamed: 2': 'ecc'}, inplace=True)

planet_table

In [0]:
planet_table['Name']

### Table columns are just arrays

In [0]:
planet_table['Name'][0:2]

In [0]:
planet_table['Name'].size

In [0]:
planet_table['a'].mean()

In [0]:
planet_table['ecc'].sum()

### Sometimes you want just the data without the index (`.values`)

In [0]:
my_names = planet_table['Name'].values

my_names

In [0]:
my_names[0:2]

## Sorting

In [0]:
planet_table.sort_values(['ecc'])

In [0]:
planet_table.sort_values(['ecc'], ascending=False)

#### The original table is unchanged

In [0]:
planet_table

#### To `save` the sort, assign the sorted table to a new variable:

In [0]:
sorted_table = planet_table.sort_values(['ecc'])

In [0]:
sorted_table

### Notice that the index has **NOT** been reordered!

In [0]:
sorted_table['Name'][0]

### You can fix this by resetting the index

In [0]:
sorted_table = planet_table.sort_values(['ecc']).reset_index(drop=True)

In [0]:
sorted_table

In [0]:
sorted_table['Name'][0]

## Masking

In [0]:
mask1 = planet_table['a'] > 5

mask1

In [0]:
planet_table[mask1]

In [0]:
planet_table[mask1]['a']

In [0]:
mask2 = ((planet_table['a'] > 5) &
         (planet_table['ecc'] < 0.04))

planet_table[mask2]

In [0]:
masked_table = planet_table[mask2].reset_index(drop=True)
masked_table

## Adding a column to the Table
* perihelion distance$\ = a(1-e)$

In [0]:
perihelion = planet_table['a'] * (1.0 - planet_table['ecc'])

In [0]:
perihelion

In [0]:
planet_table['Perihelion'] = perihelion

In [0]:
planet_table

## Saving a table

In [0]:
planet_table.to_csv('./Data/NewPlanets2.csv', index=False)

## Different Delimiters

Because some people just want to see the world burn, they create datasets where the columns are separted by something other than a comma ","

##### `./Data/Planets_Ver2.txt`

```
Name:a:
Mercury:0.3871:0.2056
Earth:0.9991:0.0166
Jupiter:5.2016:0.0490
Neptune:29.9769:0.0088
```

In [0]:
planet_table_2 = pd.read_csv('./Data/Planets_Ver2.txt', delimiter = ":")

In [0]:
planet_table_2

##### `./Data/Planets_Ver3.txt`

```
Name a 
Mercury 0.3871 0.2056
Earth 0.9991 0.0166
Jupiter 5.2016 0.0490
Neptune 29.9769 0.0088
```

In [0]:
planet_table_3 = pd.read_csv('./Data/Planets_Ver3.txt', delimiter = " ")

In [0]:
planet_table_3

##### `./Data/Planets_Ver4.txt`

```
 Name   a 
    Mercury 0.3871  0.2056
 Earth 0.9991   0.0166
     Jupiter 5.2016  0.0490
 Neptune    29.9769    0.0088
```

In [0]:
planet_table_4 = pd.read_csv('./Data/Planets_Ver4.txt', delimiter = " ", skipinitialspace=True)

In [0]:
planet_table_4

---

# Messy Data

* `pandas` is a good choice when working with messy data files.
* In the "real world" all data is messy.

##### Let us read-in the file: `./Data/Mess.csv`

```
#######################################################
#
# Col 1 - Name
# Col 2 - Size (km)
#
#######################################################
"Sample 1",10
"",23
,
"Another Sample",
```

### This is not going to end well ...

In [0]:
messy_table = pd.read_csv('./Data/Mess.csv')

### Tell `pandas` about the comments:

In [0]:
messy_table = pd.read_csv('./Data/Mess.csv', comment = "#")

messy_table

#### `NaN` = Not_A_Number, python's null value

## Not quite correct ...

### Turn off the header

In [0]:
messy_table = pd.read_csv('./Data/Mess.csv', comment = "#", header= None)

messy_table

### Add the column names

In [0]:
col_name = ["Name", "Size"]

messy_table = pd.read_csv('./Data/Mess.csv', comment = "#", header= None, names = col_name)

messy_table

### Deal with the missing data with `fillna()`

In [0]:
messy_table['Name'].fillna("unknown", inplace=True)
messy_table['Size'].fillna(999.0, inplace=True)

messy_table

----

# Lots of Data

* 'pandas' will cutoff the display of really long tables
* You can change this with `pd.set_option('display.max_rows', # of rows)`

In [0]:
star_table = pd.read_csv('./Data/NamedStars.csv')

In [0]:
star_table

In [0]:
pd.set_option('display.max_rows', 70)

In [0]:
star_table