# Reading Data

* Python has a large number of different ways to read data from external files. 
* Python supports almost any type of file you can think of, from simple text files to complex binary formats.
* In this class we are going to mainly use the pakage **`pandas`** to load external files into `DataFrames`.

In [0]:
import pandas as pd
import numpy as np

##### Let us read-in the file: `./Data/Planets.csv`

```
Name,a,
Mercury,0.3871,0.2056
Earth,0.9991,0.0166
Jupiter,5.2016,0.0490
Neptune,29.9769,0.0088
```

In [0]:
planet_table = pd.read_csv('./Data/Planets.csv')

In [0]:
planet_table

## Renaming columns

In [0]:
planet_table.rename(columns={'Unnamed: 2': 'ecc'}, inplace=True)

planet_table

In [0]:
planet_table.loc[:, 'ecc']

In [0]:
planet_table['ecc']

### Sometimes you want just the data without the index (`.values`)

In [0]:
my_values = planet_table['ecc'].values

my_values

In [0]:
my_values[0:2]

## Adding a column - `insert`

`.insert(loc, column, value, allow_duplicates = False)`

#### perihelion distance [AU] = `semi_major axis * ( 1 - eccentricity )`

In [0]:
def find_perihelion(semi_major, eccentricity):
    result = semi_major * (1.0 - eccentricity)
    return result

#### Use `DataFrame` columns as arguments to the `find_perihelion` function

In [0]:
my_perihelion = find_perihelion(planet_table['a'], planet_table['ecc'])

In [0]:
my_perihelion

In [0]:
# Add column in position 1 (2nd column)

planet_table.insert(1, 'Perihelion', my_perihelion, allow_duplicates = False)

In [0]:
planet_table

## Removing a column - `drop`

In [0]:
planet_table.drop(columns='Perihelion', inplace = True)

In [0]:
planet_table

## Adding a column (quick) - always to the end of the table

In [0]:
planet_table['Perihelion'] = my_perihelion

In [0]:
planet_table

## Rearranging columns

In [0]:
planet_table.columns

In [0]:
my_new_order = ['a', 'Perihelion', 'Name', 'ecc']

In [0]:
planet_table = planet_table[my_new_order]

In [0]:
planet_table

## Adding a row `.append`

* The new row has to be a `dictionary` or another `DataFrame`
* Almost always need to use: `ignore_index=True`

In [0]:
my_new_row = {'Name': 'Venus', 'a': 0.723, 'ecc': 0.007}

In [0]:
my_new_row

In [0]:
planet_table.append(my_row, ignore_index=True)

In [0]:
planet_table

In [0]:
planet_table = planet_table.append(my_row, ignore_index=True)

In [0]:
planet_table

#### `NaN` = Not_A_Number, python's null value

----

# Reading (bad) Data

## Different Delimiters

Because some people just want to watch the world burn, they create datasets where the columns are separted by something other than a comma.

#### Bad - Using another delimiter like `:`

##### `./Data/Planets_Ver2.txt`

```
Name:a:
Mercury:0.3871:0.2056
Earth:0.9991:0.0166
Jupiter:5.2016:0.0490
Neptune:29.9769:0.0088
```

In [0]:
planet_table_2 = pd.read_csv('./Data/Planets_Ver2.txt', delimiter = ":")

In [0]:
planet_table_2

#### Worse - Using whitespace as a delimiter

##### `./Data/Planets_Ver3.txt`

```
Name a 
Mercury 0.3871 0.2056
Earth 0.9991 0.0166
Jupiter 5.2016 0.0490
Neptune 29.9769 0.0088
```

In [0]:
planet_table_3 = pd.read_csv('./Data/Planets_Ver3.txt', delimiter = " ")

In [0]:
planet_table_3

#### WORST! - Using inconsistent whitespace as a delimiter!

##### `./Data/Planets_Ver4.txt`

```
 Name   a 
    Mercury 0.3871  0.2056
 Earth 0.9991   0.0166
     Jupiter 5.2016  0.0490
 Neptune    29.9769    0.0088
```

In [0]:
planet_table_4 = pd.read_csv('./Data/Planets_Ver4.txt', delimiter = " ", skipinitialspace=True)

In [0]:
planet_table_4

---

# Messy Data

* `pandas` is a good choice when working with messy data files.
* In the "real world" all data is messy.

##### Let us read-in the file: `./Data/Mess.csv`

```
#######################################################
#
# Col 1 - Name
# Col 2 - Size (km)
#
#######################################################
"Sample 1",10
"",23
,
"Another Sample",
```

### This is not going to end well ... (errors galore!)

In [0]:
messy_table = pd.read_csv('./Data/Mess.csv')

### Tell `pandas` about the comments:

In [0]:
messy_table = pd.read_csv('./Data/Mess.csv', comment = "#")

messy_table

## Not quite correct ...

### Turn off the header

In [0]:
messy_table = pd.read_csv('./Data/Mess.csv', comment = "#", header= None)

messy_table

### Add the column names

In [0]:
my_column_name = ['Name', 'Size']

messy_table = pd.read_csv('./Data/Mess.csv', comment = "#", header= None, names = my_column_name)

messy_table

### Deal with the missing data with `.fillna()`

In [0]:
messy_table['Name'].fillna("unknown", inplace=True)
messy_table['Size'].fillna(999.0, inplace=True)

messy_table

----

# Lots of Data

* 'pandas' will cutoff the display of really long tables
* You can change this with `pd.set_option('display.max_rows', # of rows)`

In [0]:
star_table = pd.read_csv('./Data/NamedStars.csv')

In [0]:
star_table

In [0]:
pd.set_option('display.max_rows', 70)

In [0]:
star_table