<div class="alert alert-block alert-info">
<b>TBA:</b> jcash will improve text here

    This is still a draft let me know where there is confusion.
</div>

# Tutorial 07: Basic Data Files in Python

## Overview 
- File input and output
- - Simple text formats
  - csv, xls
- Accessing data
- - Numpy array slices (data-sci/ 1-numpy)
  - Pandas dataframe (data-sci/ 4-pandas)

 
In data science, it is very common for your Python code to access a file containing the data you want to analyze. Like most programming languages, there are a variety of techniques available in Python to work with file input and output (I/O for short). 

Luckily, most astronomical data is saved in formats that are easier to work with once you understand the functions available. In addition to built-in functions for I/O, there are useful functions in both the NumPy and Pandas packages that we will explore in this tutorial. 


A later tutorial will work with several other common file formats including fits files. 




In [1]:
import numpy as np
import pandas as pd

## 1.0) Data file formats

Before you can choose the best way to import your data file, you need to know more about the format of that data file. 


### ascii files

ASCII files are generic format files that can be read or produced by most applications. There are three common ASCII data formats: .DAT, .CSV, and .TXT. ASCII files are generic format files read or produced by most applications. These files can also be imported into most applications, including word processors, spreadsheets, and ASCII editors.

Ascii files can be viewed by text editors and web browsers very easily. You will want to visually look at the file contents (at least the first few lines) to understand the data better. 

Things to look for:
* Can I view the text?
* Is there a common format on each line of the file?
* What separates one piece of information from the next (space, comma, tab).

### Spreadsheet files

Add more info here...

.xls and .ods

### Other data formats

In this tutorial, we note that there are many other file formats that can be used for storing data. A complete coverage of these files is beyond the scope of this tutorial. Another later tutorial in this series does cover a common astronomical data file type called a .fits file. 

### Data Examples

Throughout this tutorial, we will show specific examples of opening Data files with the various techniques. For these tutorials, all data files will be contained in a directory `data/` stored alongside the tutorial files. You will need to ensure these data files are downloaded/uploaded with the Jupyter notebooks. 

If you are working on a jupyter-notebook server such as Anaconda on the Cloud or the Rubin Science Platform, you should be able to view the data files in the jupyter-server. 

When we say `filename`, that is a string which contains both the path to the file and the name of the file. 

If you download the entire tutorial directory with the `data/` directory, you can reference individual files with the syntax `./data/filename`
For example the first file we will look at is the file named "the-zen-of-python.txt". In that case the full filename with path would be written as 
`"./data/the-zen-of-python.txt"`


<blockquote> 
    
    **Caution**
    
    Depending on how you are opening this tutorial or running this code in Python, the kernel will have different rules for how you must specify the pathname for the datafile you are accessing. You may need to change the path definition in the various cells which refer to the data files you should use.
    
</blockquote>

## 2.0)  Unformatted text files

If a file contains ascii text but there is no standard format line by line, then you will probable need to read each line of the file into string variables. 

Depending on what you need from the file, you may use a variety of string functions and conditional statements to extract that information. 

### 2.1) Built-in open function

Within Python, one of the built-in functions is `open()`.

- The parameter you pass to the is the **filename**
- Optionally you can specify the mode
    - The default if you do not specify the mode is 'r' for read access
    - Other common options are: 'w' to write to a file and 'a' to append to an existing file
- The output is a file object (not the contents of the file)
    - You still need to use other functions to read or write to that file
 
The **file object** is a Python class with a variety of methods available. 
We will look at several of these to help you understand the options.

As we move forward, we will use shorter versions of these calls. 

In [2]:
#Uncomment the line below by removing the # symbol to see the full help information on this function.
#help(open)

In [3]:
#Be sure to identify the location where your code is looking for the file.

#This will return your directory.

import os
print(os.getcwd())

/home/jcash/GitHub/SCSU-PAARE-python-intro-tutorials


In [4]:
#Specifying the filename. 
filename = "./data/the-zen-of-python.txt"

print(type(filename))

<class 'str'>


In [5]:
#Opening a file to create a file object.
fileobj = open("./data/the-zen-of-python.txt",'r')
print(type(fileobj))

<class '_io.TextIOWrapper'>


#### Checking a file

When you first work with a data file, you may need to check to see if it is readable before moving forward. 

In general, you will already know what type of file you have and can skip this step. 

In [6]:
#Testing is a file is readable.
print(fileobj.readable())

True


#### Reading the full file

You can read in the full file into one big string using the `.read()` function.


In [7]:
fileobj = open(filename)
result = fileobj.read()
print(type(result))
print(len(result))
result

<class 'str'>
856


"The Zen of Python, by Tim Peters\n\nBeautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.\nFlat is better than nested.\nSparse is better than dense.\nReadability counts.\nSpecial cases aren't special enough to break the rules.\nAlthough practicality beats purity.\nErrors should never pass silently.\nUnless explicitly silenced.\nIn the face of ambiguity, refuse the temptation to guess.\nThere should be one-- and preferably only one --obvious way to do it.\nAlthough that way may not be obvious at first unless you're Dutch.\nNow is better than never.\nAlthough never is often better than *right* now.\nIf the implementation is hard to explain, it's a bad idea.\nIf the implementation is easy to explain, it may be a good idea.\nNamespaces are one honking great idea -- let's do more of those!"

#### Reading in the file line by line

The `readlines()` method will give you a list where each item in the list is a string containing one line of the file.

You can then do things with each line of the file by indexing the list and using string operations.

In [8]:
fileobj = open(filename)
lines = fileobj.readlines()

print(type(lines))

<class 'list'>


In [9]:
#Shows the number of lines in the file.
len(lines)

21

In [10]:
#Printing out a single line by indexing.
print(lines[4])

Simple is better than complex.



In [11]:
#Print a subsection of the lines.
print(lines[4:6])

['Simple is better than complex.\n', 'Complex is better than complicated.\n']


In [12]:
#Iterating over the list of lines to test for a substring.
for line in lines:
    if "by" in line:
        print(line)

The Zen of Python, by Tim Peters



#### Splitting the lines into words

If you needed to separate each line of the file into individual words, we can then use string splitting on the list of lines. 


In [13]:
fileobj = open(filename)          #open the fileobject
lines = fileobj.readlines()       #read the lines into a list of lines
words = []                        #create an empty list to hold the words
for line in lines:                #go line by line
    words.append(line.split(' ')) # split each line at the spaces

print(words[0:4])

[['The', 'Zen', 'of', 'Python,', 'by', 'Tim', 'Peters\n'], ['\n'], ['Beautiful', 'is', 'better', 'than', 'ugly.\n'], ['Explicit', 'is', 'better', 'than', 'implicit.\n']]


**closing a file**

Notice that in the above examples, we had to open the file each time. Technically we should be closing the file in between these open calls. 


In [14]:
#check to see if a fileoject is closed
if fileobj.closed == False:
    print('it is still open')

it is still open


In [15]:
#the syntax to close a file
fileobj = open(filename)
lines = fileobj.readlines()
fileobj.close()

**with statements to open a file**

Using a `with` statement allows python to open the file, execute a section of code and then properly close the `fileobject` without having to do an explicit `close` command. For this reason, it is often the preferred method. 

The syntax is a little different for the order of the command but it contains the same information in a more compact format.

Since the variable for the fileobject is only used inside the with statement, it is often shortened to just `f` (just make sure you have not already used that variable name for something else).



In [16]:
filename = "./data/the-zen-of-python.txt"
with open(filename, 'r') as f:
    lines = f.readlines()
    words =[]
    for line in lines:
        words.append(line.split(' '))

### 2.2) Advantages of the open function

The advantage of using the built-in `open()` function in Python is that it will work for any ascii textfile. 

- The file can contain any number of rows of any length. 
- The length of each line doesn't matter.
- The lines can contain any type of information
- You can treat each line anyway you need to in order to extract any information you need

### 2.3) DisAdvantages of the open function

Although the `open()` function is very powerful, there are often more efficient ways to access the data in the file if it has a well ordered structure for the data. 

You do need to know your data first to use these other methods but once you do, you can use the best method from the other ones in this tutorial. 

## 3.0) Column formatted text files 

If the data file contains ascii text with a standard format on each line, we have some more efficient ways to read in and work with the data.

In particular, in Data Science we often have columns of data with each row containing the same number of columns. 

Below are examples of a few files that we will now be using. We summarize the first few lines of each files as raw text here just to give you a view of each file. 

We could use the open file method described in the previous section. 

It will read in the lines and put the strings into a list. 

To use the values as numbers, we would still need to: 
- Strip off the next line character,
- Split the lines into strings,
- Convert each string into a number.

This works as shown below (without a detailed explanation of each step), but takes a lot of code to do everything we need.

In [17]:
filename = "./data/syn.txt" 
with open(filename) as f:
    lines = f.readlines()

data =[]
for line in lines:
    temp = line.rstrip('\n')
    vals = temp.split(' ')
    values = []
    for val in vals:
        values.append(float(val))
    data.append(values)
    
data[0:2]

[[0.048962338191566035, 12.070034199388704],
 [0.47187691702469947, 12.64081926075754]]

### 3.1) Numpy loadtxt

**If** the data file contains ascii text  of numbers organized into columns of data...

One more efficient method you can use is the numpy function `np.loadtxt`.

The general syntax is `data = np.loadtxt(filename, delimiter = None, skiprows = 0)`. 

If you do not specify a delimiter, it will assume whitespace.

If you do not specify a number of rows to skip at the start of the file, it will start with the first line.

Full documentation is given at:
https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html

#### Files with just numbers and spaces

In [18]:
#Here is the code to read in the syn.txt file. 
filename = "./data/syn.txt"     #Set the path to the file.
data = np.loadtxt(filename)     #Read in the file contents to a numpy array, no special options needed.

data[0:2]

array([[ 0.04896234, 12.0700342 ],
       [ 0.47187692, 12.64081926]])

In [19]:
#Here we can examine information about the data.
print(type(data))
print(len(data))
print(np.shape(data))
print(type(data[0][0]))

<class 'numpy.ndarray'>
500
(500, 2)
<class 'numpy.float64'>


As you can see above, the data is immediately accessible as a NumPy 2D data array with just a single line of code to readin the data from the file and format it as numbers. 

#### Files with a header

A header is a line or lines at the top of the data file containing information about the data. This information is very useful in understanding the data, but we need to be careful in how we read in the file. 

For `np.loadtxt` you can skip reading in these rows using the `skiprows=` parameter. 
- The default value is None
- Otherwise, it should be an integer equal to the number of lines to skip before reading the data.
- lines which start with the `#` symbol are considered comments and skipped automatically, but can use a skiprows parameter

For our example files, 
- syn.txt had no header
- GCN25560.txt has a single line header


In [20]:
#Example that skips the header line.
filename = './data/GCN25560.txt'
data = np.loadtxt(filename,skiprows=1)
data[0:2]

array([[2.45872537e+06, 5.11600000e+01, 1.69300000e+01, 1.00000000e-02],
       [2.45872537e+06, 5.98000000e+01, 1.72100000e+01, 2.00000000e-02]])

#### Files that are comma separated

By default, the `np.loadtxt` assumes that the separator between the data columns in a whitespace. 
If a data file has commas seperating the values in the columns, we can still use the same method but we have to specify the delimiter.

These comma-separated-values files are often given the extension of `.csv` but can also have `.txt` or `.dat` extensions.


In [21]:
#Here is a comma delimited example with one header row.
filename = './data/galaxies.txt'
data = np.loadtxt(filename, delimiter =',', skiprows =1)

data[0:2]

array([[1.0000000e+00, 1.3337110e+02, 5.7598427e+01, 3.9515216e-02],
       [2.0000000e+00, 1.3368567e+02, 5.7480250e+01, 4.1055806e-02]])

### 3.2) Using Numpy if data are not all numbers

NumPy is most efficient when working with numbers. Further a numpy array must have only one data type. By default, `np.loadtxt` assumes that the data can all be converted to float values. 

If even one value in the data file is a non numeric string, `np.loadtxt` will give an error when it tries to convert that string into a float value. 

We can work around this by specifically telling numpy to use a string data type when working with the file. 

Below, are examples of using `np.loadtxt` with the Moons_and planets.csv file
- The first shows the correct syntax to use to get strings
- The second cell shows the error statement you will get without the data type
    - uncomment the command and execute to see the error

In [22]:
#This is the correct call
file = "./data/Moons_and_planets.csv"
data2 = np.loadtxt(file,dtype="str",delimiter=',',skiprows=1)

print(type(data2))
data2[0:2]

<class 'numpy.ndarray'>


array([['Moon', 'Earth', '1737.1'],
       ['Phobos', 'Mars', '11.1']], dtype='<U13')

In [23]:
file = "./data/Moons_and_planets.csv"
#This will give a ValueError
#Uncomment the line below to see what the error looks like

data2 = np.loadtxt(file,delimiter=',',skiprows=1)

ValueError: could not convert string 'Moon' to float64 at row 0, column 1.

Now the numpy array is an array of string values, both the words and numbers are left as strings.


Additional information on using Numpy including a few additional functions and formats can be found at:

https://numpy.org/doc/stable/user/how-to-io.html

### 3.3) Reading in files with Pandas

Since data files may have mixed data types, numpy is not the right choice for all data files. Several other packages focus on different ways to work with Data files. **Pandas** is a very commonly used one. 

In their documentation they use the description: 

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Full documentation and Users guides can be found at https://pandas.pydata.org/docs/

Advantages of pandas
- It deals with multiple data types easily
- The resulting data structure for numeric columns can be easily converted to numpy arrays
- There are functions and methods to work with the data in the table
- Any header information can be used to define column names (instead of just skipping the lines)

#### Pandas read csv

When working with data files with comma seperated values, you can use the `pd.read_csv()` function. 

The required parameter is the string filename, and the output is a pandas dataframe object.

It is common to use df in the output variable name to indicate that it is a DataFrame, but this is not required.

In [24]:
#Reading in the datafile
filename = './data/galaxies.csv'
gal_df = pd.read_csv(filename)

type(gal_df)

pandas.core.frame.DataFrame

We can use the same format to look at the first few lines of the data as we did with the numpy arrays, but we immediately see that the output is easier to read. 

The column names were taken from the header automaticaly, and the values are shown in the normal decimal place format instead of the scientific notation format we saw in the numpy arrays. 


In [25]:
gal_df[0:2]

Unnamed: 0,# mangaid,objra,objdec,redshift
0,1,133.3711,57.598427,0.039515
1,2,133.68567,57.48025,0.041056


#### Pandas with other column data

We can use the `pd.read_csv()` function even if the data has a different separator.

You will need to specify the `delimiter` keyword or the equivalent `sep` keyword (short for separator)

- A single space delimiter will use `sep=' '`
- A variable number of white spaces will use `sep='\s+'`
- A comma would use `sep=','` use just leave off sep and the comma will be assumed


In [26]:
#Using the syn.txt file
#Here we set the delimiter and also say no header
filename = './data/syn.txt'
df = pd.read_csv(filename, sep=' ', header=None)

df[0:2]

Unnamed: 0,0,1
0,0.048962,12.070034
1,0.471877,12.640819


In [27]:
#Using the syn.txt file
filename = './data/GCN25560.txt'
df = pd.read_csv(filename,sep='\s+')

df[0:2]

Unnamed: 0,JD,dt_minutes,ap_mag,Mag_err
0,2458725.366,51.16,16.93,0.01
1,2458725.372,59.8,17.21,0.02


#### Pandas with mixed data types

While the Moons_and_planets data file was very hard to deal with using numpy, pandas has no difficulty with it at all. 


In [28]:
#Here we see that it easily handles text and numbers
filename = "./data/Moons_and_planets.csv"
df = pd.read_csv(filename)

df[0:2]

Unnamed: 0,# Name of Moon,Name of Planet,Diameter (km)
0,Moon,Earth,1737.1
1,Phobos,Mars,11.1


## 4.0) Working with spreadsheet files

While text and csv files are ascii files where the data is stored in a way that you can directly view the file, Spreadsheet programs such as Microsoft Excel or LibreOffice Calc, create files that are not ASCII. 

To read in the data files, we need to use different approaches.

### 4.1) convert to csv

If you only need to work with one spreadsheet file, it may be easier to open that spreadsheet with spreadsheet software and use the `Save As` options to save it out as a .csv file. Then you can use the methods described above to bring in the new .csv file into python.

### 4.2) Pandas and Excel

Excel is commonly used enough that Pandas has a method to work with the Excel files using `pd.read_excel()` instead of `pd.read_csv()`.

The function call is very similar to the `read_csv` if you only have a single sheet in the spreadsheet file. 

If you have multiple sheets or only need to pull in a specific range of cells from the spreadsheet, there are keyword parameters to do this. 

We show only the simple example here.

In [29]:
filename = './data/galaxies.xlsx'

df = pd.read_excel(filename)

df[0:2]

Unnamed: 0,# mangaid,objra,objdec,redshift
0,1,133.3711,57.598427,0.039515
1,2,133.68567,57.48025,0.041056


### 4.3) Pandas and OpenDocuments

The OpenDocuments format for a spreadsheet is given the extension `.ods`. 

LibreOffice is a cross-platform, free program that will work with `.ods` files. They can be saved out as either `.csv` or `.xlsx` files so that you can use the methods described above. 

Recent versions of Microsoft Excel will also read in a `.ods` file and allow you to save it in a different format. 


In the unlikely event that you need to work with these files in bulk, you will need an additional packages that are not standard in the anaconda version of python. You would need to be able to install new python packages to use these. 

Possible packages are:
- `odfpy`
- `pandas_ods_reader`

## 5.0) Writing Output files

### Simple ASCII text files of numpy arrays.

https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html#numpy.savetxt

In [30]:
filename = "outfile"
f = open(filename, 'w')
print("Filename is '{}'.".format(f.name))
if f.closed:
    print("File is closed.")
else:
    print("File isn't closed.")

Filename is 'outfile'.
File isn't closed.


### Simple csv file from pandas

# Assignments


## Exercise 1

In this exercise, you will be working with the file 'NGC5272.txt'

1) Use the built-in open method to read in the lines of the file
    - Don't forget to close the file when done
2) Print the first three lines of the file
   - Use this to determine if the file has a header line
   - Use this to determine what delimiter is used
3) Using numpy or pandas, read in the data
    - Print out the number of data rows
    - Print out the first five rows of data

In [31]:
# Step 1: use open and readlines to get the data


# Step 2: print out the first three lines


# Step 3: use either numpy or pandas to read the data


# Step 3b: print out the number of rows of data


# Step 3c: print out the first five rows of data




## Exercise 2  

1) Read in the `Moons_and_planets.dat` file
2) Count the number of moons for each planet above a threshold size of 100 km
    - use one of the solutions from the earlier tutorial on conditionals and control files as a guide
    - convert the diameters from strings into floats
    - use the diameters to limit what moons you count
3) Print out a statement for each planet with the name of the planet and the number of large moons it has

In [32]:
# Step 1: read in the data file


# Step 2a: Set up a loop to check the rows


# Step 2b: Loop over the moons and add to each counter when you find a large moon


# Step 3: Print out a statement for each planet with the planet and the count

