# Handling data with python

## 1. Simple Variables

Basic data types:

* Integers
* Floats
* Strings
* Boolean

### 1.1 Numbers

In [None]:
one = 1 # As an integer

print(one)
print('This variable is a', type(one))

In [None]:
one = 1.00 # As a float

print(one)
print('This variable is a', type(one))

### 1.2 Strings

In [None]:
lapis = 'LAPIS'

print(lapis)

Indexing strings is an important skill.

In [None]:
lapis[-1]

In [None]:
print('[0] =>', 'LAPIS'[0])
print('[0:1] =>', 'LAPIS'[0:1])
print('[0:2] =>', 'LAPIS'[0:2])
print('[-1] =>', 'LAPIS'[-1])

You can use unicode characters:

In [None]:
print('I %s LAPIS' % '\u2665')

You can see the size of the string:

In [None]:
len(lapis)

### 1.3 Booleans

Booleans are results of a logical expression:

In [None]:
print(type(lapis) == str)
print(type(lapis) == float)

You can also create a boolean variable:

In [None]:
a_boolean = True

print(a_boolean)
print(a_boolean is True)
print(a_boolean is False)

### 1.4 Converting between variable types

You can easily convert between types.

In [None]:
print(float(a_boolean))
print(int(a_boolean))
print(int(13.97))
print(str(a_boolean))

### 1.5 Operations

Basic math operations are simple:

In [None]:
two = one + one
zero = one - one
two = 2 * one
four = 2 ** 2
half = 1 / 2

print(two, 'is', one, '+', one)
print(zero, 'is', one, '-', one)
print(two, 'is', 2, '*', one)
print(four, 'is', 2, '**', 2)
print(half, 'is', one, '/', 2)

Boolean operations are my favorite.

In [None]:
# Is the variable one equal to 1?
print(one == 1) 

In [None]:
# The expression reads: one is larger or equal to 1 or one/two>0 and a_boolean is True
print((one >= 1) | ((one / two > 0) & (a_boolean == True)))

In [None]:
# Things are different for strings

'A String' + ' Another String'

## Question time!
### What happens if I multiply a string by an integer or float?
### What happens if I multiply a float and a boolean?

## 2. Less simple variables

Python allows you to organize variables in lots of ways. Out of the box, the possibilities are:

* Lists
* Tuples
* Dictionaries

### 2.1 Lists and tuples

The most basic data-types are tuples and lists.

In [None]:
a_tuple = ('a', 1, 1.0)
a_list = ['a', 1, 1.0]

print(a_tuple)
print(a_list)

You can have multidimensional tuples and lists, and even lists of tuples and tuples of lists.

In [None]:
a_list_of_lists = [[1,0,0],[0,1,0],[0,0,1]]
a_list_of_tuples = [(1,0,0),(0,1,0),(0,0,1)]
a_tuple_of_lists = ([1,0,0],[0,1,0],[0,0,1])
a_tuple_of_tuples = ((1,0,0),(0,1,0),(0,0,1))

print(a_list_of_lists)
print(a_list_of_tuples)
print(a_tuple_of_lists)
print(a_tuple_of_tuples)

Operations of lists and tuples are equivalent.

In [None]:
print('Doubles the tuple:', 2 * a_tuple)
print('Doubles the list:', 2 * a_list)
print('Concatenates tuples:', a_tuple + (2, 1))
print('Concatenates lists:', a_list + [2, 1])

Indexing lists and tuples is like indexing strings.

In [None]:
print(a_list[0])
print(a_list[0:2])
print(a_list[-1])

print(a_tuple[0])
print(a_tuple[0:2])
print(a_tuple[-1])

##### But what is the difference? 
Lists are dynamic and tuples are static!


In [None]:
# This will work
a_list[0] = 'a new element'
print(a_list)

In [None]:
# This will NOT work
a_tuple[0] = 'a new element'
print(a_list)

You can even create an empty list and append things to it as you go:

In [None]:
empty_list = []
print('This is an empty list:', empty_list)

In [None]:
empty_list.append('an element')
empty_list.append('another element')

# Now that the list is not empty anymore we should change its name
not_empty = empty_list

print('Not empty anymore:', not_empty)

You can see the size of the list or tuple just like with strings.

In [None]:
len(not_empty)

### 2.2 Dictionaries

Dictionaries let you organize your data in fields with proper names.

In [None]:
a_dictionary = {'one' : 1.0,
                'two' : 2.0,
                'heart' : '\u2665',
                'a_list' : ['this', 'is', 'a', 'list']
                }

To see the names of the fields in a dictionary:

In [None]:
print(a_dictionary.keys())

To get access to the data:

In [None]:
print(a_dictionary['heart'])
print(a_dictionary['a_list'])

### 2.3 Arrays

Arrays are data structures provided by the numpy package, they make computations much faster!

In [None]:
import numpy as np

You can create arrays of random data sampling from a probability distribution, which is useful for examples, let's create this dataset from a gaussian distribution.

In [None]:
fake_data = np.random.normal(size=100)

print(fake_data)

The object np.array has a series of methods:

In [None]:
print('Mean:', fake_data.mean())
print('Standard deviation:', fake_data.std())
print('Sum', fake_data.sum())
print('Maximum:', fake_data.max())
print('Index of the maximum value:', fake_data.argmax())
print('Minimum:', fake_data.min())
print('Index of the minimum value:', fake_data.argmin())

Let's play with some multidimentional data:

In [None]:
more_fake_data = np.zeros((10, 12)) # A matrix full of zeros
more_fake_data

In [None]:
more_fake_data[:,-1] = 2 * np.ones(10) # Setting the last column to a vector of twos
more_fake_data

In [None]:
more_fake_data[0, 0] = np.pi # Setting first element to pi
more_fake_data

##### Numpy allows looooooots of tricks

In [None]:
print('The data has shape', more_fake_data.shape)
print('Sum:', more_fake_data.sum())
print('Sum over x direction:', more_fake_data.sum(axis=1))
print('Sum over y direction:', more_fake_data.sum(axis=0))
print('Square roots: \n', np.sqrt(more_fake_data))
print('Integrals:', np.trapz(more_fake_data, dx=1))

In [None]:
array_from_a_list = np.array(a_list_of_lists) # Creates array from a list
print('Array from a list:\n', array_from_a_list)

In [None]:
print('Median:', np.median(fake_data))
print('Percentile:', np.percentile(fake_data, 25))

In [None]:
aranged_data = np.arange(0, 100, 10) # Series of numbers spaced according to a certain step
print(aranged_data)

In [None]:
linear = np.linspace(0, 100) # Linearly spaced data
log = np.logspace(0, 100) # Logarithmicaly spaced data

print(linear)
print(log)

##### You can look up the documentation of any python function or object with a "?"

In [None]:
np.sum?

## 3 Some quick coding lessons

### 3.1 For loops

Python let's you do for loops over anything:

In [None]:
# Keys in a dictionary:
for key in a_dictionary.keys():
    print(key)

In [None]:
# Files in a directiory:

import os

for file in os.listdir('./'):
    print(file)

In [None]:
# And numbers in a range:

for i in range(10):
    print('i=',i)
    i = i**2 + 1 
    print('i**2+1=', i)

Lear list comprehension and never do an explicit loop in again in your life!

In [None]:
a_cleverly_defined_list = [i**2 + 1 for i in range(10)]
print(a_cleverly_defined_list)

## Question time!

### See if you can get an array where each element is the number of characters on the names of the files in this directory!

### Extra: Can you do it with list comprehension?

### Extra: Do you think you can make a dictionary where the keys are the names of files and the contents are the numbers of characters in the name of each file?

### 3.2 Defining functions

Functions in python are defined like this:

In [None]:
def a_function(a_number):
    another_number = a_number**3 + a_number**2
    return another_number

print(a_function(2))

### 3.3 Plotting

Plots are made with the matplotlib package.

In [None]:
import matplotlib.pyplot as plt

Create random data and plot a histogram:

In [None]:
x = np.random.normal(size=1000)

In [None]:
plt.hist(x)

Create a fake correlation (12x + 2 line + noise) and plot it:

In [None]:
y = 12 * (x + np.random.normal(size=1000)) + 2

In [None]:
plt.plot(x, y, '.', label='Data')
plt.plot(x, 12 * x + 2, label=r'$y=12x+2$')

plt.legend()

plt.xlabel('x')
plt.ylabel('y')

Sometimes you are dealing with so much data that you can't see any correlation when plotting points:

In [None]:
x = np.random.normal(loc=20, scale=3, size=10000)
y = (x + np.random.normal(size=10000, scale=3)) ** 2

plt.plot(x, y, '.')

You can make points transparent or smaller.

In [None]:
plt.plot(x, y, '.', alpha=0.5, markersize=1)

But there are better ways to do this, like the hexbin function.

In [None]:
plt.hexbin(x, y, gridsize=50, mincnt=1)

Or the kdeplot function of the seaborn package.

In [None]:
import seaborn as sns

sns.kdeplot(x, y)

# 4. Taking a look at S-PLUS data

There are several ways to organize tables on python, I tend to use astropy tables.

In [None]:
from astropy.table import Table

catalog = Table.read('splus_laplata.txt', format='ascii')

catalog

A  more popular option is Pandas.

In [None]:
import pandas as pd

catalog = pd.read_table('splus_laplata.txt', delim_whitespace=True)

catalog

Is nice to have at least some proficiency with both Astropy tables and Pandas dataframes. Only learn pytables if these two don't do the trick.

|                | Intuitive | Memory management | Integration with astronomy tools | General purpose |
|----------------|-----------|-------------------|----------------------------------|-----------------|
| Astropy Tables | 10        | 3                 | 10                               | 0               |
| Pandas         | 9         | 7                 | 0                                | 10              |
| PyTables       | 0         | 10         | 0                                | 10              |

### Let's make a color-color diagram

In [None]:
plt.plot(catalog['uJAVA_auto']-catalog['r_auto'], catalog['g_auto']-catalog['i_auto'], '.')

What just happened ???

The S-PLUS magnitudes are reported as 99 or -99 when the object is not detected or not observed in a given band.

In [None]:
no_missing_bands = catalog['nDet_auto'] == 12

no_missing_bands.sum()/len(catalog)

In [None]:
catalog = catalog[no_missing_bands]

plt.plot(catalog['uJAVA_auto']-catalog['r_auto'], catalog['g_auto']-catalog['i_auto'], '.', ms=0.5)

plt.xlabel('u-r')
plt.ylabel('g-i')

Let's plot galaxies, stars and quasars separately.

In [None]:
stars = catalog['class_2'] == 'STAR'
galaxies = catalog['class_2'] == 'GALAXY'
quasars = catalog['class_2'] == 'QSO'

plt.plot(catalog['uJAVA_auto'][stars]-catalog['r_auto'][stars], catalog['g_auto'][stars]-catalog['i_auto'][stars],
         '.', ms=2, color='green', label='Stars')
plt.plot(catalog['uJAVA_auto'][galaxies]-catalog['r_auto'][galaxies], catalog['g_auto'][galaxies]-catalog['i_auto'][galaxies], 
         '.', ms=2, color='red', label='Galaxies')
plt.plot(catalog['uJAVA_auto'][quasars]-catalog['r_auto'][quasars], catalog['g_auto'][quasars]-catalog['i_auto'][quasars],
         '.', ms=2, color='blue', label='Quasars')

plt.xlabel('u-r')
plt.ylabel('g-i')

plt.legend()

##### What a mess! Let's plot contours!

In [None]:
sns.kdeplot(catalog['uJAVA_auto'][stars]-catalog['r_auto'][stars], catalog['g_auto'][stars]-catalog['i_auto'][stars], 
            cmap='Greens', levels=5, label='Stars')
sns.kdeplot(catalog['uJAVA_auto'][galaxies]-catalog['r_auto'][galaxies], catalog['g_auto'][galaxies]-catalog['i_auto'][galaxies], 
            cmap='Reds', levels=5, label='Galaxies')
sns.kdeplot(catalog['uJAVA_auto'][quasars]-catalog['r_auto'][quasars], catalog['g_auto'][quasars]-catalog['i_auto'][quasars], 
            cmap='Blues', levels=5, label='Quasars')

plt.legend()

plt.xlim(-0.2, 4.3)
plt.ylim(-0.2, 2)

plt.xlabel('u-r')
plt.ylabel('g-i')

## Question time!

### What about plotting a histogram of redshifts for the different types of S-PLUS objects? How you think we could overplot the three classes?