# Python for Data Science

In this notebook, we will explore the basics of using Python programming language for data science. These exercises are implemented in Ipython notebooks. [Ipython](https://ipython.org/) is an interactive envrionment for python programming commonly used for data science and machine learning. 

This notebook is adapted from the [Exploratory Computing course](https://github.com/mbakker7/exploratory_computing_with_python) by Mark Bakker. 

In [None]:
# this code is to create the data files for the exercises
# you can ignore them
import numpy as np
tokyo_temperature = [6.0,6.0,9.0,14.5,19.0,22.5,26,27.5,24,18.5,13.0,8.0]
np.savetxt('tokyo_temperature.txt', tokyo_temperature, delimiter=',')
holland_temperature = [3.1,3.3,6.2,9.2,13.1,15.6,17.9,17.5,14.5,10.7,6.7,3.7]
np.savetxt('holland_temperature.txt', holland_temperature, delimiter=',')
newyork_temperature = [-1,0,4,10,16,21,24,23,19,13,7,2]
np.savetxt('newyork_temperature.txt', newyork_temperature, delimiter=',')
holland_seawater = [5.5,4.5,5.7,7.3,11.7,14.8,17.5,20.0,19.0,17.2,13.0,8.8]
np.savetxt('holland_seawater.txt', holland_seawater, delimiter=',')


## First Steps with Python

Python is a popular, open-source programming language used for both scripting applications and standalone programs. For example, you can use Python as a calculator. Position your cursor in the code cell below and hit `shift+enter`. The output should be 12!

In [None]:
6 * 2

When you are programming, you usually store your values in `variables`. These variables can also be used to perform arithematic operations.

In [None]:
a = 6
b = 2
a * b

Here `a` and `b` are variables. Each variable has a `type`. In this case, they are both `integer` type. To write the value of a variable to the screen, use the `print` function.

In [None]:
print(a)
print(b)
print(a * b)
print(a / b)

You can add text to the `print` function by putting the text between quotes (either single or double quotes work). You can also add variables to the string by putting the variable name in `{}`.

In [None]:
print(f'Value of a is {a}')

A variable can be raised to a power by using `**`.

In [None]:
a ** b

Division works as well. 

(Note for Python 2 users: `1/3` gives zero in Python 2, as the division of two integers returns an integer. Use `1.0/3` instead.)

In [None]:
# for python2 use print('1/3 gives ', 1.0/3)
print(f'1/3 gives {1 / 3}')

### Resources

* Python supports several data-types other than integer. Please read more on the official documentation [page](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex).
* String manipulation and printing is straight forward. Documentation can be found for [print](https://docs.python.org/3/tutorial/inputoutput.html), [strings](https://docs.python.org/3/library/string.html).
* A comprehensive list of arithematic operators and boolean operators are available [here](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex).

### Exercise 1: First Python Code

Compute the value of the polynomial $y=ax^2+bx+c$ at $x=-2$, $x=0$, and $x=2$ using $a=1$, $b=1$, $c=-6$.

Hint: Create variables for a,b,c,x. Assign different values and evaluate the expression for each.

In [None]:
# put your code here


[Answer to Exercise 1](https://colab.research.google.com/drive/1-nwmcGnc-Oj56iHmrZFsU6tyJHunbd13#scrollTo=vNFHznAs4vqq)

### More on Variables

Once you create a variable in a Python session, it will remain in memory, so you can use it in other cells as well. For example, variable `a`, defined earlier in this Notebook, still exists. 

In [None]:
print(f'a: {a}')

It is important to keep these points in mind:
* Variable names may be as long as you like. Using descriptive names helps in understanding the code. 
* Variable names cannot have spaces, nor can they start with a number. 
* Variable names are case sensitive. So the variable `avalue` is not the same as the variable `Avalue`. 
* Name of a variable can not be reserved words in the Python language. For example, it is not possible to create a variable `for = 7`, as `for` is a reserved word. 

## Plotting and Arrays

Plotting is not part of standard Python. Luckily, a package exist to create beautiful graphics. The graphics package we will use is called `matplotlib`. To be able to use the plotting functions in `matplotlib` we have to import it. We import the plotting part of `matplotlib` and call it `plt`. We also give a second command which tells Python to show any graphs inside this Notebook and not in a separate window.

In [None]:
# import plotting part of matplotlib package as plt
import matplotlib.pyplot as plt

# this makes all matplotlib plots to be inside the notebook
%matplotlib inline

Packages have to be imported only once in a Python session. After the import, any plotting function may be called from any code cell as `plt.function`:

In [None]:
# use the plot function of plt
# this will create a graph and plot the contents of the list 
plt.plot([1, 2, 3, 2])

Let's try to plot $y$ vs $x$ for $x$ going from $-4$ to $4$ for the polynomial in the exercise above. To do that, we need to evaluate $y$ at a bunch of points. A sequence of values is called an array (for example an array of integers or floats). Array functionality is available in the package `numpy`. Let's import `numpy` and call it `np`, so that any function in the `numpy` package may be called as `np.function`:

In [None]:
# numpy package is used for array manipulation
# we import numpy as np 
import numpy as np

To create an array `x` consisting of 10 equally spaced points between `-4` and `4`, use the `linspace` command:

In [None]:
# create an array of 10 numbers from -4,4
x = np.linspace(-4, 4, 10)

# output the array here
print(x)

In the above cell, `x` is an array of 10 floats (`-4.` is a float, `-4` is an integer).
If you type `np.linspace` and then a question mark:

`np.linspace?` 

and then hit [shift-enter] a help window appears to explain the input arguments of the function. 

In [None]:
np.linspace?

Let's plot $y$ using 100 $x$ values from $-4$ to $4$.

In [None]:
# initialize variables
a = 1
b = 1
c = -6

# define array of x values
x = np.linspace(-4, 4, 100)

# compute y for all x values
y = a * x ** 2 + b * x + c  

# plot y vs x
plt.plot(x, y)

Note that  *one hundred* `y` values are computed in the simple line `y = a * x ** 2 + b * x + c`.  Python treats arrays in the same fashion as it treats regular variables when you perform mathematical operations. The math is simply applied to every value in the array (and it runs much faster than when you would do every calculation separately). 

The `plot` function can take many arguments. Looking at the help box of the `plot` function gives you a lot of help. `plot` can be used with one argument as `plot(y)`, which plots `y` values along the vertical axis and enumerates the horizontal axis starting at 0. `plot(x, y)` plots `y` vs `x`, and `plot(x, y, formatstring)` plots `y` vs `x` using colors and markers defined in `formatstring`: 
* It can be used to define the color, for example `'b'` for blue 
* It can be used to define the linetype `'-'` for line 
* You can also define markers, for example `'o'` for circles. 
* You can even combine them: `'r--o'` gives a red dashed line with circle marker. 

`plot` also takes a large number of keyword arguments. A keyword argument is an optional argument that may be added to a function. For example, to plot a line with width 6 (Note the formatting string):

In [None]:
# plot 3 points with the formatting string 'g:s'
# green dotted line with square markers
# use keywords linewidth, markersize
plt.plot([1, 2, 3], [2, 4, 3], 'g:s', linewidth=6, markersize=25)

Note: Keyword arguments should come after regular arguments. `plot(linewidth=6, [1, 2, 3], [2, 4, 3])` gives an error.

* Names may be added along the axes with the `xlabel` and `ylabel` functions, e.g., `plt.xlabel('this is the x-axis')`.  
* A title can be added to the figure with the `plt.title` command. 
* Multiple curves can be added to the same figure by giving multiple plotting commands. They are automatically added to the same figure.

### Exercise 2: First Graph

Plot $y=(x+2)(x-1)(x-2)$ for $x$ going from $-3$ to $3$ using a dashed red line. On the same figure, plot a blue circle for every point where $y$ equals zero. Set the size of the markers to 10 (you may need to read the help of `plt.plot` to find out how to do that). Label the axes as 'x-axis' and 'y-axis'. Add the title 'First nice Python figure of Your Name', where you enter your own name.

In [None]:
# put your code here


[Answer to Exercise 2](https://colab.research.google.com/drive/1-nwmcGnc-Oj56iHmrZFsU6tyJHunbd13#scrollTo=WZ4cpFLc47gf&line=10&uniqifier=1)

## Loading Data Files

Numerical data can be loaded from a data file using the `loadtxt` function of `numpy` with the command is `np.loadtxt`. You need to make sure the file is in the same directory as your notebook, or provide the full path. The filename (or path plus filename) needs to be between quotes. 

In [None]:
# read the documentation to learn more
np.loadtxt?

### Exercise 3: Loading Data

You are provided with the data files containing monthly temperature of Tokyo, Holland and New York City. The data is stored in `tokyo_temperature.txt, holland_temperature.txt, newyork_temperature.txt`. Plot the temperature for each location against the number of the month (starting with 1 for January) all in a single graph. Add a legend by using the function `plt.legend`. 

Hint: Load the text files like `np.loadtxt('tokyo_temperature.txt')`. Read the documentation of `np.legend` to learn using it.

In [None]:
# put your code here


[Answer to Exercise 3](https://colab.research.google.com/drive/1-nwmcGnc-Oj56iHmrZFsU6tyJHunbd13#scrollTo=N2x1EMB24-Vp&line=14&uniqifier=1)

## Gallery of Graphs

The plotting package `matplotlib` allows you to make appealing graphs. Check out [matplotlib gallery](http://matplotlib.org/gallery.html) to get an overview of many of the options. Following exercises use several of the matplotlib options.


### Exercise 4: Subplots and Markers

Load the monthly air temperature and seawater temperature for Holland. Create one figure with two plots above each other using the subplot command (use `plt.subplot?`). On the top graph, plot the air and sea temperature. Label the ticks on the horizontal axis as 'jan', 'feb', 'mar', etc., rather than 0,1,2,etc. Use `plt.xticks?` to find out how. In the bottom graph, plot the difference between the air and seawater temperature.

Hint: Filenames are `holland_seawater.txt, holland_temperature.txt`

In [None]:
# put your code here


[Answer to Exercise 4](https://colab.research.google.com/drive/1-nwmcGnc-Oj56iHmrZFsU6tyJHunbd13#scrollTo=TZjQfVfu6Loh&line=10&uniqifier=1)

### Exercise 5: Pie Chart

At the 2012 London Olympics, the top ten countries (plus the rest) receiving gold medals were `['USA', 'CHN', 'GBR', 'RUS', 'KOR', 'GER', 'FRA', 'ITA', 'HUN', 'AUS', 'OTHER']`. They received  `[46, 38, 29, 24, 13, 11, 11, 8, 8, 7, 107]` gold medals, respectively. 
* Make a pie chart (check `plt.pie?`) of the top 10 gold medal winners plus the others at the London Olympics. 
* Try some keyword arguments to make the plot look nice. You may want to give the command `plt.axis('equal')` to make the scales along the horizontal and vertical axes equal so that the pie actually looks like a circle rather than an ellipse. 
* There are four different ways to specify colors in matplotlib plotting; you may read about it [here](http://matplotlib.org/examples/pylab_examples/color_demo.html). The coolest way is to use the html color names. 
* Use the `colors` keyword in your pie chart to specify a sequence of colors. The sequence must be between square brackets like `['MediumBlue','SpringGreen','BlueViolet']`. The html names for the colors may be found, for example, [here](http://en.wikipedia.org/wiki/Web_colors).

In [None]:
# put your code here


[Answer to Exercise 5](https://colab.research.google.com/drive/1-nwmcGnc-Oj56iHmrZFsU6tyJHunbd13#scrollTo=HXom7Uzl_fTK&line=8&uniqifier=1)

### Exercise 6: Fill Between

Load the air and sea temperature, as used in Exercise 4, but this time make one plot of temperature vs month. 
* Use the `plt.fill_between` command to fill the space between the curve and the $x$-axis. 
* Specify the `alpha` keyword, which defines the transparancy.
* Note that you need to specify the color using the `color` keyword argument.

In [None]:
# put your code here


[Answer to Exercise 6](https://colab.research.google.com/drive/1-nwmcGnc-Oj56iHmrZFsU6tyJHunbd13#scrollTo=2e_hOfOZ_zWr&line=11&uniqifier=1)

## Solutions

The solutions to the exercises are available here. Please refrain from looking at them before solving the exercise.

In [None]:
# solution to exercise 1

# create variables
a = 1
b = 1
c = -6

# assign value to x
x = -2

# evaluate y
y = a * x ** 2 + b * x + c

# print the value
print(f'y evaluated at x={x} is {y}')

# re-evaluate y
x = 0 
y = a * x ** 2 + b * x + c
print(f'y evaluated at x={x} is {y}')

# re-evaluate y
x = 2
y = a * x ** 2 + b * x + c
print(f'y evaluated at x={x} is {y}')

In [None]:
# solution to exercise 2

# variable for x data
x = np.linspace(-3, 3, 100)

# evaluate polynomial for x data
y = (x + 2) * (x - 1) * (x - 2)

# plot the polynomial
plt.plot(x, y, 'r--')

# plot the roots of the polynomial
plt.plot([-2, 1, 2], [0, 0, 0], 'bo', markersize=10)

# add labels and title
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('First Python Figure of MI Lab');

In [None]:
# solution to exercise 3

# load the temperature data from txt files
tokyo = np.loadtxt('tokyo_temperature.txt')
holland = np.loadtxt('holland_temperature.txt')
newyork= np.loadtxt('newyork_temperature.txt')

# create a variable for months
months = np.linspace(1, 12, 12)

# plot each city data
plt.plot(months, tokyo)
plt.plot(months, holland)
plt.plot(months, newyork)

# add labels to the plot
plt.xlabel('Number of the month')
plt.ylabel('Mean monthly temperature (Celcius)')

# limit the x-range from 1 to 12
plt.xlim(1, 12)

# add a legend with the location decided by matplotlib
plt.legend(['Tokyo','Holland','New York'], loc='best');

In [None]:
# solution to exercise 4

# load the variables
air = np.loadtxt('holland_temperature.txt') 
sea = np.loadtxt('holland_seawater.txt')

# create first subplot
plt.subplot(211)

# plot both files
plt.plot(air, 'b', label='air temp')
plt.plot(sea, 'r', label='sea temp')

# add legend, labels, limits
plt.legend(loc='best')
plt.ylabel('temp (Celcius)')
plt.xlim(0, 11)
plt.xticks([])

# create second subplot
plt.subplot(212)

# plot the difference
plt.plot(air-sea, 'ko')

# add labels and limits
plt.xticks(np.linspace(0, 11, 12),
           ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'])
plt.xlim(0, 11)
plt.ylabel('air - sea temp (Celcius)');

In [None]:
# solution to exercise 5

# create golde medals list
gold = [46, 38, 29, 24, 13, 11, 11, 8, 8, 7, 107]

# create countries list
countries = ['USA', 'CHN', 'GBR', 'RUS', 'KOR', 'GER', 'FRA', 'ITA', 'HUN', 'AUS', 'OTHER']

# generate pie plot with a list of colors argument
plt.pie(gold, labels = countries, colors = ['Gold', 'MediumBlue', 'SpringGreen', 'BlueViolet'])

# make the axes of equal size
plt.axis('equal');

In [None]:
# solution to exercise 6

# load the data files
air = np.loadtxt('holland_temperature.txt') 
sea = np.loadtxt('holland_seawater.txt')

# use fill_between to fill with color under the curve
plt.fill_between(range(1, 13), air, color='b', alpha=0.3)
plt.fill_between(range(1, 13), sea, color='r', alpha=0.3)

# add labels, ticks and limits
plt.xticks(np.linspace(0, 11, 12), ['jan', 'feb', 'mar', 'apr',\
           'may', 'jun', 'jul', 'aug', 'sep', ' oct', 'nov', 'dec'])
plt.xlim(1, 12)
plt.ylim(0, 20)
plt.xlabel('Month')
plt.ylabel('Temperature (Celcius)');