<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/courses/ds4b-sdc25-1-intro/notebooks/s1-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preface

## Why Python?
1. General purpose language - add what you need
2. Portable (Linux, Windows, Mac)
3. Interactive
4. Free
5. Community and eco-system
6. Easy to use

## Working with Python
Workflows - many - find your own! In this course - Jupyter notebook and Pandas:
* Python + Jupyter notebook + Pandas = A complete environment
* Interactive
* Encourage an iterative work process (research?)
* Documentation, code and visualization in one - literate programming
* Reproducing results and figures

# Introduction

In this session, you will learn the basic Python syntax for data manipulation & analysis, including:

1. General syntax
2. basic operations
3. Object & data types
4. Flow controls




In [None]:
import numpy as np # Basic library for all kind of numerical operations
import pandas as pd # Basic library for data manipulation in dataframes

# Basics

## The very basics

In [None]:
# Running a cell (Ctrl-Enter, Shift-Enter)
print('Hello world')

## Variables

In [None]:
i = 6
print(i, type(i))

In [None]:
x = 3.2
print(x, type(x))

In [None]:
s = 'Hello'
print(s, type(s))

In [None]:
t = 5 == 6
print(t, type(t))

In python, we have different **types of variables**. The most common are **integers (int)**, **floats (float)**, **strings (str)** and **booleans (bool)**. Above we have assigned values to variables i, x, s and t. We can check the type of a variable using the function **type()**.

## Value assignment & evaluation

In [None]:
x = 3         # Assignment
print('We asigned x the value of ', x , '!')              # Evaluate the expression and print result

In [None]:
y = 4         # Assignment
y + 5         # Evaluation, y remains 4

In [None]:
z = x + 2*y  # Assignment
z             # Evaluation

In [None]:
a=2

In [None]:
# basic mathematical operations
print(x+y, x*y, x-y, x/y, a**2, x+y**2, (x+y)**2)

## Value comparison

Comparisons return boolean values: True or False

In [None]:
2==2  # Equality

In [None]:
2!=2  # Inequality

In [None]:
x <= y # less than or equal: "<", ">", and ">=" also work

In [None]:
(x | z) >= y # OR

In [None]:
(x & z) >= y # AND

In [None]:
x + z / 50 < y

## Special Constraints, NA, NaN, Inf

In [None]:
print([1, None, 3])

## Importing
We need to import libraries or only parts of libraries all the time. Use name-conventions when doing so

In [None]:
# https://docs.python.org/3/library/math.html
from math import sqrt

In [None]:
a = 2
b = 3

c = sqrt(a**2 + b**2)
print(c)

In [None]:
# Alternative with numpy

c2 = (a**2 + b**2)**(1/2)
print(c2)

## Functions
* Define a function
* Function name: pythagoras
* Arguments: a, b
* Indentation using tab (4 spaces) for the whole function
* `return` statement

In [None]:
def pythagoras(a, b):
    return sqrt(a**2 + b**2) # Notice the tab!

In [None]:
c = pythagoras(a, b)
print(c)

In [None]:
some_list = [(2,4),(6,7),(8,9),(1,6)]
pd.DataFrame(some_list)

In [None]:
[pythagoras(stuff[0],stuff[1]) for stuff in some_list]

In [None]:
# Add results as a new column
df = pd.DataFrame(some_list, columns=['a','b']) 
df['c'] = [pythagoras(stuff[0],stuff[1]) for stuff in some_list]
print(df)


**Best practice:** Adding documentation via
* Doc-string (""")
* Try placing the cursor at the function and press `<shift+tab>`

In [None]:
def pythagoras(a, b):
    """
    Computes the length of the hypotenuse of a right triangle
    
    Arguments
    a, b: the two lengths of the right triangle
    """
    
    return sqrt(a**2 + b**2)

## Mini-assignment
* Construct a function that given two points $(x_1, y_1), (x_2, y_2)$ on a line computes the slope $a$ of the line
$$ y = ax + b$$
given by
$$ a = \frac{y_2- y_1}{x_2 - x_1}$$

In [None]:

def slope(x1, y1, x2, y2):
    """
    Computes the slope of a line given two points on it
    """
    return (y2 - y1) / (x2 - x1)


In [None]:
# test the function
slope(1, 2, 3, 4) 

# Flow Control (loops & friends)

Python is made for readability and therefore tabs and new lines have syntax meaning


In [None]:
# If/else controls
x = 5 
y = 10

if (x==0):
  y = 0
  print('X is zero')
else:
  y = y/x  
  print(y)

In [None]:
# For loops
for i in range(1,x+1):
  print("OMG, i just counted to " + str(i))

In [None]:
# While loop
x = 5

while x > 0:
  print(x) 
  x = x-1

In [None]:
x = 1

while True: 
  print(x)
  x = x + 1
  if x > 7:
    break

In [None]:
even = [] # empty list
for i in range(10):
    even.append(i*2)
even

In [None]:
odd = []
for i in even:
    odd.append(i+1)
odd

### Mini-assignment

Write a function `KtoC` that translates Kelvin to Celcius

$$ C = K - 273.15 \quad \text{with} \quad C\geq - 273.15$$

The function returns `None` when $C < -273.15$

In [None]:
def KtoC(kelvin):
  """
  Translates Kelvin to Celsius.

  Args:
    kelvin: The temperature in Kelvin.

  Returns:
    The temperature in Celsius, or None if the temperature is below -273.15.
  """
  celsius = kelvin - 273.15
  if celsius < -273.15:
    return None
  else:
    return celsius

# Example Usage:
print(KtoC(298.15))  # Output: 24.869999999999996
print(KtoC(-100)) # Output: None
print(KtoC(0)) # Output: -273.15



#Object classes


## Vector

One-dimensional collection of values

In [None]:
# Numeric
v1 = [1,5,11,33] # [] initiate a list
v1

In [None]:
# String
v2 = ["hello","world"]
v2

In [None]:
# Boolean
v3 = [True, True, False, True]
v3

Evaluating elements in vectors

In [None]:
v1[0]

In [None]:
v1[1:3]

Manipulatingg vector elements

In [None]:
v1[2] = 1337
v1

Combining different types of elements you obtain a list of lists (later) with all elements in their original format

In [None]:
v5 =[v1, v2, v3]
v5 
# Integers (numbers) are still numbers, not strings (text). Easy to see because they don't have ' '

Adding vectors will append them (not sum them)

In [None]:
v1 + v3

In [None]:
# Same for multiplication
v1 * 2

**Element-wise operations:** To do numerical operations on vectors
numpy.arrays. NumPy is a library, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Here, you can already see that Python is a CS language.

In [None]:
v1_array = np.array(v1)
v2_array = np.array(v2)
v3_array = np.array(v3)

In [None]:
v1_array

In [None]:
v1_array + 5

In [None]:
v1_array + v3_array

In [None]:
# Arrays of different size
v1_array + np.array([1,7])

We cannot sum arrays of different size!

In [None]:
# non-numerical arrays
v1_array + v2_array

We cannot sum numbers and words!

**Mathematical operations over the vector:** For most maths you need to engage numpy or other modules (Python is not per sea maths language)

In [None]:
# that works the same way
np.sum(v1)

In [None]:
np.mean(v1)

In [None]:
# Standard deviation for population - DeltaDegreesOfFreedom = 0 by default
np.std(v1, ddof=0)

In [None]:
np.std(v1, ddof=1)

In [None]:
np.corrcoef(v1,v1)

Also consider this cheat sheet

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

## Lists
* An indexable collection of variables (objects)
* C-style or 0-indexed

In [None]:
l = ['Caroline', 1.0, pythagoras]
type(l)

In [None]:
l

In [None]:
l[0]

In [None]:
type(l[0])

Common methods for lists

In [None]:
l.append(sqrt(2.0))
l

In [None]:
a = l.pop(2) #The pop() function is used to remove an element from the list 
a

In [None]:
l

In [None]:
l.pop(0)
l.append(100)
l.sort(reverse=True)

In [None]:
l

In [None]:
l[1] = 2
l

In [None]:
l.extend([6.0, 4])
l

## Tuples
* Immutable "lists"

In [None]:
t = (1.0, 4.0)
t, type(t)

In [None]:
t[1]

In [None]:
t[1] = 2

We cannot assign only one value to a tuppuple element because tuples are immutable.

## Dictionaries
- Like lists with user-definable indices
- Can, like lists and tuples, contain a mix of different types of data.
- The indices can *also* be different kinds of data - unlike lists and tuples.

In [None]:
d = {'one': 1, 2: 1 + 1, 3.0: 'three'}
d

Usefull methods

In [None]:
d.keys()

In [None]:
d.items()

In [None]:
some_value = d.pop(3.0)
d

In [None]:
some_value

In [None]:
d['four'] = 4
d

In [None]:
d.update({'five': 5.0, 6: 6.0})
d

## Data Frames

In Python Data Frames are managed by Pandas, a very comprehensive library for data manipulation and analysis.

We will introduce to it later more in detail, so here only brief:

In [None]:
# We construct the DF from a dictionary which is indicated by {'some_key':['some_values']}

df1 = pd.DataFrame(
    {'ID':range(1,5), # Python counts from 0 and the last value in a range is excluded
     'FirstName':["Jesper","Jonas","Pernille","Helle"],
     'Female':[False,False,True,True],
     'Age':[22,33,44,55]
})

In [None]:
# Python doesn't really do much factors and as you can see pandas understood your input formats
df1.info()

In [None]:
df1.FirstName #dot notation

In [None]:
df1['FirstName'] #more traditional subsetting

In [None]:
df1.loc[:,'FirstName'] #more complex subsetting

In [None]:
df1.iloc[:,1] #index based

In [None]:
# Rows 1 and 2, columns 3 and 4 - the gender and age of Jesper & Jonas
df1.iloc[[0,1],[2,3]]


In [None]:
#Same thing
df1.loc[[0,1],['Female','Age']]

In [None]:
# Rows 1 and 3, all columns

df1.iloc[[0,2],:] # don't forget to count index-1 when going from R to python

In [None]:
#Find the names of everyone over the age of 30 in the data
df1[df1.Age > 30]

In [None]:
# or "Query style" (There are always many ways of doing the same thing)
df1.query('Age > 30')