# Data Management (Spring/Summer 2018) at OSIPP, Osaka U
By Shuhei Kitamura

## Python basics

### Outline
1. Arithmetic operation
2. Run Python scripts
3. Variables and Objects
    - Variable assignment
    - Types
    - Concatenation and other operations
    - Lists
    - Dictionaries
    - Tuples        
5. Functions
6. Methods
7. Modules and Packages
8. If Statements
9. Loops
10. NumPy
11. pandas
    - Get elements
    - Change contents
    - Computation
    - Treat missing values
    - Merge

## 1. Arithmetic operation
- Main operators are `+`, `-`, `*`, and `/`.
- Write `1 + 2` and execute (Shift + Enter or push "run cell" button). Next, do the same for `print(1 + 2)`.

In [3]:
print(3 + 5)

8


- If you want to write a comment, use `#`.
- Write `# print(1 + 2)` and execute.

In [4]:
#print(1 + 2)

- Compute 1 + 2 / 3 - 4 * 5 and print.

In [5]:
print(1 + 2/3-4*5)

-18.333333333333332


- `**` for power. Power is right associative. 
- Calculate $-2^3$, $3^{3^3}$, and $\frac{3}{3^3}$.

In [9]:
print(3*3/3)

3.0


- `%` for modulus and `//` for floor division.
- Calculate `7 % 2`. How about `7 // 2`?

In [13]:
print(7 // 2)

3


## 2. Run Python scripts
- Launch a text editor (e.g. Visual Studio Code). Create a new file.
<!--
![Alt text](https://github.com/urbsu/dm_tutorial/blob/master/sources/python7.PNG?raw=true)
-->
![Alt text](./sources/python7.png)
- Type `print(1 + 2)` in line 1, and `3 + 4` in line 2.
- Save the file in your repository. Type the file name (e.g., "mypython"), and set Python for the file type. The file extension becomes ".py".

- Check the current directory using `getcwd()` in the `os` package.
    - We will talk about packages later.
- If it is not your local repository, set the current directory by typing: `cd "C:\path_to_your_local_repository"` where you need to specify the path to your local repository.

In [1]:
import os
os.getcwd()

'C:\\Users\\User\\Documents\\dm-tutoria'

- Run a Python script by typing: `run mypython.py`. 
- Open "mypython.py". Change line 2 to `print(3 + 4)`, save, and close. 
- Execute `run mypython.py` again. Do you see the difference?

In [17]:
run mypython.py

3


- Let's push the new file to your remote repository. Start SoureTree.
- Stage "mypython.py."
- Commit and push (commit message = e.g. "Create mypython.py.").

## 3. Variables and Objects
- You make a variable and assign your "data" to it. "Data" are also called objects in Python.
- For example, `x = 1` creates a variable with a name `x`, and assigns object `1` to that variable.
    - Try `x = x + 1`. You should get an error. Why?

#### - Variable assignment
- To assign an object to a variable, type like this:
```python
mysum = 1 + 2
print(mysum)
```
- If you do not need to define a variable, just type:
```python
print(1 + 2)
```

In [19]:
x = 1 + 6
print(x)

7


- You can make more than one variable at once.
- You can also print several values at once.
- Try:
```python
a, b = 1, 2
c = d = 1
print(a, b)
print(a); print(b)
print(c, d)
```

In [20]:
a, b = 1, 2
c=d=1
print(a, b)
print(a); print(b)
print(c, d)

1 2
1
2
1 1


#### - Types
- An object has a unique type. Major types are: numbers (floats, integers), strings, booleans, lists, tuples, and dictionaries.
- Use `type()` to check the type.
- Check the type of:
```python
myint = 1
myfloat = 1.0
mystr = "Hello"
mybool = True  

In [34]:
type(True)

bool

- What are the type of `True` and `"True"`?

In [37]:
type("True")

str

- If a string includes a quote like `"This is a "quote""`, it returns an error. In this case, use `'This is a "quote"'` instead.
- Try both:

In [43]:
type('This is a "quote"')

str

- What are the outcome of `1 + 2` and `1 + 2 == 3`? Print them. Also check the type.

In [51]:
print(1 + 2==4)

False


- You can change the type of an object by using `float()`, `int()`, `str()`, and `bool()`.
- Write `myint = 1`, then print. Convert to float, then print again.

In [65]:
myint = 1
str(myint)

'1'

- These functions assign a new object representing the converted value, rather than changing the object itself.
    - This applies to *immutable* types, i.e., float, int, str, and tuple.
- To see this, type:
```python
print(1)
print(float(1))
print(type(float(1)))
```

In [66]:
print(1)
print(float(1))
print(type(float(1)))

1
1.0
<class 'float'>


- `False` is `0` and `0` is `False`. `True` is `1` and `1` is `True`.
- Most of the rest of objects are also `True`.
- Try:
```python
print(int(False),bool(0))
print(int(True),bool(1))
print(bool("Hello"))
```

- The type of `None` is `NoneType`. `None` is `False`.
- Type:
```python
print(type(None))
print(bool(None))
```

#### - Concatenation and other operations
- You can use `+` and `*` for strings and booleans.
- Try:
```python
print("He" + "l" * 2 + "o!") 
print(True + False) 
```
- What is the outcome of `True * 2`? Try it.

- You cannot include both numbers and strings at once with `+` operator.
- Try:
```python
print("I have " + savings + " USD in my account.")
```
-Next, try:
```python
print("I have " + str(savings) + " USD in my account.")  
```
- You often use the above form or the following form when you use a loop.
```python
print("I have ", savings, " USD in my account.")  
```

In [74]:
savings = 100

#### - Lists
- Lists are a type of the object (like str, float, etc.).
- A list can contain any Python type including a list.
- Print the following list. Also, check the type.

In [6]:
mylist1 = ["sam", 1.75, "sara", 1.82] # mylist1 contains strings and numbers.
mylist2 = [["sam", 1.75], ["sara", 1.82]] # mylist2 contains lists, and each list contains a string and a number.
print(mylist1)

['sam', 1.75, 'sara', 1.82]


- An empty list can be made by writing, e.g., `mylist = []`.
- Make an empty list and check the type.

TypeError: 'mylist' is an invalid keyword argument for this function

- You can make more than one list at once.

In [None]:
list1, list2 = [1,2], [3,4]
print(list1, list2)

- Use brackets like `mylist[]` to subset a list. For a list of lists, use `mylist[][]`, etc. 
- A list is indexed from 0 (zero-based indexing)!!
- Get the first element in `mylist1` ("sam"). Also get "sam" in `mylist2`.

In [5]:
mylist1 = ["sam", 1.75, "sara", 1.82] 
mylist2 = [["sam", 1.75], ["sara", 1.82]

- Arithmetic operation is possible.
- Calculate the sum of heights in `mylist`.

In [25]:
mylist = ["sam", 1.75, "sara", 1.82] 

- To slice a list, use `mylist[start:end]`, i.e., brackets and a colon. The rule is `mylist[inclusive:exclusive]`.
- You can also use `-` in `[:]`, which specifies the index from the end.
- Get the first and the second element in `mylist` using `[:]`.
- Get elements from the second to the end in `mylist` using `[:]`.
- Repeat the same exercises using `-`.

In [9]:
mylist = ["sam", 1.75, "sara", 1.82] 

- You can also subset and slice a string.
- Try:
```python
print("sara"[0])
print("sara"[0:2])
```

In [12]:
print("sara"[1])

a


- You can change the element of a list by typing `mylist[location] = new_value`.
- Change "sam" in `mylist` to "robert".

In [14]:
mylist = ["robert", 1.75, "sara", 1.82]

- You can add a list to another list using `+` operator.
- Add `["bob",1.85]` to `mylist1`.
- Try the same for `mylist2`.

In [None]:
mylist1 = ["sam", 1.75, "sara", 1.82] 
mylist2 = [["sam", 1.75], ["sara", 1.82]] 

- You cannot add a number to a list.
- Try adding some number to `mylist` using `+` operator.

In [None]:
mylist = ["sam", 1.75, "sara", 1.82] 

- `*` operator can be used for lists. What are the outcome of `[1] * 4` and `["sara"] * 4`?
- Also try `[1 * 4]` and `["sara" * 4]`.

In [15]:
print[1]*4
print["sara"]*4

TypeError: 'builtin_function_or_method' object is not subscriptable

- To delete an element in the list, use `del(mylist[location])`.
- Delete the second element in `mylist`.

In [16]:
mylist = ["sam", 1.75, "sara", 1.82] 
del(mylist[1])
print(mylist)

['sam', 'sara', 1.82]


- You can check if an element is in the list using `in`.
- You can apply the same method for strings.
- Try:
```python
print("sara" in mylist1)
print("sara" in mylist2)
print("r" in "sara")
```

In [34]:
mylist1 = ["sam", 1.75, "sara", 1.82] 
mylist2 = [["sam", 1.75], ["sara", 1.82]] 
print("sara" in myist)
print("sara" in mylists2[1]
print("r" in "sara")
      
      

- When you manipulate a list, the object itself changes.
- This applies to *mutable* types, i.e., lists and dictionaries.
    - While float, int, str, and tuple are *immutable* types.
- Different objects are assigned:
```python
mylist1 = ["a", "b", "c"]
mylist2 = ["a", "b", "c"]
```
- The same object is assigned:
```python
myint1 = 1
myint2 = 1
```
- Let's see the difference.

In [17]:
mylist1 = ["a", "b", "c"]
print(mylist1)
mylist2 = mylist1
mylist2[1] = "d"
print(mylist1)

['a', 'b', 'c']
['a', 'd', 'c']


In [18]:
myint1 = 1
print(myint1)
myint2 = myint1
myint2 = 2 
print(myint1)

1
1


- To avoid this, use `[:]`.
- Repeat the same thing using `mylist2 = mylist1[:]`.

In [3]:
mylist1 = ['a', 'b', 'c']
mylist2 = mylist[:]

#### - Dictionaries
- Dictionaries are one of the types.
- A dictionary contains a set of key-value pairs.
- You can access a value using the associated key.
- Keys have to be immutable types, i.e., numbers, strings, booleans, etc.
    - For example, a list cannot be a key.
- There is no concept of order or index for dictionaries!!
- Get the height of "sara" in `mydict`.

In [19]:
mydict = {"sam":1.75, "sara":1.82}
mydict["sara"]

1.82

- It's possible to make a dictionary of dictionaries.
- You can access an element using `mydict[][]`, etc.
- Get the height of "sara" in `mydict`.

In [20]:
mydict = {"sam":{"height":1.75, "weight":80.0}, "sara":{"height":1.82, "weight":85.0}}
mydict["sara"]["height"]


1.82

- You can get all keys of a dictionary using `mydict.keys()`.
- Use `mykey in mydic` to check if `mydict` contains `mykey`.
- Check all keys in `mydict`.
- Check all keys in `mydict["sam"]`.
- Check if "height" is a key in `mydict`.
- Check if "height" is a key in `mydict["sam"]`.

In [1]:
mydict = {"sam":{"height":1.75, "weight":80.0}, "sara":{"height":1.82, "weight":85.0}}
print(mydict)

{'sam': {'height': 1.75, 'weight': 80.0}, 'sara': {'height': 1.82, 'weight': 85.0}}


- You can add a key-value pair to the dictionary using `mydict[key] = value`.
- The same method also allows you to modify values.
- To delete a pair, use `del(mydict[key])`.
- Add "lisa" whose height is 1.65 to `mydict`.
- Delete "lisa" (and the associated height).

In [46]:
mydict = {"sam":1.75, "sara":1.82}

- If keys are not unique, the left most value shows up.

In [None]:
mydict = {"sam":1.75, "sara":1.82, "sam":1.95}
print(mydict)

#### - Tuples
- A tuple is similar to a list except that you cannot change the element.
- Change the second element in `mytuple` to 4.

In [28]:
mytuple = (1, 2, 3)
print(mytuple[1])

2


- You can slice and subset a tuple.

In [None]:
mytuple = (1, 2, 3)
print(mytuple[1])
print(mytuple[0:2])

## 4. Functions
- Functions return outputs given inputs. They also allow options if available.
- Useful built-in functions include: print(), float(), int(), str(), len(), max(), min(), round(), sorted(), etc. 
    - Some functions are used only for specific types.
    - (Strictly speaking, operators like float(), int(), str() are not functions. They are called class constructors.)
- Print the maximum height in `heights` using `max()`. Can you do the same for `mylist`?
- Print the length of `mylist` using `len()`.
- Print the length of "sam" in `mylist` using `len()`.

In [35]:
mylist = ["sam", 1.75, "sara", 1.82]
heights = [1.75, 1.82]
max(heights)
print(len(mylist))
print(len("sara"))
help(round)

4
4
Help on built-in function round in module builtins:

round(...)
    round(number[, ndigits]) -> number
    
    Round a number to a given precision in decimal digits (default 0 digits).
    This returns an int when called with one argument, otherwise the
    same type as the number. ndigits may be negative.



- To see a help file, type `?` before or after a function name, or write `help()`.
- Try `?round`, `round?`, and `help(round)`.

In [42]:
round(1.84,0)

2.0

- It returns `round(number[, ndigits]) -> number` and an explanation about the function. This means that the input is a numeric number, and the output is a numeric number that is rounded. `ndigits` is an option by which you can specify a precision in decimal digits.
- For example, `round(1.85, 1)` returns the closest float number with one decimal digit, i.e., 1.9. What about `round(1.84,1)`?

- You can make a function easily. The following example uses `x` and `y` as inputs and returns `x + y`.

In [45]:
def mysum(x, y): 
    return x + y
print(mysum(2, 3))
print(mysum(10, 3))

5
13


- You can also use lambda functions.

In [46]:
mysum = lambda x, y: x + y
print(mysum(2, 3))

5


- Make a function which returns the type of an input.

In [47]:
def myfunc(x):
    return type(x)
print(myfunc("sara"))

<class 'str'>


## 5. Methods
- Methods are functions attached to an object.
- Useful methods include: index(), count(), append(), remove(), reverse(), sort(), capitalize(), upper(), replace(), etc.
- You write like `object.method`.

In [48]:
mylist = ["sam", 1.75, "sara", 1.82]
heights = [1.82, 1.75]
print(mylist)
print(mylist.index("sara")) # returns the index of (the first) "sara"
print(mylist[2].index("a")) # returns the index of (the first) "a" in "sara"
print(mylist.count("sara")) # counts the number of "sara" in mylist1
print(mylist[2].count('a')) # counts the number of "a"s in "sara"
mylist.append('ken'); mylist.append(1.95) # append
print(mylist)
mylist.remove('ken'); mylist.remove(1.95) # remove
print(mylist)
mylist.reverse() # reverse the order
print(mylist)
heights.sort() # sort works for numbers.
print(heights)
# mylist.sort() # This returns an error.

['sam', 1.75, 'sara', 1.82]
2
1
1
2
['sam', 1.75, 'sara', 1.82, 'ken', 1.95]
['sam', 1.75, 'sara', 1.82]
[1.82, 'sara', 1.75, 'sam']
[1.75, 1.82]


- Some methods are attached only to a specific object type.

In [1]:
mylist = ["sam", 1.75, "sara", 1.82]
heights = [1.82, 1.75]
# print(mylist1.upper()) # This returns an error.
print(mylist[2].upper()) # same as 'sara'.upper()
print(mylist[2].capitalize()) # same as 'sara'.capitalize()
print(mylist[2].replace("s","k")) # replace 's' as 'k' in 'sara'
print(mylist) # None of these methods changes the original list.
mylist[2] = mylist[2].upper() # You have to assign the output to the same variable.
print(mylist)

SARA
Sara
kara
['sam', 1.75, 'sara', 1.82]
['sam', 1.75, 'SARA', 1.82]


- You can check methods using `dir()`.

In [49]:
mylist = ["sam", 1.75, "sara", 1.82]
print(dir(mylist))
print(dir(mylist[0]))
print(dir(mylist[1]))

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'cap

- `?object.method`, `object.method?`, or `help(object.method)` to see a help file.
- Get the help file for `upper` for strings using `mylist` as the object.

In [50]:
mylist = ["sam", 1.75, "sara", 1.82]
?mylist[0].upper

## 6. Modules and Packages
- You can import useful modules using `import module`. 
    - It is like defining a `module` object and enabling functions attached to the module object.
- `,` operator can be used to import multiple modules.

In [2]:
import math, numpy
print(math.radians(45)) # convert degrees to radians
print(numpy.radians(45)) # convert degrees to radians
# (Note that NumPy is not a module, but a package. NumPy should have already been installed if you have installed Python using Anaconda. If not, open Command Prompt (for Windows), then type `pip3 install numpy` (for Python 3) or `pip install numpy` (for Python 2) to manually install NumPy. For Anaconda users, use `conda install numpy` for installing and updating NumPy.

0.7853981633974483
0.7853981633974483


- Use `import module as myname` to define the module name as `myname`.

In [3]:
import math as m, numpy as np
print(m.radians(45))
print(np.radians(45))

0.7853981633974483
0.7853981633974483


- Alternatively, use `from module import function` to import a specific function, or use `from module import *` to import all functions from the module. The latter option is not always recommended. The reason:
```python
from math import *
from numpy import *
deg = arange(12.) * 30.
print(radians(deg))
```
vs.
```python
from numpy import *
from math import *
deg = arange(12.) * 30.
print(radians(deg))
```
- You will get an error in the second case because `radians` in that case is `math.radians`, which does not allow a numpy array. 

In [4]:
from math import *
from numpy import *
deg = arange(12.) * 30.
print(radians(deg))

[0.         0.52359878 1.04719755 1.57079633 2.0943951  2.61799388
 3.14159265 3.66519143 4.1887902  4.71238898 5.23598776 5.75958653]


- `,` operator can be used to import multiple functions.

In [5]:
from math import radians, sin, cos
print(radians(45))

0.7853981633974483


- A package contains modules, functions, etc. There are 1,274,332 packages (files) (as of 2018/04/29) in [PyPI](https://pypi.org/).
- To install a package to your local machine, you can use either `pip install` or `conda install` in Command Prompt (for Windows).

- You can import a module/function from a package using `import package.module`, `from package import module` or even `from package import function`.

In [1]:
from numpy import ndarray # ndarray is a function in the NumPy package.

## 7. If Statements
- Python uses colons and indents for if statements.

In [1]:
inp = "Hello"
if inp == "Hello": 
    print("World!") # indent + block

World!


- If statements can be complex like this.

In [3]:
inp = "Hello"
if inp == "Hell":
    print("No.")
else:
    if inp == "He":
        print("No.")
    else:
        print("World!")  

World!


- The above example can be simplified by using `elif`.

In [None]:
inp = "Hello"
if inp == "Hell":
    print("No.")
elif inp == "He":
    print("No.")
else:
    print("World!")

- Alternatively, you can even simplify them using a lambda function.

In [3]:
myfunc = lambda inp: "No." if inp == "Hell" else ("No." if inp == "He" else "World!")
print(myfunc("Hello"))

World!


- Relational operators include: `==`, `!=`, `>`, `<`, `>=`, and `<=`.
- You can also combine them with `and`, `or`, `not`, and `in`.

In [12]:
year = 2018
if year > 2017 and year < 2019:
    print('2018!')

year = "2017"
if not year == 2017: # be careful about the type
    print('Not 2017!')

mylist = ["sam", 1.75, "sara", 1.82]
if "sam" in mylist:
    print("sam in mylist!")

if "lisa" not in mylist:
    print("lisa not in mylist!")

2018!
Not 2017!
sam in mylist!
lisa not in mylist!


- The difference between `==` and `is` is that `==` tests for logical equality, while `is` tests for object identity.
- For `None`, you should use `is`. That's the rule.
<!-- see, https://www.python.org/dev/peps/pep-0008/#programming-recommendations -->

In [14]:
print(True is 1) # object identity
print(True == 1) # logical equality
x = 1 # immutable
y = 1
print(x is y, x == y)
x = 2
y = 5
print(x is y, x == y)
x = [1,2] # mutable
y = [1,2]
print(x is y, x == y)
x = [1,2]
y = x
print(x is y, x == y)

False
True
True True
False False
False True
True True


- Make a function which returns the type of an input if the option is 1, and returns the input itself otherwise.

In [16]:
myfunc = lambda inp.opt:1 print(type(inp)) if inp==1 else print(inp)

SyntaxError: invalid syntax (<ipython-input-16-27891d14cf34>, line 1)

## 8. Loops
- You can use `for` or `while` for loops.
- `for` loops often use `range()`.

In [20]:
print("--- loop to 9 ---")
for item in range(10):
    print("Current number is ",item)

print("--- loop from 1 to 4 ---")
for item in range(1,5):
    print("Current number is ", item)
    
print("--- loop from 2 to 8 by an increment of 2 ---")
for item in range(2,9,2):
    print("Current number is ", item)

--- loop to 9 ---
Current number is  0
Current number is  1
Current number is  2
Current number is  3
Current number is  4
Current number is  5
Current number is  6
Current number is  7
Current number is  8
Current number is  9
--- loop from 1 to 4 ---
Current number is  1
Current number is  2
Current number is  3
Current number is  4
--- loop from 2 to 7 by an increment of 2 ---
Current number is  2
Current number is  4
Current number is  6
Current number is  8


- You can also loop over strings, lists, dictionaries, and tuples.

In [7]:
print("--- loop over strings ---")
mystr = "Hello World!"
for x in mystr:
    print(x)

print("--- loop over a list ---")
mylist = ["sam", 1.75, "sara", 1.82]
for x in mylist:
    print(x)

print("--- loop over a list of lists ---")
mylist_of_lists = [["sam", 1.75], ["sara", 1.82]]
for mylist in mylist_of_lists:
    for x in mylist:
        print(x) 

print("--- loop over a dictionary ---")        
mydict = {"sam":1.75, "sara":1.82}
for x in mydict:
    print(x)
    
print("--- loop over a tuple ---")        
mytuple = (1, 2, 3)
for x in mytuple:
    print(x)

--- loop over strings ---
H
e
l
l
o
 
W
o
r
l
d
!
--- loop over a list ---
sam
1.75
sara
1.82
--- loop over a list of lists ---
sam
1.75
sara
1.82
--- loop over a dictionary ---
sam
sara
--- loop over a tuple ---
1
2
3


- Loops can be used inside a list.

In [25]:
mylist = [x for x in range(8)]
print(mylist)

[0, 1, 2, 3, 4, 5, 6, 7]


- Make a loop inside a list to change an element in `mylist` to one if the element is zero.

In [21]:
mylist = [0, 2, 3, 4]
mylist[0] = 1
print(mylist)
computer_brands = ["Apple", "Asus", "Dell", "Samsung"]
for brands in computer_brands:
    print(brands)
developped_country = ['japan', 'usa', 'germany', 'haiti', 'france']
for country in developped_country:
    print(country)

[1, 2, 3, 4]
Apple
Asus
Dell
Samsung
japan
usa
germany
haiti
france


- To append elements in a list, use `append()` in a loop.

In [None]:
z = []
for i in range(0,10):
    z.append(i)
print(z)
for i in range(1,10):
    if i == 3:
        break
    print(i)
for i in range(1,10):
    if i == 3:
        continue
    print(i)

- You may often use `break` and `continue` in a `while` loop.
    - `break` means that you will exit the current loop.
    - `continue` means that you will go back to the start of the current loop.

In [None]:
print("--- a loop till 10 ---")
cnt = 0
while cnt < 10:
    print(cnt)
    cnt += 1 # same as cnt = cnt + 1.

print("--- a loop till 10 except 5 ---")
cnt = 0
while cnt < 10:
    cnt += 1
    if cnt == 5:
        continue
    print(cnt)

print("--- a loop till 5 ---")    
cnt = 0
while cnt < 10:
    print(cnt)
    cnt += 1
    if cnt > 5:
        break

- For infinite loops, use `while True`.

In [None]:
print("--- infinite loop till 10 ---")
cnt = 0
while True:
    print(cnt)
    cnt += 1
    if cnt > 10:
        break      

- Make a function which returns "Five!" when the input is 5, and "." otherwise. Then use that function in a loop ranging from 1 to 9.

In [3]:
myfunct = lambda x:"five!" if x == 5 
else:
for x in a range(1,9):
    print(myfive(x))

SyntaxError: invalid syntax (<ipython-input-3-a9718ceefa77>, line 1)

## 9. NumPy
- NumPy (numeric python) is the foundamental package for numerical computation in Python.

In [6]:
import numpy as np

- Recall that lists do not allow element-by-element calculation.
- Such calculation is possible for NumPy arrays.
- Calculate bmi using 
```python
bmi = weight / height ** 2
print(bmi)
```
- Then do the same using `np_height` and `np_weight`.

In [6]:
height = [1.80, 1.76, 1.64] 
weight = [80, 75, 60]
np_height = np.array(height) # convert height to a NumPy array (use ?np.array to get a help file)
np_weight = np.array(weight)
bmi = np_weight / np_height**2
print(bmi)

[24.69135802 24.21229339 22.30814991]


- `+` operator adds elements to a list for lists, while it calculates the sum of each element in an array for NumPy arrays.
- Try:
```python
print(height + weight)
print(np_height + np_weight)
```

In [9]:
height = [1.80, 1.76, 1.64] 
weight = [80, 75, 60]
np_height = np.array(height) 
np_weight = np.array(weight)
print(np_height + np_weight)
print(height + weight)

[81.8  76.76 61.64]
[1.8, 1.76, 1.64, 80, 75, 60]


- A NumPy array can contain only one type. If there are multiple types, they are converted to a single type.
- Types are ordered: strings > floats > integers > booleans
- Check the type of the first element in `myarray1`, `myarray2`, and `myarray3`.

In [17]:
myarray1 = np.array([1, 1.0, "1", True]) # a list contains four types
myarray2 = np.array([1, 1.0, True]) # three types
myarray3 = np.array([1, True]) # two types
print(type(myarray1[0]))
print(type(myarray2[0]))

<class 'numpy.str_'>
<class 'numpy.float64'>


- NumPy arrays can be more than one dimensional.
    - You can check the number of dimensions of an array using `np.ndim(myarray)` and the shape of an array using `np.shape(myarray)`.
    - You can also use `ndim` and `shape` as a method like `myarray.ndim` and `myarray.shape`. They are attributes to instances.
- Two dimensional NumPy arrays are also called 2D NumPy arrays.
- Print the number of dimensions and the shape of `np_2d` and `np_3d`.
    - Try both `np.ndim(myarray)` and `myarray.ndim`.

In [15]:
np_2d = np.array([[1,2,3],[4,5,6]])
np_3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
print(np.ndim(np_2d))
print(np.ndim(np_3d))
print(np_2d.ndim)

2
3
2


- To access each element, use `myarray[]`, `myarray[][]`, etc.
    - You can also use `myarray[,]`.
- Get the second element in the first array in `np_2d`.
    - Try both `myarray[][]` and `myarray[,]`.

In [16]:
np_2d = np.array([[1,2,3],[4,5,6]])
print(np_2d[0][1])
print(np_2d[0,1])

2
2


- Useful functions include: `mean()`, `median()`, `std()`, `sum()`, `zeros()`, `ones()`, `round()`, `corrcoef()`.

In [17]:
list_2d = [[1,2,3],[4,5,6]]
np_2d = np.array(list_2d)
print(np.mean(np_2d)) # mean of all elements
print(np.mean(np_2d[:,0])) # mean of first column [1,4]
print(np.mean(np_2d[0,:])) # mean of first row [1,2,3]

myzeros = np.zeros((2, 3, 4)) # make an array with zeros
print(myzeros)

np_gender = np.array(["male","female","male","male","female"])
np_heights = np.array([191,160,185,190,170])
np_median_mheights = np.median(np_heights[np_gender == "male"]) # median of male heights
np_median_fheights = np.median(np_heights[np_gender == "female"]) # median of female heights
print("male median heights: ", np_median_mheights, " female median heights: ", np_median_fheights)

x = np.round(np.random.normal(1.75,0.20,50),2) # generate random values with mean = 1.75, std = 0.20, obs.= 50 and round them to two decimal places.
y = np.round(np.random.normal(65.32,15.0,50),2)
print(x)
print(y)
print(np.corrcoef(x,y)) # correlation of x and y

3.5
2.5
2.0
[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]]
male median heights:  190.0  female median heights:  165.0
[1.63 1.72 1.85 1.61 1.58 1.63 1.67 1.96 1.58 1.64 1.88 1.61 1.65 1.73
 1.67 2.07 1.85 1.63 1.85 1.55 1.53 1.38 2.13 1.93 1.88 1.84 1.78 1.69
 1.75 1.51 1.83 1.52 1.89 1.66 1.47 1.44 1.35 2.05 2.01 1.86 1.64 1.6
 1.97 1.67 1.69 1.61 1.86 1.7  2.14 1.72]
[ 88.8   77.52  45.64  49.31  74.36  74.4   58.48  65.82  31.18  39.02
  62.01  29.84  60.06  80.28  72.48  80.89  61.61  80.93 113.71  62.49
  65.09  94.61  81.54  61.18  46.74  80.87  74.23  81.88  89.    46.94
  76.33  84.22  55.19  71.73  64.1   47.75  51.84  61.14  45.04  57.
  86.3   53.65  48.71  61.89  77.52  98.03  73.85  47.62  47.3   60.37]
[[ 1.         -0.01155242]
 [-0.01155242  1.        ]]


- Some NumPy functions such as `np.sum()` are faster than generic functions such as `sum()` if data are already a NumPy array.

In [18]:
list_2d = [[1,2,3],[4,5,6]]
np_2d = np.array(list_2d)
print(np.sum(np_2d)) # sum of all elements using NumPy sum()
print(np.sum(list_2d))  # you can apply NumPy sum() for lists (not recommended)
print(sum([sum(x) for x in list_2d])) # sum of all elements using generic sum()

21
21
21


## 10. pandas
- pandas is a very useful Python package made for handling data. It's built on NumPy.
- (pandas should have already beeen installed when you have installed Python using Anaconda. If not, open Command Prompt, then type `pip3 install pandas` (for Python 3) or `pip install pandas` (for Python 2) to manually install pandas. For Anaconda users, use `conda install pandas` for installing and updating pandas.

In [1]:
import pandas as pd

- You can import csv data (i.e., comma separated data) using `pd.read_table()`. This function automatically set the data type as pandas DataFrame.
    - Be sure to include `sep=','` for csv files. Default is `sep=\t`.
    - For tsv data (i.e., tab separated data), do not need to include any `sep` option. If you want, you can write `sep='\t'` (where `\` is a backslash).
    - If you add `index_col = 0` option, pandas recognizes that the first column is an index (identifier).

- The following commands read `cars.csv` and `cars.txt` stored in your local repository.
    - If your data are stored in another place, you need to specify the path.

In [8]:
mycars = pd.read_table('cars.csv', sep=',')
print(mycars)
print(type(mycars))
mycars = pd.read_csv('cars.csv', sep=',', index_col = 0)
print(mycars)
mycars = pd.read_table('cars.txt', index_col = 0)
print(mycars)

    id  cars_per_cap        country  drives_right
0   US           809  United States          True
1  AUS           731      Australia         False
2  JAP           588          Japan         False
3   IN            18          India         False
4   RU           200         Russia          True
5  MOR            70        Morocco          True
6   EG            45          Egypt          True
<class 'pandas.core.frame.DataFrame'>
     cars_per_cap        country  drives_right
id                                            
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True
     cars_per_cap        country  drives_right
id                                            
US            809  United States          Tru

- To get the index, use `df.index` or `df.index.values`.
    - To check whether the index is unique, use `df.index.is_unique`.
- To get column names, use `df.columns`.
- To show values as a NumPy ndarray, use `df.values`.
- Get the index and column names of `mycars`. Check whether the index is unique.
- Check whether column names of `mycars` include "country".

In [4]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print(mycars.index)
print(mycars.columns)
print("county" in mycars.columns)
print(mycars.index.is_unique)
print(mycars.values)

Index(['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG'], dtype='object', name='id')
Index(['cars_per_cap', 'country', 'drives_right'], dtype='object')
False
True
[[809 'United States' True]
 [731 'Australia' False]
 [588 'Japan' False]
 [18 'India' False]
 [200 'Russia' True]
 [70 'Morocco' True]
 [45 'Egypt' True]]


- To check whether a column or the index contains duplicated values, use `df.duplicated()`.
    - It's useful to combine it with `.any()` as `df.duplicated().any()`.
- Check whether the index of `mycars` contains any duplicated values.
- Check whether the country column of `mycars` contains any duplicated values.

In [11]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print(mycars["country"].duplicated().any())

False


- To get unique values, use `df.unique()`.
    - This is different from `df.is_unique()`, which returns a boolean.
- Get unique values in the country column of `mycars`.

In [27]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)

### - Get elements
- To access each element, you can use `df[]`.
    - To slice rows, use `df[:]`.
    - However, you cannot use `df[:]` for columns!!!
- Instead, you can directly write like `df.column_name`.    
- `df[]` and `df[[]]` are different.
    - `df[]` returns pandas Series (1D)
    - `df[[]]` returns pandas DataFrame (2D, a dict-like container for Series objects)

In [31]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
# get columns
print(mycars['country']) 
print(mycars.country) 
print(mycars[['country']])
print(type(mycars['country']))
print(type(mycars[['country']]))
print(mycars[['country','drives_right']]) # get two columns
# print(mycars['country','drives_right']) # this returns an error
# get rows
print(mycars[0:2])

id
US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object
id
US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object
           country
id                
US   United States
AUS      Australia
JAP          Japan
IN           India
RU          Russia
MOR        Morocco
EG           Egypt
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
           country  drives_right
id                              
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          True
     cars_per_cap        country  drives_right
id                                            
US    

- To get first few rows and last few rows easily, use `df.head` and `df.tail`, respectively.

In [30]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print(mycars)
print(mycars.head(2)) # first 2 rows
print(mycars.tail(2)) # last 2 rows

     cars_per_cap        country  drives_right
id                                            
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True
     cars_per_cap        country  drives_right
id                                            
US            809  United States          True
AUS           731      Australia         False
     cars_per_cap  country  drives_right
id                                      
MOR            70  Morocco          True
EG             45    Egypt          True


- You can use operators like `==`, `>` and `<=` for conditions.

In [34]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print(mycars[mycars['country'] == "Australia"])
print(mycars[mycars['country'] == "Japan"])

     cars_per_cap    country  drives_right
id                                        
AUS           731  Australia         False
     cars_per_cap country  drives_right
id                                     
JAP           588   Japan         False


- Useful indexers for selection are `loc` and `iloc`.
    - `loc` uses indices.
    - `iloc` uses integers.
- For selecting a row, use `loc[]` or `loc[[]]` (`iloc[]` or `iloc[[]]`). You get Series and DataFrame, respectively.
- For selecting a column, use `loc[:,[]]` (`iloc[:,[]]`).
- With these indexers, operation gets similar to what you usually do in R.

In [90]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print(mycars)
# select rows
print(mycars.loc['RU'])
print(mycars.loc[['RU']])
print(type(mycars.loc['RU']))
print(type(mycars.loc[['RU']]))
#print(mycars.iloc[[4]])

# select columns
#print(mycars.loc[:,['country']])
#print(mycars.loc[:,['country','drives_right']]) # select more than one column
print(mycars.iloc[:,[1,2]])

# select rows and columns
#print(mycars.loc[['RU'],['country','drives_right']])
#print(mycars.iloc[[4],[1]])
#print(mycars.loc[:,['cars_per_cap']])

     cars_per_cap        country  drives_right
id                                            
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True
cars_per_cap       200
country         Russia
drives_right      True
Name: RU, dtype: object
    cars_per_cap country  drives_right
id                                    
RU           200  Russia          True
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
           country  drives_right
id                              
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          T

 - If you combine these methods, you can incorporate more advanced conditions to show elements.
     - You can also use `df.query()` for subsetting rows.
     - You can also use `df.filter()` for subsetting columns and rows.

In [89]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
# subsetting rows
#print(mycars.loc[(mycars['cars_per_cap'] > 600) & (mycars['drives_right'] == True), 'country']) 
# show the names of countries which satisfy the condition
#print(mycars.query("cars_per_cap > 600 & drives_right == True").country)  # using pandas' query
# subsetting columns
#print(mycars.loc[:,((mycars.columns != 'drives_right') & (mycars.columns != 'country'))])
drops = ['drives_right','country']
#print(mycars.loc[:,(x for x in mycars.columns if x not in drops)]) # using an if-statement
keeps = ['drives_right','country']
#print(mycars.filter(keeps))# using pandas'filter
mycars.drop(['drives_right', 'cars_per_cap'], axis = 1)
mycars.drop(columns = ['country'])

Unnamed: 0_level_0,cars_per_cap,drives_right
id,Unnamed: 1_level_1,Unnamed: 2_level_1
US,809,True
AUS,731,False
JAP,588,False
IN,18,False
RU,200,True
MOR,70,True
EG,45,True


### - Change contents
- Delete a column/row:
    - Deletion is an option. Getting a subset is another option.
    - To delete a column, use `del()` or `df.drop( , axis=1)`. (`axis=1` for deleting columns. `axis=0` is default.)
    - To delete a row, use `df.drop()`.
- Delete "cars_per_cap" in `mycars` using `del()`.
- Delete the "country" column in `mycars` using `df.drop( , axis=1)`.
- Delete the "US" row in `mycars` using `df.drop()`.

In [8]:
import pandas as pd
import numpy as np
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
#del(mycars['cars_per_capita'])


In [4]:
del(mycars['country'])
print(mycars)

     cars_per_cap  drives_right
id                             
US            809          True
AUS           731         False
JAP           588         False
IN             18         False
RU            200          True
MOR            70          True
EG             45          True


- Add a new column with values, use `df['colname'] = values`.

In [94]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
#mycars['one'] = 1
#print(mycars)
mycars['GDP'] = [1, 2, 3, 4, 5, 6,7]
print(mycars)

     cars_per_cap        country  drives_right  GDP
id                                                 
US            809  United States          True    1
AUS           731      Australia         False    2
JAP           588          Japan         False    3
IN             18          India         False    4
RU            200         Russia          True    5
MOR            70        Morocco          True    6
EG             45          Egypt          True    7


- To sort, use `df.sort_values(by=column_names)`.
    - With `df.sort_index()`, you car sort data using the index.
- Sort `mycars` by "cars_per_cap."
- Sort `mycars` by "drives_right" and "country."
- Sort `mycars` by the index.

In [7]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
mycars.sort_values(by = 'country')

Unnamed: 0_level_0,cars_per_cap,country,drives_right
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AUS,731,Australia,False
EG,45,Egypt,True
IN,18,India,False
JAP,588,Japan,False
MOR,70,Morocco,True
RU,200,Russia,True
US,809,United States,True


In [65]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)

- To transpose data, use `df.T`.

In [66]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print(mycars)
print(mycars.T)

     cars_per_cap        country  drives_right
id                                            
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True
id                       US        AUS    JAP     IN      RU      MOR     EG
cars_per_cap            809        731    588     18     200       70     45
country       United States  Australia  Japan  India  Russia  Morocco  Egypt
drives_right           True      False  False  False    True     True   True


- To reshape, use `df.pivot()`.
<!--  - Alternatively, use `df.stack()` and `df.unstack()`. -->

In [4]:
col1 = ["a","a","b","b"]
col2 = [1,2,1,2]
col3 = [0.5,0.2,0.4,0.1]
col4 = [True,False,False,True]
mydf = pd.DataFrame({'id':col1,'time':col2,'varA':col3,'varB':col4}) # make a dataframe
mydf.set_index(['id','time']) # set the index
print(mydf)
print(mydf.pivot(index='id', columns='time')) # reshape mydf from long to wide

  id  time  varA   varB
0  a     1   0.5   True
1  a     2   0.2  False
2  b     1   0.4  False
3  b     2   0.1   True
     varA        varB       
time    1    2      1      2
id                          
a     0.5  0.2   True  False
b     0.4  0.1  False   True


- Appendix: Use `pd.melt()` to show data in an alternative form.

In [10]:
col1 = ["a","a","b","b"]
col2 = [1,2,1,2]
col3 = [0.5,0.2,0.4,0.1]
col4 = [True,False,False,True]
mydf = pd.DataFrame({'id':col1,'time':col2,'varA':col3,'varB':col4}) 
mydf.set_index(['id','time'])
print(mydf)
print(mydf.melt(id_vars=['id','time']))

  id  time  varA   varB
0  a     1   0.5   True
1  a     2   0.2  False
2  b     1   0.4  False
3  b     2   0.1   True
  id  time variable  value
0  a     1     varA    0.5
1  a     2     varA    0.2
2  b     1     varA    0.4
3  b     2     varA    0.1
4  a     1     varB   True
5  a     2     varB  False
6  b     1     varB  False
7  b     2     varB   True


- To rename column names, use `df.rename(columns = {old_name: new_name})`.
    - To rename multiple column names, use
```python
new_cols = ['new1','new2']
df.rename(columns=dict(zip(df.columns[[loc1,loc2]], new_cols)))
```
- To rename the index, use `df.rename(index = {old_name: new_name})`.
- Rename the "country" column to "ctry."
- Rename the "US" index to "USA."

In [108]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
#print(mycars)
mycars2 = mycars.rename(columns = {'country': 'ctry'})
print(mycars2)
#print(mycars.rename(index = {'US': 'USA'}))

     cars_per_cap           ctry  drives_right
id                                            
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


- To replace values, use `df.replace(old,new)`.
    - To replace multiple values, use `df.replace({old1:new1,old2:new2})`
    - To replace specific letters in a string, use `str.replace()`.

In [120]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
#print(mycars)
mycars.replace({'country':'pays', 'cars_per_cap':'GDP'})
#mycars['country'] = mycars['country'].replace({"United States":"U.S.A"}) # replace United States to U.S.A.
#print(mycars.replace({'United States':'U.S.A.'}, inplace = True))

#mycars['country'] = mycars['country'].str.replace("ia","IA") # replace ia to IA.
#print(mycars)

Unnamed: 0_level_0,cars_per_cap,country,drives_right
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


### - Computation
- Calculation is based on the index and column names.

In [None]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
mycars['one'] = 1 # add a new column
print(mycars)
# take a subset
mycars_new = mycars.drop('US') # drop 'US' row
mycars_new = mycars_new.drop('one',axis=1) # drop 'one' column
# add mycars_new to mycars
sum_mycars = mycars + mycars_new 
print(sum_mycars)

- To take column/row sums, use `df.sum()`.
- To take column/row means, use`df.mean()`.

In [None]:
mydf = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('abc'), index=['x','y','z']) # make a dataframe
mydf = mydf.replace(6, np.nan) # replace 6 with NaN
print(mydf)
print(mydf.sum()) # sum by column
print(mydf.sum(axis=1)) # sum by row

- Missing values are ignored in calculation.

In [6]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
mycars_nan = mycars.replace(45, np.nan) # replace Egypt's cars/cap (45) with NaN
print(mycars_nan)
sum_cpc = mycars.loc[:,['cars_per_cap']] + mycars_nan.loc[:,['cars_per_cap']] # sum cars/cap
print(sum_cpc)

     cars_per_cap        country  drives_right
id                                            
US          809.0  United States          True
AUS         731.0      Australia         False
JAP         588.0          Japan         False
IN           18.0          India         False
RU          200.0         Russia          True
MOR          70.0        Morocco          True
EG            NaN          Egypt          True
     cars_per_cap
id               
US         1618.0
AUS        1462.0
JAP        1176.0
IN           36.0
RU          400.0
MOR         140.0
EG            NaN


- To get a sumary table, use `df.describe()`.

In [7]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print(mycars.describe())

       cars_per_cap
count      7.000000
mean     351.571429
std      345.595552
min       18.000000
25%       57.500000
50%      200.000000
75%      659.500000
max      809.000000


### - Treat missing values
- To check if data contain specific values, use `df.isin()`.
    - To check if data contain NaN, use `df.isna()`.
- It's useful to combine them with `.any()`.
- Check whether the country column in `mycars` contains "Japan."
- Check whether the country column in `mycars` contains NaN.

In [8]:
mycars = pd.read_table('cars.csv', sep=',', index_col = 0)
print (mycars['country'].isna().any())


False


- There is no single answer to deal with missing values.
    - drop: `df.dropna()`
    - replace: `df.fillna()`
- Drop a row/column if at least one value is missing.
    - To select a specific column, use `subset`.

In [11]:
mydata = pd.DataFrame([[3,4,5],[1,2,np.nan],[3,np.nan,np.nan]])
print(mydata)
print("Drop rows containing NaN.")
mydata1 = mydata.dropna() 
print(mydata1)
print("Drop rows if column 1 has NaN.")
mydata2 = mydata.dropna(subset=[1]) 
print(mydata2)
print("Drop columns containing NaN.")
mydata3 = mydata.dropna(axis=1) 
print(mydata3)

   0    1    2
0  3  4.0  5.0
1  1  2.0  NaN
2  3  NaN  NaN
Drop rows containing NaN.
   0    1    2
0  3  4.0  5.0
Drop rows if column 1 has NaN.
   0    1    2
0  3  4.0  5.0
1  1  2.0  NaN
Drop columns containing NaN.
   0
0  3
1  1
2  3


- Drop a row/column if all values are missing.

In [12]:
mydata1 = pd.DataFrame([[3,4,np.nan],[np.nan,2,np.nan],[np.nan,np.nan,np.nan]])
print(mydata1)
mydata2 = mydata1.dropna(how='all')
print(mydata2)

     0    1   2
0  3.0  4.0 NaN
1  NaN  2.0 NaN
2  NaN  NaN NaN
     0    1   2
0  3.0  4.0 NaN
1  NaN  2.0 NaN


- Replace NaN with another value.

In [None]:
mydata = pd.DataFrame([[3,4,np.nan],[np.nan,2,np.nan],[np.nan,np.nan,np.nan]])
print(mydata)
mydata_fillna = mydata.fillna(0) # replace with zero
print(mydata_fillna)

- To check if data contain np.inf and - np.inf, use `np.isinf()`.

In [None]:
df_inf = pd.DataFrame([1, 2, np.inf,-np.inf])
print(df_inf)
print(np.isinf(df_inf))
print(np.isinf(df_inf).any())

### - Merge
- To merge datasets, use `merge()`.

In [9]:
mydata1 = pd.DataFrame({'key': ["a","b","c","d"], 'data1': range(4)})
mydata2 = pd.DataFrame({'key': ["a","b","e"], 'data2': range(3)})
print(mydata1)
print(mydata2)
inner = pd.merge(mydata1, mydata2, on='key') # inner join (intersection)
outer = pd.merge(mydata1, mydata2, on='key', how='outer') # outer join (union)
right = pd.merge(mydata1, mydata2, on='key', how='right') # right join (keep right data)
left  = pd.merge(mydata1, mydata2, on='key', how='left') # left join (keep left data)
print(inner)
print(outer)
print(right)
print(left)

   data1 key
0      0   a
1      1   b
2      2   c
3      3   d
   data2 key
0      0   a
1      1   b
2      2   e
   data1 key  data2
0      0   a      0
1      1   b      1
   data1 key  data2
0    0.0   a    0.0
1    1.0   b    1.0
2    2.0   c    NaN
3    3.0   d    NaN
4    NaN   e    2.0
   data1 key  data2
0    0.0   a      0
1    1.0   b      1
2    NaN   e      2
   data1 key  data2
0      0   a    0.0
1      1   b    1.0
2      2   c    NaN
3      3   d    NaN


- You can merge even if key names are different.

In [121]:
mydata1 = pd.DataFrame({'key1': ["a","b","c","d"], 'data1': range(4)})
mydata2 = pd.DataFrame({'key2': ["a","b","e"], 'data2': range(3)})
print(mydata1)
print(mydata2)
outer = pd.merge(mydata1, mydata2, left_on='key1', right_on='key2', how='outer')  
print(outer)

   data1 key1
0      0    a
1      1    b
2      2    c
3      3    d
   data2 key2
0      0    a
1      1    b
2      2    e
   data1 key1  data2 key2
0    0.0    a    0.0    a
1    1.0    b    1.0    b
2    2.0    c    NaN  NaN
3    3.0    d    NaN  NaN
4    NaN  NaN    2.0    e


- You can use more than one key.

In [122]:
mydata1 = pd.DataFrame({'id': ["a","a","b","b"], 't': ["1","2","1","2"], 'data1': range(4)})
mydata2 = pd.DataFrame({'id': ["a","a"], 't': ["1","2"], 'data2': range(2)})
print(mydata1)
print(mydata2)
outer = pd.merge(mydata1, mydata2, on=['id', 't'], how='outer')  
print(outer)

   data1 id  t
0      0  a  1
1      1  a  2
2      2  b  1
3      3  b  2
   data2 id  t
0      0  a  1
1      1  a  2
   data1 id  t  data2
0      0  a  1    0.0
1      1  a  2    1.0
2      2  b  1    NaN
3      3  b  2    NaN


- What happens if both datasets have the same column name but the contents are different?

In [None]:
mydata1 = pd.DataFrame({'key': ["a","b","c","d"], 'data1': range(4)})
mydata2 = pd.DataFrame({'key': ["a","b","e"], 'data2': range(3)})
print(mydata1)
print(mydata2)
outer = pd.merge(mydata1, mydata2, on='key', how='outer')
print(outer)
outer = pd.merge(mydata1, mydata2, on='key', how='outer', suffixes=('_mydata1','_mydata2')) # rename duplicated column names
print(outer)

- You can merge using indices.

In [None]:
mydata1 = pd.DataFrame({'key': ["a","b","c","d"], 'data1': range(4)})
mydata2 = pd.DataFrame({'data2': range(3)}, index=['a','b','e'])
print(mydata1)
print(mydata2)
outer = pd.merge(mydata1, mydata2, left_on='key', right_index=True, how='outer') # use 'key' column for the left data and the index for the right data 
print(outer)