<a href="https://colab.research.google.com/github/annemariet/tutorials/blob/master/Correction_of_01_Introduction_pythonPandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

This notebook gives some reminders on the [Pandas DataFrames and Series structures](https://pandas.pydata.org/docs/getting_started/dsintro.html).

## Series


`Series`structures can be created from the following data types:
- scalar values 
- Python native dictionaries
- multidimensional arrays (called ndarrays).

A series is a vector of values taken by a variable. Usually it would represent the values taken by a variable for different observations (or individuals).

In [0]:
import pandas as pd

 ### from scalar values

In [0]:
# Program to Create series with scalar values  
data_points =[1, 3, 4, 5, 6, 2, 9]  # Numeric data 
  
# Creating series with default index values 
s = pd.Series(data_points) 


In [0]:
# predefined index values 
index =['a', 'b', 'c', 'd', 'e', 'f', 'g']  
  
# Creating series with predefined index values 
si = pd.Series(data_points, index) 

In [6]:
si

a    1
b    3
c    4
d    5
e    6
f    2
g    9
dtype: int64

In [7]:
si['f']  # direct indexing

2

### from a dictionary

A dictionary is a key-value mapping. Where values are indexed by their position in a list, here values are indexed by a key (which can take any hashable type, for instance alpha-numerical characters).

In [8]:
# Program to Create Dictionary series 
dictionary ={'a':1, 'b':2, 'c':3, 'd':4, 'e':5}  
  
# Creating series of Dictionary type 
sd = pd.Series(dictionary)
sd

a    1
b    2
c    3
d    4
e    5
dtype: int64

###  from a Ndarray

In [0]:
import pandas as pd
# Program to Create ndarray series 
nddata =[[2, 3, 4], [5, 6, 7]]  # Defining 2darray 
  
# Creating series of 2darray 
snd = pd.Series(nddata) 

In [10]:
nddata, nddata[0][1]

([[2, 3, 4], [5, 6, 7]], 3)

In [11]:
snd[1][1]

6

## DataFrames

A DataFrame is a 2-dimensional data structure : several columns contain the variables, with their observations indexed on the rows.

It can be built from the same kind of data as `Series`:
- one or more scalar vectors
- one or more dictionaries
- 2D-numpy ndarray

In [0]:
# Program to Create Data Frame with two dictionaries 
dict1 ={'a':1, 'b':2, 'c':3, 'd':4}        # Define Dictionary 1 
dict2 ={'a':5, 'b':6, 'c':7, 'd':8, 'e':9} # Define Dictionary 2 
data = {'first':dict1, 'second':dict2}  # Define Data with dict1 and dict2 
df = pd.DataFrame(data)  # Create DataFrame 

In [13]:
df  # note that the missing value is filled with a NaN by default (Not a Number)

Unnamed: 0,first,second
a,1.0,5
b,2.0,6
c,3.0,7
d,4.0,8
e,,9


###  from Series

A DataFrame can also be created from a set of series, for instance as follows:

In [0]:
# Program to create Dataframe of three series  
import pandas as pd 
  
s1 = pd.Series([1, 3, 4, 5, 6, 2, 9])           # Define series 1 
s2 = pd.Series([1.1, 3.5, 4.7, 5.8, 2.9, 9.3]) # Define series 2 
s3 = pd.Series(['a', 'b', 'c', 'd', 'e'])     # Define series 3 
  
  
Data ={'first':s1, 'second':s2, 'third':s3} # Define Data 
dfseries = pd.DataFrame(Data)              # Create DataFrame 

### from 2D-numpy ndarray

In [0]:
# Program to create DataFrame from 2D array 
import pandas as pd # Import Library 
d1 =[[2, 3, 4], [5, 6, 7]] # Define 2d array 1 
d2 =[[2, 4, 8], [1, 3, 9]] # Define 2d array 2 
Data ={'first': d1, 'second': d2} # Define Data  
df2d = pd.DataFrame(Data)    # Create DataFrame 

In [15]:
df2d

Unnamed: 0,first,second
0,"[2, 3, 4]","[2, 4, 8]"
1,"[5, 6, 7]","[1, 3, 9]"


## Tidy data

It is useful to organize a DataFrame as [_tidy data_](https://vita.had.co.nz/papers/tidy-data.pdf):

"A dataset is a collection of **values**, usually either numbers (if quantitative) or strings (if
qualitative). Values are organised in two ways. Every value belongs to a **variable** and an
**observation**. A variable contains all values that measure the same underlying attribute (like
height, temperature, duration) across units. An observation contains all values measured on
the same unit (like a person, or a day, or a race) across attributes."

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table"


# Exercise 

We are going to play with an automobile dataset, in which each column gives a different feature of a car, such as body shape, motor type, price, etc.

You can download the data at [this url](https://github.com/annemariet/tutorials/blob/master/data/Automobile_data.csv). 




Load the dataframe and print the first 10 and last 10 lines:

- use the github "raw" button to get the link to the raw content
- pandas read_csv can read from urls
- use `head` and `tail` methods.

In [0]:
# load df
url = "https://raw.githubusercontent.com/annemariet/tutorials/master/data/Automobile_data.csv"
df = pd.read_csv(url, index_col=0)

In [21]:
df.columns

Index(['company', 'body-style', 'wheel-base', 'length', 'engine-type',
       'num-of-cylinders', 'horsepower', 'average-mileage', 'price'],
      dtype='object')

In [22]:
# head
df.head(10)

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0
5,audi,sedan,99.8,177.3,ohc,five,110,19,15250.0
6,audi,wagon,105.8,192.7,ohc,five,110,19,18920.0
9,bmw,sedan,101.2,176.8,ohc,four,101,23,16430.0
10,bmw,sedan,101.2,176.8,ohc,four,101,23,16925.0
11,bmw,sedan,101.2,176.8,ohc,six,121,21,20970.0


In [24]:
# tail
df.tail(10)

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
69,toyota,wagon,95.7,169.7,ohc,four,62,31,6918.0
70,toyota,wagon,95.7,169.7,ohc,four,62,27,7898.0
71,toyota,wagon,95.7,169.7,ohc,four,62,27,8778.0
79,toyota,wagon,104.5,187.8,dohc,six,156,19,15750.0
80,volkswagen,sedan,97.3,171.7,ohc,four,52,37,7775.0
81,volkswagen,sedan,97.3,171.7,ohc,four,85,27,7975.0
82,volkswagen,sedan,97.3,171.7,ohc,four,52,37,7995.0
86,volkswagen,sedan,97.3,171.7,ohc,four,100,26,9995.0
87,volvo,sedan,104.3,188.8,ohc,four,114,23,12940.0
88,volvo,wagon,104.3,188.8,ohc,four,114,23,13415.0


You can access to the content of a column by indexing the dataframe with the column name, which returns a `pd.Series`. You can view the list of columns with `df.columns`.

In [25]:
d1=df['company']
d1

index
0     alfa-romero
1     alfa-romero
2     alfa-romero
3            audi
4            audi
         ...     
81     volkswagen
82     volkswagen
86     volkswagen
87          volvo
88          volvo
Name: company, Length: 61, dtype: object

In [44]:
len(df)

61

What is the company with the most expensive car?
- using and filtering, you can select rows for which a predicate is true, eg: `df[df["<column>"]==<value>]`.
- using `df.loc[<index>]` you can select rows at the given indexes.
- pandas offers both `max` and `idxmax` methods.

In [43]:
# answer
%timeit df.sort_values(by="price", ascending=False).head(1).company

The slowest run took 6.12 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 703 µs per loop


In [41]:
%timeit df[df["price"] == df["price"].max()]

The slowest run took 6.16 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 766 µs per loop


In [42]:
%timeit df.loc[df["price"].idxmax()]

1000 loops, best of 3: 297 µs per loop


Print the details of all the Toyota cars




In [45]:
# answer
df[df["company"]=="toyota"]

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
66,toyota,hatchback,95.7,158.7,ohc,four,62,35,5348.0
67,toyota,hatchback,95.7,158.7,ohc,four,62,31,6338.0
68,toyota,hatchback,95.7,158.7,ohc,four,62,31,6488.0
69,toyota,wagon,95.7,169.7,ohc,four,62,31,6918.0
70,toyota,wagon,95.7,169.7,ohc,four,62,27,7898.0
71,toyota,wagon,95.7,169.7,ohc,four,62,27,8778.0
79,toyota,wagon,104.5,187.8,dohc,six,156,19,15750.0


In [50]:
select_company = "toyota"
df.query(f"company == '{select_company}'")

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
66,toyota,hatchback,95.7,158.7,ohc,four,62,35,5348.0
67,toyota,hatchback,95.7,158.7,ohc,four,62,31,6338.0
68,toyota,hatchback,95.7,158.7,ohc,four,62,31,6488.0
69,toyota,wagon,95.7,169.7,ohc,four,62,31,6918.0
70,toyota,wagon,95.7,169.7,ohc,four,62,27,7898.0
71,toyota,wagon,95.7,169.7,ohc,four,62,27,8778.0
79,toyota,wagon,104.5,187.8,dohc,six,156,19,15750.0


You can use `<Series>.value_counts()` to count the number of cars per company.

In [52]:
# counts
df.company.value_counts()

toyota           7
bmw              6
mazda            5
nissan           5
mercedes-benz    4
audi             4
volkswagen       4
mitsubishi       4
honda            3
alfa-romero      3
porsche          3
chevrolet        3
jaguar           3
isuzu            3
dodge            2
volvo            2
Name: company, dtype: int64

In [54]:
import numpy as np
np.unique(df.company.values, return_counts=True, )

(array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
        'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mitsubishi',
        'nissan', 'porsche', 'toyota', 'volkswagen', 'volvo'], dtype=object),
 array([3, 4, 6, 3, 2, 3, 3, 3, 5, 4, 4, 5, 3, 7, 4, 2]))

Find the most expensive car for each company. For this you can use the `groupby` method.

In [59]:
# code
df.groupby("company")["price"].max()

company
alfa-romero      16500.0
audi             18920.0
bmw              41315.0
chevrolet         6575.0
dodge             6377.0
honda            12945.0
isuzu             6785.0
jaguar           36000.0
mazda            18344.0
mercedes-benz    45400.0
mitsubishi        8189.0
nissan           13499.0
porsche          37028.0
toyota           15750.0
volkswagen        9995.0
volvo            13415.0
Name: price, dtype: float64

In [61]:
df.query("company == 'alfa-romero'")

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0


In [63]:
df["price"].describe()

count       58.000000
mean     15387.000000
std      11320.259841
min       5151.000000
25%       6808.500000
50%      11095.000000
75%      18120.500000
max      45400.000000
Name: price, dtype: float64

Groupbys allow for a variety of aggregation functions. Can you compute the average mileage by company?

In [66]:
# code
df.groupby("company")["average-mileage"].mean().sort_values()

company
jaguar           14.333333
porsche          17.000000
mercedes-benz    18.000000
bmw              19.000000
audi             20.000000
alfa-romero      20.333333
volvo            23.000000
honda            26.333333
mazda            28.000000
toyota           28.714286
mitsubishi       29.500000
dodge            31.000000
nissan           31.400000
volkswagen       31.750000
isuzu            33.333333
chevrolet        41.000000
Name: average-mileage, dtype: float64

Sort all cars by decreasing price, using the `sort_values` method.

In [74]:
# code
df[["price", "company"]].sort_values(by="price", ascending=False)

Unnamed: 0_level_0,price,company
index,Unnamed: 1_level_1,Unnamed: 2_level_1
47,45400.0,mercedes-benz
14,41315.0,bmw
46,40960.0,mercedes-benz
62,37028.0,porsche
15,36880.0,bmw
...,...,...
36,5195.0,mazda
16,5151.0,chevrolet
31,,isuzu
32,,isuzu


Pandas gives you access to merge and concatenate functions. Create two dataframes from the following dictionaries and merge them to get a single dataframe with 3 columns: Company, Price, horsepower, and 4 lines.

In [0]:
Car_Price = {'Company': ['Toyota', 'Honda', 'BMV', 'Audi', "Volvo"], 'Price': [23845, 17995, 135925 , 71400, 45632]}
car_Horsepower = {'brand': ['Toyota', 'Honda', 'BMV', 'Audi'], 'horsepower': [141, 80, 182 , 160]}

In [88]:
# code
d1 = pd.DataFrame(Car_Price)
d2 = pd.DataFrame(car_Horsepower)
df12 = pd.merge(d1, d2, left_on="Company", right_on="brand", how="outer", )
df12

Unnamed: 0,Company,Price,brand,horsepower
0,Toyota,23845,Toyota,141.0
1,Honda,17995,Honda,80.0
2,BMV,135925,BMV,182.0
3,Audi,71400,Audi,160.0
4,Volvo,45632,,


In [85]:
d1.merge(d2, on="Company", how="outer")

Unnamed: 0,Company,Price,horsepower
0,Toyota,23845,141.0
1,Honda,17995,80.0
2,BMV,135925,182.0
3,Audi,71400,160.0
4,Volvo,45632,


- write a program to change the order of a pandas Series. Create a series indexed with 'A', 'B', 'C'... like this:
- A 1
- B 2
- C 3
- D 4
- E 5

and reorder using a new list such as 'B', 'D', 'E'..., using `reindex`.

In [91]:
# code
s = pd.Series(index=list("ABCDE"), data=range(1, 6))
s

A    1
B    2
C    3
D    4
E    5
dtype: int64

In [92]:
s.reindex(["B", "D", "C", "E", "A"])

B    2
D    4
C    3
E    5
A    1
dtype: int64

# NumPy

The numpy library (http://www.numpy.org/) is the go-to library for numerical analysis in Python;

In [0]:
import numpy as np

In [94]:
np.pi

3.141592653589793

## Arrays with numpy.array()

### Creation
You can create an array from a list (1-d vector), or a list of lists of the same lengths (2-d matrix).
You can also create empty arrays, arrays of zeros and ones, or random arrays of any given size and value type (int, float...). 

In [0]:
a = np.array([[1, 2, 3], [4, 5, 6]])

In [96]:
a

array([[1, 2, 3],
       [4, 5, 6]])

In [97]:
type(a), a.dtype


(numpy.ndarray, dtype('int64'))

In [0]:
a[0, 1] = 1.2

In [0]:
b = a.astype(np.float)
b[0, 1] = 1.2

In [105]:
a, b

(array([[1, 1, 3],
        [4, 5, 6]]), array([[1. , 1.2, 3. ],
        [4. , 5. , 6. ]]))


### Accessing elements

In [106]:
a[0,1]

1

In [107]:
a[1,2]

6

### numpy.arange()

`numpy.arange` gives you a range from a to b (excluded), increasing with the given step (defaulting to 1).

In [108]:
m = np.arange(3, 15, 2)
m

array([ 3,  5,  7,  9, 11, 13])

Note the difference between `numpy.arange()` and native Python `range()`:

- `numpy.arange()` returns a numpy.ndarray.
- `range()` returns an object of type `range`, which is a special kind of iterator.

In [109]:
type(m), m.dtype

(numpy.ndarray, dtype('int64'))

In [0]:
n = range(3, 15, 2)
type(n)

range

*A side-note on iterators*

Iterators are useful when you don't want to keep all the iterated value in memory. It allows you to "consume" a list of items in a loop. If you create an iterator and consume it to the end, it will not "restart" again in a new loop. `range` is special in that you can reuse it. Compare `x = range(10)` and `x=iter(range(10))` in the code below.

In [209]:
x = range(10)
for i in x:
  print("i =", i)

for j in x:
  print("j =", j)

i= 0
i= 1
i= 2
i= 3
i= 4
i= 5
i= 6
i= 7
i= 8
i= 9
j= 0
j= 1
j= 2
j= 3
j= 4
j= 5
j= 6
j= 7
j= 8
j= 9


In [210]:
x = iter(range(10))
for i in x:
  print("i =", i)

for j in x:
  print("j =", j)

i= 0
i= 1
i= 2
i= 3
i= 4
i= 5
i= 6
i= 7
i= 8
i= 9


`numpy.arange()` accepts non-integer inputs.

In [114]:
np.arange(0, 11*np.pi, np.pi)

array([ 0.        ,  3.14159265,  6.28318531,  9.42477796, 12.56637061,
       15.70796327, 18.84955592, 21.99114858, 25.13274123, 28.27433388,
       31.41592654])

In [121]:
np.arange(3, 9, 0.6)

array([3. , 3.6, 4.2, 4.8, 5.4, 6. , 6.6, 7.2, 7.8, 8.4])

In [115]:
11*np.pi

34.55751918948772

### numpy.linspace()
`numpy.linspace()` has a differnt API: it takes a range [a, b] (b included) and a number of values rather than a step.

In [120]:
np.linspace(3, 9, 11)

array([3. , 3.6, 4.2, 4.8, 5.4, 6. , 6.6, 7.2, 7.8, 8.4, 9. ])

## Applying mathematical functions

`numpy`gives you a number of mathematical functions, which can be applied to numpy arrays, ie to each element individually: `sin`, `cos`, `log` `exp`...


In [123]:
x = np.linspace(-np.pi/2, np.pi/2, 10)
y = np.sin(x)
y

array([-1.        , -0.93969262, -0.76604444, -0.5       , -0.17364818,
        0.17364818,  0.5       ,  0.76604444,  0.93969262,  1.        ])

# Exercise

Create a 4x2 integer array (of type unsigned int16) and print the following attributes:
- the shape `shape`,
- the number of dimensions `ndim`,
- the size in bytes of each element `itemsize`.

Compare also with `nbytes` and `size`.

In [126]:
# code
a = np.ones((4, 2), dtype=np.uint16)
print("shape", a.shape)
print("dimensions", a.ndim)
print("element size", a.itemsize)

shape (4, 2)
dimensions 2
element size 2


*side-note: vectors and row/column matrix in `numpy`*

Numpy treats differently arrays of shape (n,), (1, n) and (n, 1).
Try 

In [211]:
a = np.ones(12)
b = np.ones((1, 12))
a.shape, b.shape

((12,), (1, 12))

In [212]:
a * b # broadcasts and multiply by element, fine

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [213]:
a @ b # matrix multiplication, cannot infer the type of multiplication: outer-product (1, d)x(1, n) or scalar product (n, 1) x (1, n)?

ValueError: ignored

In [217]:
# 2 ways of doing the outer product.
v1 = a[:, np.newaxis] @ b
v2 = a.reshape((a.shape[0], 1)) @ b
np.all(v1 == v2)

True

In [219]:
# 2 ways of doing the scalar product
v1 = a @ b.flatten()
v2 = a @ b.reshape((b.shape[1],))
v1, v2

(12.0, 12.0)

- Create an array of size 5x2, with values ranging from 100 to 200, such that the different between two consecutive elements is 10. You can use `arange` and `reshape`.

In [137]:
#### code
a = np.arange(100, 200, 10).reshape((5, 2))
a

array([[100, 110],
       [120, 130],
       [140, 150],
       [160, 170],
       [180, 190]])

Given the following array, can you print the third column only?


In [0]:
import numpy
sampleArray = numpy.array([[11 ,22, 33], [44, 55, 66], [77, 88, 99]])

In [139]:
# code
sampleArray[:, 2]

array([33, 66, 99])

Given the following array, can you return only odd rows and even columns? (considering the mathematical numbering with row 1 at index 0). 


In [145]:
import numpy
sampleArray = numpy.array([[3 ,6, 9, 12], [15 ,18, 21, 24], 
[27 ,30, 33, 36], [39 ,42, 45, 48], [51 ,54, 57, 60]])
sampleArray

array([[ 3,  6,  9, 12],
       [15, 18, 21, 24],
       [27, 30, 33, 36],
       [39, 42, 45, 48],
       [51, 54, 57, 60]])

In [146]:
# code
sampleArray[0::2, 1::2]

array([[ 6, 12],
       [30, 36],
       [54, 60]])

Let A, B be two array of the same size, compute C such that $c_i = \sqrt{a_i + b_i}$.

In [0]:
import numpy
arrayOne = numpy.array([[5, 6, 9], [21 ,18, 27]])
arrayTwo = numpy.array([[15 ,33, 24], [4 ,7, 1]])

In [221]:
#### code
c = arrayOne + arrayTwo
print("A+B=\n", c)
print("C=\n", np.sqrt(c))
# note that the operations are broadcasted: the operations happen as if iterating over each element of the arrays.


A+B=
 [[20 39 33]
 [25 25 28]]
C=
 [[4.47213595 6.244998   5.74456265]
 [5.         5.         5.29150262]]


In [222]:
# to go further: check the difference between multiply and matrix multiplication
arrayOne * arrayTwo

array([[ 75, 198, 216],
       [ 84, 126,  27]])

In [223]:
np.multiply(arrayOne, arrayTwo)

array([[ 75, 198, 216],
       [ 84, 126,  27]])

In [225]:
arrayOne.T @ arrayTwo

array([[159, 312, 141],
       [162, 324, 162],
       [243, 486, 243]])

In [226]:
np.dot(arrayOne.T, arrayTwo)

array([[159, 312, 141],
       [162, 324, 162],
       [243, 486, 243]])

In [229]:
arrayOne.T.dot(arrayTwo)

array([[159, 312, 141],
       [162, 324, 162],
       [243, 486, 243]])

Create a new integer array of size 8x3, with values ranging from 10 to 34 with step size=1. Split the array into 4 subarrays (using `split`).


In [156]:
# code
a = np.arange(10, 34).reshape(8, 3)
a

array([[10, 11, 12],
       [13, 14, 15],
       [16, 17, 18],
       [19, 20, 21],
       [22, 23, 24],
       [25, 26, 27],
       [28, 29, 30],
       [31, 32, 33]])

In [158]:
np.split(a, 4)

[array([[10, 11, 12],
        [13, 14, 15]]), array([[16, 17, 18],
        [19, 20, 21]]), array([[22, 23, 24],
        [25, 26, 27]]), array([[28, 29, 30],
        [31, 32, 33]])]

Sort the array:
- along the second row
- along the second column

In [0]:
import numpy
sampleArray = numpy.array([[34,43,73],[82,22,12],[53,94,66]])

In [196]:
# Preliminary examples showing the different sort APIs
# 1. in-place with the array method
print(sampleArray)
sampleArray.sort(axis=0)
print("by rows\n", sampleArray)
sampleArray.sort(axis=1)
print("by columns\n", sampleArray)

[[34 43 73]
 [82 22 12]
 [53 94 66]]
by rows
 [[34 22 12]
 [53 43 66]
 [82 94 73]]
by columns
 [[12 22 34]
 [43 53 66]
 [73 82 94]]


In [204]:
# 2. returning a new array with `np.sort` function
# note that the second sort happens on the original array, unlike in 1.

sampleArray = numpy.array([[34,43,73],[82,22,12],[53,94,66]])
print("original\n", sampleArray)
print("by columns\n", np.sort(sampleArray, axis=0)) # reorder the rows of each columns
print("by rows\n", np.sort(sampleArray, axis=1)) # reorder the columns in each row

original
 [[34 43 73]
 [82 22 12]
 [53 94 66]]
by rows
 [[34 22 12]
 [53 43 66]
 [82 94 73]]
by columns
 [[34 43 73]
 [12 22 82]
 [53 66 94]]


In [207]:
# code answering the question: using argsort and `np.array` indexing power.
sortrow1index = sampleArray[1].argsort()
a = sampleArray[:, sortrow1index]

# we check that the original array didn't change
print("original:\n", sampleArray)
print("2d row ordering:", sortrow1index)
print("sorted by 2d row:\n", a)

original:
 [[34 43 73]
 [82 22 12]
 [53 94 66]]
2d row ordering: [2 1 0]
sorted by 2d row:
 [[73 43 34]
 [12 22 82]
 [66 94 53]]


In [208]:
print("sorted by 2d column\n", sampleArray[sampleArray[:,1].argsort(),:])

sorted by 2d column
 [[82 22 12]
 [34 43 73]
 [53 94 66]]


Given the following array, print the max along axis 0 and the min along axis 1.

In [0]:
import numpy
sampleArray = numpy.array([[34,43,73],[82,22,12],[53,94,66]])

In [187]:
# code
print(sampleArray.max())
print(sampleArray.min())

94
12


Given the following array, remove the second column and replace it with the new column values using `delete` and `insert`. Print the intermediate results.

In [0]:
import numpy
sampleArray = numpy.array([[34,43,73],[82,22,12],[53,94,66]]) 
newColumn = numpy.array([[10,10,10]]) 

In [195]:
# code
print("Original array:")
print(sampleArray)
a2 = np.delete(sampleArray, 1, axis=1)
print("After removing 2d column:")
print(a2)
print("Replacing with new column:")
a3 = np.insert(a2, 1, newColumn, axis=1)
print(a3)

Original array:
[[34 43 73]
 [82 22 12]
 [53 94 66]]
After removing 2d column:
[[34 73]
 [82 12]
 [53 66]]
Replacing with new column:
[[34 10 73]
 [82 10 12]
 [53 10 66]]
