In [2]:
import numpy as np
import pandas as pd

# Important
## looping

Try to vectorize all your pandas operations 
You can use:
* iterrows() (good)
* apply() (bettter)

If the operators does not work (eg. ** 2) there is usually a panda method allowing you to do  what you want

## filtering values
passing_mark = 60

df['course_mark'] >= passing_mark # series of true/False
df_passed = df[df[df['course_mark'] >= passing_mark] # better way

## SQL 
### pulling data
For ressource efficiency it is better to filter at the database using SQL. Depending on the size of your query (pulling billions of records) you may not have enough (RAM) for the request and requesting such a query is ressource intensive


Panda avail:
 'read_sql',
 'read_sql_query',
 'read_sql_table',

### SQL WERE
#Select a single row by its position

* ```loc``` gets rows (or columns) with particular labels from the index.
* ```iloc``` gets rows (or columns) at particular positions in the index (so it only takes integers).

```
df.iloc[1] # selects the first row
df.iloc[1:3] # can use the same as a list
df.iloc[1:3 , 0:2] # can use [specific row, column]
df.iloc[1,2] #returns a single value
```
### SQL GROUP BY
filter the columns before grouping
~~~
df[['species','course_mark']].groupby('species').mean()
f[['species','gender','course_mark']].groupby('species','gender').mean()
#multiple columns
~~~

### SQL JOIN
~~~
# student_id is the primary key and must be present in both tables
# how (how the join is made
help(pd.DataFrame.merge)
pd.merge(df,df_more_info, on='studen_it', how='inner')
~~~

There are many method available in series that aren't made available in DataFrame. this is not an issue if we extract the desired column/serie from the DataFrame, carry the desired changes and then insert it back to the DataFrame

# Pandas and Numpy
## Data Type: Series

series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)

* data can be many different things:

   * a Python dict
   * an ndarray
   * a scalar value


### ndarrays
Index optional

In [19]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd','e' ]) # if no index is specified one will be created
# user defined indexes must match the length of the data 
s # notice how each ndarrays have a type

a    0.331381
b   -0.248782
c    1.364291
d   -0.330436
e    0.217402
dtype: float64

In [20]:
# accessing the data can be done through num index or the user defined indexes
print(s[0])
print(s['b'])

0.33138124052782847
-0.2487822404762439


In [13]:
pd.Series(np.random.randn(5)) # default index

0   -1.571274
1   -0.475562
2   -0.616063
3    0.764194
4    0.890147
dtype: float64

###  Dictionary

In [21]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

### From scalar value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of the index.

In [22]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

#### Series is ndarray-like (from Numpy)

Series acts very similarly to a ndarray from NumPy and is a valid argument to most NumPy functions. Operations such as slicing will also slice the index.

In [26]:
s[::-1] # reversed array

e    0.217402
d   -0.330436
c    1.364291
b   -0.248782
a    0.331381
dtype: float64

In [27]:
s[s > s.median()] # WHERE clause

a    0.331381
c    1.364291
dtype: float64

In [28]:
s[[4, 3, 1]]

e    0.217402
d   -0.330436
b   -0.248782
dtype: float64

In [29]:
np.exp(s)

a    1.392891
b    0.779750
c    3.912950
d    0.718610
e    1.242844
dtype: float64

That being said if you need an actualy ndarry for numpy you must convert it from panda to numpy

In [30]:
s.to_numpy()

array([ 0.33138124, -0.24878224,  1.36429146, -0.33043587,  0.2174025 ])

#### Series is dict-like (from Python)

Series acts very similarly to a ndarray from NumPy and is a valid argument to most NumPy functions. Operations such as slicing will also slice the index.

In [31]:
'e' in s

True

In [33]:
s['e'] = 12 # if label is not contained an exception is raised
print(s)

a     0.331381
b    -0.248782
c     1.364291
d    -0.330436
e    12.000000
dtype: float64


### Vectorized operations (SUPER IMPORTANT)

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in Pandas. Series can also be passed into most NumPy methods expecting an ndarray

In [36]:
s+s

a     0.662762
b    -0.497564
c     2.728583
d    -0.660872
e    24.000000
dtype: float64

In [39]:
s * 2 # doesn't change the main array s but return its resutl

a     0.662762
b    -0.497564
c     2.728583
d    -0.660872
e    24.000000
dtype: float64

In [38]:
np.exp(s)

a         1.392891
b         0.779750
c         3.912950
d         0.718610
e    162754.791419
dtype: float64

 ### key difference between Series and ndarray (ALSO HUGE)
 that operations between Series automatically align data based on the label. Thus, you can write computations without considering whether the Series involved have the same labels.
 Great when you are missing data

In [40]:
s1 = s[1:]
s2 = s[:-1]
s1 + s2

a         NaN
b   -0.497564
c    2.728583
d   -0.660872
e         NaN
dtype: float64

In [43]:
s = pd.Series(np.random.randn(5), name='something') # series can also have names
s

0    2.911133
1   -1.908267
2   -0.915137
3    0.001937
4   -0.775998
Name: something, dtype: float64

## DataFrames
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input:

Don't think of the dimensions the same way as linear algebra with vectors. Think about it in x/y

In [55]:
d = {'one': pd.Series([1., 2.], index=['0', '1']),
         'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} # ensure the index matches
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
0,1.0,
1,2.0,
a,,1.0
b,,2.0
c,,3.0
d,,4.0


In [56]:
# d being a dictionary with its values being a panda series
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
         'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} # if no index is passed panda will provide one
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [57]:
pd.DataFrame(d, index=['d', 'b', 'a']) # organized with desired index and in a user defined order

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [58]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']) # this time we selected 'column three' which doesn't exist
# still panda is able to deal with missing values

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


### Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [59]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [61]:
df['three'] = df['one'] * df['two'] # column creating from result
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [62]:
df['flag'] = df['one'] > 2 # boolean column creation
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


In [63]:
 del df['two']
df

Unnamed: 0,one,three,flag
a,1.0,1.0,False
b,2.0,4.0,False
c,3.0,9.0,True
d,,,False


In [65]:
df['foo'] = 'bar' # scalar value insertion will propagate the length or the rows
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,,,False,bar


In [67]:
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,,,False,bar,


In [68]:
list('abc')

['a', 'b', 'c']

In [70]:
df = pd.DataFrame(np.random.randn(8, 3), columns=list('ABC')) # in the example he had an gave date time but didn't work for us. That's ok because we do not need the index
df * 5 + 2

Unnamed: 0,A,B,C
0,6.094645,4.909544,-1.289835
1,-2.138926,6.44183,8.443969
2,3.643625,7.424927,-3.155319
3,-1.44134,-0.231174,5.197328
4,3.337258,-4.427184,7.679923
5,16.713755,0.550422,-1.355849
6,-1.577155,5.849997,-4.015655
7,-4.815752,6.988152,9.880489


In [71]:
1 / df

Unnamed: 0,A,B,C
0,1.221107,1.718482,-1.519833
1,-1.208043,1.125662,0.775919
2,3.042056,0.921671,-0.969872
3,-1.452922,-2.240973,1.563806
4,3.738994,-0.777946,0.880294
5,0.339818,-3.44928,-1.489936
6,-1.397759,1.298702,-0.831165
7,-0.733595,1.002375,0.634478


In [72]:
df ** 4

Unnamed: 0,A,B,C
0,0.449764,0.114662,0.18742
1,0.469538,0.622828,2.758885
2,0.011677,1.385784,1.130166
3,0.224404,0.039651,0.167213
4,0.005117,2.730253,1.66529
5,74.991841,0.007065,0.202922
6,0.261982,0.351529,2.095326
7,3.452829,0.990555,6.170676


In [74]:
df # data frame remains unchanged

Unnamed: 0,A,B,C
0,0.818929,0.581909,-0.657967
1,-0.827785,0.888366,1.288794
2,0.328725,1.084985,-1.031064
3,-0.688268,-0.446235,0.639466
4,0.267452,-1.285437,1.135985
5,2.942751,-0.289916,-0.67117
6,-0.715431,0.769999,-1.203131
7,-1.36315,0.99763,1.576098


In [80]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)
print(df1,'\n',df2)

       a      b
0   True  False
1  False   True
2   True   True 
        a      b
0  False   True
1   True   True
2   True  False


### bitwise operator recap
* &  	AND 	Sets each bit to 1 if both bits are 1
* | 	OR 	Sets each bit to 1 if one of two bits is 1
* ^ 	XOR 	Sets each bit to 1 if only one of two bits is 1
* ~  	NOT 	Inverts all the bits

In [81]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [83]:
 df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [85]:
-df1 # inverse

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False


In [88]:
dft = pd.DataFrame({'A': np.random.rand(3),
                        'B': 1,
                        'C': 'foo',
                        'D': pd.Timestamp('20010102'),
                        'E': pd.Series([1.0] * 3).astype('float32'),
                        'F': False,
                        'G': pd.Series([1] * 3, dtype='int8')})
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.265622,1,foo,2001-01-02,1.0,False,1
1,0.936956,1,foo,2001-01-02,1.0,False,1
2,0.330205,1,foo,2001-01-02,1.0,False,1


For the most part, Pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns].
However, NumPy doesn't allow non-numeric data types, therefore, Pandas has to extend NumPy's type system in a few places. The following table lists most of Pandas extension types (the most common ones):

In [152]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

### Pandas has two ways of storing strings.

   * object dtype, which can hold any Python object, including strings.
   * StringDtype, which is dedicated to strings (introduced in 2020, only in the Pandas 1.0.0 version)

It is recommended to use StringDtype for strings because an object can hide any data type inside. 

In [90]:
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3      6
4    foo
dtype: object

In [91]:
dft['A'].dtype

dtype('float64')

## converting
You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). 

In [93]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
df1

Unnamed: 0,A
0,0.141354
1,-1.177823
2,-0.648464
3,0.086542
4,0.322908
5,-0.217708
6,-0.679702
7,1.364404


In [94]:
df1.dtypes

A    float32
dtype: object

In [95]:
dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})
dft1 = dft1.astype({'a': np.bool, 'c': np.float64})
dft1

Unnamed: 0,a,b,c
0,True,4,7.0
1,False,5,8.0
2,True,6,9.0


In [96]:
dft1.dtypes # goes by columns

a       bool
b      int64
c    float64
dtype: object

## Attributes of Pandas objects

Pandas objects have a number of attributes enabling us to access metadata:

   * Shape: gives the axis dimensions of the object, consistent with ndarray

   * Axis labels:
       * Series: index (only axis)
       * DataFrame: index (rows) and columns


In [100]:
df = pd.DataFrame(np.random.randn(8, 3),
                      columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,1.315139,3.058814,0.43259
1,0.609064,0.393456,-1.267463
2,0.454433,0.179353,1.632422
3,1.865333,-0.59268,-0.846585
4,0.858333,0.351358,-0.233329
5,-1.181659,0.662493,0.591432
6,-0.516239,-1.437974,-2.050421
7,-0.520084,0.100148,0.366445


In [101]:
df.columns = [x.lower() for x in df.columns]
df

Unnamed: 0,a,b,c
0,1.315139,3.058814,0.43259
1,0.609064,0.393456,-1.267463
2,0.454433,0.179353,1.632422
3,1.865333,-0.59268,-0.846585
4,0.858333,0.351358,-0.233329
5,-1.181659,0.662493,0.591432
6,-0.516239,-1.437974,-2.050421
7,-0.520084,0.100148,0.366445


## Counting values in Series

The value_counts() Series method and top-level function computes a histogram of a 1D array of values.

In [104]:
data = np.random.randint(0, 7, size=50)
data

array([0, 1, 3, 5, 0, 2, 4, 4, 6, 0, 3, 2, 6, 3, 6, 6, 3, 4, 1, 3, 3, 5,
       4, 6, 4, 0, 4, 1, 6, 3, 6, 4, 1, 4, 3, 2, 4, 5, 5, 6, 4, 1, 2, 3,
       3, 6, 1, 3, 5, 0])

In [105]:
s = pd.Series(data)
s.value_counts() # WARNING can only be used in 1d arrays so series and not dataframes

3    11
4    10
6     9
1     6
5     5
0     5
2     4
dtype: int64

Finding the most frenquent values in a serie (mode)

In [106]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
s5

0    1
1    1
2    3
3    3
4    3
5    5
6    5
7    7
8    7
9    7
dtype: int64

In [107]:
s5.mode() # if there are more than one it returns them

0    3
1    7
dtype: int64

## Altering labels
### Reindexing

reindex() is the fundamental data alignment method in Pandas. It is used to implement nearly all other features relying on a label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:

   * Reorders the existing data to match a new set of labels
   * Inserts missing value (NA) markers in label locations where no data for that label existed


In [5]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -1.758754
b   -0.279973
c   -1.009801
d   -1.634509
e   -0.761150
dtype: float64

In [6]:
s.reindex(['e', 'b', 'f', 'd']) # F doesn't shows up as NaN because there was no key'f'

e   -0.761150
b   -0.279973
f         NaN
d   -1.634509
dtype: float64

In [7]:
df = pd.DataFrame({
     'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
     'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
     'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,1.166862,-2.819775,
b,-0.412584,-0.994425,0.30304
c,0.188266,-0.588971,-0.394805
d,,0.472863,0.383307


In [8]:
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one']) # in this example we change the rows and columns

Unnamed: 0,three,two,one
c,-0.394805,-0.588971,0.188266
f,,,
b,0.30304,-0.994425,-0.412584


### Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis

In [9]:
df = pd.DataFrame({
     'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
     'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
     'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,0.419104,0.209293,
b,0.593056,-0.176225,-1.71162
c,-1.38788,-0.76399,0.699568
d,,-0.064691,0.378812


In [10]:
#rows
df.drop(['a', 'd'], axis=0) # doesn't delelte the data tho but like all thing in python can be assigned to a variable object

Unnamed: 0,one,two,three
b,0.593056,-0.176225,-1.71162
c,-1.38788,-0.76399,0.699568


In [11]:
#columns
df.drop(['one'], axis=1)

Unnamed: 0,two,three
a,0.209293,
b,-0.176225,-1.71162
c,-0.76399,0.699568
d,-0.064691,0.378812


### Renaming

The rename() method allows us to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.
create copies and doesn't modify the object hence why df.rename


In [12]:
s

a   -1.758754
b   -0.279973
c   -1.009801
d   -1.634509
e   -0.761150
dtype: float64

In [13]:
s.rename(str.upper)

A   -1.758754
B   -0.279973
C   -1.009801
D   -1.634509
E   -0.761150
dtype: float64

In [16]:
s.index = s.index.rename('alpha')
s

alpha
a   -1.758754
b   -0.279973
c   -1.009801
d   -1.634509
e   -0.761150
Name: test, dtype: float64

Can also be used for DataFrames but read help(pd.DataFrame.rename) as it contains examples

In [142]:
df.rename(columns={'one': 'foo', 'two': 'bar'},
              index={'a': 'apple', 'b': 'banana', 'd': 'durian'})

Unnamed: 0,foo,bar,three
apple,-0.725656,-2.594434,
banana,1.743822,-0.2053,0.675464
c,0.11789,0.864897,-0.191246
durian,,-0.529426,0.949863


In [143]:
# other examples
#df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
#In [244]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')

## .dt and .str accessors
### .dt accessor

Series has an accessor to succinctly return datetime-like properties for the values of the Series, if it is a datetime/period-like Series. This will return a Series, indexed like an existing Series.

#### datetime

In [147]:
s = pd.Series(pd.date_range('20130101 09:10:12', periods=4)) # how to place date time in the rows. Interestingly enough the index is updated (new day per row)
s

0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [150]:
print(s.dt.hour)
#print(s.dt.day)
#print(s.dt.minute)
#print(s.dt.second)

0    9
1    9
2    9
3    9
dtype: int64


In [151]:
stz = s.dt.tz_localize('US/Eastern')
stz

0   2013-01-01 09:10:12-05:00
1   2013-01-02 09:10:12-05:00
2   2013-01-03 09:10:12-05:00
3   2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [161]:
s.dt.tz_localize('UTC').dt.tz_convert('Asia/Tokyo')

0   2013-01-01 18:10:12+09:00
1   2013-01-02 18:10:12+09:00
2   2013-01-03 18:10:12+09:00
3   2013-01-04 18:10:12+09:00
dtype: datetime64[ns, Asia/Tokyo]


If you want to find all of the timezone available in Python

import pytz
https://pvlib-python.readthedocs.io/en/stable/timetimezones.html

### .str accessor

Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods.

In [167]:
#if you want to insert a Nan use np.nan
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
                  dtype="string")
s
# list of all available string manipulations for series
# dir(pd.Series.str)

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

## Sorting

There are three types of sorting in Pandas: 
* 1. Sorting by index labels 
* 2. Sorting by column values 
* 3. Sorting by a combination of both

### By index

The ```Series.sort_index()```
and ```DataFrame.sort_index()```
methods are used to sort a Pandas object by its index levels.

In [169]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,-0.155852,0.214913,
b,-0.78155,0.691551,-0.132224
c,1.693165,0.310386,0.157824
d,,-1.306033,0.304214


In [176]:
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
                          columns=['three', 'two', 'one'])
unsorted_df

Unnamed: 0,three,two,one
a,,0.214913,-0.155852
d,0.304214,-1.306033,
c,0.157824,0.310386,1.693165
b,-0.132224,0.691551,-0.78155


In [172]:
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,,0.214913,-0.155852
b,-0.132224,0.691551,-0.78155
c,0.157824,0.310386,1.693165
d,0.304214,-1.306033,


In [180]:
# Sort DataFrame by index
unsorted_df.sort_index(ascending=False) # if parameter is left blank

Unnamed: 0,three,two,one
d,0.304214,-1.306033,
c,0.157824,0.310386,1.693165
b,-0.132224,0.691551,-0.78155
a,,0.214913,-0.155852


In [177]:
# Sort DataFrame by column names
unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
a,-0.155852,,0.214913
d,,0.304214,-1.306033
c,1.693165,0.157824,0.310386
b,-0.78155,-0.132224,0.691551


In [178]:
#Sort Series by index
unsorted_df['three'].sort_index()

a         NaN
b   -0.132224
c    0.157824
d    0.304214
Name: three, dtype: float64

### By values

The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values.

In [20]:
df1 = pd.DataFrame({'one': [2, 1, 1, 1],
                        'two': [1, 3, 2, 4],
                        'three': [5, 4, 3, 2]})
df1

Unnamed: 0,one,two,three
0,2,1,5
1,1,3,4
2,1,2,3
3,1,4,2


In [21]:
# Sort DataFrame by column "two"
df1.sort_values(by='two') # must have the argument by!!!

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


In [22]:
#Sort DataFrame by columns "one" and "two"
df1[['one', 'two', 'three']].sort_values(by=['one', 'two'], ascending=False) # because a DataFrame object is at least 2d that means when specifying a value we need to use df1[][]

Unnamed: 0,one,two,three
0,2,1,5
3,1,4,2
1,1,3,4
2,1,2,3


In [189]:
df1[['one','two']].sort_values(by=['one', 'two'])

Unnamed: 0,one,two
2,1,2
1,1,3
3,1,4
0,2,1


In [190]:
df1

Unnamed: 0,one,two,three
0,2,1,5
1,1,3,4
2,1,2,3
3,1,4,2


## Accessing values in a Serie or DataFrame

In [197]:
df1.values[0]

array([2, 1, 5], dtype=int64)

In [202]:
s = pd.Series([1,2,3,4,5])
#help(pd.Series)
s[2]

3

In [26]:
dir(pd.DataFrame)
help(pd.DataFrame.merge)

Help on function merge in module pandas.core.frame:

merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) -> 'DataFrame'
    Merge DataFrame or named Series objects with a database-style join.
    
    The join is done on columns or indexes. If joining columns on
    columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
    on indexes or indexes on a column or columns, the index will be passed on.
    
    Parameters
    ----------
    right : DataFrame or named Series
        Object to merge with.
    how : {'left', 'right', 'outer', 'inner'}, default 'inner'
        Type of merge to be performed.
    
        * left: use only keys from left frame, similar to a SQL left outer join;
          preserve key order.
        * right: use only keys from right frame, similar to a SQL right outer join;
          preserve key order.
       