Most of this comes from the **Python for Data Analysis** Book by *Wes McKinney*

In [2]:
import numpy as np
import pandas as pd

Pandas has two data structures: A series (one-dimension) and a DataFrame (multiple-dimensions)

### Series

In [7]:
data = pd.Series([4, 76, 3, -2, .5])
data

0     4.0
1    76.0
2     3.0
3    -2.0
4     0.5
dtype: float64

As we can see, an *Index* is created. This can be explicitly specified or automatically created. 

In [8]:
data.values #array representation of the Series

array([ 4. , 76. ,  3. , -2. ,  0.5])

In [9]:
data.index #gives the index info

RangeIndex(start=0, stop=5, step=1)

#### Give Index explicitly

In [34]:
new_data = pd.Series(data = [4, 3, 2, 1], index = ["a", "b", 1, 2])
new_data

a    4
b    3
1    2
2    1
dtype: int64

#### Accessing rows

In [35]:
new_data["b"]

3

In [36]:
#By default, the explicit index is used,
# rather than the position of the data
# see loc and iloc
new_data[2]

1

In [37]:
new_data[new_data>1]

a    4
b    3
1    2
dtype: int64

#### Accessing data by position

In [26]:
new_data.iloc[1]

3

#### Accessing data by index name

In [27]:
new_data.loc[1]

2

#### Assigning data

In [38]:
new_data.iloc[3] = 69
new_data

a     4
b     3
1     2
2    69
dtype: int64

#### Series Operations

In [43]:
new_data.isnull() #pd.isnull(new_data)

a    False
b    False
1    False
2    False
dtype: bool

In [42]:
pd.notnull(new_data) #new_data.notnull()

a    True
b    True
1    True
2    True
dtype: bool

In [44]:
new_data + data

0     NaN
1    78.0
2    72.0
3     NaN
4     NaN
a     NaN
b     NaN
dtype: float64

In [46]:
#Give the Series a name
new_data.name = 'example'
new_data

a     4
b     3
1     2
2    69
Name: example, dtype: int64

In [48]:
#Give the index a name
new_data.index.name = "index"
new_data

index
a     4
b     3
1     2
2    69
Name: example, dtype: int64

### DataFrame

In [11]:
#Data can also be given as a dictionary, keys are then used as an index
data = {'day': ['Monday', 'Tuesday', 'Wednesday', 'Friday'],
       'value': [12, 34, 55, 69],
       'more': [True, False, True, False]}
df = pd.DataFrame(data)
df

Unnamed: 0,day,value,more
0,Monday,12,True
1,Tuesday,34,False
2,Wednesday,55,True
3,Friday,69,False


As we can see, there is an automatic index again and the df is shown in a nice format. `.head()` gives the first five rows

In [169]:
df.head()

Unnamed: 0,day,value,more
0,Monday,12,True
1,Tuesday,34,False
2,Wednesday,55,True
3,Friday,69,False


#### Accessing Columns

In [124]:
df.more.iloc[2] #attribute notation, works only when column = 
                # valid python-variable (no spaces, trailing numbers, ...)

True

In [125]:
df['value'][2] #dict-like notation

55

#### Creating new column

In [21]:
df['newcol'] = 1 #does not work with attribute notation
df

Unnamed: 0,day,value,more,newcol
0,Monday,12,True,1
1,Tuesday,34,False,1
2,Wednesday,55,True,1
3,Friday,69,False,1


In [23]:
df['newcol2'] = df['value']*2
df

Unnamed: 0,day,value,more,newcol,newcol2
0,Monday,12,True,1,24
1,Tuesday,34,False,1,68
2,Wednesday,55,True,1,110
3,Friday,69,False,1,138


### Dropping Items from an Axis

In [127]:
del df['newcol'] #also does not work with attribute access
df.columns

Index(['day', 'value', 'more'], dtype='object')

In [132]:
#Drop Keyword, works with multiple columns and rows (axis!)
cols = ['more']
new_df = df.drop(cols, axis = 1)
new_df

Unnamed: 0,day,value
0,Monday,12
1,Tuesday,34
2,Wednesday,55
3,Friday,69


In [134]:
#Delete a row
df.drop([0]) #can also give inplace = True for inplace mutation

Unnamed: 0,day,value,more
1,Tuesday,34,False
2,Wednesday,55,True
3,Friday,69,False


### Transpose the df

In [135]:
df.T

Unnamed: 0,0,1,2,3
day,Monday,Tuesday,Wednesday,Friday
value,12,34,55,69
more,True,False,True,False


In [106]:
df.index.name = 'random'
df.columns.name = 'features'
df

features,day,value,more
random,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Monday,12,True
1,Tuesday,34,False
2,Wednesday,55,True
3,Friday,69,False


Values attribute returns an nd-array with the df values:

In [107]:
df.values

array([['Monday', 12, True],
       ['Tuesday', 34, False],
       ['Wednesday', 55, True],
       ['Friday', 69, False]], dtype=object)

Also important to note: The index is immutable

In [117]:
#df.index[1] = 12 --> Type Error

### Re-Indexing: Gives a new index to data

In [118]:
df2 = pd.Series([1, 2, 3, 4], index = ["a", "b", "c", "d"])
df2.reindex(["c", "d", "a", "b"])

c    3
d    4
a    1
b    2
dtype: int64

Re-Indexing can be done with filling values for NAs that appear, such as `ffill`, `bfill`, ...

In [119]:
df2.reindex(["a", "b", "c", "x"])

a    1.0
b    2.0
c    3.0
x    NaN
dtype: float64

In [120]:
df2.reindex(["a", "b", "c", "x"], method = "ffill")

a    1
b    2
c    3
x    4
dtype: int64

Re-Indexing works for columns as well:

In [131]:
df.reindex(['value', 'day', 'more'], axis = 1)

Unnamed: 0,value,day,more
0,12,Monday,True
1,34,Tuesday,False
2,55,Wednesday,True
3,69,Friday,False


Re-Indexing both at the same time:

In [130]:
df.reindex(index = [0, 2, 3, 1], columns = ['value', 'day', 'more'])

Unnamed: 0,value,day,more
0,12,Monday,True
2,55,Wednesday,True
3,69,Friday,False
1,34,Tuesday,False


### Indexing, Selection and Filtering

In [138]:
new_data['a']

4

In [142]:
new_data['a':'b'] #includes endpoint (.loc)

index
a    4
b    3
Name: example, dtype: int64

In [141]:
df[1:3] #excludes endpoint (.iloc)

Unnamed: 0,day,value,more
1,Tuesday,34,False
2,Wednesday,55,True


In [144]:
#giving new value to corresponding section
new_data['a':'b'] = 4
new_data

index
a     4
b     4
1     2
2    69
Name: example, dtype: int64

In [146]:
df[:2] #the first two rows

Unnamed: 0,day,value,more
0,Monday,12,True
1,Tuesday,34,False


In [149]:
df.value > 8

0    True
1    True
2    True
3    True
Name: value, dtype: bool

In [24]:
df.iloc[:2, 0][df.value == 12][df.day == 'Monday'] #as many [] as needed

0    Monday
Name: day, dtype: object

For further stuff on that, refer to P144 in the Data Analysis Book

### Function Application and Mapping

NumPy ufuncs also work with Pandas objects

In [170]:
df = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df

Unnamed: 0,b,d,e
Utah,1.222139,-3.060116,1.194811
Ohio,-0.010086,-0.399486,0.922145
Texas,0.904392,-0.741873,0.366238
Oregon,-1.986555,-1.252987,-1.822857


In [161]:
np.abs(df) #gives positive floats from df

Unnamed: 0,b,d,e
Utah,0.57593,0.570305,0.381549
Ohio,0.581425,0.593045,1.008309
Texas,1.047858,0.547083,1.616207
Oregon,0.225991,0.29586,1.269392


A function of an one-dimensional array can be applied to each row or column with the `.apply` method.

In [162]:
f = lambda x: x.max() - x.min() #Calculates the diff between max and 
                                # and min for one series / axis

In [163]:
df.apply(f) #results in a series of results per column
            # calculation has happened across rows

b    1.623788
d    1.140129
e    2.624516
dtype: float64

In [171]:
df.apply(f, axis = 1) #calculation across columns

Utah      4.282255
Ohio      1.321631
Texas     1.646265
Oregon    0.733569
dtype: float64

In [172]:
df.sum()

b    0.129890
d   -5.454462
e    0.660337
dtype: float64

In [173]:
df.sum(axis = 1)

Utah     -0.643167
Ohio      0.512572
Texas     0.528757
Oregon   -5.062398
dtype: float64

For element-wise calculations, `.applymap` is used (`.map` for series)

In [179]:
df.applymap(lambda x: '%.3f' % x)

Unnamed: 0,b,d,e
Utah,1.222,-3.06,1.195
Ohio,-0.01,-0.399,0.922
Texas,0.904,-0.742,0.366
Oregon,-1.987,-1.253,-1.823


### Sorting and Ranking

In [181]:
df = pd.DataFrame(np.arange(8).reshape((2, 4)),
                  index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
df

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [187]:
df.sort_index(inplace = True)
df

Unnamed: 0,a,b,c,d
one,5,6,7,4
three,1,2,3,0


In [186]:
df.sort_index(axis = 1, inplace = True)
df

Unnamed: 0,a,b,c,d
one,5,6,7,4
three,1,2,3,0


In [196]:
df.sort_values(by = 'a') #ascending

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [197]:
df.sort_values(by = 'c', ascending = False) #descending

Unnamed: 0,a,b,c,d
one,5,6,7,4
three,1,2,3,0


In [201]:
df.sort_values(by = ['a', 'b']) #sort by multiple columns

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The `.rank` of a DF / Series assigns ranks to the values. Same values get the mean of two ranks.

In [222]:
ser = pd.DataFrame([[7, 2, 3, 5, 3, 6],
                   [2, 3, 4, 5, 6, 7]], ['a', 'b'])
ser = ser.T
ser

Unnamed: 0,a,b
0,7,2
1,2,3
2,3,4
3,5,5
4,3,6
5,6,7


In [215]:
ser.rank()

Unnamed: 0,a,b
0,6.0,1.0
1,1.0,2.0
2,2.5,3.0
3,4.0,4.0
4,2.5,5.0
5,5.0,6.0


In [216]:
ser.rank(method = 'first') #first appearing value gets first rank

Unnamed: 0,a,b
0,6.0,1.0
1,1.0,2.0
2,2.0,3.0
3,4.0,4.0
4,3.0,5.0
5,5.0,6.0


In [217]:
ser.rank(ascending = False)

Unnamed: 0,a,b
0,1.0,6.0
1,6.0,5.0
2,4.5,4.0
3,3.0,3.0
4,4.5,2.0
5,2.0,1.0


In [218]:
ser.rank(method = 'max') #both get highest rank of both values

Unnamed: 0,a,b
0,6.0,1.0
1,1.0,2.0
2,3.0,3.0
3,4.0,4.0
4,3.0,5.0
5,5.0,6.0


In [219]:
ser.rank(axis = 1) #Rank across columns works as well

Unnamed: 0,a,b
0,2.0,1.0
1,1.0,2.0
2,1.0,2.0
3,1.5,1.5
4,1.0,2.0
5,1.0,2.0
