# Data Wrangling

<a href='https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html'>10 minutes to Pandas</a>

In [3]:
import pandas as pd
import numpy as np

## Object creation

series creation

In [6]:
s = pd.Series([0, 1, 2, 3, 4])

In [35]:
s

0    0
1    1
2    2
3    3
4    4
dtype: int64

Dataframe creation employing several methods:

In [38]:
df = pd.DataFrame({
    'A': 1., # fills column A with 1.0
    'B': pd.Timestamp('20130102'), # fills column B with a date
    'C': pd.Series(1, index=list(range(4)), dtype='float32'), # creates a series to fill column C
    'D': np.array([3]*4, dtype='int32'), # creates and expand a list to fill column
    'E': pd.Categorical(['test', 'train', 'test', 'train']), # creates categorical (not string) variable
    'F': 'foo', # fills with string variable
})

In [37]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


## Data selection

In [40]:
A = pd.read_excel('A.xlsx')

In [41]:
A

Unnamed: 0,col0,col1,col2,col3,col4
0,AA,BA,CA,DA,EA
1,AB,BB,CB,DB,EB
2,AC,BC,CC,DC,EC
3,AD,BD,CD,DD,ED
4,AE,BE,CE,DE,EE


In [42]:
A

Unnamed: 0,col0,col1,col2,col3,col4
0,AA,BA,CA,DA,EA
1,AB,BB,CB,DB,EB
2,AC,BC,CC,DC,EC
3,AD,BD,CD,DD,ED
4,AE,BE,CE,DE,EE


Display the index:

In [43]:
A.index

RangeIndex(start=0, stop=5, step=1)

### Display the columns:

In [44]:
A.columns

Index(['col0', 'col1', 'col2', 'col3', 'col4'], dtype='object')

NumPy representation/equivalent of data<br>
<strong>
    NumPy arrays have one data type for the entire array<br>
    Pandas DataFrames hold one data type per column
</strong>

In [45]:
A.to_numpy()

array([['AA', 'BA', 'CA', 'DA', 'EA'],
       ['AB', 'BB', 'CB', 'DB', 'EB'],
       ['AC', 'BC', 'CC', 'DC', 'EC'],
       ['AD', 'BD', 'CD', 'DD', 'ED'],
       ['AE', 'BE', 'CE', 'DE', 'EE']], dtype=object)

In [46]:
A.describe()

Unnamed: 0,col0,col1,col2,col3,col4
count,5,5,5,5,5
unique,5,5,5,5,5
top,AB,BD,CB,DD,EC
freq,1,1,1,1,1


### Transposing data

In [47]:
A.T

Unnamed: 0,0,1,2,3,4
col0,AA,AB,AC,AD,AE
col1,BA,BB,BC,BD,BE
col2,CA,CB,CC,CD,CE
col3,DA,DB,DC,DD,DE
col4,EA,EB,EC,ED,EE


### Sorting by axis:

Descending index sorting

In [48]:
A.sort_index(ascending=False)

Unnamed: 0,col0,col1,col2,col3,col4
4,AE,BE,CE,DE,EE
3,AD,BD,CD,DD,ED
2,AC,BC,CC,DC,EC
1,AB,BB,CB,DB,EB
0,AA,BA,CA,DA,EA


Descending column sorting

In [49]:
A.sort_index(axis=1, ascending=False)

Unnamed: 0,col4,col3,col2,col1,col0
0,EA,DA,CA,BA,AA
1,EB,DB,CB,BB,AB
2,EC,DC,CC,BC,AC
3,ED,DD,CD,BD,AD
4,EE,DE,CE,BE,AE


Descending sorting by both axes<br>
(Reverse the dataframe)

In [50]:
A.sort_index(axis=1, ascending=False).sort_index(axis=0, ascending=False)

Unnamed: 0,col4,col3,col2,col1,col0
4,EE,DE,CE,BE,AE
3,ED,DD,CD,BD,AD
2,EC,DC,CC,BC,AC
1,EB,DB,CB,BB,AB
0,EA,DA,CA,BA,AA


## Data Selection
### Getting

In [51]:
A['col3']

0    DA
1    DB
2    DC
3    DD
4    DE
Name: col3, dtype: object

Selecting via []<br>
row slicing

In [52]:
A[1:3]

Unnamed: 0,col0,col1,col2,col3,col4
1,AB,BB,CB,DB,EB
2,AC,BC,CC,DC,EC


### Selection by label

<strong>Pending, see following <a href='https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label'>link</a></strong>

In [53]:
A.loc[1[2]]

TypeError: 'int' object is not subscriptable

Selecting on a single-axis by label:

```df.loc[start row : end row]```

In [56]:
A.loc[1:3]

Unnamed: 0,col0,col1,col2,col3,col4
1,AB,BB,CB,DB,EB
2,AC,BC,CC,DC,EC
3,AD,BD,CD,DD,ED


Selection of a row portion<br>
Data is displayed as a column<br>
Notice that the non-existent column will be displayed as a ```NaN```

In [57]:
A.loc[3, ['col3', 'col1', 'col5']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return getattr(section, self.name)[new_key]


col3     DD
col1     BD
col5    NaN
Name: 3, dtype: object

Selecting only first row<br>
(data is displayed as column)

In [58]:
A.loc[0]

col0    AA
col1    BA
col2    CA
col3    DA
col4    EA
Name: 0, dtype: object

Same result as ```A.loc[0]```<br>
but in a df presentation/type

In [59]:
A.loc[:0]

Unnamed: 0,col0,col1,col2,col3,col4
0,AA,BA,CA,DA,EA


Selecting last row

In [60]:
A.loc[len(A)-1:]

Unnamed: 0,col0,col1,col2,col3,col4
4,AE,BE,CE,DE,EE


Selecting last 3 rows:

In [61]:
A.loc[len(A)-3:]

Unnamed: 0,col0,col1,col2,col3,col4
2,AC,BC,CC,DC,EC
3,AD,BD,CD,DD,ED
4,AE,BE,CE,DE,EE


Selecting on a multi-axis by label:

In [62]:
A.loc[:, ['col3', 'col5']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,col3,col5
0,DA,
1,DB,
2,DC,
3,DD,
4,DE,


In [63]:
A.loc[1:2, ['col2', 'col4']]

Unnamed: 0,col2,col4
1,CB,EB
2,CC,EC


Selecting scalar/single value:

In [64]:
A.loc[1, 'col3']

'DB'

In [65]:
A.loc[1, ['col3']]

col3    DB
Name: 1, dtype: object

In [66]:
A.loc[1:1, ['col3']]

Unnamed: 0,col3
1,DB


### Selection by position

Selection of the last row<br>
displayed as a column

In [67]:
A.iloc[len(A)-1]

col0    AE
col1    BE
col2    CE
col3    DE
col4    EE
Name: 4, dtype: object

Selection of the last row<br>
displayed as a dataframe

In [68]:
A.iloc[len(A)-1:]

Unnamed: 0,col0,col1,col2,col3,col4
4,AE,BE,CE,DE,EE


by integers (similar to numpy)

syntax:<br>
```df.iloc[start row : end row -1, start col  : end col -1]```

In [235]:
A.iloc[2:4, 1:3]

Unnamed: 0,col1,col2
2,BC,CC
3,BD,CD


central data

In [69]:
A.iloc[2:3, 2:3]

Unnamed: 0,col2
2,CC


several not-sequential columns selection

In [81]:
A

Unnamed: 0,col0,col1,col2,col3,col4
0,AA,BA,CA,DA,EA
1,AB,BB,CB,DB,EB
2,AC,BC,CC,DC,EC
3,AD,BD,CD,DD,ED
4,AE,BE,CE,DE,EE


In [83]:
A.iloc[1:4, [0, 3, 4]]

Unnamed: 0,col0,col3,col4
1,AB,DB,EB
2,AC,DC,EC
3,AD,DD,ED


In [84]:
A.iloc[[1,2], [1, 2, 4]]

Unnamed: 0,col1,col2,col4
1,BB,CB,EB
2,BC,CC,EC


slicing rows explicitly

In [86]:
A.iloc[0:2, :]

Unnamed: 0,col0,col1,col2,col3,col4
0,AA,BA,CA,DA,EA
1,AB,BB,CB,DB,EB


slicing columns explicitly

In [87]:
A.iloc[:, [1,3]]

Unnamed: 0,col1,col3
0,BA,DA
1,BB,DB
2,BC,DC
3,BD,DD
4,BE,DE


## Boolean indexing

In [92]:
array = np.random.randn(5,5)

In [93]:
df = pd.DataFrame(array)

In [94]:
df

Unnamed: 0,0,1,2,3,4
0,0.879036,-0.825262,0.634716,-0.481218,-1.220715
1,-0.638497,-0.111432,-0.599698,0.570246,0.736952
2,1.454193,-0.305721,1.280712,-0.499257,0.652761
3,0.222417,-0.111608,-0.797981,-3.426054,0.738839
4,-1.416393,-0.619235,-0.594305,0.649356,-0.411582


## Tips & tricks

Setting a time series as index using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html">date_range</a>

In [72]:
dates = pd.date_range('20130101', periods=6)

In [73]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Creates a 6 x 4 array filled with random numbers

In [95]:
array2 = np.random.randn(3,4)

Converts to dataframe the previously declared ```array```<br>
Notice that its values will change each time the above line is executed.

In [98]:
df2 = pd.DataFrame(array2)

In [99]:
df2

Unnamed: 0,0,1,2,3
0,-0.025258,-0.400433,0.782399,-1.730806
1,-0.692327,-0.507851,1.197368,0.253623
2,0.027082,0.97647,-0.077618,0.153089


Setting up the previoulsy declared dates range as index of the below df

In [77]:
df = df.set_index(dates)

In [78]:
df

Unnamed: 0,0,1,2,3
2013-01-01,-0.475478,-1.818202,-0.191779,0.308438
2013-01-02,-0.537211,-1.141735,-0.81132,-0.714705
2013-01-03,0.218059,-0.727422,-1.755766,-1.025506
2013-01-04,0.612431,1.463061,-1.430404,0.084717
2013-01-05,-0.111866,2.31828,0.47484,0.614675
2013-01-06,0.969472,-0.327165,-0.747083,-0.487223


Renaming columns via dictionary

In [79]:
df = df.rename(columns = {0:'A', 1:'B', 2:'C', 3:'D'})

In [80]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.475478,-1.818202,-0.191779,0.308438
2013-01-02,-0.537211,-1.141735,-0.81132,-0.714705
2013-01-03,0.218059,-0.727422,-1.755766,-1.025506
2013-01-04,0.612431,1.463061,-1.430404,0.084717
2013-01-05,-0.111866,2.31828,0.47484,0.614675
2013-01-06,0.969472,-0.327165,-0.747083,-0.487223
