## Operations on the entire Dataframe

### Renaming the columns of the dataframe

#### Renaming a certain column

### Adding a column to the dataframe

In [41]:
import pandas as pd
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
print(df)

   B  C
0  1  4
1  2  5
2  3  6


We are adding the column into the first position of the dataframe

In [42]:
idx = 0
new_col = [7, 8, 9]  # can be a list, a Series, an array or a scalar   
df.insert(loc=idx, column='A', value=new_col)

### Add column with constant value

In [62]:
df = pd.DataFrame('x', index=range(4), columns=list('ABC'))

df['new'] = 'y'

print(df)

   A  B  C new
0  x  x  x   y
1  x  x  x   y
2  x  x  x   y
3  x  x  x   y


### Converting a Pandas Series to dict or list

In [43]:
>>> s = pd.Series([1, 2, 3, 4])
>>> s.to_dict()
{0: 1, 1: 2, 2: 3, 3: 4}
>>> s.to_list()

[1, 2, 3, 4]

### Rounding all values in dataframe

### Combine 2 dataframes

In [44]:
In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
   ...:                     'B': ['B0', 'B1', 'B2', 'B3'],
   ...:                     'C': ['C0', 'C1', 'C2', 'C3'],
   ...:                     'D': ['D0', 'D1', 'D2', 'D3']},
   ...:                     index=[0, 1, 2, 3])
   ...: 

In [2]: df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
   ...:                     'B': ['B4', 'B5', 'B6', 'B7'],
   ...:                     'C': ['C4', 'C5', 'C6', 'C7'],
   ...:                     'D': ['D4', 'D5', 'D6', 'D7']},
   ...:                      index=[4, 5, 6, 7])
   ...: 

In [3]: df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
   ...:                     'B': ['B8', 'B9', 'B10', 'B11'],
   ...:                     'C': ['C8', 'C9', 'C10', 'C11'],
   ...:                     'D': ['D8', 'D9', 'D10', 'D11']},
   ...:                     index=[8, 9, 10, 11])
   ...: 

In [4]: frames = [df1, df2, df3]

In [5]: result = pd.concat(frames)
print(result)

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11


Combine 2 dataframes by the index

We can combine 2 dataframes by a certain common column

Combine more than 2 dataframes on a certain column 

### Setting the index of a dataframe to a certain column

### Serialize a dataframe

### Transpose a dataframe (shift rows by columns)

In [45]:
import pandas as pd
d1 = {'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data=d1)
df1

Unnamed: 0,col1,col2
0,1,3
1,2,4


Let's transpose now:

In [46]:
df1_transposed = df1.transpose()
print(df1_transposed)

      0  1
col1  1  2
col2  3  4


### Reset index (Drop index)

In [47]:
import numpy as np
df = pd.DataFrame([('bird', 389.0),
                   ('bird', 24.0),
                   ('mammal', 80.5),
                   ('mammal', np.nan)],
                  index=['falcon', 'parrot', 'lion', 'monkey'],
                  columns=('class', 'max_speed'))

print(df)

         class  max_speed
falcon    bird      389.0
parrot    bird       24.0
lion    mammal       80.5
monkey  mammal        NaN


In [48]:
df=df.reset_index()
print(df)

    index   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN


## Operations on the different variables in the Dataframe

### pd.crosstab

The crosstab function can be used with the dataframe in order to create contigency tables to analyze the relationship between categorical variables. 
For example, let's analyze if there is a relationship between the probability of having a price reversal (`Reversed`= True) if there is divergence in the RSI indicator (`Divergence`=True):

The 2x2 contigency table thas is above shows counts (including the Total row/columns counts. Now, if we want to calculate the proportion along each column:

### pd.groupby

This Pandas function is used to split the data depending on the categories of a certain variable. Then, a certain operation can be applied to the subgroups created after splitting the data and finally the different subgroups are combined into the final dataframe

Let's see an example:

We split the dataframe into 2 subgroups depending on the values of the `Divergence` variable (TRUE/FALSE). And then, we calculate the mean for the variable `Trend length after (bars)` for each subgroup

Now, let's calculate the normalized value_counts for each category

### pd.cut

This Pandas function can be used to create bins based on the values of a continuous variable. In this way, we create discrete chunks that are used as ordinal categorical variables. For example:

In [49]:
pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)


([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], (6.533, 9.7], (0.19, 3.367]]
 Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] < (6.533, 9.7]],
 array([0.1905    , 3.36666667, 6.53333333, 9.7       ]))

And if we want to create our own cut points we need to use a numpy array:

In [50]:
cuts = np.array([0,2,4,6,10])
pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), cuts)

[(0, 2], (0, 2], (2, 4], (6, 10], (6, 10], (2, 4]]
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 10]]

### pd.melt

This function is used for doing the following. Let's assume that we have the following dataframe:

In [51]:
df = pd.DataFrame(data = np.random.random(size=(4,4)), columns = ['A','B','C','D'])
df.head()

Unnamed: 0,A,B,C,D
0,0.799995,0.72388,0.441851,0.855194
1,0.715625,0.617369,0.359415,0.504778
2,0.694445,0.342369,0.081886,0.483613
3,0.92103,0.865819,0.886652,0.612787


Then, with pd.melt we can do the following:

In [52]:
pd.melt(df)

Unnamed: 0,variable,value
0,A,0.799995
1,A,0.715625
2,A,0.694445
3,A,0.92103
4,B,0.72388
5,B,0.617369
6,B,0.342369
7,B,0.865819
8,C,0.441851
9,C,0.359415


### drop function
This is used to drop a column from the dataframe. The original DF is not affected by this drop

removing multiple columns

### copy function
This function can be used in order to create a copy of a certain column in the DataFrame:

Now, let's duplicate a certain column in the DF (in this case we are adding a new column with these duplicated values)

Now, if we want to create a new variable with the RSI values divided by two:

### sorting the dataframe by a column of date type

### filling the na (or n.a. or missing values)

First thing that is interesting is to check what columns have NaN values:

And in order deal with NaN values
There are different strategies for this:
* Dropping the na values for each of the missing rows:

* Replacing the missing values with the median:

* Selecting the null values in a dataframe column

* Selecting the not null values in a dataframe column

### Changing the date type of an entire column
Example extracted from https://code.i-harness.com/en/q/f27a5e

In [53]:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df.dtypes

one      object
two      object
three    object
dtype: object

In [54]:
df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes

one       object
two      float64
three    float64
dtype: object

### Working with dates
We illustrate here how to work with dates. First, let's convert a date string element in the `End of trend` column into a datetime object: 

Now, we can perform arithmetic operations with this date. For example, we can substract a day:

Now, we can convert back this date object into a string:

Now, let's work on the entire column. Let's replace the data type of the `End of trend` column to datetime object:

### Deleting rows based on the value of one column
Let's imagine that we want to remove from the dataframe all records where the value of 'Divergence' is True. This can be done in the following way:

### Rolling on a certain column
This basically means to perform a certain operation on a dataframe column depending on the window size

In [55]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'B': [0, 1, 2, 3, 4]})

df.rolling(4).sum()

Unnamed: 0,B
0,
1,
2,
3,6.0
4,10.0


### pandas melt
This function is used to unpivot a certain column and to transform a df from a compact to a wide format

In [56]:
import pandas as pd

# creating a dataframe 
df = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 
                   'Course': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 
                   'Age': {0: 27, 1: 23, 2: 21}})
print(df)

     Name    Course  Age
0    John   Masters   27
1     Bob  Graduate   23
2  Shiela  Graduate   21


In [57]:
# Name is id_vars and Course is value_vars 
pd.melt(df, id_vars =['Name'], value_vars =['Course']) 

Unnamed: 0,Name,variable,value
0,John,Course,Masters
1,Bob,Course,Graduate
2,Shiela,Course,Graduate


In [58]:
# multiple unpivot columns 
pd.melt(df, id_vars =['Name'], value_vars =['Course', 'Age']) 

Unnamed: 0,Name,variable,value
0,John,Course,Masters
1,Bob,Course,Graduate
2,Shiela,Course,Graduate
3,John,Age,27
4,Bob,Age,23
5,Shiela,Age,21


In [59]:
# Names of ‘variable’ and ‘value’ columns can be customized 
pd.melt(df, id_vars =['Name'], value_vars =['Course'], 
        var_name ='ChangedVarname', value_name ='ChangedValname') 

Unnamed: 0,Name,ChangedVarname,ChangedValname
0,John,Course,Masters
1,Bob,Course,Graduate
2,Shiela,Course,Graduate
