## Exploring the Dataframe

Exploring the indexes (rows) of a certain dataframe

Method 1:

In [14]:
list(dfObj.index.values)

['name', 'age', 'city']

Method 2:

In [15]:
list(dfObj.index) 

['name', 'age', 'city']

Good!, now we can inspect this new dataframe:  

In [11]:
# returns a tuple with number of rows/columns
DF.shape

(86, 11)

In order to have basic information the DataFrame:

In [12]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 11 columns):
End of trend                 86 non-null object
RSI                          86 non-null float64
Divergence                   86 non-null bool
Number of bounces            86 non-null int64
Trend length before(bars)    86 non-null int64
Currency Pair                86 non-null object
Direction                    86 non-null object
Entry Time-frame             86 non-null object
Reversed                     86 non-null bool
Trend length after (bars)    69 non-null float64
Ranging                      86 non-null bool
dtypes: bool(3), float64(2), int64(2), object(4)
memory usage: 5.7+ KB


In order to have a more detailed report on the memory usage you do:

In [13]:
DF.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 11 columns):
End of trend                 86 non-null object
RSI                          86 non-null float64
Divergence                   86 non-null bool
Number of bounces            86 non-null int64
Trend length before(bars)    86 non-null int64
Currency Pair                86 non-null object
Direction                    86 non-null object
Entry Time-frame             86 non-null object
Reversed                     86 non-null bool
Trend length after (bars)    69 non-null float64
Ranging                      86 non-null bool
dtypes: bool(3), float64(2), int64(2), object(4)
memory usage: 25.4 KB


And we can also take a look to the first rows of the dataframe:

In [14]:
DF.head(3) #only the 3 first lines are shown

Unnamed: 0,End of trend,RSI,Divergence,Number of bounces,Trend length before(bars),Currency Pair,Direction,Entry Time-frame,Reversed,Trend length after (bars),Ranging
0,23/04/2008 03:00,60.0,True,2,53,EUR/USD,up,D,True,132.0,False
1,06/05/2008 03:00,39.0,False,0,7,EUR/USD,down (within uptrend),D,False,,False
2,27/05/2008 03:00,54.0,False,0,5,EUR/USD,up,D,True,6.0,True


In order to know whe column names:

In [15]:
DF.columns

Index(['End of trend', 'RSI', 'Divergence', 'Number of bounces',
       'Trend length before(bars)', 'Currency Pair', 'Direction',
       'Entry Time-frame', 'Reversed', 'Trend length after (bars)', 'Ranging'],
      dtype='object')

If we want to check a particular column from the dataframe ('RSI' for example):

In [16]:
RSI=DF[['RSI']]

If we want to select 2 non consecutive columns:

In [17]:
a=DF[['RSI','Ranging']]

### Selecting using .iloc and .loc
Extracted from:
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

#### .iloc<br>
Single selection:<br>
* Rows:<br>
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.<br>
data.iloc[1] # second row of data frame (Evan Zigomalas)<br>
data.iloc[-1] # last row of data frame (Mi Richan)<br>
* Columns:<br>
data.iloc[:,0] # first column of data frame (first_name)<br>
data.iloc[:,1] # second column of data frame (last_name)<br>
data.iloc[:,-1] # last column of data frame (id)<br>

Multiple selection:<br>
<br>
data.iloc[0:5] # first five rows of dataframe<br>
data.iloc[:, 0:2] # first two columns of data frame with all rows<br>
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.<br>
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1)<br>
data.iloc[:, [0,1]] <br>

#### .loc<br>
Single selection:<br>
a=DF.loc[:,'Direction']<br>

Multiple selection:<br>
a=DF.loc[:,['Direction','RSI']]

### Objects returned by .iloc and .loc

* If only one row is selected then we will get a Pandas series:<br>
data.iloc[0]
* If we use list selector then we get a Dataframe:<br>
data.iloc[[0]]
* If we select multiple rows then we get a Dataframe:<br>
data.iloc[0:5]

In [18]:
ix='RSI'

DF.loc[:,ix]

0     60.0
1     39.0
2     54.0
3     42.0
4     43.0
5     61.0
6     20.6
7     17.0
8     58.0
9     20.6
10    77.0
11    41.0
12    72.0
13    42.0
14    72.0
15    48.0
16    52.0
17    47.0
18    64.0
19    72.0
20    62.0
21    26.0
22    50.0
23    27.0
24    23.0
25    30.0
26    71.0
27    72.0
28    66.0
29    27.0
      ... 
56    32.0
57    57.0
58    38.0
59    73.0
60    25.0
61    68.0
62    40.0
63    59.0
64    39.0
65    29.0
66    28.0
67    59.0
68    24.0
69    57.0
70    43.0
71    54.0
72    51.0
73    28.0
74    41.0
75    26.0
76    28.0
77    39.0
78    30.0
79    39.0
80    49.0
81    20.0
82    45.0
83    27.0
84    32.0
85    31.0
Name: RSI, Length: 86, dtype: float64

### Setting the value of a certain cell in the dataframe
#### By index:

In [19]:
import pandas as pd

df=pd.DataFrame(index=['A','B','C'], columns=['x','y'])
df.at['C', 'x'] = 10

#### By position:

In [20]:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],columns=['A', 'B', 'C'])
df.iat[1, 2] = 10

### Logical selection

And for example, if we want to select all records for which the 'Reversed' column is TRUE:

In [21]:
reversed_true=DF.loc[DF['Reversed']==True]

And if we want to select based in either the value of one column or a different one:

In [22]:
DF.loc[(DF['Reversed']==True) | DF['Divergence']==True]

Unnamed: 0,End of trend,RSI,Divergence,Number of bounces,Trend length before(bars),Currency Pair,Direction,Entry Time-frame,Reversed,Trend length after (bars),Ranging
0,23/04/2008 03:00,60.0,True,2,53,EUR/USD,up,D,True,132.0,False
2,27/05/2008 03:00,54.0,False,0,5,EUR/USD,up,D,True,6.0,True
4,16/06/2008 03:00,43.0,False,0,4,EUR/USD,down,D,True,20.0,True
5,15/07/2008 03:00,61.0,False,0,20,EUR/USD,up,D,True,41.0,True
7,12/9/2008 03:00,17.0,False,2,43,EUR/USD,down,D,True,8.0,False
8,23/9/2008 03:00,58.0,False,0,6,EUR/USD,up (within downtrend),D,True,24.0,False
9,28/10/2008 03:00,20.6,True,5,70,EUR/USD,down,D,True,36.0,False
10,19/12/2008 03:00,77.0,False,0,13,EUR/USD,up,D,True,42.0,False
11,05/03/2009 03:00,41.0,True,0,56,EUR/USD,down,D,True,189.0,False
12,20/03/2009 03:00,72.0,False,0,11,EUR/USD,up,D,True,20.0,False


Now, if we want the counts (frequencies) for a certain categorical variable we have to enter the following:

In [23]:
DF['Currency Pair'].value_counts()

EUR/USD    73
USD/CAD    11
AUD/USD     2
Name: Currency Pair, dtype: int64

And if we want to have proportions instead of counts we do:

In [24]:
DF['Currency Pair'].value_counts(normalize=True)

EUR/USD    0.848837
USD/CAD    0.127907
AUD/USD    0.023256
Name: Currency Pair, dtype: float64

And if we want we have percentages we do:

In [25]:
DF['Currency Pair'].value_counts(normalize=True)*100

EUR/USD    84.883721
USD/CAD    12.790698
AUD/USD     2.325581
Name: Currency Pair, dtype: float64

Now, if we want to copy the entire dataframe:

In [26]:
newDF = DF.copy()

newDF.head(3)

Unnamed: 0,End of trend,RSI,Divergence,Number of bounces,Trend length before(bars),Currency Pair,Direction,Entry Time-frame,Reversed,Trend length after (bars),Ranging
0,23/04/2008 03:00,60.0,True,2,53,EUR/USD,up,D,True,132.0,False
1,06/05/2008 03:00,39.0,False,0,7,EUR/USD,down (within uptrend),D,False,,False
2,27/05/2008 03:00,54.0,False,0,5,EUR/USD,up,D,True,6.0,True
