# In this session, we learn the one data structure we will be working with the most: dataFrame

## <font color="red"> DataFrame </font>

DataFrame is Pandas' way to handle tabular data. 
You can think of it as a two-way array. In fact, each column of the "table" is a series

## <font color="blue"> 1. Define a DataFrame </font>

In [None]:
#In this example, we first create a 4 by 3 array, then use this array to create a df object.

import pandas as pd
import numpy as np
a = np.random.rand(4,3)
print(type(a))
print(a)
df = pd.DataFrame(a)
print(df)

In [None]:
# We could add labels to each column -- this would be your "column headers"
df.columns = ['A','B','C']
df

In [None]:
# What about the rows? the default would always start from 0. To change the row index, use the index() function --
# Remember, a df is actually a bunch of series put together. Each column is a series, therefore, all series command 
# works for all columns
df.index =[1,2,3,4]
df

In [None]:
# We can also convenient create time series, thanks to the date_range() function
df.index = pd.date_range('20200101',periods = 4)
df

In [None]:
# To GET the column name:
print(df.columns)
# To GET the row index:
print(df.index)
# To get the values of the whole table:
print(df.values)

In [None]:
# Most of the time, we use excel or csv files as input and use dataFrame to hold a table. Please refer to the IO tutorial file 

## <font color="blue"> 2. Reading Operations </font>

Just like an NumPy array, we can do slicing in DataFrame as well


In [None]:
# Selecting the first five rows
df.head()

In [None]:
# You can also decide how many rows you want
df = pd.DataFrame(np.random.rand(20,20))
print(df)
df.head(6)

In [None]:
# You can also view the LAST five rows of the table:
df.tail()

In [None]:
# and you can also customize
df.tail(7)

In [None]:
# You can select a specific column
df.columns = np.arange(1,21)
print(df)
print(df[4])

In [None]:
# Remember that each COLUMN is a series, so you are actually returning a series
a = df[6]
type(a)

In [None]:
# However, if you retrieve multiple columns...
b = df[[6,9]]
print(b)
print(type(b))

In [None]:
# We have covered slicing by columns. Now we slice by rows
print(df[2:4])

In [None]:
# Notice how row and columns work differently -- this would NOT return the fourth row:
print(df[4])
# How do you retrieve row 4 then? by using iloc()
print(df.iloc[4])

In [2]:
# is it a little hard to see using the above example? Let's create another df and refresh our memory of the array object
import pandas as pd
import numpy as np
a = np.array(np.random.randint(1,100,30))
b = a.reshape(5,-1)
print(b)

[[67 85 38 61 22 94]
 [44 57 77 38 93 53]
 [12 23 92 71 30 54]
 [65 65 86 71 40 42]
 [25 76 63 31 73 80]]


In [None]:
df2 = pd.DataFrame(b)
df2

In [None]:
# Now let's compare how you retrive a row vs. a column
df2[3]

In [None]:
df2[3:5]

In [None]:
df2[[3,5]]

In [None]:
df2.iloc[2]

In [None]:
df2.iloc[2:4]

In [None]:
df2.iloc[[2,4]]

In [None]:
df2.iloc[2,4]

In [None]:
# Confusing? When iloc takes in ONE item - be it a list or a number, it inteprets it as the row index
# However, if it takes in TWO parameters separated by "," it will take the first one as the row index and the second one as the 
# column index
df2.iloc[2:4,3:5]

In [None]:
# To Summarize:
# To retrive whole columns, you just need to use the command df[column indexes]
# Try retrieve column 2 and 5 from df2:



In [None]:
# To retrieve rows, you need the iloc function: df.iloc[row numbers]
# Try retrieve row 2 and row 4 from df2


In [None]:
# To slice, you use iloc() with TWO parameters
# Try retrieve row 1 and 3 at column 2 and 4


------------------------------------------------------------------------------------------------------------------

In [3]:
#Let's now work with real data set.
GBI = pd.read_excel('GBI.xlsx',sheet_name='Sales')
print(GBI)

       YEAR  MONTH  DAY  Customer   CustomerDescr    City Salesorg Country  \
0      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
1      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
2      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
3      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
4      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
...     ...    ...  ...       ...             ...     ...      ...     ...   
48379  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48380  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48381  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48382  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48383  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   

       OrderNumber  OrderItem  ... Division SalesQuantity UnitO

In [4]:
#How do we use column NAME instead of POSITION to retrieve a column?
#Let's retrieve the 'CITY' column:
GBI['City']

0        Boston
1        Boston
2        Boston
3        Boston
4        Boston
          ...  
48379    Bochum
48380    Bochum
48381    Bochum
48382    Bochum
48383    Bochum
Name: City, Length: 48384, dtype: object

In [5]:
#It is the same as loc() function - NOTE that the syntax for loc[], just like iloc[], is row,column
GBI.loc[:,'City']

0        Boston
1        Boston
2        Boston
3        Boston
4        Boston
          ...  
48379    Bochum
48380    Bochum
48381    Bochum
48382    Bochum
48383    Bochum
Name: City, Length: 48384, dtype: object

In [6]:
#... or simply:
GBI.City

0        Boston
1        Boston
2        Boston
3        Boston
4        Boston
          ...  
48379    Bochum
48380    Bochum
48381    Bochum
48382    Bochum
48383    Bochum
Name: City, Length: 48384, dtype: object

In [7]:
#You can also select unique values of a column
GBI.City.unique()

array(['Boston', 'München', 'Berlin', 'Bochum', 'Atlanta', 'Chicago',
       'Hannover', 'Leipzig', 'Stuttgart', 'Magdeburg', 'Irvine',
       'Hamburg', 'New York City', 'Detroit', 'Heidelberg',
       'Philadelphia', 'Grand Rapids', 'Seattle', 'Washington DC',
       'Denver', 'Palo Alto', 'Frankfurt', 'Anklam'], dtype=object)

In [10]:
# Let's try to retrieve the first five rows of the city column
print(GBI.loc[0:4,'City'])

0    Boston
1    Boston
2    Boston
3    Boston
4    Boston
Name: City, dtype: object


In [9]:
#Exercise: print the City name for the 100th row in GBI; 
GBI.City[100]

'Atlanta'

In [11]:
# Selecting based on value: Just like list, you can use logic operators to filter out rows you do not want
print(GBI[GBI.Revenue>6000])


       YEAR  MONTH  DAY  Customer    CustomerDescr     City Salesorg Country  \
1      2007      1    1      5000   Beantown Bikes   Boston     UE00      US   
4      2007      1    1      5000   Beantown Bikes   Boston     UE00      US   
6      2007      1    1      5000   Beantown Bikes   Boston     UE00      US   
14     2007      1    1     15000    Bavaria Bikes  München     DS00      DE   
28     2007      1    1     16000    Capital Bikes   Berlin     DN00      DE   
...     ...    ...  ...       ...              ...      ...      ...     ...   
48207  2011     12   10     23000  Red Light Bikes  Hamburg     DN00      DE   
48215  2011     12   12     16000    Capital Bikes   Berlin     DN00      DE   
48216  2011     12   12     16000    Capital Bikes   Berlin     DN00      DE   
48346  2011     12   28     15000    Bavaria Bikes  München     DS00      DE   
48351  2011     12   28     15000    Bavaria Bikes  München     DS00      DE   

       OrderNumber  OrderItem  ... Divi

## <font color="blue"> 3. Writing Operations </font>

In [12]:
# Sorting can be done by labels:
print(GBI.sort_index(axis=0, ascending = False))
print(GBI.sort_index(axis=1, ascending = False))
# axis 0 means sorting by row, axis 1 means sorting by column LABLES

       YEAR  MONTH  DAY  Customer   CustomerDescr    City Salesorg Country  \
48383  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48382  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48381  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48380  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
48379  2011     12   31     19000        Fahrpott  Bochum     DN00      DE   
...     ...    ...  ...       ...             ...     ...      ...     ...   
4      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
3      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
2      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
1      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   
0      2007      1    1      5000  Beantown Bikes  Boston     UE00      US   

       OrderNumber  OrderItem  ... Division SalesQuantity UnitO