# Pandas DataFrame
[Build a DataFrame](#build) | 
[Explore a DataFrame](#explore) | 
[Select elements of the DataFrame](#select) | 
[Remove column](#remove) | 
[Add a column](#addCol) | 
[Remove a row](#removeRow) | 
[Add a row](#addRow) | 
[Sorting](#sort) | 
[Applying functions to all (or subsets) of a DataFrame](#apply)

As mentioned in class, we will use the standard Pandas DataFrame commands rather than those proposed by the data8 book.
In addition to the guide provided here you can find good tutorials on Pandas DataFrames at [tutorial point](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm) and on the [Pandas official site](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe)

In [None]:
import pandas as pd

## Build a DataFrame <a name = "build"></a>
There are several ways to buid DataFrames. For the time being, we will only consider one.

In [None]:
# Constructing dataFrame from a data structure (dictionary)
# Note that the elements in the dictionary must have the same length
cs1040={'year':[2016, 2016, 2017, 2017, 2018, 2018, 2019],
        'semester': ['Spring', 'fall', 'Spring', 'fall', 'Spring', 'fall', 'Spring'],
        'avg_grade': [3.0, 2.8, 3.2, 3.1, 3.4, 3.0, 3.5]}
cs1040_df = pd.DataFrame(data=cs1040)
cs1040_df

In [None]:
# Note that in the example above I did not define an index, so an index is defined automatically (the numbers 0, 1, 2, 3, ...)
# One can, however define an index
players = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data=players, index=['rank1','rank2','rank3','rank4'])
print (df)

In [None]:
# You can also create an empty DataFrame:
df = pd.DataFrame()
print(df)

## Explore a DataFrame <a name="explore"></a>

In [None]:
# looking at the beginning of the dataFrame
cs1040_df.head(2)

In [None]:
# looking at the end of the dataFrame
cs1040_df.tail()

In [None]:
# descriptive statistics for numerical columns
cs1040_df.describe()

# Select elements of the DataFrame <a name="select"></a>

In [None]:
# Select one column
cs1040_df['avg_grade']

In [None]:
# Select several columns
cs1040_df[['avg_grade','year']]

In [None]:
# Select a set of consecutive rows (slice)
cs1040_df[2:6]

In [None]:
# Select all the rows that have "Spring" as the semester value (filter)
cs1040_df[cs1040_df['semester']=="Spring"]

In [None]:
# Another filter example
cs1040_df[cs1040_df['avg_grade']> 3.2]

In [None]:
# Another way to select all the rows that have "Spring" as the semester value
#
# Two steps:
# First index the table with the semester column
# And then locate those semesters that have value "Spring"
#
# this is how a table with the semester column as an index looks like
cs1040_df.set_index('semester')

In [None]:
# Here I create a new table that has the semester colum as an index
# and then locate the rows that have "Spring" as the semester value
cs1040_df_indexed=cs1040_df.set_index('semester')
cs1040_df_indexed.loc['Spring']

In [None]:
# note that both tables are unchanged
print(cs1040_df_indexed, '\r', cs1040_df)

## Remove column <a name="remove"></a>

In [None]:
# remove one column
cs1040_df.drop(['year'],axis=1)

In [None]:
cs1040_df

#### You can also delete a column using the *del* function

del cs1040_df\['year'\]

#### or using the *pop* function

cs1040_df.pop('year')

## Add a column <a name = "addCol"></a>

In [None]:
# You can assign values directly
cs1040_df['Num of students']=[10,12,11,9,17,14,20]
cs1040_df

In [None]:
# Or you can calculate values
cs1040_df['%grade']=cs1040_df['avg_grade']/4*100
cs1040_df

## Remove a row <a name="removeRow"></a>

In [None]:
cs1040_df.drop(6)

## Add a row <a name="addRow"></a>

In [None]:
myrow = {'year':[2015], 'semester':['Spring'], 'avg_grade':[4.0], 'Num of students':[18], '%grade':[100]}
df = pd.DataFrame(data=myrow)
cs1040_df=cs1040_df.append(df)
cs1040_df

In [None]:
# note the repeated labels
# This still works
cs1040_df[5:8]

In [None]:
# also this works
cs1040_df[0:2]

In [None]:
# However if I remove the row with label 0, more than one row will be removed
cs1040_df=cs1040_df.drop(0)
cs1040_df

## Sorting <a name="sort"></a>

In [None]:
# tables can be sorted on the value of columns or rows
cs1040_df.sort_values(by=['avg_grade'],axis=0)

In [None]:
# also descending
cs1040_df.sort_values(by=['avg_grade'],axis=0, ascending=False)

In [None]:
# also along multiple rows or colums
cs1040_df.sort_values(by=['semester','avg_grade'],axis=0)

## Applying functions to all (or subsets) of a DataFrame <a name="apply"></a>

Note the wrong order of the semester column above 
It is due to the capital S of Spring (capital letters come before lower case in utf_8)
So, I would like to capitalize all of the elements in the column semester.

In Python there are [several methods](https://docs.python.org/3/library/stdtypes.html#index-30) that allow to manipulate strings, we will use the *capitalize* method

Below is a simple example

In [None]:
a = "hello"
a.capitalize()

Now I can apply the capitalize method to all elements of the semester column

In [None]:
cs1040_df['semester'].str.capitalize()

In [None]:
# At this point however, my table hasn't changed
cs1040_df

In [None]:
# In order to make it change, I have to assign the result of the operation back to the variable
cs1040_df['semester'] = cs1040_df['semester'].str.capitalize()
cs1040_df

In [None]:
# AN ALTERNATIVE WAY TO APPLY FUNCTIONS TO MULTIPLE ELEMENTS IN TABLES (LAMBDA FUNCTIONS)
#     ***  REMOVE PREVIOUS RUNS TO TEST

capitalizer = lambda x: x.capitalize()
cs1040_df['semester']=cs1040_df['semester'].apply(capitalizer)

In [None]:
cs1040_df

In [None]:
cs1040_df.sort_values(by=['semester'],axis=0)