In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%reload_ext postcell
%postcell register

In [None]:
%matplotlib inline

#### Generating sample data

In [None]:
#boiler plate code
student_scores_pd = pd.DataFrame(((np.random.rand(10,4) * 100) )
             , columns=['Assignment 1', 'Assignment 2', 'Test', 'Extra Credit']
             , index=['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Jon', 'Arya', 'Ned', 'Danny', 'That red lady']
            )
student_scores_pd['Test'] *= 10
#student_scores_pd['Test2'] = 'hello'
student_scores_pd['Extra Credit'] = np.power(student_scores_pd['Extra Credit'], 2)/100
student_scores_pd = student_scores_pd.round()

student_scores_pd

# Pandas - dataframe operations

Much like series (and numpy), dataframes have _many_ built in mathematical operations. Calling an aggregating math function on a dataframe will return values, aggregated for each column:

In [None]:
student_scores_pd.to_numpy().sum()

In [None]:
student_scores_pd.sum()

However, if you wanted to add up the two assignments and the test for each student, you would have to tell the sum function that you want to sum across a different axis:

In [None]:
student_scores_pd.sum(axis=1)

In [None]:
student_scores_pd.sum(axis='columns')

In [None]:
student_scores_pd.sum(axis='rows')

#### Detour: `...axis=?`

`axis=0` = `axis="rows"` and `axis=1` = `axis="columns"` . 

This is the rule which helps me the most:

`df.sum(axis='rows')` means you end up with value for each **column**

`df.sum(axis='columns')` means you end up with value for each **row**


Note that the default axis in Pandas functions is 0 or 'rows'. This means that, by default, functions will operate on rows or _go through each column_ , which will result in _one value per column_.

In [None]:
student_scores_pd.mean()

In [None]:
student_scores_pd.mean(axis=0)

In [None]:
student_scores_pd.mean(axis='rows')

### Remove columns (and the 'axis' error related to it)

In [None]:
student_scores_pd

In [None]:
student_scores_pd.drop(['Test'])

In [None]:
student_scores_pd.drop(['Test'], axis=1)

Most functions don't _mutate_ or change the original dataframe (unless you use the `inplace=True` argument):

In [None]:
student_scores_pd

```python
student_scores_pd.drop(['Test'], axis=1, inplace=True)
```

### Common mathematical functions

Pandas has all the standard mathematical and statistical functions you might expect to find in a data science package such as `mean`, `sum`, `max`, `min`, `log`, etc.

**Example** Given the `student_scores_pd` dataframe below, standardize it so all values are between 0 and 1 using the Min-Max feature scaling formula at https://en.wikipedia.org/wiki/Normalization_(statistics) : 

note: scikit-learn provides an API to this: `MinMaxScaler()`

In [None]:
(student_scores_pd - student_scores_pd.min()) / (student_scores_pd.max() - student_scores_pd.min())

Is it scaling the whole dataset as a single matrix or is it doing the right thing, and scaling each column?

In [None]:
tmp_df = pd.DataFrame()
for c in student_scores_pd.columns:
    tmp_df[c] = (student_scores_pd[c] - student_scores_pd[c].min()) / (student_scores_pd[c].max() - student_scores_pd[c].min())

In [None]:
tmp_df

**Question** Why aren't the two datasets different?

### Simple transformations

In [None]:
student_scores_pd.hist(figsize=(20,10));

In [None]:
student_scores_pd.plot.scatter(x='Test', y='Extra Credit')

**Exercise** Looks like the `Extra Credit` column in `student_scores_pd` may need to be log scaled. Please do so.

In [None]:
%%postcell exercise_030_130_a

#type your answer here

### Sort a dataframe

In [None]:
student_scores_pd

We can sort a table by a column, in either ascending or decending manner

In [None]:
student_scores_pd.sort_values(by='Test', ascending=False)

In [None]:
student_scores_pd.sort_values(by=['Test', 'Assignment 1', 'Extra Credit'], ascending=False)

A dataframe can be sorted by index in the following manner:

In [None]:
student_scores_pd.sort_index()

**Exercise** Sort the dataframe `student_scores_pd` by Extra Credit, then Test scores
(try to guess how multiple columns might be sorted)

In [None]:
%%postcell exercise_030_130_d

#type your answer here

### 'Bucketize' a column

There are times when you need to convert a 'numeric' column to a category or a factor. For example, client ages may be better represented as the decage of their age (20s, 30s, 40s) or 'young', 'middle age', 'senior.' Algorithms to create such buckets can be quite involved; however, for our purpose, we will use `pd.cut` to divide up values any way pandas likes.

In [None]:
student_scores_pd

In [None]:
student_scores_pd['Assignment 1']#.sort_values()

In [None]:
pd.cut(student_scores_pd['Assignment 1'], bins=3)

In [None]:
student_scores_example_pd = student_scores_pd.copy()
student_scores_example_pd['Assignment 1 segmented'] = pd.cut(student_scores_pd['Assignment 1'], bins=3)

In [None]:
student_scores_example_pd

**Exercise** Add the buckets created above to dataframe `student_scores_example_pd` and name the column `Assignment 1 segmented`

In [None]:
%%postcell exercise_030_130_b

student_scores_example_pd = student_scores_pd.copy()

#type your answer here

student_scores_example_pd

### Convert row values to columns (one-hot-encoding)

Statistical models like logistic regression need 'one-hot-encoded' columns. The following example shows how built-in function can be used, alogn with the knowledge of dataframes we have built-up so far. Note that the sciki-learn library provides a similar function in _OneHotEncoder_.

In [None]:
student_scores_pd['Gender'] = ['Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Private Info']
student_scores_pd

Convert the `Gender` column

In [None]:
student_scores_pd['Gender']

In [None]:
pd.get_dummies(student_scores_pd['Gender'])

Combine the orignal datagframe with the new columns (NOTE: we will learn about `concat` in a later lecture)

In [None]:
pd.concat([student_scores_pd, pd.get_dummies(student_scores_pd['Gender'])], axis=1)

Looks like we have to remove the `Gender` column ourselves

In [None]:
pd.concat([student_scores_pd, pd.get_dummies(student_scores_pd['Gender'])], axis=1).drop(['Gender'], axis=1)

**Exercise** Change the 'Assignment 1 segmented' column in `student_scores_example_pd` to be one-hot-encoded, for consumtpion by a model

In [None]:
%%postcell exercise_030_130_c

#type your answer here