# Mapping from `datascience` to Pandas

Welcome! This notebook is an unofficial resource created by the Data Science division.

It serves as an introduction to working with Python's widely used Pandas library for students who have taken data 8. The functions introduced will be analogous to those in Berkeley's `datascience` module, with examples provided for each.

We will cover the following topics in this notebook:
1. [Basics of Pandas](#basics)
    - [Importing and Loading Packages](#import)
<br>
<br>
2. [Dataframes: Working with Tabular Data](#dataframes)
    - [Creating a Dataframe](#creating)
    - [Accessing Values in Dataframe](#accessing)
    - [Manipulating Data](#manipulating)
<br>
<br>
3. [Visualizing Data](#visualizing)
    - [Histograms](#histograms)
    - [Line Plots](#line)
    - [Scatter Plots](#scatter)
    - [Bar Plots](#bar)

## 1. Basics <a id='basics'></a>

This notebook assumes familiarity with Python concepts, syntax and data structures at the level of Data 8. For a brief refresher on some Python concepts, refer to this [Python Basics Guide on Github](https://github.com/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)

Python has a great ecosystem of data-centric packages which makes it excellent for data analysis. Pandas is one of those packages, and makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give us a single, convenient, place to do most of our data analysis and visualization work.

### 1.1 Importing and Loading Packages <a id='import'></a>

It is useful to import certain packages in our workspace for analysis and data visualization. But first, we may need to install these package if they are not present already. We do this via the command line as follows:

In [None]:
# Install datascience, pandas, and numpy packages
!pip install datascience
!pip install pandas
!pip install numpy

Once we have installed the required packages, we do not need to reinstall them again when we start or reopen a Jupyter notebook. We can simply import them using the `import` keyword. Since we import Pandas as `pd`, we need to prefix all functions with `pd`, similar to how we prefix all numpy functions with `np` (such as `np.append()`).

In [None]:
# Run this cell to import the following packages
from datascience import * # Import the datascience package
import pandas as pd # Import the pandas library. pd is a common shorthand for pandas
import numpy as np # Import numpy for working with numbers

## 2. Dataframes: Working with Tabular Data <a id='dataframes'></a>

In Python's `datascience` module, we used `Table` to build our dataframes and used commands such as `select()`, `where()`, `group()`, `column()` etc. In this section, we will go over some basic commands to work with tabular data in Pandas

### 2.1 Creating a Dataframe <a id='creating'> </a>

Pandas introduces a data structure (i.e. dataframe) that represents data as a table with columns and rows. 

In Python's `datascience` module that is used in Data 8, this is how we created tables from scratch by extending an empty table:

In [None]:
# datascience: Create a table
t = Table().with_columns([
     'letter', ['a', 'b', 'c', 'z'],
     'count',  [  9,   3,   3,   1],
     'points', [  1,   2,   2,  10],
 ])
t

In Pandas, we can use the function `pd.DataFrame` to initialize a dataframe from a dictionary or a list-like object. Refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for more information

In [None]:
# pandas: Create a dataframe from a dictionary
df_from_dict = pd.DataFrame({ 'letter' : ['a', 'b', 'c', 'z'],
                      'count' : [  9,   3,   3,   1],
                      'points' : [  1,   2,   2,  10]
                      })
df_from_dict

More often, we will need to create a dataframe by importing data from a .csv file. In `datascience`, this is how we read data from a csv:

In [None]:
# datascience: Create a table from a CSV file (data/baby.csv)
datascience_baby = Table.read_table('data/baby.csv')
datascience_baby

In Pandas, we use `pd.read.csv()` to read data from a csv file. Sometimes, depending on the data file, we may need to specify the parameters `sep`, `header` or `encoding` as well. For a full list of parameters, refer to [this guide](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [None]:
# pandas: Create a dataframe from a CSV file (reading baby.csv, located in current working directory)
baby = pd.read_csv('data/baby.csv')
baby.head() # Display first few rows of dataframe

In [None]:
# View summary of data
baby.describe()

In [None]:
# pandas: Create a dataframe by loading CSV from URL
# https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/sat2014.csv
sat = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/sat2014.csv')
sat.head()

In [None]:
# View information about dataframe
print(sat.shape) # View dimensions (rows, cols)
print(sat.columns.values) # View column names

### 2.2 Accessing Values in Dataframe <a id='accessing'> </a>

In `datascience`, we can use `column()` to access values in a particular column as follows:

In [None]:
# Access column 'letter'. Returns array
t.column('letter')

In Pandas, columns are also known as Series. We can access a Pandas series by using the square bracket notation.

In [None]:
# Return Series object
sat['State']

If we want a numpy array of column values, we can call the method `values` on a Series object:

In [None]:
# Column values as array
sat['State'].values

In `datascience`, we used `take()` to access a row in the Table:

In [None]:
# Select first two rows using Python's slicing notation
t.take[0:2]

In Pandas, we can access rows and column by their position using the `iloc` method. We need to specify the rows and columns we want in the following syntax: `df.iloc[<rows>, <columns>]`. For more information on indexing, refer to [this guide](https://pandas.pydata.org/pandas-docs/stable/indexing.html)

In [None]:
# Select first two rows using iloc
baby.iloc[0:2, :] 

In [None]:
# Specify row indices
baby.iloc[[1, 4, 6], :]

We can also access a specific value in the dataframe by passing in the row and column indices:

In [None]:
# Get value in second row, third column
baby.iloc[1, 2]

### 2.3 Manipulating Data <a id='manipulating'></a>

**Adding Columns**

Adding a new column in `datascience` is done by the `with_column()` function as follows:

In [None]:
# datascience: Add a new column
t.with_column('vowel', ['yes', 'no', 'no', 'no'])
t

In Pandas, we can use the bracket notation and assign a list to add to the dataframe as follows:

In [None]:
# pandas: Add a new column
df_from_dict['newcol'] = [5, 6, 7, 8]
df_from_dict

We can also add an existing column to the new dataframe as a new column by performing an operation on it:

In [None]:
# Add count * 2 to the dataframe
df_from_dict['doublecount'] = df_from_dict['count'] * 2
df_from_dict

**Selecting Columns**

In `datascience`, we used `select()` to subset the dataframe by selecting columns:

In [None]:
# datascience: Select columns
t.select(['letter', 'points'])

In Pandas, we use a double bracket notation to select columns. This returns a dataframe, unlike a Series object when we only use single bracket notation

In [None]:
# pandas: Select columns (double bracket notation for new dataframe)
df_from_dict[['count', 'doublecount']]

**Filtering Rows Conditionally**

In `datascience`, we used `where()` to select rows according to a given condition:

In [None]:
t.where('points', 2) # Rows where points == 2

In [None]:
t.where(t['count'] < 8) # Rows where count < 8

In Pandas, we can use the bracket notation to subset the dataframe based on a condition. We first specify a condition and then subset using the bracket notation:

In [None]:
# Array of booleans
baby['Maternal Smoker'] == True

In [None]:
# Filter rows by condition Maternal.Smoker == True
baby[baby['Maternal Smoker'] == True]

In [None]:
# Filter with multiple conditions
df_from_dict[(df_from_dict['count'] < 8) & (df_from_dict['points'] > 5)]

**Renaming Columns**

In `datascience`, we used `relabeled()` to rename columns:

In [None]:
# datascience: Rename 'points' to 'other name'
t.relabeled('points', 'other name')

Pandas uses `rename()`, which has an `index` parameter that needs to be set to `str` and a `columns` parameter that needs to be set to a dictionary of the names to be replaced with their replacements:

In [None]:
# pandas: Rename 'points' to 'other name'
df_from_dict.rename(index = str, columns = {"points" : "other name"})

**Sorting Dataframe by Column**

In `datascience` we used `sort()` to sort a Table according to the values in a column:

In [None]:
# datascience: Sort by count
t.sort('count')

In Pandas, we use the `sort_values()` to sort by column. We need the `by` parameter to specify the row we want to sort by and the optional parameter `ascending = False` if we want to sort in descending order:

In [None]:
# Pandas: Sort by count, descending
df_from_dict.sort_values(by = ['count'], ascending = False)

**Grouping and Aggregating**

In `datascience`, we used `group()` and the `collect` argument to group a Table by a column and aggregrate values in another column:

In [None]:
# datascience: Group by count and aggregate by sum
t.select(['count', 'points']).group('count', collect=sum)

In Pandas, we use `groupby()` to group the dataframe. This function returns a groupby object, on which we can then call an aggregation function to return a dataframe with aggregated values for other columns. For more information, refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)

In [None]:
# Select two columns for brevity
df_subset = df_from_dict[['count', 'points']]
df_subset

In [None]:
# pandas: Group by count and aggregate by sum
count_sums_df = df_subset.groupby(['count']).sum()
count_sums_df

**Pivot Tables**

In `datascience`, we used the `pivot()` function to build contingency tables:

In [None]:
# datascience: Create new table
cones_tbl = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
    'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
    'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)

cones_tbl

In [None]:
# Pivot on color and flavor
cones_tbl.pivot("Flavor", "Color")

We can also pass in the parameters `values` to specify the values in the table and `collect` to specify the aggregration function.

In [None]:
# Set parameters values and collect
cones_tbl.pivot("Flavor", "Color", values = "Price", collect = np.sum)

In Pandas, we use `pd.pivot_table()` to create a contingency table. The argument `columns` is similar to the first argument in `datascience`'s `pivot` function and sets the column names of the pivot table. The argument `index` is similar to the second argument in `datascience`'s `pivot` function and sets the first column of the pivot table or the keys to group on. For more information, refer to the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html)

In [None]:
# pandas: Create new dataframe
cones_df = pd.DataFrame({"Flavor" : ['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'],
                         "Color" : ['pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'],
                         "Price" : [3.55, 4.75, 5.25, 5.25, 5.25, 4.75]})
cones_df

In [None]:
# Create the pivot table
pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"])

If there is no data in the groups, then Pandas will output `NaN` values. 

We can also specify the parameters like `values` (equivalent to `values` in `datascience`'s `pivot`) and `aggfunc` (equivalent to `collect` in `datascience`'s `pivot`)

In [None]:
# Additional arguments
pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"], values = "Price", aggfunc=np.sum)

**Joining/Merging**

In `datascience`, we used `join()` to join two tables based on shared values in columns. We specify the column name in the first table to match on, the name of the second table and the column name in the second table to match on.

In [None]:
# datascience: Create new table
ratings_tbl = Table().with_columns(
    'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
    'Stars', make_array(2.5, 3.5, 4)
)
ratings_tbl

In [None]:
# Join cones and ratings
cones_tbl.join("Flavor", ratings_tbl, "Kind")

In Pandas, we can use the `merge()` function to join two tables together. The first parameter is the name of the second table to join on. The parameters `left_on` and `right_on` specify the columns to use in the left and right tables respectively. There are more parameters such as `how` which specify what kind of join to perform (Inner (Default), Outer, Left, Right). For more information, refer to this [Kaggle Tutorial](https://www.kaggle.com/crawford/python-merge-tutorial/notebook)

In [None]:
# pandas: Create new ratings df
ratings_df = pd.DataFrame({"Kind" : ['strawberry', 'chocolate', 'vanilla'],
                           "Stars" : [2.5, 3.5, 4]})
ratings_df

In [None]:
# Merge cones and ratings
cones_df.merge(ratings_df, left_on = "Flavor", right_on = "Kind")

## 3. Visualizing Data <a id='visualizing'> </a>

In `datascience`, we learned to plot data using histograms, line plots, scatter plots and histograms. The corresponding functions were `hist()`, `plot()`, `scatter()` and `barh()`. Plotting methods in Pandas are nearly identical to `datascience` since both build on the library `matplotlib`

In this section we will go through examples of such plots in Pandas

<a id='histograms'></a>**3.1 Histograms**

In `datascience`, we used `hist()` to create a histogram. In this example, we will be using data from `baby.csv`. Recall that the baby data set contains data on a random sample of 1,174 mothers and their newborn babies. The column `Birth.Weight` contains the birth weight of the baby, in ounces; `Gestational.Days` is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.

In [None]:
# Import matplotlib for plotting
import matplotlib
%matplotlib inline

In [None]:
# datascience: Read in the data (data/baby.csv)
datascience_baby = Table.read_table('data/baby.csv')
datascience_baby

In [None]:
# datascience: Create a histogram
datascience_baby.hist('Birth Weight')

In Pandas, we use `hist()` to create histograms, just like `datascience`. Refer to the [documentation](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.hist.html) for a full list of parameters

In [None]:
# pandas: Create a histogram
baby.hist('Birth Weight')

<a id='line'></a>**3.2 Line Plots**

In `datascience`, we used `plot()` to create a line plot of numerical values. In this example, we will be using census data and plot variables such as Age in a line plot

In [None]:
# datascience: Line plot
# https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv
census_tbl = Table.read_table("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv").select(['SEX', 'AGE', 'POPESTIMATE2014'])
children_tbl = census_tbl.where('SEX', are.equal_to(0)).where('AGE', are.below(19)).drop('SEX')
children_tbl.plot('AGE')

In Pandas, we can use `plot.line()` to create line plots. For a full list of parameters, refer to the [documentation](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.line.html)

In [None]:
# pandas: Line plot
# https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv
census_df = pd.read_csv("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv")[["SEX", "AGE", "POPESTIMATE2014"]]
children_df = census_df[(census_df.SEX == 0) & (census_df.AGE < 19)].drop("SEX", axis=1)
children_df.plot.line(x="AGE", y="POPESTIMATE2014")

<a id='scatter'></a>**3.3 Scatter Plots**

In `datascience`, we used `scatter()` to create a scatter plot of two numerical columns

In [None]:
# datascience: Read in the data
# https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/deflategate.csv
football_tbl = Table.read_table('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/deflategate.csv')
football_tbl

In [None]:
# datascience: Scatter plot
football_tbl.scatter('Blakeman', 'Prioleau')

In Pandas, we use `plot.scatter()` to create a scatter plot. For a full list of parameters, refer to the [documentation](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.scatter.html)

In [None]:
# pandas: Read in the data
# https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/deflategate.csv
football_df = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/deflategate.csv')
# pandas: Scatter plot
football_df.plot.scatter(x="Blakeman", y="Prioleau")

<a id='bar'></a>**3.4 Bar Plots**

In `datascience`, we used `barh()` to create a horizontal bar chart

In [None]:
# datascience: Horizontal bar chart
t.barh("letter", "points")

In Pandas, we use `plot.barh()` to create a bar chart. For a full list of parameters, refer to the [documentation](http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.plot.barh.html)

In [None]:
# pandas: Horizontal bar chart
df_from_dict.plot.barh(x='letter', y='points')

---

## Further Reading

Here is a list of useful Pandas resources:

- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [Dataquest Pandas Tutorial](https://www.dataquest.io/blog/pandas-python-tutorial/)
- [Pandas Cookbook](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/tree/master/cookbook/)