# Data: Past, Present, Future
## Exploratory Data Analysis: fun with `pandas`

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use("ggplot")

%matplotlib notebook

A data set for the Academy's consideration:

https://github.com/deedy/gradcafe_data

Let's use:

https://github.com/deedy/gradcafe_data/blob/master/all_uisc_clean.csv?raw=true


### Discussion

**How was this data produced?**

**What assumptions does it make?**

**What purpose was this data collected for?**

**What is being approximated?**

### Back to the Activity

'Cuz I'm nice, I'm gonna give you the names of the columns

In [None]:
labels = [
    "row_id", "uni_name", "major", "degree", "season", "decision",
    "decision_method", "decision_date", "decision_timestamp", "ugrad_gpa",
    "gre_verbal","gre_quant","gre_writing", "is_new_gre", "gre_subject",
    "status", "post_date", "post_timestamp", "comments"
]

We import directly from the web without having to download using `wget`. 

In [None]:
gre = pd.read_csv(
    "https://github.com/deedy/gradcafe_data/blob/master/all_uisc_clean.csv?raw=true",
    sep=",",
    names=labels
)

Note this throws a warning. Does anyone know what it means?

Let's take a looksee...

In [None]:
gre.head()

Note that you can show more rows or less rows by adding a number to head, e.g., `gre.head(10)`. What does the inclusion of this particular information tell us about the assumptions of this data set? (Has anyone ever been to the grad cafe?)  

Now lets look at an individual column similar to the way we did last week. 

In [None]:
gre["ugrad_gpa"]

What new information do we learn from this? What should concern us? 

In [None]:
gre["ugrad_gpa"].mean()

Recall the `describe()` from lab 2. Now we get to properly use it! 

In [None]:
gre["ugrad_gpa"].describe()

What might we learn from considering these past three code blocks? Are there any obvious biases? 

In [None]:
gre.describe(include="all")

## Doing snazzy stuff with `groupby`
What do the following three functions do?

In [None]:
gre.groupby(by="decision").mean()

In [None]:
gre.groupby(by="uni_name").mean()

In [None]:
gre.groupby(by="uni_name")[["row_id"]].count() > 100

## box plot goodness
We saw box plots in lab 2, but never got to use them on anything but simulated data. Now is our opportunity! (For information on the meaning of box plots, see lab 2.) 

In [None]:
fig, axes = plt.subplots()
gre["ugrad_gpa"].plot.box(ax=axes)

In [None]:
gre[["ugrad_gpa"]] <= 4

In [None]:
fig, axes = plt.subplots()
gre["ugrad_gpa"][gre["ugrad_gpa"] <= 4].plot.box(ax=axes)

In [None]:
# other things to do: do a time series, look at gpa over time, etc.

# Let's compare stuff
## Ye olde scatter plot

pick two columns want to plot as x and y


In [None]:
fig, axes = plt.subplots()
gre.plot.scatter(x="ugrad_gpa", y="gre_verbal", ax=axes)

Ugh, something is not right in Denmark, my friend.

What's wrong here? What can we do?

In [None]:
gre_old = gre[gre["is_new_gre"] == 0]

In [None]:
gre_old.describe()

In [None]:
fig, axes = plt.subplots()
gre_old.plot.scatter(x="ugrad_gpa", y="gre_verbal", ax=axes)

In [None]:
gre_new = gre[gre["is_new_gre"] == 1]

In [None]:
gre_new.describe()

In [None]:
fig, axes = plt.subplots()
gre_new.plot.scatter(x="ugrad_gpa", y="gre_verbal", ax=axes)

In [None]:
gre_new_accepted = gre_new[gre_new["decision"] == "Accepted"]

In [None]:
gre_new_accepted.describe()

In [None]:
fig, axes = plt.subplots()
gre_new_accepted.plot.scatter(x="ugrad_gpa", y="gre_verbal", ax=axes)

## how deal with weirdo gpas > 4?

In [None]:
gre_new["ugrad_gpa"] <= 4

In [None]:
gre_new_normal_gpa = gre_new[gre_new["ugrad_gpa"] <= 4]

In [None]:
fig, axes = plt.subplots()
gre_new_normal_gpa.plot.scatter(x="ugrad_gpa", y="gre_verbal", ax=axes)

Not too promising....

Let's explore *lots* of different potential relations at once.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
gre_new_just_scores = gre_new_normal_gpa[[
    "ugrad_gpa", "gre_verbal", "gre_quant", "gre_writing"
]]

In [None]:
scatter_matrix(
    gre_new_just_scores,
    alpha=0.2,
    figsize=(6, 6),
    diagonal="kde"
)
plt.show()