In [1]:
cd ../

/Users/caitlynachen/Desktop/luxresearch/lux


This dataset contains a total of 1295 records of American colleges and their properties, collected by the [US Department of Education](https://collegescorecard.ed.gov/data/documentation/).

In [2]:
import pandas as pd
%load_ext autoreload
%autoreload 2
import lux

In [3]:
df = pd.read_csv("lux/data/college.csv")

In [4]:
df.setContext(["MedianEarnings"])

In [5]:
df

Button(description='Toggle Pandas/Lux', style=ButtonStyle())

Output()



We see that the information about ACTMedian and SATAverage has a very strong correlation. This means that we could probably just keep one of the columns and still get about the same information. So let's drop the ACTMedian column.

In [6]:
df = df.drop(columns=["ACTMedian"])
df

Button(description='Toggle Pandas/Lux', style=ButtonStyle())

Output()



From the Category tab, we see that there is very few records for Predominant Degree as Certificate. In addition, there are not a lot of colleges with a Private For-Profit funding model.

In [6]:
df[df["PredominantDegree"]=="Certificate"].toPandas()

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
1061,Cleveland State Community College,Certificate,Associate,Public,Southeast,Small City,1.0,910,10764,5111,5424,6859.5,25.6,24530.5,26000


There is only a single record for Certificate, we look at the [webpage for programs offered at Cleveland State Community College](http://catalog.clevelandstatecc.edu/content.php?catoid=2&navoid=90) and it looks like there is a large number of associate as well as certificate degrees offered. So we decide that this is more appropriately labelled as "Associate" for the `PredominantDegree` field.

In [7]:
df.loc[df["PredominantDegree"]=="Certificate","PredominantDegree"] = "Associate"

Taking a look at the subset of 9 colleges that are Private For-Profit, there isn't really any commonalities across them, so we can just leave the data as is for now.

In [8]:
df[df["FundingModel"]=="Private For-Profit"]

LuxWidget(recommendations=[{'action': 'Correlation', 'description': 'Show relationships between two quantitati…



Back to looking at the entire dataset:

In [9]:
df

LuxWidget(recommendations=[{'action': 'Correlation', 'description': 'Show relationships between two quantitati…



You are interested in picking a college to attend and want to understand the `AverageCost` of attending different colleges and how that relates to other information in the dataset.

In [10]:
df.setContext(["AverageCost"])

  checkValExists = spec.attribute in uniqueVals and spec.value in uniqueVals[spec.attribute]


In [11]:
df

LuxWidget(current_view={'config': {'view': {'continuousWidth': 400, 'continuousHeight': 300}}, 'data': {'name'…



We see that there are a large number of collegs that cost around 20000 per year. We also see that Bachelor degree colleges and colleges in New England and large cities tend to have a higher `AverageCost` than its counterparts.

We are interested in the trend of `AverageCost` v.s. `SATAverage` since there is a rough upwards relationship above `AverageCost` of $30000, but below that the trend is less clear.

In [6]:
df.setContext(["AverageCost","SATAverage"])

In [7]:
df

Button(description='Toggle Pandas/Lux', style=ButtonStyle())

Output()



By adding the `FundingModel`, we see that the cluster of points on the left can clearly be attributed to public colleges, whereas private colleges more or less follow a trend that shows that colleges with higher `SATAverage` tends to have higher `AverageCost`.