# Lecture 1 – Introduction

## Data 6, Summer 2022

This is a Jupyter notebook. We'll write all of our code in this class in a Jupyter notebook.

Today, don't worry about how any of this works. Throughout the summer, we'll learn how each of these pieces work.

**Note: If you're having trouble loading any plots or maps, try using Google Chrome.**

In [None]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.graph_objects as go

## California universities

Here, we'll load in data about all public universities in California. The data comes from [this Wikipedia article](https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_California).

In [None]:
# Load in the "california_universities.csv" file in the "data" folder
uni = Table.read_table('data/california_universities.csv')

# Remove irregular formatting
uni = uni.with_columns(
    'Enrollment', uni.apply(lambda s: int(s.replace(',', '')), 'Enrollment'),
    'Founded', uni.apply(lambda s: int(s.replace('*', '')), 'Founded')
)

Data is often stored in tables. In about a few weeks, we'll become very, very familiar with how tables work. But for now, let's just observe.

In [None]:
# Let's see what the table looks like
uni.show(5)

Let's start asking questions.

### What are the largest public universities in California?

In [None]:
# Largest universities - table format
uni.sort("Enrollment", descending=True).show(5)

In [None]:
# Can we visualize the sizes of each university?
uni.sort("Enrollment", descending=True).barh("Name", "Enrollment")

### What's the oldest public university in California? 🤔

In [None]:
# Oldest university - table format
uni.sort("Founded", descending=False).show(1)

In [None]:
# How can we visualize the ages of the universities?
uni_copy = uni.sort('Founded').with_columns('Total Universities', np.arange(1, uni.num_rows + 1))
uni_copy.plot('Founded', 'Total Universities')

Let's add some spice.

In [None]:
# Just run me
fig = go.Figure()

fig.add_trace(
    go.Scatter(x = uni_copy.column('Founded'), 
               y = uni_copy.column('Total Universities'), 
               hovertext = uni_copy.column('Name'),
               mode = 'markers',
              )
)

fig.add_trace(
    go.Scatter(x = uni_copy.column('Founded'), 
               y = uni_copy.column('Total Universities'),
               line = dict(color = 'blue'),
              )
)

fig.update_layout(title = 'Total Number of Public Universities in California by Year',
                  xaxis_title = 'Year',
                  yaxis_title = 'Total Universities',
                  showlegend = False)

fig.show()

## Public Universities in California (and you!)

### Where are the public universities in California located?

First, we need some additional information:

In [None]:
# Load in the "california_universities.csv" file in the "data" folder
uni_locations = Table.read_table('data/uni_locations.csv')
uni_locations

Let combine some data.

In [None]:
# Join the `uni` and `uni_locations` tables
unis_with_location = uni.join("Name", uni_locations, "University")
unis_with_location

What if we want to plot these on a map?

We can use the `plotly` API (essentially a library of additional things we can do with Python)!

In [None]:
# Just run me

def bubble_plot(tbl, text, size=None, lat="Latitude", lon="Longitude", color=None, title=None, scale_factor=150):
    fig = go.Figure()
    
    if not color:
        color_arr = ['royalblue'] * tbl.num_rows
    else:
        color_arr = tbl.column(color)
        
    if not size:
        size_arr = [1 / scale_factor] * tbl.num_rows
    else:
        size_arr = tbl.column(size) / scale_factor

    fig = fig.add_trace(go.Scattergeo(
                            lat = tbl.column(lat), 
                            lon = tbl.column(lon),
                            text = tbl.column(text),
                            marker = dict(
                                size = size_arr,
                                sizemode = 'area',
                                color = color_arr
                            )
                        ))

    fig.update_geos(fitbounds="locations")
    fig.update_layout(
        geo = dict(
                scope = 'usa',
                landcolor = 'rgb(217, 217, 217)',
            ),
        title = title
    )
    
    return fig


In [None]:
# Call the `bubble_plot` function, passing in the proper arguments
fig = bubble_plot(unis_with_location, text="Name", size="Enrollment", title="Public Universities in California")
fig.show()

Can we add more information?

In [None]:
# Let's add a color column
unis_with_color = unis_with_location.with_column('Color', ['crimson'] * unis_with_location.num_rows)
unis_with_color

In [None]:
# Use the `bubble_plot` function to map the universities, this time specifying the bubble color
fig = bubble_plot(unis_with_color, text="Name", size="Enrollment", color="Color", title="Public Universities in California")
fig.show()

It would be nice if this were color-coded based on UC vs. CSU. We can do that!

In [None]:
#Just run me
def code_uc(name):
    if 'University of California' in name:
        return 'royalblue'
    else:
        return 'crimson'

In [None]:
# Apply the `code_uc` function to the 'Name' column to color-code the universities
uni_locations_separate = unis_with_color.with_column('Color', unis_with_color.apply(code_uc, 'Name'))
uni_locations_separate

In [None]:
# Plot the color-coded universities on the map with the `bubble_plot` function
fig = bubble_plot(uni_locations_separate, text="Name", size="Enrollment", color="Color", title="UCs and CSUs")
fig.show()

Violà!

### Where are you all from?

Using the responses from the welcome survey, let's use our knowledge of Python to plot the hometowns of the students in Data 6!

In [None]:
# Load in the "student_hometowns.csv" file from the "data" folder
hometowns = Table.read_table("data/student_hometowns.csv")
hometowns

In [None]:
# Plot the hometowns of Data 6 students using the `bubble_plot` function
fig = bubble_plot(hometowns, text="City", title="Where Data 6 Students Are From", scale_factor=0.02)
fig.show()

The end!