In [None]:
from datascience import *
import numpy as np

# Notice the new library
import matplotlib.pyplot as plt # pyplot is the main module in matplotlib for plotting in Python
plt.style.use('fivethirtyeight') # setting up styles
# inline lets us embed plots to our notebook as static images
%matplotlib inline 

## Importing Datasets

To import datasets into JupyterHub, follow these steps:

1) Download your dataset. (csv preferred for the datascience library)

2) Click on File > Open... to open the directory your notebook is in.

3) Click Upload and upload your dataset.

4) Go back to the notebook, and import the dataset using the `Table().read_table("your_file_name.csv")` function.


In [None]:
# Let's practice importing the bechdel dataset: bechdel.csv
bechdel = ...
bechdel

In [None]:
# We can use attributes to learn more about the dataset: num_rows, labels, num_columns
bechdel...

In [None]:
## As a quick note: tabular data is how most data scientists work with their datasets.
# datascience is our main tool in this class. datascience is largely an educational tool,
# and a simplified version of the more industry-standard pandas library

import pandas as pd
bechdel_df = pd.read_csv("bechdel.csv")
bechdel_df.head()

# pandas has a lot of interesting syntax and requires decent knowledge of Python
# we'll have some workshops and demos on this later in the semester,
# but notice the similarities! 

## Basic Table Manipulations

The majority of data science will be taking these large datasets (in the form of tables) and performing manipulations on them to learn something interesting.

To manipulate tables, we will use methods, which are functions that act on a specific object -- in this case, a table. For example: tbl.where("column", value) 

You should get comfortable using these methods. The nice thing about methods is that, instead of nesting them like functions:

`sum(max(1, 3), min(4, 6))`, 

we can string them together and read them from left to right. For example: 

`tbl.where("column", value).sort("column2").column("column3").`

Today, we'll go over some of the basic table methods. This will be your best friend: http://data8.org/sp21/python-reference.html 

In [None]:
# To view a table, we can use tbl.show(). What's the datatype?
bechdel...

In [None]:
# That's a lot of columns! Let's clean it up.
# For this analysis, we only want the columns "year", "title", "clean_test", "binary"
clean_bechdel = ...
clean_bechdel

In [None]:
# Let's make it more readable. Change clean_test to "test" and binary to "result"
# Notice how we edit the table "in place"
clean_bechdel...
# tbl.sort


In [None]:
# Now, let's see what is the first and last year in the dataset. We can do this by sorting and looking at the table. 
clean_bechdel...

In [None]:
# Let's focus on a single year. In that year, how many movies passed or failed the test? 
# We can use tbl.where to filter a table by some condition
clean_bechdel...

# What movies in 2010 failed the Bechdel test? Let's get the data as an array.
...

In [None]:
# I did some data cleaning and manipulations for you, and created this table "bechdel_by_year.csv"
# which shows the proportion of how many movies in the dataset passed/failed the Bechdel test
# You should be able to figure out how to do this transformation in a few weeks. 
props_by_year = ...
props_by_year

In [None]:
# (More on this next week) We can use tbl.group("col") to find the count of all unique values in a column
# Does this explain anything about the pattern we saw above? 
clean_bechdel.group("year").plot(0)

## Intro to Matplotlib workshop

Now, we made a few graphs using the tools from the datascience library. It's easy; all we needed to do was tbl.plot("x", "y"), tbl.scatter("x", "y"), tbl.barh("categories"), etc. 

But most people don't use datascience to graph! They use matplotlib, which is a popular visualization library used to graph data in Python. For our purposes, it's more useful because we get a lot more control over what we graph using matplot instead of just the functions in datascience.

(Note: this is for your personal learning; you're allowed to use the datascience functions for everything in this class).


In [None]:
# If you have experience in R, you'll notice that there are /a lot/ of similarities in syntax.
# Recall that we imported matplotlib.pyplots as plt at the start

# Q1: Let's recreate the graph in how movies have gotten better (or worse) 
# in terms of the Bechdel test over time.

x = ...
y = ...

# Building a plot: figure/figsize, plot, x/ylim, x/ylabel, axh/vline, title
plt...

In [None]:
#Q2 (depending on time): Make a bar chart comparing the number of each category from "test" 
# from the clean_bechdel table in 2010

# Hint: plt.barh("categories", "value")
tests_2010 = ...
tests_2010

...

## Histograms

Histograms are very important tools to analyze the distribution of our datasets. What is the center and spread of a value in the sample? In this course, we can make them with tbl.hist("col") or with matplotlib.

Some general rules about histograms:

1) Histograms look like bar charts, but instead of categories separating the groups, we have a continuous range. The range will be split into categories by setting "bins" (ex. value a to b, b to c, and c to d) that values fall into. 

2) We are going to forever use density histograms, NOT count histograms. Density is a measure of % of data per x-unit. Density is better than counts because it scales based on the size of our bins and therefore gives a better idea of the distribution (see here: https://www.inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html)

3) Bins are inclusive of the left end but non-inclusive of the right end. In mathematical terms, if we have a bin A-B, then \[A, B). For example: if we have the bin [0, 10), the values 5 and 9.99999999 fall in this bin, but 10 does not.

4) Lastly, because of these values, we can make the following conclusions:

(Percent of data in bin) = Area of bin = Height (in density) / Width of bin (B - A, if B > A)

and that the sum of area of bins = 100. If you're confused about this, try doing the algebra/unit conversions.

In [None]:
# Some made-up data. 
age_data =  Table().with_column("Ages", [3, 4, 8, 9, 14, 18, 20, 23, 24, 35])
age_data

In [None]:
# How we'll make histograms in homework
bins = ...
age_data.hist(...)

### A toy example: a binomial (coin-flip) and normal distribution

A binomial distribution is defined as a distribution that has a binary success/failure with *n* independent experiments. The probability of a success is defined as *p* and the probability of a failure is therefore *(1-p)*.

We'll learn later about normal distributions, but if you take a future data science course, all normal distributions follow this equation:

The *normal curve with mean $\mu$ and SD $\sigma$* is defined by

$$
f(x) ~ = ~ \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\big{(}\frac{x-\mu}{\sigma}\big{)}^2}, ~~~ -\infty < x < \infty
$$

One cool thing is that if we repeat *n* binomial experiments many times, the distribution of results from the binomial distribution will be approximately normal. Let's demonstrate this property with 10 coin flips and matplotlib!

In [None]:
# We'll learn later that, if we perform an experiment many times,
# the distribution of our observed ("empirical") results will converge on the true ("probability") distribution
# This is a fundamental part of computational statistics/data science, because 
# we're going to use simulations to approximate distributions. 


# The np.random module; let's simulate 50 flips of a fair coin, many many times
heads = ...
heads

In [None]:
# Make a histogram of the results using matplot
bins = np.arange(min(heads) - 0.5, max(heads) + 0.5, 1) 
# weird bins, but this is to make sure only 1 integer value falls within each bin and that the curve draws correctly
# in other words, it centers our values between bars
plt.figure(figsize = (10, 5)); 
plt.hist(heads, bins = bins, density = True);

# What do you notice? Center, spread, shape

In [None]:
# Draw the normal curve over the histogram with plt.plot(...)
mean = ...
std = ...

x = ... 
f_x = (1 / (np.sqrt(2 * np.pi) * std)) * np.e ** (-0.5 * ((x - mean) / std) ** 2) # the normal curve equation

...


## Next Week

For our next discussion, we will spend more time on Python coding fundamentals. We'll be going over (1) conditionals, or how to make decisions with your code using "Boolean" values and (2) how to write functions to make your code more versatile. We will also try to go over some more complicated table manipulations (group, pivot, join) which we can use to pull more interesting information from your dataset.

If you want to learn more about matplotlib, check out these resources:

https://matplotlib.org/users/index.html 

https://realpython.com/python-matplotlib-guide/

https://towardsdatascience.com/visualizations-with-matplotlib-part-1-c9651008b6b8
