# Python Libraries Interactive Notebook

Today, we will be learning about four fundamental python libraries. These are widely used and plenty of documentation can be found online. Don't be afraid to search Google/Stack Overflow!
1. Numpy: https://docs.scipy.org/doc/numpy-dev/user/index.html
- Pandas: http://pandas.pydata.org/pandas-docs/stable/
- Matplotlib: https://matplotlib.org/2.1.2/index.html
- Seaborn: https://seaborn.pydata.org/

# Table of Contents

I. [Numpy](#1)<br>
II. [Pandas](#2)<br>
III. [Plot with Pandas](#2.5)<br>
IV. [Matplotlib](#3)<br>
V. [Seaborn](#4)

### Jupyter Notebook Recap

`To run a cell: select cell, press SHIFT + ENTER`

The last line of a cell is always displayed

In [0]:
"this will NOT be displayed"
"this will be displayed"

If cells contain `...` we expect you to replace `...` with your code :)

## Import

`numpy`, `pandas`, `matplotlib`, and `seaborn` are made by other people! We need to `import` these modules in order to use them.

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# %matplotlib inline
# sns.set()

# <font id="1" color="blue">Numpy</font>

<tr><td>
<img src="http://rickizzo.com/images/posts/2017-12-19/numpy.jpeg"/></td>

<td style="text-align:left">Numpy's main use is ```np.array```
<br><br>
Numpy arrays take less space than built-in lists and come with a **wide variety of useful functions.**</td></tr>

In [0]:
# make an array
a = np.array([2,3,4])
a

In [0]:
# make a 2-dimensional array (matrix)
matrix = np.array([ [1,2,3],
                    [4,5,6],
                    [7,8,9] ])
matrix

Linear Algebra!

In [0]:
# you can multiply matrices with np.dot
np.dot(matrix, a)

### Arithmetic with numpy!

**You can add/subtract/multiply/divide with numpy arrays!** You *cannot* do this with built-in python lists.

In [0]:
a + 5

In [0]:
a * -1

In [0]:
b = np.array([3, 2, 1])
a + b

If you try to perform operations on two arrays of different lengths, <span style="color:red">an error will occur.</span> Try running the following cell!

In [0]:
# Run me!
b + np.array([1, 2, 3, 4])

Use ```len( array )``` to find length of array.

In [0]:
len(b)

In [0]:
len(np.array([1, 2, 3, 4]))

Conditionals apply to every element of a numpy array as well. This will come in handy later!

In [0]:
a = np.array([1, 2, 3, 1, 1])
a == 1

### Essential array functions

Why do we use Numpy? **Numpy provides a multitude of useful functions for arrays.** We'll teach you a few (many more exist!)

<font color="blue">Exercise:</font> Search online how to find the mean of a numpy array.

In [0]:
x = np.array([1, 5, -7, 18, 1, -2, 4])

In [0]:
# Find the mean of array x
x_mean = np.mean(...)

Here, we'll give you a list of some useful numpy functions. Remember, you can easily find info about these by searching google / numpy documentation!

In [0]:
np.sum(x)

In [0]:
np.min(x)

In [0]:
np.max(x)

In [0]:
np.median(x)

In [0]:
np.cumsum(x)

In [0]:
np.abs(x)

What do you think ```np.cumsum``` does? Note, numpy has a similar function ```np.cumprod```. Try it!

What do you think ```np.diff``` does?

In [0]:
np.diff(x)

Two super useful functions in numpy are `np.arange` and `np.linspace`. They allow you to craft arrays with equidistant values:
* np.arange asks for [`start`], `stop`, and [`step`]
* np.linspace asks for `start`, `stop`, and `num`

In [0]:
np.arange(0, 100, 10)

In [0]:
np.linspace(0, 100, 15)

### Python

Using ```np.arrays``` in python is a little bit different than with built-in lists.

In [0]:
a = np.array([2, 3, 4])
b = [2, 3, 4]
print(a)
print(b)

#### Adding values to np.array is different

In [0]:
b.append("hello")
b

In [0]:
a = np.append(a, 'hello')
a

#### For loops work the same way

In [0]:
c = np.array([1, 2, 3, 4, 5])
cumulative_product = 1

for element in c:
    cumulative_product *= element
    
cumulative_product

### <font color="blue">Numpy Exercises</font>

Use [`np.arange`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) to create an array called `arr1` that contains every odd number from 1 to 100, inclusive.

In [0]:
arr1 = np.arange(...)
arr1

Use `arr1` to create an array `arr2` of every number divisible by 4 from 1 to 200, inclusive.

In [0]:
arr2 = ...
arr2

Create the same array, but using [`np.linspace`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html) instead. Call this array `arr3`.

In [0]:
arr3 = np.linspace(...)
arr3

Print the following summary statistics for `arr3`: 

* minimum
* 1st quartile (Hint: See [`np.percentile()`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html))
* median
* mean
* standard deviation
* 3rd quartile
* max


In [0]:
print('Minimum: '            + str(...))
print('1st quartile: '       + str(...))
print('Median: '             + str(...))
print('Mean: '               + str(...))
print('Standard Deviation: ' + str(...))
print('3rd Quartile: '       + str(...))
print('Max: '                + str(...))

# <span id="2" style="color: blue">Pandas</span>

<tr><td><img width=200 src="https://c402277.ssl.cf1.rackcdn.com/photos/13100/images/featured_story/BIC_128.png?1485963152"/></td><td>

Pandas is all about tables!</td></tr>

A table is called a ['dataframe'](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) in Pandas. Consider the table `fruit_info`:



<table border="1" class="dataframe">
  <thead><tr><td>**color**</td><td>**fruit**</td></tr></thead>
<tr><td>red</td><td>apple</td></tr>
<tr><td>orange</td><td>orange</td></tr>
<tr><td>yellow</td><td>banana</td></tr>
<tr><td>pink</td><td>raspberry</td></tr>
</table>

## Pandas Series

Let's break this table down. DataFrames consist of columns called **```Series```**. Series act like numpy arrays.

_How to make a Series:_

1.   create a numpy ```array```
2.   call ```pd.Series(array, name="...")``` &nbsp;&nbsp; <font color="gray"># name can be anything</font>

<font color="blue">Exercise:</font> Make a Series that contains the colors from `fruit_info` and has `name='color'`


In [0]:
array = np.array(...)
color_column = pd.Series(...)
color_column

<font color="blue">Exercise:</font> Make another Series for the fruit column:

In [0]:
array = np.array(...)
fruit_column = pd.Series(...)
fruit_column

Combine your Series into a table!

`pd.concat([ series1, series2, series3, ... ], 1)`

Don't forget the ```1``` or you'll just make a giant Series.

In [0]:
fruit_info = pd.concat([color_column, fruit_column], 1)
fruit_info

What if we were given the DataFrame and we want to extract the columns?

In [0]:
fruit_info['fruit'] # we get the fruit_column Series back!

### Dictionaries

Also, we can manually create tables by using a [python dictionary](https://www.python-course.eu/dictionaries.php). A dictionary has the following format:

```
d = { "name of column"   :  [  list of values  ],
      "name of column 2" :  [  list of values  ],
                        ...
                        ...
    }```
    

In [0]:
d = { 'fruit' : ['apple', 'orange', 'banana', 'raspberry'],
      'color' : ['red', 'orange', 'yellow', 'pink']
    }

In [0]:
fruit_info_again = pd.DataFrame(d)
fruit_info_again

### Add Columns

Add a column to `table` labeled "new column" like so:

`table['new column'] = array`

In [0]:
fruit_info['inventory'] = np.array([23, 18, 50, 20])
fruit_info

<font color="blue">Exercise:</font> Add a column called ```rating``` that assigns your rating from 1 to 5 for each fruit :) 

In [0]:
fruit_info['rating'] = ...

fruit_info  # should now include a rating column

### Drop

<font color="blue">Exercise:</font> Now, use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `color` column.

In [0]:
fruit_info_without_color = ... # must include axis=1

fruit_info_without_color

## California Baby Names

Time to use a real dataset!

You can read a `.csv` file into pandas using `pd.read_csv( url )`.

Create a variable called `baby_names` that loads this data: `https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/ca_baby_names.csv`



In [0]:
baby_names = pd.read_csv("https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv")

Let's display the table. We can just type `baby_names` and run the cell but baby_names is HUGE! So, let's display just the first five rows with:

`DataFrame.head( # of rows )`

In [0]:
baby_names.head(5)

## Row, Column Selection

Follow the structure:

`table.loc[rows, columns]`

`table.loc[2:8, [ 'Name', 'Count']]`

The above code will select columns "Name" and "Count" from rows 2 **through** 8.

In [0]:
# Returns the name of our columns
baby_names.columns

In [0]:
baby_names.loc[2:8, ['Name', "Count"]]

<font color="blue">Exercise:</font> Return a table that includes rows 1000-1005 and only includes the column "Name".

In [0]:
baby_names.loc[...]

In [0]:
# Want to select EVERY row?
# Don't put anything before and after the colon :
baby_names.loc[:, ['Sex', 'Name']].head(4)

### Selecting an entire Column

Remember we can extract the column in the form of a **Series** using:

`table_name['Name of column']`

In [0]:
name_column = baby_names['Name']
name_column.head(5) # we can also use .head with Series!

### Selecting rows with a Boolean Array

Lastly, we can select rows based off of True / False data. Let's go back to the simpler `fruit_info` table.

In [0]:
fruit_info

In [0]:
# select row only if corresponding value in *selection* is True
selection = np.array([True, False, True, False])
fruit_info[selection]

## Filtering Data

So far we have selected data based off of row numbers and column headers. Let's work on filtering data more precisely.

`table[condition]`

In [0]:
condition = baby_names['Name'] == 'Carlo'
baby_names[condition].head(5)

The above code only selects rows that have Name equal to 'Carlo'. Change it to your name!

### Apply multiple conditions!

 `table[ (condition 1)  &  (condition 2) ]`
 

 
<font color="blue">Class Exercise:</font> select the names in Year 2000 that have larger than 3000 counts.

In [0]:
result = ...
result.head(3)

### Thorough explanation:

Remember that calling `baby_names['Name']` returns a **Series** of all of the names.

Checking if values in the series are equal to `Carlo` results in an array of {True, False} values. 

Then, we select rows based off of this boolean array. Thus, we could also do:

In [0]:
names = baby_names['Name']
equalto_Carlo = (names == 'Carlo')  # equalto_Carlo is now an array of True/False variables!
baby_names[equalto_Carlo].head(5)

## Using Numpy with Pandas

How many rows does our `baby_names` table have?

In [0]:
len(baby_names)

That's a lot of rows! We can't just look at the table and understand it.

Luckily, **Numpy** functions treat pandas **Series** as np.arrays.

<font color="blue">Exercise:</font> What is oldest and most recent year that we have data from in `baby_names`?
HINT: np.min, np.max

In [0]:
recent_year = ...
oldest_year = ...
(recent_year, oldest_year)

<font color="blue">Exercise:</font> How many baby names were born in CA in 2015?

Hint: the 'Count' column refers the the number of occurrences of a baby name. How could we find the total number of baby names? Now narrow that to only 2015.

In [0]:
baby_names_2015 = ...
baby_names_2015_counts = ...
number_baby_names_2015 = np.sum(...)
number_baby_names_2015

In [0]:
# Or to do it all in one operation:
...

### np.unique

In [0]:
# return an array with an element for each unique value in the Series/np.array
np.unique(baby_names['Sex'])

In [0]:
# demo
states = np.unique(baby_names['State']) # okay now we know this dataset only involves California babies.

<font color="blue">Class Exercise:</font> Find the number of different baby names in our dataset.

In [0]:
names = ...
number_unique_names = ...
number_unique_names

## Copy vs View

Depending on how you format your code, pandas might be returning a copy of the dataframe (i.e. a whole new dataframe, but just with the same values), or a view of the dataframe (i.e. the same dataframe itself).

In [0]:
carlos_fruits = fruit_info.copy()
carlos_fruits

Let's say Carlo is happy with those ratings. But Jun Seo loves bananas! Let's make a "new" dataframe and change the ratings accordingly:

In [0]:
junseos_fruits = carlos_fruits
junseos_fruits['rating'] = [3, 4, 9999, 5]
junseos_fruits

And taking a look back at Carlo's fruits:

In [0]:
carlos_fruits

Wait, Carlo's banana rating shouldn't be that high! What happened is that junseos_fruits returned a *view* on Carlo's dataframe. Then did our shenanigans affect the original fruit_info dataframe too?

In [0]:
fruit_info

No, because when we called `carlos_fruits = fruit_info.copy()`, we asked pandas to forcibly create a brand new dataframe with identical values instead.

### SettingWithCopyWarning

This is arguably one of the most frustrating warnings you will see while using pandas. 
TL;DR: Use .loc instead of square brackets to index into data.

Let's say Jun Seo strongly dislikes apples.

In [0]:
junseos_fruits[junseos_fruits['fruit'] == 'apple']

In [0]:
junseos_fruits[junseos_fruits['fruit'] == 'apple']['rating'] = -100
junseos_fruits

In [0]:
junseos_fruits['rating']

In [0]:
junseos_fruits['rating'][0] = -100
junseos_fruits

In [0]:
junseos_fruits.loc[1, 'rating'] = 1738
junseos_fruits

## [optional] Group By

We won't have time to go through this thoroughly in lab. However, we encourage you to look into this material if you want to go further. Feel free to ask us any questions!

In the previous section we calculated the number of baby names registered in 2015.

In [0]:
np.sum(baby_names[baby_names['Year'] == 2015]['Count'])

There are 107 years though. If we wanted to know how many babies were born in California for each year we need to do something more efficient.

`groupby` to the rescue!

Groupby allows us to split our table into groups, each group having one similarity.

For example if we group by "Year" we would create 107 groups because there are 107 unique years.


<center>`baby_names.groupby('Year')`</center>


Now we have 107 groups but what do we do with them? We can apply the function `sum` to each group. This will sum the other numerical column, 'Counts' which reduces each group to a single row: Year and sum.

Excellent tutorial: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

In [0]:
# this will apply sum to the "Count" column of each year group
yearly_data = baby_names.groupby('Year').sum()
yearly_data.head(5)

Further reading: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

# <font id="2.5" color="blue">Plot with Pandas</font>

In [0]:
# %matplotlib inline

[Pandas.plot documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)

Pandas comes with a built-in `plot` method that can be very useful! `pandas.plot` actually uses `matplotlib` behind the scenes!

`yearly_data` contains the number of registered babies per year.

In [0]:
yearly_data.head()

## Line Graphs

In [0]:
yearly_data.plot(kind="line")  #kind='line' is optional

### Study: Name History

In [0]:
# don't worry about this function unless you want to learn about groupby
def your_name_history(name):
    return baby_names[baby_names['Name'] == name].groupby('Year').sum()

In [0]:
table = your_name_history('John')

table.plot()

## Bar Graphs

We can modify our data before we graph it to analyze different things.

In [0]:
yearly_data.plot(kind="bar")
plt.axis('off')

<font color="blue">Class Exercise:</font> How could we graph only the 15 years after World War II (i.e. 1945-1960)?

Hint: create a table with only the desired years first

In [0]:
modified = yearly_data.loc[...]

modified.plot(...)

# <font color="blue" id="3">Matplotlib</font>


## Line Graphs
Use [`plt.plot()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html) to create line graphs! The required arguments are a list of x-values and a list of y-values.

In [0]:
np.random.seed(42) # To ensure that the random number generation is always the same
plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
plt.show()

In [0]:
%matplotlib inline

plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
# plt.show() no longer required

## Histograms
To explore other types of charts, let's load in a built-in dataset from Seaborn and first take a quick peek:

In [0]:
tips = sns.load_dataset('tips')
tips.head()

Histograms can be plotted in matplotlib using [`plt.hist()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html).
This will take one required argument of the x-axis variable.

In [0]:
plt.hist(tips['total_bill'])

## Scatterplots
Scatterplots can be made using [`plt.scatter()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html). It takes in two arguments: x-values and y-values.

In [0]:
plt.scatter(tips['total_bill'], tips['tip'])

In [0]:
plt.scatter(tips['total_bill'], tips['tip'])
plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount')

In [0]:
plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

# Points with smoker == 'yes'
plt.scatter(x=tips.loc[tips['smoker'] == 'Yes', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'Yes', 'tip'],
            label='Smoker', alpha=0.6)

# Points with smoker == 'no'
plt.scatter(x=tips.loc[tips['smoker'] == 'No', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'No', 'tip'],
            label='Non-Smoker', alpha=0.6)

plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount (by Smoking Habits)')
plt.legend()

## Exercises in Matplotlib
We'll do the exercises using a famous dataset: [the iris dataset](https://archive.ics.uci.edu/ml/datasets/iris).
First, let's load it in and take a look:

In [0]:
iris = sns.load_dataset('iris')
iris.head()

![alt text](https://www.wpclipart.com/plants/diagrams/plant_parts/petal_sepal_label.png)

Let's also take a look at the different species:

In [0]:
iris['species'].unique()

<font color="blue">Exercise:</font> Create a basic scatterplot of the petal lengths versus the petal widths. Label your axes (use the documentation linked above to make them meaningful)!

In [0]:
...

<font color="blue">Exercise:</font> This time, create the same scatterplot, but assign a different color for each flower species.

In [0]:
plt.scatter(...)
plt.scatter(...)
plt.scatter(...)
plt.legend()

In [0]:
def plot_by_species(species, x, y):
    plt.scatter(x=iris.loc[iris['species'] == species, x],
             y=iris.loc[iris['species'] == species, y],
             label=species)

for species in iris['species'].unique():
    plot_by_species(species, 'sepal_length', 'sepal_width')

plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Sepal Width (by Species)')
plt.legend()

# <font id="4" color="blue"> Seaborn</font>

## Histogram
Back to the tips dataset to explore seaborn! First off is seaborn's take on the histogram, [`sns.distplot()`](https://seaborn.pydata.org/generated/seaborn.distplot.html#seaborn.distplot). By default, it shows a relative distribution and overlays a *kernel density estimator*; if you would like seaborn to just show a plain histogram, you can add the argument `kde=False`.

In [0]:
plt.figure(figsize=(15, 10))
plt.subplot(1, 2, 1)
sns.distplot(tips['total_bill'])

plt.subplot(1, 2, 2)
sns.distplot(tips['total_bill'], kde=False)

## Scatterplot
To create a scatterplot using seaborn, you can use [`sns.lmplot()`](https://seaborn.pydata.org/generated/seaborn.lmplot.html#seaborn.lmplot). It'll take x-values and y-values, and overlay a least-squares regression line and standard deviation

Note: You can use pandas indexing, but check out the fancy ability to refer to columns by their names instead.

In [0]:
sns.lmplot(x='total_bill', y='tip', data=tips)

Let's do that same plot from earlier, where we faceted by smoker. It's a lot easier in seaborn, since we only need to pass in an additional argument of `hue`:

In [0]:
sns.lmplot(x='total_bill', y='tip', hue='smoker', data=tips)

Cool. Do smokers' and non-smokers' generosities differ by day of the week? Let's try out the `row` and `col` (column) arguments:

In [0]:
sns.lmplot(x='total_bill', y='tip', row='time', col='smoker', data=tips)

## Seaborn Exercises
<font color="blue">Exercise:</font> Your turn! Create a histogram of the petal widths in the `iris` dataset.

In [0]:
...

<font color="blue">Exercise:</font> Now try to create a scatterplot of petal lengths versus petal widths, and color the points based on the species of flowers. Feel free to turn off the regression line using `fit_reg=False`.

In [0]:
...

That's the end of our workshop.

We hope you learned something. Keep this notebook handy for reference later!

## Hope to see you at our workshop next week: Python Modeling

<img width="120" src="https://dss.berkeley.edu/static/img/logo.jpg"/>