# [HIST 160] Introduction to Python & Graphing!

In this exercise, we will go over how to use Python and Matplotlib to generate graphs!

## Table of Contents
1 - [Python & Math](#python)<br>
2 - [Variables](#var)<br>
3 - [Functions & For Loops](#func)<br>
4 - [Data Structures](#data)<br>
5 - [Graping Demographics](#graph)<br>
6 - [Other Graphs](#other)<br>


**Importing Dependencies:**

What are Dependencies? Dependencies in Python are packages that adds extra functionality to Python, besides the language itself! For example, today we will be using a package called matplotlib to create beautiful graphs. In order to access it's abilities, we import it here:

In [None]:
import math
import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use('fivethirtyeight')

----

## Section 1: Python - Math  <a id='python'></a>

Python is the main programming language we'll use in the lecture. Although this lecture should give you the backbones of how to use Python, please feel free to review one or more of the following materials in your own time to learn more about Python. 

- **[Python Tutorial](https://docs.python.org/3.5/tutorial/)**: Introduction to Python from the creators of Python
- **[Composing Programs Chapter 1](http://composingprograms.com/pages/11-getting-started.html)**: This is more of a introduction to programming with Python, from CS 61A

<br>
**Mathematical Expressions**

In Python, we: 
- Add by using the '+' sign
- Multiply by using the '*' sign
- Exponentiate by using the '**' sign
- Divide by using '/' sign
- Floor Divide by using '//' sign (8 // 3 = 2, 9 // 5 = 1) 
- Take the remainder / modulus by using the '%' sign

*Exercise:* Take the product of three and three to the power of six, and subtract 168.

In [None]:
...

---
## Section 2: Variables  <a id='var'></a>
A name that is used to denote something or a value is called a variable. In python, variables can be declared and values can be assigned. Here are a few examples of variables and their assignments:

In [None]:
x = 2
y = 3
ab = "Hi!"

**Output and Printing**

Return and printing are two different things: 
- Return: A value is not necessarily printed, but it is stored away inside a computer, if we bind it! 
- Printing: A value pops up on our screen!

We print using the **print** function and return a value using the **return** function. 

Here is a good point for us to take a break, and talk about calling functions. In Python, we have numerous functions, like: 
- print: print(SOMETHING) will print out the SOMETHING to our screen
- sum: sum(VALUES) will sum up a lot of values together
- And more!

The most beautiful aspect about functions, actually, is that in Python, we can make our own functions, for anything we need to. We will discuss more about this a little later. 

For now, the most important thing to remember is that, to call a function, we write the function, like "print" and put paranthesis after the function like "print()". Then, we can put in our *arguments* inside the paranthesis, like "print('Hi!')" Arguments are what we call the function with, for example, it is the 'SOMETHING' in our print, or the 'VALUES' in our sum function.

*Exercise:* Print the words 'Hello World!'

In [None]:
...

---
## Section 3: Functions, For Loops  <a id='func'></a>
A function is a block of organized, reusable code that is used to perform a single, related action. 

There are built in python functions but you can also create your own! If I wanted to print, return, or do a mathematical operation without having to do it manually everytime, I could just create a function and call it!

So, how do I create them? We create functions using `def`, with the following structure: 

`def function_name(arguments):
    [function procedures]
    return [output]`

How do I call them after making them? We first write the function name, then put parenthesis around them, like so: 


<center>`function_name(arguments)`</center>

In [None]:
def sum_multiply(a,b,c):
    """ Adds 'a' and 'b * c' """
    # YOUR CODE HERE


In [None]:
sum_multiply(1, 2, 3)

Something that also comes in very handy is what we call the "for loops" which repeats a block of code once for each element in a collection. 

Say I have a list of [1, 2, 3, 4, 5] and I want to add one to each element. This is **daunting**. But fear not, **for loop** is here to save your day! 

Let's walk through an example together: 

In [None]:
lst = [1, 2, 3, 4, 5]

new_lst = []

for elem in lst: 
    new_elem = elem + 1
    new_lst.append(new_elem)

In [None]:
new_lst

--- 
## Section 4: Data Structures <a id='data'></a>

**List:** Lists are primitive data structures that store data in the form of a list. 

To define a list, equate a variable to `[]` or `list()`. We can index into lists by using `lst[]` or add to the list by calling `.extend()` or `.append()`

Here is how to use lists: 

In [None]:
names = ['One', 'Two', 'Three', 'Four', 'Five']

In [None]:
names[0]

In [None]:
names[1 : 4]

**Dictionary:** Dictionaries are more used like a database because here you can index a particular sequence with your user defined string.

To define a dictionary, equate a variable to `{}` or `dict()`

In [None]:
number_dict = {'One' : 1,
               'Two' : 2, 
               'Three' : 3,
               'Four' : 4,
               'Five' : 5}

In [None]:
number_dict['One']

In [None]:
number_dict['Five']

--- 
## Section 5: Graphing Demographics <a id='graph'></a>

![title](image.png)

In class we discussed the demographic transition - high birth and death rates give way to low birth and death rates, a pattern visually depicted above, and critical to our understanding of welfare programs, migration, inequality, and, really, practically any question concerning labor. Now we can go on to produce our very own display of this transition, by incorporating demographic data from a variety of countries.

--- 
## Section 6: Graph for Sweden Data <a id='graph'></a>

Now, let's get down to business to draw beautiful graphs for, firstly, the Sweden data. We can then move forth to drawing graphs for everything else soon!

### Objective:
We discussed the demographic transition - high birth and death rates give way to low birth and death rates, a pattern visually depicted above. 

Our objective is to draw a graph like above, using a Python package called `matplotlib`, given a dataset containing the information. What does it mean for us? We will be able to visually see, instead of seeing numbers, how birth rates and death rates have evolved over time. 

### Introduction to Graphing in Python
Graphing in Python is done using a package called `matplotlib`. 

`matplotlib` as a few cool features, like below: 
1. `plt.plot(x, y)` : This will plot out the graph of x and y data. 
2. `plt.xlabel(name)` : This will give the x-axis a label
3. `plt.ylabel(name)` : This will give the y-axis a label
4. `plt.title(name)` : This will give the title!
5. `plt.show()` : This will allow us to "see" what we plotted

In [None]:
plt.plot([1, 2, 3, 4, 5], [1, 4, 9, 16, 25])
plt.title("Simple Graph!")
plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")
plt.show()

#### 1. Using Pandas to Observe Data

Firstly, we load the Sweden data to Pandas Dataframe (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

*What does this do?* It takes the data from our Comma-Separated Values - CSV - file (think Excel files), and converts it into a format that Pandas, our primary data manipulation library, can read. 

In [None]:
sweden = pd.read_csv('data/sweden.csv')

Now, let us observe the first few data. To do this, we will use `.head()` function

In [None]:
sweden.head()

**Interesting!** It looks like we have Year, Population, Live Births, Deaths (and more!) columns. 

#### 2. Use Matplotlib to Draw *Beautiful* Graphs
Our goal is to create a graph that looks at the demographic changes in Sweden. The way we create graphs in Python is to also use a package: `Matplotlib` which creates professional looking graphs! 

*(There are more packages out there, from Plotly to Seaborn, which also create beautiful graphs. But, the easiest and the most "core" tool that data scientists use is Matplotlib, so we will work mostly with that in this session.)*

The question, then, is what should we draw a graph of? Should we draw a graph of: 
1. Population changes by year?
2. Percentage population changes by year? 
3. Absolute population changes by year?

We will start off with the easiest - population changes by year - and build our way up. 

If I tell you to draw a graph on Microsoft Excel, with year and population, what would you do? Our first instinct is to drag the columns, and click on the "Draw Graph" button. In Python, we have the power of selecting everything we want. 

Let's start by selecting what goes on the x-axis and y-axis! We can index into a column by using `[column name]` and then use `.values` to grab all the values of the column

In [None]:
x_axis = sweden['Year'].values

In [None]:
y_axis = sweden['Population'].values

In [None]:
plt.plot(x_axis, y_axis)
plt.show()

Oh no! What has happened there? Although our code should create a beautiful graph, it looks like something is not working right. 

Take a look back to when we called `.head()`. We need **integer** values to create our graph, but taking a cursory look at the data, is this correct? 

*Hint: It is not!*

Let's try to fix this issue then. We can try to re-import the data with some pre-fixes that Pandas has: `df.read_csv(file, thousands=',')`

In [None]:
sweden = pd.read_csv('data/sweden.csv', thousands=',')

We take a look at the data again. Does it look cleaner now? 

In [None]:
sweden.head()

In [None]:
x_axis = sweden['Year']

In [None]:
y_axis = sweden['Population']

In [None]:
plt.plot(x_axis, y_axis)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('Population of Sweden by Year')
plt.grid(True)

plt.show()

We now define new variables to define the birth and death rate per population in Sweden by year. And, why don't we make them a little descriptive too?  

In [None]:
birth_per_pop = 1000 * sweden['LiveBirths'] / y_axis
death_per_pop = 1000 * sweden['Deaths'] / y_axis

In [None]:
# Go bears for gold and blue!
plt.plot(x_axis, birth_per_pop, color='gold')
plt.plot(x_axis, death_per_pop, color='blue')
plt.xlabel('Year')
plt.ylabel('Birth/Death Rate')
plt.title('Birth/Death Rate of Sweden by Year')
plt.grid(True)

plt.show()

---
## Section 7: Other Graphs <a id='other'></a>

Now, let's use what we did for Sweden to draw graphs for everything else!

In [None]:
England_Wales = pd.read_csv('data/England_Wales.csv',  thousands=',')
Chile = pd.read_csv('data/Chile.csv',  thousands=',')
Russia = pd.read_csv('data/Russia.csv',  thousands=',')
Mauritius = pd.read_csv('data/Mauritius.csv', thousands=',')

First, we check how our data roughly looks like!

In [None]:
Chile.head()

We are also working with **real** data here. This implies that a lot of the data we have are going to, unfortunately, be a little dirty, so we need to go through the process of finding out everything. For example, we had to separate everything by the thousands, or else, the program recognized it as a string!

Another issue that we face is that the column names may not necessarily be clean. Guess what - we have a method to find out what the column names are exactly encoded as!

In [None]:
Chile.columns.tolist()

Now, let's go through our process of finding the x and y-axis values, and also make them a little more descriptive this time!

In [None]:
year = Chile['Year\r'].astype(int)
pop = Chile['Population\r'].astype(int)
birth_per_pop = 10000 * Chile['LiveBirths\r'] / pop
death_per_pop = 10000 * Chile['Deaths'] / pop

In [None]:
# Go bears for gold and blue!
plt.plot(year, birth_per_pop, color='gold')
plt.plot(year, death_per_pop, color='blue')
plt.xlabel('Year')
plt.ylabel('Birth/Death Rate')

plt.title('Birth/Death Rate of Chile by Year')
plt.grid(True)

plt.show()

---
### Exercise: 
Let's now do the same for all the other countries: Russia and Mauritius

In [None]:
Russia = Russia.dropna()
Russia.columns.tolist()

In [None]:
Russia.head()

In [None]:
year = Russia['Year\r'].astype(int)
pop = Russia['\rPopulation\r'].astype(int)
birth_per_pop = 100 * Russia['LiveBirths'] / pop
death_per_pop = 100 * Russia['Deaths\r'] / pop

In [None]:
plt.plot(year, birth_per_pop, color='gold')
plt.plot(year, death_per_pop, color='blue')
plt.xlabel('Year')
plt.ylabel('Birth/Death Rate')
plt.title('Birth/Death Rate of Russia by Year')
plt.grid(True)

plt.show()

### Mauritius Data

In [None]:
Mauritius = Mauritius.dropna()

In [None]:
year = Mauritius['Year\r'].astype(int)
pop = Mauritius['Population\r'].astype(int)
birth_per_pop = 1000 * Mauritius['LiveBirths\r'].astype(float) / pop
death_per_pop = 1000 * Mauritius['Deaths\r'].astype(float) / pop

In [None]:
plt.plot(year, birth_per_pop, color='gold')
plt.plot(year, death_per_pop, color='blue')
plt.xlabel('Year')
plt.ylabel('Birth/Death Rate')
plt.title('Birth/Death Rate of Mauritius by Year')
plt.grid(True)

plt.show()

### [Advanced] England and Wales Data

This is very interesting because we are now going to work with **merging** datasets together, which is a truly powerful technique when we work with large datasets! 

First, as usual, let's import our datasets!

In [None]:
EW_Year = pd.read_csv('data/england_wales_year.csv',  thousands=',')
EW_Rates =  pd.read_csv('data/england_wales_rates.csv',  thousands=',')

Let's have a look at what each data entails. 

In [None]:
EW_Year

In [None]:
EW_Rates

Uh-oh! Do you notice the issue here? The two datasets have different start and end dates! Yikes!

To fix this, we are going to merge the two datasets together. Ultimately, this means we will end up with a smaller dataset than we would like, but we cannot reproduce the graph without the entirety of data. 

Let's now merge the two datasets together

In [None]:
England_Wales = EW_Year.merge(EW_Rates, on='Year', how='inner')
England_Wales

Much better. Our dataset has reduced in size, but we have successfully converted the data into a form we can work with. Let's now use our previous skills to draw beautiful graphs for them. 

First, let's grab the column names. 

In [None]:
England_Wales.columns.tolist()

In [None]:
x_axis = England_Wales['Year'].astype(int)
y_axis = England_Wales['UK'].astype(int)
y_new = England_Wales['BirthRate']
y_new2 = England_Wales['DeathRate']

In [None]:
# Go bears for gold and blue!
plt.plot(x_axis, y_new, color='gold')
plt.plot(x_axis, y_new2, color='blue')
plt.xlabel('Year')
plt.ylabel('Birth/Death Rate')
plt.title('Population of England/Wales by Year')
plt.grid(True)

plt.show()

In [None]:
England_Wales.describe()

In [None]:
England_Wales.mean()

In [None]:
England_Wales.std()

In [None]:
England_Wales.loc[England_Wales['BirthRate'].idxmax()]

In [None]:
England_Wales.loc[England_Wales['BirthRate'].idxmin()]

---
## 4. What's Next?

Please help us understand your views of this module with this short survey: 
https://docs.google.com/forms/d/e/1FAIpQLSe54U3E64kYFWwQHSUpAvWYMuJOdKzbHDZjPa3nMUlHSSs0PQ/viewform

Data science is a fast-growing field with applications in almost every subject you can imagine. Students and researchers alike have used Jupyter notebooks and data-driven methods to do everything from completing a lower-division class problem set to presenting a graduate research project. 

If you'd like to learn more about how to incorporate data science into your academic career:

* [DATA-8](http://data8.org) is offered every semester and is a great introduction to coding and statistics. The website includes links to the textbook, syllibi, and past homeworks.
* Data Science [Connector Courses](https://data.berkeley.edu/education/connectors) teach applied data science in everything from literature to cancer research. They can be taken with or after DATA-8.
* The Berkeley Institute for Data Science ([BIDS](https://bids.berkeley.edu/)) hosts data science talks, research resources organized by field, and office hours for those interested in more in-depth data science research.
* [DLAB](http://dlab.berkeley.edu/) also offers workshops and consulting to help you hone your skills.