<img src="../dsi.png" style="height:128px;">

# Lesson 1: Why Data Science?

Welcome to the interactive Jupyter Notebook-based component of Data Science for India! Each week, we're going to practice what we covered in the worksheets with *programs* on the computer. We talked a lot about how computers are useful when working with data, so let's close out today's session with an example.

Today we're going to look at some data, and how we can *visualize* it! Along the way, we'll learn about Jupyter Notebooks and how to use them. In this notebook, you will also see some lines of *code*. Code is something that a machine can understand and *interpret*, or make sense of. 

Just like you know how to speak languages like English, Hindi, and many more, you'll soon know how to write code in a language called Python. Python is just one of the many languages computers can understand! For today, you won't need to write any code unless you are curious and want to play around (go ahead!). Instead, just look at a few of the cool things we can do with data science.

Run each "cell", or block of code, by pressing the keys "Shift" and "Enter" at the same time. 

In [None]:
# Below we have `import` statements, which help us set up the notebook! 
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

Some lines of code are equal to a value. In Jupyter notebooks, we actually get to see what the result of the last line of code is equal to.

Here, we're going to read a *table* of data about the quality of different bodies of water all over India. You might recognize some of these rivers and lakes. In fact, you might have even been there!


The *rows* (displayed sideways) of the table represent the location where the measurements were taken, and the *columns* (displayed vertically), show the different types of measurements. You might even recognize some from chemistry class.

In [None]:
water = Table().read_table("water_quality.csv")
#Data from: https://data.gov.in/catalog/status-water-quality-india-2008-and-2011
water

Of course, the Jupyter notebook won't show us *all* the data points. Let's see how many locations were part of the survey. 

To give us a little bit more information, let's look at the states where data was taken.

In [None]:
print("Data was taken at " + str(water.num_rows) + " locations" + " in " +
      str(water.group("State Name").num_rows) + " states.")
water.group("State Name")

Let's see if we can figure out some information about the temperature of the water from each state. Because there were multiple sites in each state, we need to find the value that's most representative of data from the entire state.
Thus, we'll take the *average* of each column. The *average*, or *mean*, is a number that represents the center of a group of numbers. We can now *visualize* this data with a *bar graph*.

In [None]:
grouped = water.where(water.column(6), lambda x : x == x).group("State Name", np.average)
temperatures = Table().with_columns("State", grouped.column(0),
                                   "Average of Minimum Site Temperatures", grouped.column(4),
                                   "Average of Maximum Site Temperatures", grouped.column(5),
                                   "Average of Averaged Site Temperatures", grouped.column(6))
temperatures.barh("State", "Average of Averaged Site Temperatures")

The table also contains information on the amount of coliform bacteria detected in the water. This is extremely important to know, beacause it shows that some sites have more germs than others and where you're more likely to get sick as a result of using the water without purifying it before. It'll be helpful for doctors and public health officials to know what diseases are being caused due to infected water.

This table shows reports of the cases for certain diseases by state.

In [None]:
disease = Table.read_table("diseases.csv")
#Source: https://data.gov.in/keywords/acute-diarrhoeal-diseases
disease

Now, we'll need to analyze the data from both tables, so we'll do the following: *group* bacterial content by state (find the average), and *join* the two tables.

In [None]:
def clean(tbl, col_1, col_2):
    cleaned_col_1 = tbl.where(tbl.column(col_1), lambda x : x == x)
    return cleaned_col_1.where(cleaned_col_1.column(col_2), lambda x: x == x)


cleaned = clean(water, 24, 27)
bacteria = Table().with_columns("State Name", cleaned.column(3),
                    "Fecal Coliform", cleaned.column(24),
                    "Total Coliform", cleaned.column(27))
grouped = bacteria.group("State Name", np.average)
grouped

In [None]:
joined_avgs = grouped.join("State Name", disease, "State/UTs")
joined_avgs

The code below will help us make a graph with the presence of all the diseases based on the amount of bacteria in the water.

In [None]:
def make_plot(ax, tbl, xcol, ycol):
    if type(ycol) == int:
        lbl = tbl.labels[ycol]
    else:
        lbl = ycol
    ax.scatter(tbl.column(xcol), tbl.column(ycol), label=lbl)
    return lbl

In [None]:
fig, ax = plt.subplots()
ax.set_title('Disease Prevalence vs. Coliform Level')
labels = []
for disease in joined_avgs.labels[4:14:2]:
    labels.append(make_plot(ax, joined_avgs, 'Fecal Coliform average', disease))
leg = ax.legend(loc='upper right', fancybox=True, shadow=True)
leg.get_frame().set_alpha(0.4)
shown = {}
for legentry, axentry in zip(leg.get_label(), labels):
    legentry.set_picker(5)
    shown[legentry] = axentry
    

def onpick(event):
    # on the pick event, find the orig line corresponding to the
    # legend proxy line, and toggle the visibility
    legline = event.artist
    origline = lined[legline]
    vis = not origline.get_visible()
    origline.set_visible(vis)
    # Change the alpha on the line in the legend so we can see what lines
    # have been toggled
    if vis:
        legline.set_alpha(1.0)
    else:
        legline.set_alpha(0.2)
    fig.canvas.draw()

    
fig.canvas.mpl_connect('pick_event', onpick)

plt.show()

**Conclusion:**
Here, we learned about the average water temperature and tried to see how the prevalence of a disease is affected by the amount of coliform bacteria in water (bacteria definitely does contribute to some diseases)

Data science isn't just good for looking at water quality, you can use it to look at so many other areas! Now, we're going to look at work that our team did at UC Berkeley, where they investigated how much the world has changed over the years. They did this in their introductory data science class and by the end of this course, you'll be able to do some of it too!

Full project at: https://github.com/data-8/data8assets/tree/gh-pages/materials/sp17/project/project1

## Population Growth, Fertility, and Poverty over Time

This table contains data about the population(amount of people living in a place) from every country in the world! The column "geo" is a three-letter code for a country, "time" is a year, and "population_total" is the amount of people recorded to be living in that country at that year.


In [None]:
population = Table.read_table('population.csv')

#Source: https://github.com/open-numbers/ddf--gapminder--systema_globalis/raw/master/ddf--datapoints--population_total--by--geo--time.csv

population


Let's now just look the data for India (code ind) and see how much the population has grown since the year 1800.

In [None]:
india = population.where("geo", are.equal_to("ind")).where("time", are.between(1800, 2017))

In [None]:
plt.plot(india.column("time"), india.column("population_total"))
plt.title("India's Population Growth")
plt.xlabel("Year")
plt.ylabel("Population (billions)")

As we can see, India's popuation is rising, and fast. Let's now look at the whole world's population.

In [None]:
total_pop = population.group("time", sum).drop("geo sum").where("time", are.between(1800, 2017))
plt.plot(total_pop.column(0), total_pop.column(1))
plt.title("World Population")
plt.xlabel("Year")
plt.ylabel("Population (billions)")

Let's take a closer look at this by looking at *fertility rates*, which measure the amount of babies born. 

In [None]:
life_expectancy = Table.read_table('life_expectancy.csv')
child_mortality = Table.read_table('child_mortality.csv').relabeled(2, 'child_mortality_under_5_per_1000_born')
fertility = Table.read_table('fertility.csv')

In [None]:
Table().with_columns(
    '1960', fertility.where('time', 1960).column(2),
    '2010', fertility.where('time', 2010).column(2)
).hist(bins=np.arange(0, 10, 0.5), unit='child')
_ = plt.xlabel('Children per woman')
_ = plt.xticks(np.arange(10))


This figure shows us two overlaid *histograms*, one for 1960 and one for 2010, that show the *distributions* of total fertility rates for these two years among all 201 countries in the `fertility` table.

Let's look at the 50 most populous countries in 2010 (to make this run faster).

In [None]:
# We first create a population table that only includes the 
# 50 countries with the largest 2010 populations. We focus on 
# these 50 countries only so that plotting later will run faster.
big_50 = population.where('time', 2010).sort(2, descending=True).take(np.arange(50)).column('geo')
population_of_big_50 = population.where('time', are.above(1959)).where('geo', are.contained_in(big_50))

def stats_for_year(year):
    """Return a table of the stats for each country that year."""
    p = population_of_big_50.where('time', year).drop('time')
    f = fertility.where('time', year).drop('time').where("geo", are.contained_in(big_50))
    c = child_mortality.where('time', year).drop('time').where("geo", are.contained_in(big_50))
    return p.join('geo', f).join('geo', c)

Here, we create a table called `pop_by_decade` with two columns called `decade` and `population`. It has a row for each `year` since 1960 that starts a decade. The `population` column contains the total population of all countries included in the result of `stats_for_year(year)` for the first `year` of the decade. For example, 1960 is the first year of the 1960's decade. You should see that these countries contain most of the world's population.

In [None]:
decades = Table().with_column('decade', np.arange(1960, 2011, 10))

def pop_for_year(year):
    return sum(stats_for_year(year).column("population_total"))

pop_by_decade = decades.with_column("population",decades.apply(pop_for_year, 'decade'))
pop_by_decade.set_format(1, NumberFormatter)

The `countries` table describes various characteristics of countries. The `country` column contains the same codes as the `geo` column in each of the other data tables (`population`, `fertility`, and `child_mortality`). The `world_6region` column classifies each country into a region of the world. Run the cell below to inspect the data.

In [None]:
countries = Table.read_table('countries.csv').where('country', are.contained_in(population.group('geo').column(0)))
countries.select('country', 'name', 'world_6region')

After this, we create a table called `region_counts` that describes the count of how many countries in each region appear in the result of `stats_for_year(1960)`.

For example, one row would have `south_asia` as its `world_6region` value and an integer as its `count` value: the number of large South Asian countries for which we have population, fertility, and child mortality numbers from 1960.

In [None]:
top_50 = stats_for_year(1960)
region_counts = top_50.join("geo", countries, "country").group("world_6region").relabel("world_6region", "region")
region_counts

The following diagram compares total fertility rate and child mortality rate for each country in 1960. The area of each dot represents the population of the country, and the color represents its region of the world. Run the cell below. Do you think you can identify any of the dots?

In [None]:
from functools import lru_cache as cache

# This cache annotation makes sure that if the same year
# is passed as an argument twice, the work of computing
# the result is only carried out once.
@cache(None)
def stats_relabeled(year):
    """Relabeled and cached version of stats_for_year."""
    return stats_for_year(year).relabeled(2, 'Children per woman').relabeled(3, 'Child deaths per 1000 born')

def fertilty_vs_child_mortality(year):
    """Draw a color scatter diagram comparing child mortality and fertility."""
    with_region = stats_relabeled(year).join('geo', countries.select('country', 'world_6region'), 'country')
    with_region.scatter(2, 3, sizes=1, colors=4, s=500)
    plt.xlim(0,10)
    plt.ylim(-50, 500)
    plt.title(year)

fertilty_vs_child_mortality(1960)

Drag the slider to the right to see how countries have changed over time. You'll find that the great divide between countries like America and Britain and countries like India and China that existed in the 1960's has nearly disappeared. This shift in fertility rates is the reason that the global population is expected to grow more slowly in the 21st century than it did in the 19th and 20th centuries.

In [None]:
import ipywidgets as widgets

# This part takes a few minutes to run because it 
# computes 55 tables in advance: one for each year.
Table().with_column('Year', np.arange(1960, 2016)).apply(stats_relabeled, 'Year')

_ = widgets.interact(fertilty_vs_child_mortality, 
                     year=widgets.IntSlider(min=1960, max=2015, value=1960))