lesson8/Lesson8_Correlation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../dsi.png\" style=\"height:128px;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Lesson 8: Correlation\n",
    "\n",
    "In our last notebook, we learned how to use the power of programming to visualize data. But that's not all we can do with these Python libraries. Today we'll take our programming toolkit one step further and see how we can use the same libraries to calculate numerical information about our data after we've plotted it. Visualizations and numerical information together can tell us  detailed stories about our data and what it means.\n",
    "\n",
    "We'll begin by using matplotlib - the same library we used to make line graphs, bar graphs, and pie charts - to explore the process of visualizing data using scatter plots. We'll then see how we can use another powerful library called numpy to numerically investigate the relationships between variables in our data by computing correlation coefficients. Finally we'll also look into plotting lines of best fit as an introduction to the concept of regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import matplotlib\n",
    "matplotlib.use('Agg')\n",
    "from datascience import Table\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "plt.style.use('fivethirtyeight')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def checker(strong, positive):\n",
    "    if strong:\n",
    "        if positive:\n",
    "            print(\"Try Again!\")\n",
    "        else:\n",
    "            print(\"You are correct! You now know how to make and read scatterplots to analyze trends in data!\")   \n",
    "    else:\n",
    "        print(\"Try Again!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Scatterplots\n",
    "\n",
    "Remember from today's lesson that when we have data about two variables, the best way to visualize it is through something called a \"scatterplot.\" We can use our handy datascience library to quickly make lots of scatterplots.\n",
    "\n",
    "Let's start with the example of clasroom participation from the lesson. Below we'll create a new Table using that data and then use the datascience \".scatter('Column Name 1', 'Column Name 2')\" function to create a scatterplot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "classroom_participation = Table().with_columns([\n",
    "    'Student',['Sathvik','Anjali', 'Shreyan','Chaaru','Rishi','Divya'],\n",
    "    '1st Week', [3,1,1,2,2,3],\n",
    "    '12th Week', [7,2,4,5,6,5]\n",
    "])\n",
    "classroom_participation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "classroom_participation.scatter('1st Week', '12th Week')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Scatterplots are useful for illustrating \"trends\" in our data. As we can see from this scatterplot, students who tended to participate less in the first week also tended to participated less in the twelfth week - like Anjali for example, who participated the least out of all the students in the 1st and 12th weeks. Similarly, students who tended to particpate more in the first week also participated more in the twelfth week. This is exactly what we mean by a trend. \n",
    "\n",
    "This is an example of a \"weak positive\" correlation. It's weak because the data points don't form exactly a straight line, and it's positive because the data points rougly go from the bottom left to the top right in our scatterplot (positive slope). Now let's look at another example. Fill in the following cells to make a scatter plot of classroom absences vs. final grade in the class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "student_abscences = Table().with_columns([\n",
    "    'Student',['Rashi','Amit', 'Simaran','Shruti','Eesha','Arpit'],\n",
    "    'Absences', [0,1,2,3,4,7],\n",
    "    'Grades', [99,95,90,80,75,68]\n",
    "])\n",
    "student_abscences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "student_abscences.scatter(    ) # fill in this line to make the scatterplot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Is this data strongly or weakly correlated? And is the correlation positive or negative? Put your answers in the cell below and run it to find out if you're right!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "strong =     # put True here if you think it's a strong correlation and False if it's weak\n",
    "positive =     # put True here if you think the data is positively correlated, and False if it's negatively correlated\n",
    "\n",
    "checker(strong, positive)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Correlation Coefficients\n",
    "\n",
    "Scatterplots can tell us a lot about trends in our data, but just knowing whether our data is strongly or weakly correlated and whether the correlation is positive or negative isn't enough. You might ask yourself, \"How strong is  the correlation? Are some correlations stronger than others?\"\n",
    "\n",
    "This is where the idea of a correlation coefficient comes in. The correlation coefficient is a number, known as r, between -1 and 1 that can be calculated for any two variables. If r is negative the correlation is negative, and if it's positive the correlation is positive. And the \"magnitude\" of r (in other words, how close |r| is to 1) tells us how strong the correlation is.\n",
    "\n",
    "In the following cells, we'll walk you through an example of calculating the correlation coefficient for a dataset. For our dataset, let's take a look at some data we first saw all the way back in lesson 1. We'll use the datascience library to create a Table with fertility, population, life expectancy, and child mortality data for all the countries in the world for the year 2015."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "fertility_data = Table.read_table('fertility.csv').where('time', 2015).drop(\"time\")\n",
    "population_data = Table.read_table('population.csv').where('time', 2015).drop(\"time\")\n",
    "life_expectancy_data = Table.read_table('life_expectancy.csv').where('time', 2015).drop(\"time\")\n",
    "child_mortality_data = Table.read_table('child_mortality.csv').where('time', 2015).drop(\"time\")\n",
    "\n",
    "joined_data = fertility_data.join(\"geo\", population_data)\\\n",
    "                            .join(\"geo\", life_expectancy_data)\\\n",
    "                            .join(\"geo\", child_mortality_data)\n",
    "joined_data = joined_data.relabeled(\"children_per_woman_total_fertility\", \"fertility\")\\\n",
    "                        .relabeled(\"population_total\", \"population\")\\\n",
    "                        .relabeled(\"life_expectancy_years\", \"life expectancy\")\\\n",
    "                        .relabeled(\"child_mortality_0_5_year_olds_dying_per_1000_born\", \"child mortality\")\n",
    "joined_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Let's take a look at the \"child mortality\" and \"life expectancy\" columns. We'd expect these two be negatively correlated since the higher the risk of dying as a child, the lower your expected lifetime should be. We can see if this trend holds true by first plotting the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "joined_data.scatter(\"life expectancy\", \"child mortality\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Yup, in fact the data seems negatively correlated. How strong is the correlation exactly? Well to answer that let's compute the correlation coefficient, using the numpy library. First, let's get the two columns we want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "life_expectancy = joined_data.column(\"life expectancy\")\n",
    "child_mortality = joined_data.column(\"child mortality\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Remember that the next step in computing the correlation coefficient is calculating the mean and standard deviation of our two variables. Numpy lets us do this with the convenient \"np.mean\" and \"np.std\" functions. Here's how to use them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "avg_life_expectancy = np.mean(life_expectancy)\n",
    "stddev_life_expectancy = np.std(life_expectancy)\n",
    "avg_child_mortality = np.mean(child_mortality)\n",
    "stddev_child_mortality = np.std(child_mortality)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The next step is to transform the data by calculating z_x = (x - x_mean)/x_stddev and z_y = (y - y_mean)/y_stddev for each x and y value, then multiplying together to get z_x * z_y. We can use the numpy \"subtract,\" \"divide,\" and \"multiply\" functions for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "transformed_life_expectancy = np.divide(np.subtract(life_expectancy, avg_life_expectancy), stddev_life_expectancy)\n",
    "transformed_child_mortality = np.divide(np.subtract(child_mortality, avg_child_mortality), stddev_child_mortality)\n",
    "products = np.multiply(transformed_life_expectancy, transformed_child_mortality)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Finally, we add up the values in this array and divide by n, where n is the number of values, to get the final correlation coefficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "correlation_coefficient = np.sum(products) / len(products)\n",
    "print(\"Correlation coefficient: \", correlation_coefficient)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "As you can see the correlation coefficient is negative and very close to -1, indicating that this is a strong negative correlation as expected. \n",
    "\n",
    "Now it's your turn. Let's take a look at another pair of variables: fertility and population. Fill in the following cells to make the scatterplot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "joined_data.scatter(    ) # fill this in to get a scatterplot of fertility vs population"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now guess whether the correlation is strong or weak, and if it's positive or negative. Then fill in the following cells to find out if you're right by computing the correlation coefficient!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "fertility = joined_data.column(    )\n",
    "population = joined_data.column(    )\n",
    "\n",
    "avg_fertility = \n",
    "stddev_fertility = \n",
    "avg_population = \n",
    "stddev_population = \n",
    "\n",
    "transformed_fertility = np.divide(np.subtract(    ),    )\n",
    "transformed_population = np.divide(np.subtract(    ),    )\n",
    "products = \n",
    "\n",
    "correlation_coefficient = np.sum(    ) / (    )\n",
    "print(\"This is the correlation coefficient: \", correlation_coefficient)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "A much quicker way to compute the correlation coefficient is to use the np.corrcoef() function. To check if you filled in the previous cells correctly and found the right value, run the next cell to see the right answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"This is the correlation coefficient: \", \n",
    "      np.corrcoef(joined_data.column(\"population\"), joined_data.column(\"fertility\"))[1,0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Line of Best Fit\n",
    "\n",
    "The last thing you learned about correlation is the idea of a line of best fit. This is a line that roughly \"fits\" the data by describing the ideal linear relationship between the data points. Luckily we can plot lines of best fit easily using the datascience library by adding an additional argument, \"fit_line = True\", to our call to the scatter function. Here's an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "joined_data.scatter(\"life expectancy\", \"child mortality\", fit_line=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "When the data is strongly correlated (correlation coefficient close to 1 or -1), the line closely fits most of the data points well. But when the data is not strongly correlated (correlation coefficient close to 0), the lie of best fit doesn't seem to fit the data at all. Let's see what the line of best fit looks like for population and fertility. Fill in the following cell to create a scatterplot of population and fertility with a best fit line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "joined_data.scatter(    ) # fill this in"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "In the next lesson, we'll learn how to actually mathematically find the equation for this line of best fit by using a technique known as \"regression.\""
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}