Run the code cell below with the ▶| button above to set up this notebook, or type `SHIFT-ENTER`:

In [None]:
!pip install --no-cache-dir -U -q folium
import pandas as pd
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import folium
from datascience import *
from IPython.display import HTML, display
from folium import plugins
from scipy.ndimage import imread
%matplotlib inline
from sklearn import linear_model
from scipy import stats
import os
import ipywidgets as widgets
from scripts.soc_module import *
import random
os.system("rm -rf images")
os.mkdir("images")

# Sociology 130AC Module: "The Neighborhood Project"

Welcome to the data science part of your project! You have gathered data and entered it [here](https://docs.google.com/forms/d/e/1FAIpQLSeB-QAmszPXRX6Xp20tddMgivdlF3SW6pli5NdOvTBQZ7gt6A/viewform?usp=sf_link) from census tracts.  Now it's time to explore our class data and quantify our observations using Python for basic data science. However, you do not need any prior programming knowledge to do this. The purpose of the data science module  is not to teach you programming, but rather to show you the power of data science tools and how we can use them for data analysis. 

# Part 1: Introduction to Python and Jupyter Notebooks:

## 1. Cells, Arithmetic, and Code
In a notebook, each rectangle containing text or code is called a *cell*.

Cells (like this one) can be edited by double-clicking on them. This cell is a text cell, written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to worry about Markdown today, but it's a pretty fun+easy tool to learn.

After you edit a cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions.) You can also press `SHIFT-ENTER` to run any cell or progress from one cell to the next.

Other cells contain code in the Python programming language.  Running a code cell will execute all of the code it contains.

Try running this cell:

In [None]:
print("Hello, World!")

We will now quickly go through some very basic functionality of Python, which we'll be using throughout the rest of this notebook.

### 1.1 Arithmetic
Quantitative information arises everywhere in data science. In addition to representing commands to `print` out lines, expressions can represent numbers and methods of combining numbers. 

The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)

In [None]:
3.2500

We don't necessarily always need to say "`print`", because Jupyter always prints the last line in a code cell. If you want to print more than one line, though, do specify "`print`".

In [None]:
print(3)
4
5

Many basic arithmetic operations are built in to Python, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html). Use parenthesis to specify the order of operations, which act according to PEMDAS, just as you may have learned in school. Use parentheses for a happy new year!

In [None]:
1+(6*5-(6*3))**2*((2**3)/4*7)

### 1.2 Variables

We sometimes want to work with the result of some computation more than once. To be able to do that without repeating code everywhere we want to use it, we can store it in a variable with *assignment statements*, which have the variable name on the left, an equals sign, and the expression to be evaluated and stored on the right. In the cell below, `(3 * 11 + 5) / 2 - 9` evaluates to 10, and gets stored in the variable `result`.

In [None]:
result = (3 * 11 + 5) / 2 - 9
result

In [None]:
result

## 2. Functions

    
One important form of an expression is the call expression, which first names a function and then describes its arguments. The function returns some value, based on its arguments. Some important mathematical functions are

| Function | Description                                                   |
|----------|---------------------------------------------------------------|
| `abs`      | Returns the absolute value of its argument                    |
| `max`      | Returns the maximum of all its arguments                      |
| `min`      | Returns the minimum of all its arguments                      |
| `round`    | Round its argument to the nearest integer                     |

Here are two call expressions that both evaluate to 3

```python
abs(2 - 5)
max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))
```

These function calls first evaluate the expressions in the arguments (inside the parentheses), then evaluate the function on the results. `abs(2-5)` evaluates first to `abs(3)`, then returns `3`.

A **statement** is a whole line of code.  Some statements are just expressions, like the examples above, that can be broken down into its subexpressions which get evaluated individually before evaluating the statement as a whole.


### 2.1 Calling functions

The most common way to combine or manipulate values in Python is by calling functions. Python comes with many built-in functions that perform common operations.

For example, the `abs` function takes a single number as its argument and returns the absolute value of that number.  The absolute value of a number is its distance from 0 on the number line, so `abs(5)` is 5 and `abs(-5)` is also 5.

In [None]:
abs(5)

In [None]:
abs(-5)

Functions can be called as above, putting the argument in parentheses at the end, or by using "dot notation", and calling the function after finding the arguments, as in the cell immediately below.

In [None]:
nums = make_array(1,2,5) # a list of items, in this case, numbers
nums.max()

In [None]:
max(nums)

# Part 2: Your Data and Tables

We can read in the data you submitted through the survey by asking Google for the form information:

In [None]:
gdoc_key = "1D-TeB9S_qjuTJ-Jp_LBNAITAYN-uVxr1kRFM9Tg-K0Y"
spreadsheet_url = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv'.format(gdoc_key)
obs_data = pd.read_csv(spreadsheet_url)
obs_data.head()

In [None]:
obs_data['Census Tract'].value_counts().plot.barh()

In [None]:
def bar_chart_column(column_num):
    obs_data[obs_data.columns[column_num]].value_counts().plot.barh()
    plt.title(obs_data.columns[column_num])

slider = widgets.IntSlider(min=5,max=13,step=1,value=5)
display(widgets.interactive(bar_chart_column, column_num=slider))

In [None]:
obs_data.iloc[:, 5:14].mean().plot.barh()

In [None]:
download_images(obs_data)
!ls images

After uploading all of the data from the class, we will use a library called Folium to map your observations onto a map of the census tracts. Click on a marker below to see a pop-up of the data at a particular point. Try to find the census tract you visited and see if the data you collected is there! 

Next, click around census tracts near yours to see if the other students' observations are similar and see if you can eyeball any trends. Check out other areas on the map and see if there are trends for tracts in specific areas. Do specific areas  characteristics cluster in different areas? Which ones? Which characteristsics seem to cluster together? What types of data do you think will correlate with socioeconomic characteristics like median income, poverty rate, education?  Why?

After you have made some predictions, we will compare our data with socioeconomic data from the U.S. Census for the different tracts we visited and see if we can find evidence to support them. From your data we have created some point scales that measure different aspects of a neighborhood. for example, we have made a scale called “social disorder” and another called “amenities” based on some of the data you collected from the coding sheet. We will compare your data to the census data. Let's get started!

In [None]:
LDN_COORDINATES = (37.8044, -122.2711)
myMap = folium.Map(location=LDN_COORDINATES, zoom_start=12)

for i, row in obs_data.iterrows():
    image_url = random.choice(row['Images'].replace("open?", "uc?").split(","))
    tract = str(row['Census Tract'])
    comment = row["Other thoughts or comments"]
    html = html_popup(title=tract, comment=comment, imgpath=image_url, data="hello")
    folium.Marker([37.8200, -122.2427], 
                  popup=folium.Popup(folium.IFrame(html=html, width=200, height=300), max_width=2650)).add_to(myMap)

lat_shift = 0.058
lon_shift = 0.090

oakland_lat = 37.8200
oakland_lon = -122.2427

data = imread('img/census-map.PNG')
myMap.add_child(plugins.ImageOverlay(data, opacity=0.9, \
        bounds =[[oakland_lat - lat_shift, oakland_lon - lon_shift], [oakland_lat + lat_shift, oakland_lon + lon_shift]]))

myMap

First let's put all of the data onto a table. Graphs are nice for visualization to get a general idea, but it's a lot easier to manipulate graphs to get concrete results. This is the table of all of your data: The row index is the number of the census tract, and each column represents a variable you collected data about.

This is the data from American FactFinder.

In [None]:
#unemployment percent is for 16 years and older
#education: percent of people who have bachelors degree or higher
#income:household median income
official_data = pd.DataFrame.from_csv("data/merged-census.csv")
official_data

Let's analyze the data we are using to represent social disorder. If the number under A category is higher, we theorize that represents a higher state of social disorder. Therefore, we will compress all of the data to create a social disorder number. We will THEN compare this number to the census data and try to observe patterns. It is easier to compare this social disorder number rather than 10 different variables to the census data.

The table below should show five columns: The first column should be the social disorder, the second column is the points it received based on your survey, the third column is income, the fourth column is employment, and the fifth column is education level. Find your census tract and see if the income and employment and education level is what you expected to be based on your thoughts about the neighborhood.

In [None]:
combined_data = pd.DataFrame(columns=[ "Social Disorder", "Income", "Unemployment", "Education"])

#add up all of the points. Then creates a new table with points as well as official data
for tract in class_data.index.values:
    points = sum(class_data.loc[tract])
    income = official_data.loc[tract]["Household Median Income"]
    unemployment = official_data.loc[tract]["Unemployment %"]
    education = official_data.loc[tract]["Bachelor's Degree or higher %"]
    combined_data.loc[tract]= [points, income, unemployment, education]

combined_data

In [None]:
def f(variable_name, tract):
    x = combined_data[variable_name]
    
    plt.hist(x)
    plt.axvline(x=combined_data.loc[tract, variable_name], color = "RED")
    plt.xlabel(variable_name, fontsize=18)
    plt.show()
    
interactive_plot = interactive(f, variable_name=list(combined_data), tract=combined_data.index)
output = interactive_plot.children[-1]
output.layout.height = '350px'
interactive_plot

Let's first analyze income levels. We have sorted the data according to income level. Compare the income levels to the level of social disorder. Is there a correlation you can spot(as one increases or decreases, does the other do the same)?

Did you look at the whole table? A common mistake is to assume that since the top 10 results follow or do not follow a pattern, the rest don't. Real life data is often messy and not clean. Does the correlation continue throughout the whole table(a.k.a. as income decreases the points decrease) or is there no pattern? What does this mean about the data?


In [None]:
#sort by income
combined_data.sort_values("Income")

Now let's analyze education levels and employment. Now we will try to analyze the tables sorted by employment and education. To sort by either, you must delete the hashtag on the line you want to sort by below. A hashtag means that the code is a comment, which means it will not run. Therefore, when you remove the hashtag, it will the run the line. Put the hashtag back to comment the code. 

In [None]:
#combined_data.sort_values("Employment")
#combined_data.sort_values("Education")

Now do some exploring on your own. Here is the list of all of the census tracts and in every column is a data type that you collected. Right now, the data is sorted by the column 'cigarette butts found' in descending order. To change how it is sorted, simply change the column name to the column you want to sort it by, i.e. 'children playing'. Make sure the name of the column is in quotes! Also, if you want it in ascending order, change the descending to ascending. Also, you can change the amount of results it returns by changing the number inside the list command.
Play around with sorting different columns and attributes. 
What patterns do you observe?

In [None]:
#tThis part of the project right now is kinda useless because most of the code is in 1s or 0s. No point in sorting.

Eyeballing patterns is not the same as a statisical measure of a correlation; you must quantify it with numbers and statistics to prove your thoughts. This is not a very statistical measure of how much a variable correlates to the results. What does it mean for a variable "income" to match 7 out of the top 15 social disorder points? Does this correlate to the rest of the results? How well does it correlate? 

We will now use a method called linear regression to make a graph that will show the best fit line that correlates to the data. The slope of the line will show whether it is positively correlated or negatively correlated. The variable "r squared" is a measure of how close the data is to the fitted regression line. 0 means the variable explains none of the variability of the data while 1 means it explains all of the variability in the data.

We want to plot the change in Points with respect to a certain variable(like education or income). Therefore, the Y axis will always be social disorder and the X axis will be the variable that we want to analyze. Right now, the x variable is set to "Income". The graph will give you a better sense of the whole data rather than just sorting columns like you did above. The R-squared value will give you an exact "goodness-of fit" value for your model. 

To change the X axis, you need to change the x variable from "Income" to either "Education" or "Employment" (or another one of the census variables). Make sure you don't delete the quotation marks and remember to capitalize the first letter!

Why is this a better method than just sorting tables? First of all, we are now comparing all of the data in the graph to the variable, rather than comparing what our eyes glance quickly over. It shows a more complete picture than just saying "There are some similar results in the top half of the sorted data". Second of all, the graph gives a more intuitive sense to see if your variable does match the data. You can quickly see if the data points match up with the regression line. Lastly, the r-squared value will give you a way to quantify how good the variable is to explain the data.

One of the beautiful things about computer science and statistics is that you do not need to reinvent the wheel. You don't need to know how to calculate the r_squared value, or draw the regression line; someone has already implemented it! You simply need to tell the computer to calculate it. However, if you are interested in these mathematical models, take a data science or statistics course!

In [None]:
def f(x_variable, y_variable):
    x = combined_data[x_variable]
    y = combined_data[y_variable]

    plt.scatter(x, y)
    plt.xlabel(x_variable, fontsize=18)
    plt.ylabel(y_variable, fontsize=18)
    plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color="r") #calculate line of best fit
    plt.show()
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) #gets the r_value
    print("The r squared coefficient is: ", r_value**2)
    
interactive_plot = interactive(f, x_variable=list(combined_data), y_variable=list(combined_data))
output = interactive_plot.children[-1]
output.layout.height = '350px'
interactive_plot

To recap, we did 4 things to analyze our real life data:
    
1)We observed the data on a map where we could click on various points and see the data. This gives an intuitive sense of how the data is spread out across different census tracts and what we can expect from our analyses. Having a physical picture of what the data could possibly be representing is often a good first step rather than jumping straight into the numbers.

2)We observed the data in table form. As nice as graphs are, it is not possible to sort a graph. Therefore, we turned our graph into table form, compressed the data, and looked for correlation between known statistics. It is an important part of to compare your collected results to known statistics. There is no point in collecting data if you cannot measure it against some standard.

3)Instead of eyeballing the table, we now created a graph and computed the line of best fit. We also got an r_squared value which measured the correlation. This provides a more accurate representation of all data points rather than just looking at tables as well as the r_squared value which shows how "good" the line of best fit is.

We hoped you learn how to quantify your observations in a mathematical way and had some fun with manipulating data!
