# Sociology 130AC Module: "The Neighborhood Project"

Context tile.

Welcome to the data science part of your project! You have gathered data from census tracts and now it is time to quantify your observations using Python for basic data science. However, you will not need any prior programming knowledge as the purpose is not to teach you programming, but rather show you the powers of programming to real life applications.

First, we have to import basic data science libraries. These libraries allow us manipulate the data easily as well as have great visualizations. If you are interested in what these are:numPy is a scientific computing library, Pandas is a data analysis library and matplotlib is a data visualization library.

In [1]:
import pandas as pd
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import *
%matplotlib inline

After crowdsourcing your data and the rest of the class, we will use a library called Folium to map your observations onto a map of San Francisco(NOT SURE LOCATION YET????). Click on a marker below to see a pop-up of the data at a particular point. Try to find the census tract you visited and see if the data you collected is there! 

Next, click around census tracts near yours to see if the other students' observations are similar and see if you can eyeball any trends. Check out other areas on the map and see if there are trends around "good" neighborhoods and "bad" neighborhoods. Do "bad" and "good" neighborhoods have similar trends in data? Which types of data do you think correlate the most to how safe a neighborhood is? On a larger scale, what do you think defines a good neighborhood or a bad neighborhood?

After you have made some predictions, we will analyze some important factors contributing to neighborhoods. As you know, good and bad are very subjective terms, so instead, it is more scientific to analyze factors that contribute to good neighborhoods, such as income, education, and employment rates. We have turned your data into a point scale and will compare your data to the offical data from American Fact Finder. Let's get started!

In [69]:
import folium
from IPython.display import HTML, display

LDN_COORDINATES = (37.7749, -122.4194)
myMap = folium.Map(location=LDN_COORDINATES, zoom_start=12)

#Folium formats popup windows in html format
html_1="""
    <h3>Good Neighborhood</h3>
    <img
       src = http://images.glaciermedia.ca/polopoly_fs/1.1827657.1429305764!/fileImage/httpImage/image.jpg_gen/derivatives/original_size/vancouver-single-family-home-neighbourhood-street.jpg
       style="width:180px;height:128px;"
       >
    <p>
       "This neighborhood seemed really nice to our group"
    </p>
    <p>
       # of cigarette buds: 0
    </p>
    """

html_2="""
    <h3>Bad Neighborhood</h3>
    <img
        src = http://static.lakana.com/nxsglobal/lasvegasnow/photo/2015/06/23/bad%20neighborhood_1435097179455_1625324_ver1.0_640_360.jpg
        style="width:180px;height:128px;"
        >
    <p>
        "This neighborhood didn't seem safe"
    </p>
    <p>
        # of cigarette buds: 100
    </p>
    """


folium.Marker([37.803, -122.435], 
              popup=folium.Popup(folium.IFrame(html=html_1, width=200, height=300), max_width=2650)).add_to(myMap)
folium.Marker([37.765, -122.415], 
              popup=folium.Popup(folium.IFrame(html=html_2, width=200, height=300), max_width=2650)).add_to(myMap)

myMap

First let's put all of the data onto a table. Graphs are nice for visualization to get a general idea, but it's a lot easier to manipulate graphs to get concrete results. The table below should show five columns: The first column should be the census tract, the seccond column is the points it received based on your survey, the third column is income, the fourth column is employment, and the fifth column is education level. Find your census tract and see if the income and employment and education level is what you expected to be based on your thoughts about the neighborhood.

In [None]:
#put all of our data into one table

Let's first analyze income levels. We have sorted the data according to income level. Compare the income levels to the neighborhood points. Is there a correlation you can spot(as one increases or decreases, does the other do the same)?

Did you look at the whole table? A common mistake is to assume that since the top 10 results follow or do not follow a pattern, the rest don't. Real life data is often messy and not clean. Does the correlation continue throughout the whole table(a.k.a. as income decreases the points decrease) or is there no pattern? What does this mean about the data?

What about vice versa? Sort the table by points by changing the column name after "sort" from 'income' to 'points'. Is there still a correlation? Is it stronger or weaker? What does this imply? Does it imply anything?

In [None]:
#sort by income
#

Now let's analyze education levels and employment. Sort the table by changing the sort to the either "education" or "employment" and try to find correlations between the factors and the neighborhood points.

In [None]:
#maybe we should just comment code and
# let them uncomment it to sort it??? tell them to take out the hashtag and put hashtags around the other ones?

As keen as your eyes are or aren't, a good analysis isn't based on just what you see; you must quantify it with numbers and statistics to prove your thoughts. Here is the list of all of the census tracts and in every column is a data type that you collected. Right now, the data is sorted by the column 'cigarette butts found' in descending order. To change how it is sorted, simply change the column name to the column you want to sort it by, i.e. 'children playing'. Make sure the name of the column is in quotes! Also, if you want it in ascending order, change the descending to ascending. Also, you can change the amount of results it returns by changing the number inside the list command.

Play around with sorting different columns and attributes. Which sorted column leads to many similar results as the official data of best and worst neighborhoods?

In [None]:
#table of data they collected. Sorted by cigarette butts and top 15 search results.
#was thinking we have examples of how to sort data? the other possibility I could think of was to 
#add more cells with the code written for each data sorted.

As keen as your eyes are or aren't, a good analysis isn't based on just what you see; you must quantify it with numbers and statistics to prove your thoughts. This is not a very statistical measure of how much a variable correlates to the results. What does it mean for a variable "income" to match 7 out of the top 15 neighborhood points? Does this correlate to the rest of the results? How well does it correlate? We will now use a method called linear regression to make a graph that will show the best fit line that correlates to the data. The variable "r squared" is a measure of how close the data is to the fitted regression line. 0 means the variable explains none of the variability of the data while 1 means it explains all of the variability in the data.

Again, change the names of the columns to change which variable you want to create your linear regression model to. The graph will give you a better sense of the whole data rather than just sorting columns like you did above. The R-squared value will give you an exact "goodness-of fit" value for your model. 

Why is this a better method than just sorting tables? First of all, we are now comparing all of the data in the graph to the variable, rather than comparing what our eyes glance quickly over. It shows a more complete picture than just saying "There are some similar results in the top half of the sorted data". Second of all, the graph gives a more intuitive sense to see if your variable does match the data. You can quickly see if the data points match up with the regression line. Lastly, the r-squared value will give you a way to quantify how good the variable is to explain the data.

In [None]:
#linear regression model of x being a column and y being the neighborhoods.
#just to see how well they correlate. Could write all the code out for them

Let's improve on our model even further. Right now we are choosing only one variable to try to explain our data. But in real life, there is not only one factor that contributes to a good or bad neighborhood; so why are we trying to explain it using only one variable? No wonder it's difficult to get the whole picture! We will now use multiple regression to use 2(or 3) variables to make our model even more accurate. Try changing up the column names to see close your regression line is now!

In [None]:
#Multiple regression with two or three variables? Having a slider to adjust weights?
#not sure if we could do this if our table only has five rows. Does it really make sense? 
#maybe instead make them use linear regression on different colums other neighborhood points?