# Sociology 130AC Module: "The Neighborhood Project"

Welcome to the data science part of your project! You have gathered data and entered it [here](https://docs.google.com/forms/d/e/1FAIpQLSeB-QAmszPXRX6Xp20tddMgivdlF3SW6pli5NdOvTBQZ7gt6A/viewform?usp=sf_link) from census tracts.  Now it's time to explore our class data and quantify our observations using Python for basic data science. However, you do not need any prior programming knowledge to do this. The purpose of the data science module  is not to teach you programming, but rather to show you the power of data science tools and how we can use them for data analysis. 

First, we have to import basic data science libraries. These libraries allow us to manipulate the data easily as well as have great visualizations. If you are interested in what these are:numPy is a scientific computing library, Pandas is a data analysis library and matplotlib is a data visualization library. Most of the other libraries are great for visualization.

In [1]:
!pip install --no-cache-dir -U -q folium
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
from datascience import *
from ipywidgets import *
from IPython.display import HTML, display
from folium import plugins
from scipy.ndimage import imread
%matplotlib inline
from sklearn import linear_model
from scipy import stats
import os
os.system("rm -rf images")
os.mkdir("images")

In [2]:
gdoc_key = "1D-TeB9S_qjuTJ-Jp_LBNAITAYN-uVxr1kRFM9Tg-K0Y"
spreadsheet_url = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv'.format(gdoc_key)
table = pd.read_csv(spreadsheet_url)
table

Unnamed: 0,Timestamp,Email Address,Name,Census Tract,Group members,"On a scale of 1 - 5, where 1 is ""None"" and 5 is ""A Lot"", are there empty beer or liquor bottles visible in streets, yards, or alleys? (Questions 6 on first page)","On a scale of 1 - 5, where 1 is ""None"" and 5 is ""A Lot"", are there cigarette or cigar butts or discarded cigarette packages on the sidewalk or in the gutters? (Question 7 on first page)","On a scale of 1 - 5, where 1 is ""None"" and 5 is ""A Lot"", are there condoms on the sidewalk, in the gutters, or street of block face? (Question 8 on first page)","On a scale of 1 - 5, where 1 is ""None"" and 5 is ""A Lot"", is there garbage, litter, or broken glass in the street or on the sidewalks? (Question 10 on first page)","On a scale of 1 - 5, where 1 is ""None"" and 5 is ""A Lot"", are there abandoned cars in the neighborhood? (Question 11 on first page)","On a scale of 1-5 where 1 is ""Friendly Responses / Greetings / Helpful"" and 5 is ""Treated with Suspicion"", how were you regarded by the people in the block face? (Question 13 on first page)","On a scale of 1 - 4, where 1 is ""Very well kept / good condition"" and 4 is ""Poor / badly deteriorated condition"", in general, how would you rate the condition of buildings on the block face? (includes residential buildings, recreational facilities, manufacturing plants, business / industrial headquarters, etc)","Is there graffiti or evidence of graffiti that has been painted over on buildings, signs, or walls? (Questions 22-23)","On a scale of 1 - 4, where 1 is ""No fencing"" and 4 is ""High mesh fencing with barbed wire or spiked tops"", is there fencing and what kind? (Question 12, but includes not just commercial / industrial properties, but all property)",Other thoughts or comments,Images
0,10/5/2017 20:25:08,chench@berkeley.edu,Chris,4213,test,1,3,2,3,2,3,2,Yes,2,test,https://drive.google.com/open?id=0B4-VwdHxL-jC...
1,10/6/2017 11:14:54,chench@berkeley.edu,henchtest2,4213,,1,3,2,3,2,3,4,No,2,test,https://drive.google.com/open?id=0B4-VwdHxL-jC...


In [3]:
for index, row in table.iterrows():
    census_tract = row["Census Tract"]
    urls = row["Images"].split(", ")
    for u in urls:
        fid = u.split("id=")[-1]
        os.system("curl -L -o images/{}.jpg 'https://drive.google.com/uc?export=download&id={}'".format(str(census_tract) + "---" + fid, fid))

After uploading all of the data from the class, we will use a library called Folium to map your observations onto a map of the census tracts. Click on a marker below to see a pop-up of the data at a particular point. Try to find the census tract you visited and see if the data you collected is there! 

Next, click around census tracts near yours to see if the other students' observations are similar and see if you can eyeball any trends. Check out other areas on the map and see if there are trends for tracts in specific areas. Do specific areas  characteristics cluster in different areas? Which ones? Which characteristsics seem to cluster together? What types of data do you think will correlate with socioeconomic characteristics like median income, poverty rate, education?  Why?

After you have made some predictions, we will compare our data with socioeconomic data from the U.S. Census for the different tracts we visited and see if we can find evidence to support them. From your data we have created some point scales that measure different aspects of a neighborhood. for example, we have made a scale called “social disorder” and another called “amenities” based on some of the data you collected from the coding sheet. We will compare your data to the census data. Let's get started!

In [4]:
LDN_COORDINATES = (37.8044, -122.2711)
myMap = folium.Map(location=LDN_COORDINATES, zoom_start=12)

#Folium formats popup windows in html format
html_1="""
    <h3>Good Neighborhood</h3>
    <img
       src = http://images.glaciermedia.ca/polopoly_fs/1.1827657.1429305764!/fileImage/httpImage/image.jpg_gen/derivatives/original_size/vancouver-single-family-home-neighbourhood-street.jpg
       style="width:180px;height:128px;"
       >
    <p>
       "This neighborhood seemed really nice to our group"
    </p>
    <p>
       # of cigarette buds: 0
    </p>
    """

html_2="""
    <h3>Bad Neighborhood</h3>
    <img
        src = http://static.lakana.com/nxsglobal/lasvegasnow/photo/2015/06/23/bad%20neighborhood_1435097179455_1625324_ver1.0_640_360.jpg
        style="width:180px;height:128px;"
        >
    <p>
        "This neighborhood didn't seem safe"
    </p>
    <p>
        # of cigarette buds: 100
    </p>
    """


folium.Marker([37.8200, -122.2427], 
              popup=folium.Popup(folium.IFrame(html=html_1, width=200, height=300), max_width=2650)).add_to(myMap)
folium.Marker([37.7990, -122.2727], 
              popup=folium.Popup(folium.IFrame(html=html_2, width=200, height=300), max_width=2650)).add_to(myMap)

lat_shift = 0.058
lon_shift = 0.090

oakland_lat = 37.8200
oakland_lon = -122.2427

data = imread('./Census Map.PNG')
myMap.add_child(plugins.ImageOverlay(data, opacity=0.9, \
        bounds =[[oakland_lat - lat_shift, oakland_lon - lon_shift], [oakland_lat + lat_shift, oakland_lon + lon_shift]]))

myMap

First let's put all of the data onto a table. Graphs are nice for visualization to get a general idea, but it's a lot easier to manipulate graphs to get concrete results. This is the table of all of your data: The row index is the number of the census tract, and each column represents a variable you collected data about.

In [5]:
#This is for me since I can't get the datascience to work.
# import pandas as pd
# import numpy as np
# from sklearn import linear_model
# from scipy import stats
# import matplotlib.pyplot as plt
# %matplotlib inline

#put all of our data into one table
class_data = pd.DataFrame({
                    "empty bottles": np.array([1,0,0,1,1]),
                     "cigarettes": np.array([0,1,1,0,0]),
                     "condoms": np.array([1,0,1,1,1]),
                     "garbage": np.array([1,0,0,1,1]),
                     "cars abandoned": np.array([1,1,0,0,1]),
                     "friendliness": np.array([3,2,4,1,5]),
                     "condition": np.array([1,0,3,1,4]),
                     "graffiti": np.array([1,1,0,0,0]),
                     "fence": np.array([1,2,4,4,1])
                    },index = [4201, 4202, 4203, 4204,4205])
class_data





Unnamed: 0,cars abandoned,cigarettes,condition,condoms,empty bottles,fence,friendliness,garbage,graffiti
4201,1,0,1,1,1,1,3,1,1
4202,1,1,0,0,0,2,2,0,1
4203,0,1,3,1,0,4,4,0,0
4204,0,0,1,1,1,4,1,1,0
4205,1,0,4,1,1,1,5,1,0


This is the data from American FactFinder.

In [6]:
#unemployment percent is for 16 years and older
#education: percent of people who have bachelors degree or higher
#income:household median income
official_data = pd.DataFrame.from_csv("Total Data.csv")
official_data


Unnamed: 0_level_0,Unemployment %,Household Median Income,Bachelor's Degree or higher %
Census Tract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4005.0,6.5,76038,61.9
4006.0,14.9,60804,55.3
4007.0,10.5,39614,40.9
4010.0,9.2,44766,33.2
4014.0,16.2,28532,18.6
4027.0,8.3,40169,28.9
4028.0,14.9,17278,33.8
4030.0,12.6,17609,24.6
4031.0,7.0,59250,38.5
4033.0,8.9,57064,46.5


Let's analyze the data we are using to represent social disorder. If the number under A category is higher, we theorize that represents a higher state of social disorder. Therefore, we will compress all of the data to create a social disorder number. We will THEN compare this number to the census data and try to observe patterns. It is easier to compare this social disorder number rather than 10 different variables to the census data.

The table below should show five columns: The first column should be the social disorder, the second column is the points it received based on your survey, the third column is income, the fourth column is employment, and the fifth column is education level. Find your census tract and see if the income and employment and education level is what you expected to be based on your thoughts about the neighborhood.

In [7]:
combined_data = pd.DataFrame(columns=[ "Social Disorder", "Income", "Unemployment", "Education"])

#add up all of the points. Then creates a new table with points as well as official data
for tract in class_data.index.values:
    points = sum(class_data.loc[tract])
    income = official_data.loc[tract]["Household Median Income"]
    unemployment = official_data.loc[tract]["Unemployment %"]
    education = official_data.loc[tract]["Bachelor's Degree or higher %"]
    combined_data.loc[tract]= [points, income, unemployment, education]

combined_data
    
    
    


Unnamed: 0,Social Disorder,Income,Unemployment,Education
4201,10.0,117083.0,8.0,75.9
4202,7.0,82206.0,8.9,69.4
4203,13.0,85119.0,2.9,73.1
4204,9.0,38689.0,6.9,83.8
4205,14.0,80404.0,9.7,62.6


In [8]:
def f(variable_name, tract):
    x = combined_data[variable_name]
    
    plt.hist(x)
    plt.axvline(x=combined_data.loc[tract, variable_name], color = "RED")
    plt.xlabel(variable_name, fontsize=18)
    plt.show()
    
interactive_plot = interactive(f, variable_name=list(combined_data), tract=combined_data.index)
output = interactive_plot.children[-1]
output.layout.height = '350px'
interactive_plot

A Jupyter Widget

Let's first analyze income levels. We have sorted the data according to income level. Compare the income levels to the level of social disorder. Is there a correlation you can spot(as one increases or decreases, does the other do the same)?

Did you look at the whole table? A common mistake is to assume that since the top 10 results follow or do not follow a pattern, the rest don't. Real life data is often messy and not clean. Does the correlation continue throughout the whole table(a.k.a. as income decreases the points decrease) or is there no pattern? What does this mean about the data?


In [9]:
#sort by income
combined_data.sort_values("Income")

Unnamed: 0,Social Disorder,Income,Unemployment,Education
4204,9.0,38689.0,6.9,83.8
4205,14.0,80404.0,9.7,62.6
4202,7.0,82206.0,8.9,69.4
4203,13.0,85119.0,2.9,73.1
4201,10.0,117083.0,8.0,75.9


Now let's analyze education levels and employment. Now we will try to analyze the tables sorted by employment and education. To sort by either, you must delete the hashtag on the line you want to sort by below. A hashtag means that the code is a comment, which means it will not run. Therefore, when you remove the hashtag, it will the run the line. Put the hashtag back to comment the code. 

In [10]:
#combined_data.sort_values("Employment")
#combined_data.sort_values("Education")

Now do some exploring on your own. Here is the list of all of the census tracts and in every column is a data type that you collected. Right now, the data is sorted by the column 'cigarette butts found' in descending order. To change how it is sorted, simply change the column name to the column you want to sort it by, i.e. 'children playing'. Make sure the name of the column is in quotes! Also, if you want it in ascending order, change the descending to ascending. Also, you can change the amount of results it returns by changing the number inside the list command.
Play around with sorting different columns and attributes. 
What patterns do you observe?

In [11]:
#tThis part of the project right now is kinda useless because most of the code is in 1s or 0s. No point in sorting.

Eyeballing patterns is not the same as a statisical measure of a correlation; you must quantify it with numbers and statistics to prove your thoughts. This is not a very statistical measure of how much a variable correlates to the results. What does it mean for a variable "income" to match 7 out of the top 15 social disorder points? Does this correlate to the rest of the results? How well does it correlate? 

We will now use a method called linear regression to make a graph that will show the best fit line that correlates to the data. The slope of the line will show whether it is positively correlated or negatively correlated. The variable "r squared" is a measure of how close the data is to the fitted regression line. 0 means the variable explains none of the variability of the data while 1 means it explains all of the variability in the data.

We want to plot the change in Points with respect to a certain variable(like education or income). Therefore, the Y axis will always be social disorder and the X axis will be the variable that we want to analyze. Right now, the x variable is set to "Income". The graph will give you a better sense of the whole data rather than just sorting columns like you did above. The R-squared value will give you an exact "goodness-of fit" value for your model. 

To change the X axis, you need to change the x variable from "Income" to either "Education" or "Employment" (or another one of the census variables). Make sure you don't delete the quotation marks and remember to capitalize the first letter!

Why is this a better method than just sorting tables? First of all, we are now comparing all of the data in the graph to the variable, rather than comparing what our eyes glance quickly over. It shows a more complete picture than just saying "There are some similar results in the top half of the sorted data". Second of all, the graph gives a more intuitive sense to see if your variable does match the data. You can quickly see if the data points match up with the regression line. Lastly, the r-squared value will give you a way to quantify how good the variable is to explain the data.

One of the beautiful things about computer science and statistics is that you do not need to reinvent the wheel. You don't need to know how to calculate the r_squared value, or draw the regression line; someone has already implemented it! You simply need to tell the computer to calculate it. However, if you are interested in these mathematical models, take a data science or statistics course!

In [12]:
def f(x_variable, y_variable):
    x = combined_data[x_variable]
    y = combined_data[y_variable]

    plt.scatter(x, y)
    plt.xlabel(x_variable, fontsize=18)
    plt.ylabel(y_variable, fontsize=18)
    plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color="r") #calculate line of best fit
    plt.show()
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) #gets the r_value
    print("The r squared coefficient is: ", r_value**2)
    
interactive_plot = interactive(f, x_variable=list(combined_data), y_variable=list(combined_data))
output = interactive_plot.children[-1]
output.layout.height = '350px'
interactive_plot

A Jupyter Widget

To recap, we did 4 things to analyze our real life data:
    
1)We observed the data on a map where we could click on various points and see the data. This gives an intuitive sense of how the data is spread out across different census tracts and what we can expect from our analyses. Having a physical picture of what the data could possibly be representing is often a good first step rather than jumping straight into the numbers.

2)We observed the data in table form. As nice as graphs are, it is not possible to sort a graph. Therefore, we turned our graph into table form, compressed the data, and looked for correlation between known statistics. It is an important part of to compare your collected results to known statistics. There is no point in collecting data if you cannot measure it against some standard.

3)Instead of eyeballing the table, we now created a graph and computed the line of best fit. We also got an r_squared value which measured the correlation. This provides a more accurate representation of all data points rather than just looking at tables as well as the r_squared value which shows how "good" the line of best fit is.

We hoped you learn how to quantify your observations in a mathematical way and had some fun with manipulating data!
