<a href="https://colab.research.google.com/github/AbstractMonkey/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module1-afirstlookatdata/LS_DS_111_A_First_Look_at_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - A First Look at Data



## Lecture - let's explore Python DS libraries and examples!

The Python Data Science ecosystem is huge. You've seen some of the big pieces - pandas, scikit-learn, matplotlib. What parts do you want to see more of?

In [0]:
# TODO - we'll be doing this live, taking requests
# and reproducing what it is to look up and learn thing

## Assignment - now it's your turn

Pick at least one Python DS library, and using documentation/examples reproduce in this notebook something cool. It's OK if you don't fully understand it or get it 100% working, but do put in effort and look things up.

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
!apt-get -qq install libgeos-dev
!pip install -qq https://github.com/matplotlib/basemap/archive/master.zip
from mpl_toolkits.basemap import Basemap, cm

In [0]:
from google.colab import files

files.upload()

In [0]:
!ls

In [0]:
df = pd.read_csv('environmental-remediation-sites.csv')
df.head()

In [0]:
from matplotlib.colors import LinearSegmentedColormap

m = Basemap(projection='ortho',lon_0=-76.25,lat_0=42.5,resolution='h',\
             llcrnrx=-550*550,llcrnry=-550*550,
             urcrnrx=+625*625,urcrnry=+625*625)

m.drawcoastlines()
m.drawcountries()
m.drawstates()

lats = df['Latitude'].tolist()
lons = df['Longitude'].tolist()

# ######################################################################
# Using the heatmap code from http://bagrow.com/dsv/heatmap_basemap.html
# on this dataset. Credit to James Bagrow, james.bagrow@uvm.edu
#
# ######################################################################
# bin the epicenters (adapted from 
# http://stackoverflow.com/questions/11507575/basemap-and-density-plots)
#
# compute appropriate bins to chop up the data:
db = 1 # bin padding
lon_bins = np.linspace(min(lons)-db, max(lons)+db, 20+1) # 20 bins
lat_bins = np.linspace(min(lats)-db, max(lats)+db, 20+1) # 20 bins
    
density, _, _ = np.histogram2d(lats, lons, [lat_bins, lon_bins])

# ######################################################################
# Turn the lon/lat of the bins into 2 dimensional arrays ready
# for conversion into projected coordinates
lon_bins_2d, lat_bins_2d = np.meshgrid(lon_bins, lat_bins)

# convert the bin mesh to map coordinates:
xs, ys = m(lon_bins_2d, lat_bins_2d) # will be plotted using pcolormesh
# #####################################################################

# define custom colormap, white -> nicered, #E6072A = RGB(0.9,0.03,0.16)
cdict = {'red':  ( (0.0,  1.0,  1.0),
                   (1.0,  0.9,  1.0) ),
         'green':( (0.0,  1.0,  1.0),
                   (1.0,  0.03, 0.0) ),
         'blue': ( (0.0,  1.0,  1.0),
                   (1.0,  0.16, 0.0) ) }
custom_map = LinearSegmentedColormap('custom_map', cdict)
plt.register_cmap(cmap=custom_map)

# add histogram squares and a corresponding colorbar to the map:
plt.pcolormesh(xs, ys, density, cmap="custom_map")

cbar = plt.colorbar(orientation='horizontal', shrink=0.625, aspect=20, fraction=0.2,pad=0.02)
cbar.set_label('Number of brownfield sites',size=18)
#plt.clim([0,100])

# translucent blue scatter plot of epicenters above histogram:    
x,y = m(lons, lats)
m.plot(x, y, 'o', markersize=5,zorder=6, markerfacecolor='#424FA4',markeredgecolor="none", alpha=0.33)


plt.gcf().set_size_inches(20,20)


plt.show()

In [0]:
# What are the most common contaminants?

df.Contaminants.value_counts().head(25)

In [0]:
# What counties have the most brownfield sites?

df.Counties.value_counts().head(10)

In [0]:
# Number of unique counties in this dataset. This seems to suggest that each county has at least one brownfield site, because there are 62 counties in NY (not sure where the 63rd came from)

len(df['Counties'].unique())

### Assignment questions

After you've worked on some code, answer the following questions in this text block:

1.  Describe in a paragraph of text what you did and why, as if you were writing an email to somebody interested but nontechnical.

I mapped out all 70,254 brownfield sites in New York state using the Python library basemap. This is useful because it visually shows places where there is a higher concentration of brownfield sites. You can use this map to spot trends and idiosyncracies in the data, e.g. how there is a line of brownfield sites on this map that follows the Erie canal, suggesting that there have been polluting businesses alongside it. Check it out as a kernel on Kaggle: https://www.kaggle.com/pkscary/brownfield-sites-mapped-over-new-york-state .

2.  What was the most challenging part of what you did?

Learning how to process my data so that it is compatible with libraries like seaborn and basemap.

3.  What was the most interesting thing you learned?

I learned how important it is to make sure that you are using the correct function for the data type you are working with! I was surprised to see that as I processed the data, it might be interpreted as a string,, int, float, object, etc. etc. depending on what I was doing.

4.  What area would you like to explore with more time?

I would love to map more datasets! Using mapping as a form of data visualization is fun.




## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub (and since this is the first assignment of the sprint, open a PR as well).

- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [scikit-learn documentation](http://scikit-learn.org/stable/documentation.html)
- [matplotlib documentation](https://matplotlib.org/contents.html)
- [Awesome Data Science](https://github.com/bulutyazilim/awesome-datascience) - a list of many types of DS resources

Stretch goals:

- Find and read blogs, walkthroughs, and other examples of people working through cool things with data science - and share with your classmates!
- Write a blog post (Medium is a popular place to publish) introducing yourself as somebody learning data science, and talking about what you've learned already and what you're excited to learn more about.